Basic concepts of Data Representation


1.  Abstract and System defined data types

  • Data type
  • Primitive data type
  • Abstract data type
  • Polymorphic data type

2.  Representation, Primitive data structures

  • Data structures
    • Linear data structures
      • Arrays
        • Linear
        • Two dimensional
      • Linked List
        • Linear
        • Doubly
        • Circular
        • Stacks
        • Queues
    • Non-linear data structures
      • Trees
        • Binary trees
        • Binary search trees
        • Heaps
      • Graphs
      • Hash tables
  • Common operations on Data structures
    • Traversal
    • Searching
    • Insertion
    • Deletion
    • Sorting
    • Merging

Data type: A type stands for a set of values.  A set of values that correspond to a specific type, say a set of integers belong to the type integer, are collectively referred by using a data type.  A data type specifies the type of data required to be stored in a variable.

In ‘C’, variables must be declared before their use and must be declared with a relevant data type, which specifies the range of values that may be stored in the variable of that data type.  In ‘C’, data type specifies the amount of memory to be allocated.  For example: char allocates 1 byte, int – 2bytes, float – 4bytes, double – 8 bytes and long double – 10bytes on a specific machine.  Data types may be Primitive (fundamental), Abstract or Polymorphic data types.

The following are the characteristics of data types:

  • A domain is defined – the set of all possible values.
  • A set of allowable operations on those values is also defined.
  • A value is atomic within the domain.

Primitive Data Type: A primitive data type say char, int, float, double is the data type that is used to refer to a single value such as integer, float, character etc.  ‘C’ provides above four basic data types.  A variable in ‘C’ may be declared to be of primitive data type, such as int x; declares a variable x of type integer and allocates 2 bytes of memory for storing an integer value.

Abstract Data Type: ADT may be defined as a set of data values and associated operations that are precisely specified independent of any particular implementation.  Thus an Abstract Data Type is an organized collection of information and a set of operations used to manage that information.  The set of operations defines the interface of the ADT.  As long as the ADT fulfills the conditions of the interface, it doesn’t really matter how the ADT is implemented.  Since, in ADT, the data values and operations are defined with mathematical precision, rather than as an implementation in a computer language, we may reason about effects of the operations, relations to other abstract data types whether a program implements the data type etc.  One of the simplest abstract data type is the stack data type for which functions might be provided to create an empty stack, to push values onto a stack and to pop values from a stack.

The basic difference between abstract data type (ADT) and concrete data type is that the latter allow us to look at the concrete representation, whereas the former hide the representation from us.  An ADT may be pure ADT or Updatable ADT.  A pure ADT is one where all operations are pure functions.  This means that operations have no side effects.  In particular, they do not modify or update there input arguments.  They just use these arguments to generate output, which are fresh values of ADT (or of other types).  Most concrete types are pure.  For example, no operation on integers actually modifies an integer.  Instead, all the operations like ‘+’ produce fresh outputs.

An updatable ADT is one where some operations actually change values of the ADT.  For example, suppose we had an operation called ‘pop’ that took a stack as an argument and modified it.  (“in place”, “destructively”) by removing the highest priority item.  This operation would be considered impure and the whole ADT would then be impure also.  An ADT may be user defined ADT.

We know that an Abstract Data Type is a data type that satisfies the following two conditions:

1.  The representation, or definition, of the type and the operations are contained in a single syntactic unit.

2.  The representation of objects of the type is hidden from the program units that use the type, so only direct operations possible on those objects are those provided in the type’s definition.

A user defined Abstract Data Type should provide:

1.  A type definition that allows program units to declare variables of the type, but hides the representation of these variables.

2.  A set of operations for manipulating objects of the type.

An example of a user defined abstract data type is structure.  ‘C’ provides four basic types: int, char, float and double.  However, ‘C’ also provides the programmer with the ability to define his/her own types.  Structure is one such example.  A structure is an aggregate of different parts, where each part is of some existing type.

struct abc

{int x;

float y;

};

The above structure definition does not create any variables, rather it creates a new type.  Variables of this type may be created in a similar way to variables of a built in type.

struct abc a;

The typedef keyword allows us to create new type names for our new types.

For example:

typedef struct abc AB;

where AB is a new type name that can now be used to create new types.

AB b;

Polymorphic Data Types:  A polymorphic data type is a data type, which can be transformed to any distinct data type as required. An item of polymorphic data type is created by the use of a generic pointer.  A generic pointer is simply a byte address without an associated type.  In other words, a generic pointer does not point to an object of a specific type; it just points to some location in the memory of the computer.  In ANSI C, a generic pointer is declared as a void pointer.  It is only when the actual operations must be performed on the data that generic pointers are cast to pointers of specific types and de-referenced.

Data Structures:  The following are the characteristic features of data structures:

1.  It contains component data items, which may be atomic or another data structure (still a domain).

2.  A set of operations on one or more of the component items.

3.  Defines rules as to how components relates to each other and to the structure as a whole (assertions).

A data structure may be static or dynamic.  A static data structure has a fixed size.  This meaning is different from the meaning of static modifier.  Arrays are static; once we define the number of elements it can hold, the number doesn’t change.  A dynamic data structure grows and shrinks at execution time as required by its contents.  A dynamic data structure is implemented using links.

Data structures may further be categorized into linear data structures and non-linear data structures.  In linear data structures every component has a unique predecessor and successor, except first and last elements, whereas in case of non-linear data structures, no such restriction is there as elements may be arranged in any desired fashion restricted by the way we use to represent such types.

Sequential Access vs. Direct Access: In sequential access, to access the n-th element, you must access the preceding (n-1) elements.  In direct (random) access, any element can be accessed without accessing its predecessor or successor.

Linear Data structures include arrays, structures, linked lists, stacks and queues.  Non-linear data structures include trees, binary trees, graphs and digraphs.

Linear Data Structures:

1.  Arrays:  An array is said to be

  • a collection of a fixed number of component values, each of which are of the same type, atomic and structured.
  • a set of index (subscript) values that are non-negative integers (consecutive).
  • there is a 1-1 relationship between such index and an array component (element).  Index 0 selects the first element, 1 the second and so on.

Difference in Array and Structure:

  1. An array is a homogenous structure, whereas structure are heterogenous structures.
  2. A component of an array is accessed by its positions in the array, where a component of a structure is accessed by an identifier (the name).
  3. Because array components are accessed by position, an array is a structured composite type.

An array may be one dimensional or two dimensional.  A one-dimensional array contains elements row-wise, from representation point of view.  It contains only one subscript.  The subscript value begins with zero.  A two dimensional array contains two subscripts, one for row and the other for column.  Thus a two dimensional array may be considered as a tabular representation of elements where elements are arranged in rows and columns.  Two subscripts are used combinedly one for row position and other for column position, to refer to an element in a 2D array.

2.  Linked List:  A linked list is a linear list of elements linked/joined together with links, where link is maintained by keeping a reference/address of the next element.  The address of first element is sustained for traversal or other operations to be performed on the list.  A linked list is a dynamic structure.  It may grow or shrink dynamically.  A linked list may be singly linked list, doubly linked list, circular linked list or doubly circular linked list.  Header lists may also be created that contains a header node.

A singly linked list is unidirectional since elements are added up or removed from left to right.  A doubly linked list is an improved data structure form of singly linked list, since it contains both forward and backward links/reference.  Thus, traversal may easily be done in either of the direction.  A circular linked list is one in which the last node points to the first node, thereby making a loop or circle.  Circular lists are useful in situations where processing is required to be done repetitively in a loop.  A doubly circular linked list is that which maintains forward as well as backward reference and also contains loop by keeping the address of first node in last and last node in first.

All these list categories have their own merits and demerits depending upon the situation in which they are implemented.

3.  Stacks:  A stack ADT is also linear like a list or a queue.  In a stack, items are added and removed from only one end of the stack, called the top of the stack.  It is therefore a LIFO (Last In First Out) data structure.  Examples of stacks are – a stack of plates in a cupboard, a stack of files in the drawer of a table.  Stacks are often drawn vertically.  Some operations that may be performed on stack are:

  • Push: adds an item to the top of the stack
  • Pop: removes an item from the top of the stack
  • Peek(top): retrieves the top item without removing it
  • Empty: returns true if the stack is empty, false otherwise
A stack can be represented by using a singly linked list, it doesn’t matter whether the references point from the top toward the bottom and vice-versa.  A stack may also be represented by an array, but the new item should be placed in the next available place in the array rather than at the end of the array.
4.  Queues:  A queue is similar to a list but adds items only to the rear of the list and removes them only from the front.  It is called a FIFO data structure -First In First Out.  An example of a queue is a line of people at a ticket window.  We can define the following operations for a queue:
  • enqueue: adds an item to the rear of the queue
  • dequeue (or serve): removes an item from the front of the queue
  • empty: returns true if the queue is empty, false otherwise

Queues often are helpful in simulations or any situation in which items get backed up while awaiting processing.  A queue can be represented by an array or using a linked list.   A linked list representation is more efficient if the references point from the front toward the rear of the queue.
Non-Linear Data Structures:

1.  Trees:  A tree is a non-linear data structure that consists of a root node and potentially many levels of additional nodes that form a hierarchy.  Nodes that have no children are called leaf nodes.

(i)  Binary Trees:  A binary tree is defined recursively.  Either it is empty (the base case) or it consists of a root and two subtrees, each of which is a binary tree.  Binary tree and trees are represented using references as dynamic links, though it is possible to used fixed representation like arrays.

(ii)  Binary Search Trees:  A Binary Search Tree is a binary tree that satisfies the following conditions:

  • All values less than the key value (root) form a part of left subtree.
  • All values greater than the key value (root) form a part of right subtree.
  • The left and right subtrees are also binary trees.

(iii)  Heap:  Heap is a complete binary tree with n nodes such that:

  • the root node contains the largest element
  • both the left and the right sub-trees of the root are heaps once again

2.  Graphs:  A graph is a non-linear data structure.  Unlike a tree or binary tree, a graph does not have a root.  Any node in a graph can be connected to any other node by an edge.  In a directed graph or digraph, each edge has a specific direction.  Edges with direction sometimes are called arcs.

Both graphs and digraphs can be represented using dynamic links or using arrays.  As always, the representation should facilitate the intended operations and make them convenient to implement.

3.  Hash Tables:  A hash table is used to represent the hash values that are generated using a hash function.  Generally, the size of the hash table is determined in terms of the data elements it is going to sustain in itself.

Common operations on Data Structures:

  1. Traversal
  2. Searching
  3. Insertion
  4. Deletion
  5. Sorting
  6. Merging
1.  Traversal:  It refers to reading the elements represented by a given data structure.  The traversal in case of a singly linked list is from the first node until NULL is found, since it is a unidirectional list.  Reversal traversal is achieved easily in a doubly linked list, since backward references are maintained.  In a circular linked list, traversal can begin from any node that is a part of the list.  In case of tree, traversal can be either in inorder, preorder or postorder form.  Similarly, in graph each vertex node may be visited when the graph is traversed.  Thus, simply stating, visiting each element in a given data structure is termed as traversal operation.
2.  Searching:  Searching is the process of locating a user defined value from a given list of values.  Search operation may be performed on elements that are arranged/sorted as well as on unsorted elements.  The search operation is considered efficient if it is performed on sorted elements.
3.  Insertion:  Insertion refers to inserting a new element in a given data structure.  Insertion in an array requires subscript identification where this new element is to be stored.  In case of linked list, insertion may be performed at beginning, end, before/after/at element, before/after/at position.  Similar cases are applicable for other categories of linked list.  In case of trees, insertion may be either as root, left child or right child depending upon the conditions given.  In graphs, insertion of a node requires setting edges appropriately such that the new element inserted becomes a part of the graph.
4.  Deletion:  Deletion is the process of removing an element from a given data structure.  The basic procedure followed for deletion is: locate the element in the given data structure you wish to delete.  If the element is found, then assign its address in some other pointer of the same type.  Make appropriate pointer adjustments to remove all links that connect this node in a given data structure.  Then free the memory allocated to this element.  Freeing the memory explicitly helps in better memory management, since memory issues are resolved and used memory when freed is again made available for use by deletion process.
5.  Sorting:  Sorting refers to a proper arrangement of data elements in the given data structure.  Arrangement of elements may be done either in ascending order or descending order.  Sorting sometimes may prove inefficient, when it involves swapping many items simultaneously for a wide range to input.  Complexity of a search algorithm and its efficiency may be computed and checked as to which sorting algorithm out of the available sorting techniques such as bubble sort, merge sort, quick sort etc. is efficient under what set of given conditions.  Detailed sorting techniques will be described later as a separate item.
6.  Merging:  Merging is the process of combining two given data structures say linked list or arrays such that an appropriate combination is formed by the merging process.  If merging is performed such that the resultant data structure is in sorted form, then two linked lists are merged such that the linked so obtained after merging is the sorted linked list.  However merging may be done in many other ways as appropriate under given set of conditions.