Comparing and making informed decisions#

Content#

Recap of abstract data types introduced
Comparison of implementations
The Java Collections API: The good, the bad and the ugly
Determining the best abstract data type and implementation for a task

Disclaimers#

This session may be heavy in places but will give you some useful insights
Points regarding memory are not absolute due to JVMs (Java Virtual Machines) handling memory differently. Other programming languages may also behave differently.

A note on Typesetting#

Any text in monospaced font explicitly denotes either a Java class / interface (e.g. ArrayList / List) or source code.
There are a number of times these words are used in a more general sense and are not specific to Java.

Abstract Data Types#

List
Set
Map / Dictionary
Graph
Queue
Priority Queue
Double-ended queue
Stack

Some other Abstract Data Types#

Multiset / Bag
Same as a Set, but multiple equal values allowed
Multimap
Same as a Map, but a key can have multiple values
Double-ended priority queue
Double-ended queue, based on a priority

Which data structure is suitable#

Task:
Check a user’s login session is active
Information available:
A user’s session ID and a collection of active session IDs

Question:
What data structure is best suited to the task?

Task:
Retrieve a user’s details by their username
Information available:
A user’s username

Question:
What data structure is best suited to the task?

List - Expected Behaviour#

Duplicates allowed: Yes
Order: Determined by the insertion point
Actions:
- append / prepend to the list
- retrieve from a list
- empty check
Use cases:
- Many, when we want a collection of objects with duplicates and dynamic sizing
- e.g. Storing adjacent nodes in a graph

Set - Expected Behaviour#

Duplicate allowed: No
Order: Depends on implementation. In a standard set, order is not guaranteed.
Actions:
- add value to set
- contains value
- check if empty
Example use cases:
- Retaining visited objects in a depth/breadth-first search of a graph
- Any task where we want a collection but no duplicates

Map - Expected Behaviour#

Duplicate keys allowed: No
Duplicate values allowed: Yes
Order: Depends on implementation
Actions:
- add mapping (key → value)
- retrieve value by key
- empty check
Example use case:
- Lookup tables (mostly)

Graph - Expected Behaviour#

Duplicate allowed: Each ‘vertex’ is unique
Order: Depends on the traversal algorithm
Actions:
- add / remove vertices
- add / remove edges
- get vertex value
- update vertex value
- get vertex neighbours
- adjacency between vertices
Example use cases:
- Social network friendship connections

Queue - Expected Behaviour#

Duplicates allowed: Yes
Order: First In First Out (FIFO)
Actions:
- add to the back of the queue (enqueue)
- remove from the front of the queue (dequeue)
- peek at the front of the queue
- check if empty
Example use cases:
- Process transactions
- Breadth-first search of a graph

Priority Queue - Expected Behaviour#

Duplicates allowed: Yes
Order: Best In First Out (BIFO)
Actions:
- add to the queue with priority
- remove highest-priority from the queue
- peek at the highest-priority element in the queue
- check if empty
Example use cases:
- Process transactions based on priority
- Pathfinding (Dijkstra, A*)

Stack - Expected Behaviour#

Duplicates allowed: Yes
Order: Last In First out (LIFO)
Actions:
- add to the top of the stack (push)
- remove from the top of the stack (pop)
- peek at the top of the stack
- check if empty
Example use cases:
- Undo/redo functionality/Depth-first search of a graph
- Iterative in-order traversal of a binary search tree

List implementations#

There are two common list implementations:
- Array based (ArrayList and Vector)
- Linked list (LinkedList)
Which is best?
- Array-based implementations offer O(1) retrieval for access of any element in the list. However, the array must be resized when full O(n).
- Linked lists offer O(1) time complexity for appending/prepending and retrieval for the head and tail and resizing is not needed.

List implementations Time complexity big-O#

Operation	Array list	Linked list
append	O(1) / O(n) if resize needed	O(1)
prepend	O(n)	O(1)
get first	O(1)	O(1)
get last	O(1)	O(1)
get by index	O(1)	O(n)
remove first	O(n)	O(1)
remove last	O(1)	O(1)
remove by index	O(n)	O(n)
size	O(1)	O(1)

Arrays and Memory Random Access#

Arrays are random access(direct access) structures
This makes arrays O(1) for accessing any element by its index
How?
- An array is allocated: size × data type size memory
- We know the start location of the array
- So, jumping to the correct memory location is determined by: start location + (index × data type size)
- This is constant O(1), i.e., not depends on n

Arrays and Memory#

Arrays are usually a contiguous block of memory for storing values (references in the case of reference types)
A lot of space can be wasted, due to needing free space to allow the insertion of new elements
Object references in Java require 32 bits (for 32-bit systems) or 64 bits (for 64-bit systems).
The following array:

1
Object[] array = new Object[1000];

Requires 64 ×1,000 bits of contiguous memory allocation = 640,000 bits (80,000 bytes) even if the list is empty

Not drastic, but still a waste of memory (1,000 is a very small list in terms of real systems)

Arrays and Memory Crude Example#

Consider the following:

1
public class Obj {
2
    private double a, b;
3
}
4
public class Main {
5
    public static void main(String[] args) {
6
        Obj[] arr = new Obj[20];
7
        for (int i = 0; i < 4; i++) {
8
            arr[i] = new Obj();
9
        }
10
    }
11
}

Each block in this example is 8 bytes (64 bits)
- Grey → Unused array space (but space still allocated)
- Green → Used
- White → Unused

Objects require 8 bytes ‘housekeeping’ then space for references/ primitives (in this case object Obj has two double primitives, each 8 bytes)

Arrays and Memory Pros and Cons#

Although space is ‘wasted’, arrays benefit from modern CPU caching and pre-fetching during iteration as the memory is in one contiguous block
The primary issue with arrays is finding contiguous memory sufficient for storing the array.
Imagine a list intended to cater for 1 billion objects = 8 bytes * 1 billion = 8 GB of contiguous memory and this is just to store the object references!

Consider that the above array was resized when attempting to add the ½ billion to 1 item (original array size ½ billion). Until garbage collection occurs, 4GB for the original array + 8GB for the new array is required
Moral: big data requires lots of memory and arrays ‘struggle’ with a large amount of data due to their contiguous data and the need to resize in order not to waste memory
The max size of an array-based list is Integer.MAX_VALUE → 2,147,483,647
Some of the problems above could be worked around by linking multiple arrays, but this introduces overheads

List iteration#

1
public static void displayList(List<Object> list) {
2
    for (int i = 0; i < list.size(); i++){
3
        System.out.println(list.get(i));
4
    }
5
}
6
List iteration
7
List<Object> arrayList = new ArrayList<>();
8
List<Object> linkedList = new LinkedList<>();
9
…
10
displayList(arrayList);  // Time complexity for
11
displayList(linkedList); // these two calls?

Array list iterator implementation#

1
class ArrayListIterator<E> implements Iterator<E> {
2
    int index = 0;
3
    @Override
4
    public boolean hasNext() {
5
        return index < array.length;
6
    }
7
    @Override
8
    public E next() {
9
        return array[index++];
10
    }
11
};

Linked list iterator implementation#

1
class Node<E> {
2
    Node<E> next;
3
    E data;
4
}

1
class LinkedListIterator<E> implements Iterator<E> {
2
    Node<E> current = head;
3
    @Override
4
    public boolean hasNext() {
5
        return current != null;
6
    }
7
    @Override
8
    public E next() {
9
        E result = current.data;
10
        current = current.next;
11
        return result;
12
    }
13
};

Enhanced for loop Syntactical sugar#

1
for (Object obj : iterable) {
2
    System.out.println(obj);
3
}

The enhanced for loop above is just syntactical sugar for:

⬇️

1
for (Iterator<Object> iter = iterable.iterator(); iter.hasNext(); /** incrementer not needed **/) {
2
    System.out.println(iter.next());
3
}

List iteration - Take Two#

1
public static void displayList(List<Object> list) {
2
    for (Object obj : list) {
3
        System.out.println(obj);
4
    }
5
}
6
List<Object> arrayList = new ArrayList<>();
7
List<Object> linkedList = new LinkedList<>();
8
…
9
displayList(arrayList);   // Time complexity for
10
displayList(linkedList);  // these two calls?

Sets#

Question:
How do you retrieve a specific element from a set?

Answer: That’s a silly question.

The only way you could retrieve a specific element from a set is by having the element to begin with.

Sets - What are they good for?#

Checking the presence of an element within a set, e.g. whether a vertex has already been visited in a depth-first / breadth-first search of a Graph
Faster than a list, O(1) hash set or O(log n) tree set
Avoiding duplicates
Checking whether a subset exists in another set
Obtaining intersections/differences (e.g. Venn diagram)
Efficiently maintaining non-index order (sorted sets such as tree set [i.e. binary search tree])

Set implementations Time complexity big-O#

Operation	Hash set	Tree set
add	O(1) / O(n) if resize needed	O(log n)
remove	O(1)	O(log n)
contains	O(1)	O(log n)
size	O(1)	O(1)

Hash set wins hands-down for the basic operations, but the order is chaotic.
Whenever an order is required, a tree set should be used

Tree Sets#

You have been taught that tree set implementations (i.e. binary search trees) offer O(log n) time complexity
On average this is the case for a self-balancing binary search tree, but the worst-case time complexity is actually O(h), where h is the height of the tree.
A balanced tree with 1024 to 2047 (inclusive) items will have a height of 10.

$2^{10} = 1024 \hspace{2em} 2^{11} - 1 = 2047$ $\log_{2}{1024} = 10 \hspace{2em} \log_{2}{2047} = 10.999 ~ \text{(less than 11)}$

Hence, O(log n) as log n == h, if balanced

If you are struggling with logarithms, think about it this way:

level 0 = 1 node (root)
level 1 = 2 * previous level nodes = 2 nodes
level 2 = 2 * previous level nodes = 4 nodes
level 3 = 2 * previous level nodes = 8 nodes
level 4 = 2 * previous level nodes = 16 nodes
……………
level 9 = 2 * previous level nodes = 512 nodes
level 10 = 2 * previous level nodes = 1024 nodes
[height == 10]

1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 + 512 + 1024 = 2,047

If a balanced tree were to contain double the number of elements (i.e. 2 × 2047 = 4094 elements), the height of the tree would only increase by 1.

⬇️ This means that despite the number of elements doubling, the number of iterations required (on average) to search for an element in the tree only increases by 1. ⬇️ The benefit of tree structures and logarithmic algorithms only becomes truly apparently as n moves towards a larger number

Question:
For a balanced binary search tree containing 1 billion elements, at most how many iterations would be required (on average) to find any element in the tree?

Answer:
$\log_{2}{1,000,000,000} = 29.897 \approx 29$ iterations (floored result above)

Comparison to a list:
O(n) = 1,000,000,000 iterations

~ 34.483 million times more than the balanced binary search tree

Digression: Calculate logarithms#

1
int log2Est(int n) {
2
    return n > 1 ? (1 + log2Est(n/2)) : 0;
3
}

1
int log2Est(int n) {
2
    assert n > 0;
3
    int y = 0;
4
    while (n > 1) {n = n/2; y++;}
5
    return y;
6
}

the number of times you can divide it by 2 before you go below 1

Map implementations Time complexity big-O#

Operation	Hash map	Tree map
put	O(1) / O(n) if resize needed	O(log n)
get	O(1)	O(log n)
remove	O(1)	O(log n)
contains	O(1)	O(log n)
size	O(1)	O(1)

In reality, a Map is just a Set of keys that map to a Collection of Objects.
In the Java API, Maps are used to implement the Sets…

JAVA SET Implementation, It’s just a MAP…#

e.g. Inside HashSet<E>:

1
private transient HashMap<E,Object> map;
2
// Dummy value to associate with an Object in the
3
// backing Map
4
private static final Object PRESENT = new Object();
5
public boolean add(E e) {
6
    return map.put(e, PRESENT)==null;
7
}

Java Collections API#

The Java Collections API provides the following implementations of common abstract data types (some omitted)
- List → ArrayList, LinkedList and Vector
- Set → HashSet, LinkedHashSet and TreeSet
- Map → HashMap, LinkedHashMap and TreeMap
- Graph → You wish
- Queue → LinkedList, ArrayDeque
- Priority queue → PriorityQueue
- Double-ended queue → LinkedList, ArrayDeque
- Stack → Stack, ArrayDeque, LinkedList

Java Collections API Issues#

Design decisions and changes to the Java language have resulted in some notable issues with the Collections API:
- No Stack interface. There is a class Stack, which extends Vector, an old ‘thread safe’ list implementation, which introduces overhead.
- HashSet uses HashMap internally. This creates extra bloat, since although there is just a value, a key value pair is stored (2 $\times$ object references)
- TreeSet uses TreeMap internally. Less efficient, as per the HashSet problem above
- Legacy ‘Object’ methods, where generics should be used. Introduces type safety issues.

As a developer you should be aware that the Collections API is designed to be flexible. This has efficiency and performance consequences.
Use of generic types or Object means no direct storing of primitive types.
Consider an ArrayList for storing integers (the problem applies to all of the Collections API implementations…)
- ArrayList’s underlying array is: Object[] elementData; On a 64-bit system, this means each element requires 8 bytes.
- Additionally the wrapper class Integer incurs the overhead of being a reference type (16 bytes).
- A primitive int only requires 4 bytes.

Comparison
Object[]	int[]
8 bytes × array length + 16 bytes per `Integer` wrapper	4 bytes × array length, stored directly in the array
100 elements in a full array = 800 bytes + 1,600 bytes Total = 2,400 bytes	100 elements in a full array Total = 400 bytes

As a result, a full ArrayList of Integer elements uses 6 times more memory than an efficient implementation designed explicitly for storing int primitives.(This is ignoring the additional overhead of the static attributes in the Integer class and auto-boxing / unboxing).

Java Collections API Conclusion#

The Collections API is great if you want to a straight-out-of-the-box implementation of your favorite abstract data type
However when time and space complexity is of utmost importance, you should consider a custom implementation
For the most part you will be fine using the Collections API and will benefit from the range of utilities available and interoperability with other libraries

Which data structure is suitable#

Task:
Check a user’s login session is active
Information available:
A user’s session ID and a collection of active session IDs

Question:
What data structure is best suited to the task?

Example List implementation:

1
boolean isSessionActive(List<String> active, String user) {
2
    boolean found = false;
3
    Iterator<String> iter = active.iterator();
4
    while (!found && iter.hasNext()) {
5
        found = iter.next().equals(user);
6
    }
7
    return found;
8
}

Example Set implementation:

1
boolean isSessionActive(Set<String> active, String user) {
2
    return active.contains(user);
3
}

Conclusion:
- Both are equally functional, but a List would offer O(n) regardless of implementation, whereas a hash-table based Set implementation would offer O(1).
- Note: A SortedSet implementation such as TreeSet would offer O(log n).

Task:
Retrieve a user’s details by their username
Information available:
A user’s username

Question:
What data structure is best suited to the task?

List?
- Maybe, but would require an O(n) search of the list and check that the User object’s username matched what we were looking for (also assumes the username is stored in this object)
Set?
- No. The Set would contain the User objects, so to get it we would need to have the User object to begin with…
Queue? Stack?
- No. Sure if the interface allows us to do a linear search, but we’ve established this is O(n)
Map?
- A mapping of the username → User object would work a treat!
- If we don’t care about order (or rarely care) then a Hash map implementation is the way to go O(1)

Sometimes the right data structure isn’t immediately apparently.
We should first program to an interface (e.g. Set / List) and then change the implementation to suit the use the case.
For example:
1. If I want to store a key value pair, a Map is the obvious choice. It’s specifically designed for this use. Then
2. But should I use a Tree map or a Hash map?

In most cases the answer will be a Hash map. It’s quick with O(1) time complexity on most operations.
If however, the contents of the Map rarely change and we frequently want to display the keys in order, then the Tree map is a better choice.
Hash maps also waste memory, since the underlying implementation is an array. Java’s HashMap (using default load factor 0.75) always wastes at least 0.25 * table size * reference size(8 bytes) bytes

The implications of a poor decision#

Consider the algorithm for a breadth-first traversal of a graph:

1
class Node {
2
    List<Node> adjacent;
3
    String value;
4
}

Consider the algorithm for a breadth-first traversal of a graph:

1
void bft(Node start) {
2
    List<Node> nodes = new ArrayList<>();
3
    List<Node> visited = new ArrayList<>();
4
    nodes.add(start);
5
    while (!nodes.isEmpty()) {
6
        Node current = nodes.remove(0);
7
        if (!visited.contains(current))
8
            System.out.println(current.value);
9
        visited.add(current);
10
        for (Node n : current.adjacent) {
11
            if (!visited.contains(n))
12
                nodes.add(n);
13
        }
14
    }
15
}

1
void bft(Node start) {
2
    List<Node> nodes = new ArrayList<>();
3
    List<Node> visited = new ArrayList<>();
4
    nodes.add(start);
5
    while (!nodes.isEmpty()) { // isEmpty O(1)
6
        Node current = nodes.remove(0); // remove O(n)
7
        if (!visited.contains(current)) // contains O(n)
8
            System.out.println(current.value);
9
        visited.add(current); // add O(1)
10
        for (Node n : current.adjacent) {
11
            if (!visited.contains(n)) // contains O(n)
12
                nodes.add(n); // add O(1)
13
        }
14
    }
15
}

The implications of a better decision#

1
void bft(Node start) {
2
    Queue<Node> nodes = new LinkedList<>();
3
    Set<Node> visited = new HashSet<>();
4
    nodes.add(start);
5
    while (!nodes.isEmpty()) { // isEmpty O(1)
6
        Node current = nodes.remove(0); // remove O(1)
7
        if (!visited.contains(current)) // contains O(n)
8
            System.out.println(current.value);
9
        visited.add(current); // add O(1)
10
        for (Node n : current.adjacent) {
11
            if (!visited.contains(n)) // contains O(1)
12
                nodes.add(n); // add O(1)
13
        }
14
    }
15
}

音乐

音乐

Data Structures and Algorithms Week10: Comparing and making informed decisions

Comparing and making informed decisions#

Content#

Disclaimers#

A note on Typesetting#

Abstract Data Types#

Some other Abstract Data Types#

Which data structure is suitable#

List - Expected Behaviour#

Set - Expected Behaviour#

Map - Expected Behaviour#

Graph - Expected Behaviour#

Queue - Expected Behaviour#

Priority Queue - Expected Behaviour#

Stack - Expected Behaviour#

List implementations#

List implementations Time complexity big-O#

Arrays and Memory Random Access#

Arrays and Memory#

Arrays and Memory Crude Example#

Arrays and Memory Pros and Cons#

List iteration#

Array list iterator implementation#

Linked list iterator implementation#

Enhanced for loop Syntactical sugar#

List iteration - Take Two#

Sets#

Sets - What are they good for?#

Set implementations Time complexity big-O#

Tree Sets#

Digression: Calculate logarithms#

Map implementations Time complexity big-O#

JAVA SET Implementation, It’s just a MAP…#

Java Collections API#

Java Collections API Issues#

Java Collections API Conclusion#

Which data structure is suitable#

The implications of a poor decision#

The implications of a better decision#

支持与分享

评论区

音乐

文章目录