My personal notes on the book Algorithms, 4th edition, by Sedgewick & Wayne.
1. Fundamentals
1.2 Data abstraction
- Data type: a set of values and a set of operations on these values.
- Abstract data type (ADT): a data type whose represenation if hidden from the client.
1.3 Bags, queues and stacks
- Many data types involve a collection of objects, and the operations revolve around adding, removing or examining objects in the collection.
- APIs: each data type is shown below in Java:
Bag
- A collection where removing items is not supported.
- It has the ability to collect items and to iterate through them (the order is unspecified).
public class Bag<Item> implements Iterable<Item> {
Bag()
void add(Item item)
Boolean isEmpty()
int size()
}
FIFO queue
- A collection based on the first-in-first-out policy (FIFO).
- The relative order of items is preserved.
public class Queue<Item> implements Iterable<Item> {
Queue()
void enqueue(Item item)
Item dequeue()
Boolean isEmpty()
int size()
}
LIFO stack
- A collection based on the last-in-first-out policy (LIFO).
- When a client iterates the items in a stack, they’re processed in the reverse order in which they were added.
public class Stack<Item> implements Iterable<Item> {
Stack()
void push(Item item)
Item pop()
Boolean isEmpty()
int size()
}
Linked lists
- A fundamental data structure that’s an appropriate choice for representing the data in the implementation of a collection’s ADT.
“A linked list is a recursive data structure that’s either empty (null) or a reference to a node having a generic item and a reference to a linked list.”
- A linked list representes a sequence of items, and can also be represented by an array.
- Basic operations:
- Remove from the beggining.
- Insert at the beggining.
- Insert at the end.
Data structures
- Collections can be represented in two ways:
- Sequential allocation (arrays).
- Pros: index provides immediate access;
- Cons: size is fixed.
- Linked allocation (linked lists).
- Pros: Uses space proportional to size;
- Cons: needs reference to access an item.
- Sequential allocation (arrays).
1.3 Analysis of algorithms
- Mathematical analysis is applied to develop concise models of costs and to do experimental studies to validate these models. This process is based on the scientific method.
- The experiments we design must be reproducible and the hypothesis must be falseable.
- Observations
- How to make quantitative measurements of the running time of our program? Run it!
- Problem size: the difficulty of the computational task (generally the size of the input). The running time usually increases with problem size, but the real question is by how much it increases.
- How long will my program take, as a function of the input size? Just plot the data: $$f(input) = time$$, $$f(250) = 0.00$$, $$f(500) = 0.00$$, $$f(1000) = 0.1$$, $$f(2000) = 0.8$$, $$f(4000) = 6.4$$, $$f(8000) = 51.1$$, …
- We can also plot the function as a log-log plot.
- A straight line in a log-log plot is equivalent to the hypothesis that the data fits the equation $$T(n) = a*n^b$$. This is known as a power base.
- Mathematical models
- Knuth says the total running time of a program is determined by two primary factors:
- The cost of executing each statement.
- The frequency of execution of each statement.
- The frequency of execution must be analyzed in each program.
- The analysis of this frequency can lead to complicated and lenghty mathematical expressions:
$$n(n-1)(n-2)/6 = n^3/6 - n^2/2 + n/3$$
- The terms after the leading term $$\frac{n^3}{6}$$ are relatively small, and to simplify the formula, we ignore them by using tilde approximations (~).
- Tilde approximation: we write $$~f(n)$$ to represent any function that, when divided by $$f(n)$$, approaches 1 as n grows, and we write $$g(n) \sim f(n)$$ to indicate that $$g(n)/f(n)$$ approaches 1 as n grows.
n²/2 - n/2 => ~n²/2 => n²
- Order of growth
- 1: constant
- n: linear
- n^2: quadratic
- 2^n: exponential
- lg n: logarithmic
- n*lg n: linearithmic
- n^3: cubic
- PS: lg => log2
- Approximate running time: only the instructions that are executed the most (the “inner loop”) frequently play a role in the final total.
- Knuth says the total running time of a program is determined by two primary factors:
- Mathematical model of running time
- Develop an input model;
- Identify the inner loop;
- Define a cost model that includes operations in the inner loop;
- Determine the frequency of execution of those operations.
- Order of growth classifications
- Constant: the program executes a fixed number of operations, no matter the size of n.
- Logarithmic: a program whose running time’s order of growth is logarithmic, barely slower than a constant-time program.
- Linear: the running time is proportional to n (one loop).
- Linearithmic: order of growth n*log n.
- Quadratic: order of growth n^2 (two nested loops).
- Cubic: order of growth n^3 (three nested loops).
- Exponential: running time proportional to 2^n or higher.
2. Sorting
2.1 Elementary sorts
- Sorting cost model: when studying algorithms, we count compares and exchanges. For algorithms that do not use exchanges, we count array accesses.
- Selection sort: one of the simplest sorting algorithms. First, find the smallest item in the array and exchange it with the first entry. Then, find the next smallest item and exchange it with the secondy entry. Continue in this way until the entire array is sorted.
- Signaure properties: running time is insensitive to input and data movement is minimal.
- Insertion sort:
- Larger items are moved one position to the right to make room for inserting the current item before them.
- The running time depends on the initial order of the input items. The worst case is $$N^2/2$$ exchanges, and the best case is $$N-1$$ compares and exchanges.
- It works well for certain types of non-random arrays that often arise in practise.
- It’s efficient for sorting partially sorted arrays, where the number of inversions is low (few pairs of entries are out of order).
- Comparing two sorting algorithms: use the scientific method:
- Implement and debug the algorithms.
- Analyze their basic properties.
- Formulate a hypothesis about comparative performance.
- Run experiments to validate the hypothesis.
- Shellsort
- It’s an extension of Insert Sort, and it works by producing partially ordered arrays that can be efficiently sorted.
- It creates an h-sorted array, which consists of many interleaved sorted subsequences.
- There’s no ideal increment sequence, but $$3N+1$$ is good enough.
- Shellsort has acceptable running time even for large arrays, requires a small amount of code and uses no extra space.
2.2 Mergesort
- Those algorithms are based on a simple operation known as merging: combining two-ordered arrays to make one larger ordered array.
- Mergesort: to sort an array, divide it in two halves, sort the two halves (recursively), and then merge the results.
- Its running time is $$NlogN$$ to sort any array of size $$N$$.
- Abstract in-place merge: merges two disjoint ordered arrays into a third array.
- Top-down mergesort
- Mergesort is a prototype of the divide and conquer algorithm.
- Running time can be improved with some modifications to the implementation.
- Use insertion sort for small arrays (subarrays).
- Test whether the array is already in order.
- Eliminate the copy to the auxiliary array.
- Bottom-up mergesort
- Another way to implement mergesort is to organize the merges so that we do all the submerges on one pass, then do a second pass to merge those subarrays in pairs, and so forth, continuing until we do a merge that encompasses the whole array.
- It’s the method of choice for sorting data organized in a linked list. It rearranges the links to sort in place (without creating new nodes).
2.3 Quicksort
- Easy to implement, works for a variety of different kinds of input data and is fast.
- Does in-place operation and on average requires time proportional to $$NlogN$$ to sort an array of length $$N$$.
- The algorithm: it’s a divide-and-conquer method for sorting, and it works by partitioning an array into subarrays recursively. The partition function will order the items.
2.4 Priority queues
- They are useful in queues that do not need to be fully sorted to be processed.
- It must support two operations: remove the maximum and insert.
- Insert: add the new key at the end of the array, increment the size of the heap and them apply “swim”.
- Remove the maximum: take the largest key of the top, put the item from the end of the heap at the top, decrement the size of the heap, and then apply “sink”.
- Using a binary heap allows us to operate on $$logN$$ time.
- A binary tree is heap-ordered if the key in each node is larger than or equal to the keys in that node’s two children (if any).
- A binary heap is a collection of keys arranged in a complete heap-ordered tree, represented in level order in an array (not using the first entry).
- Reheapify: restores the heap order when a key is inserted or removed.
- Swim: bring a large key to the top.
- Sink: bring a small key to the bottom.
- Heapsort
- We insert the items to be sorted into a minimum-oriented priority queue, then repeatedly we “remove the minimum” to remove them all in order.
- It involves two phases: heapify and sortdown.
2.5 Applications
- Sorting is useful because it’s much easier to search for an item in a sorted array than in an unsorted one.
- Stability: a sort method is stable if it preserves the relative order of equal keys in the array.
- Which sorting algorithm should I use?
- It depends heavily on details of the application and implementation.
- In most practical situations, quicksort is the method of choice.
- Reductions
- A situation where an algorithm developed for one problem is used to solve another.
- Examples: finding duplicates, permutation (ranking), priority-queue reductions, median and order statistics.
3. Searching
3.1 Symbol tables
- Its primary purpose is to associate a value with a key. The client can insert key-value pairs and later search for a given key. The basic operations are PUT and GET.
- Ordered symbol tables: ordered keys can provide efficient implementation of PUT and GET. The API must be extented.
- Sequential search in an unordered list: we search by considering the keys in the table one after another.
- Binary search in ordered array: the underlying data structure is a pair of parallel arrays, one for the keys and one for the values.
- Keeping the keys in an ordered array lets us use the binary search method.
- It’s typically far better than sequential search, and is the method of choice in numerous practical applications.
- More efficient implementations: binary search tree, red black trees and hash tables.
3.2 Binary search trees
- They combine the flexibility of insertion in a linked list with the efficiency of search in an ordered array.
- Basic implementation
- Node count in a subtree: $$size(x) = size(x.left) + size(x.right) + 1$$
- Search: the subtree size is halved in each interaction.
- Insert: it’s not much more difficult to implement than search.
- Recursion is the most easy way to implement the functions but harder to verify correctness.
- Analysis
- The running times depend on the shapes of the trees, which, in turn, depend on the order in which keys are inserted.
- A perfectly balanced tree with $$N$$ nodes has $$\sim log N$$ nodes between the root and each null link. A non-balanced tree can have up to $$N$$ nodes on the search path.
3.3 Balanced search trees
- The other trees have poor worst-case performance. In this tree, the costs are guaranteed to be logarithmic.
- 2-3 search trees: they allow the nodes in the tree to hold more than one key.
- A perfectly balanced 2-3 search tree is one whose null links are all them same distance from the root.
- Red-black binary search trees
- Red links: bind together two 2-nodes to represent 3-nodes.
- Black links: bind together the 2-3 tree.
- They are almost perfectly balanced.
3.4 Hash tables
- Search algorithms that use hashing consisst of two separate parts:
- Compute a hash function that transforms the search key into an array index.
- Collision resolution process to deal with collisions.
- Hashing is an example of time-space tradeoff. It balances the usage of time and memory.
- Hashing allows the representation of SEARCH and INSERT that require constant time (amortized) per operation in a symbol table.
- Hash functions: it transforms keys into array indices. If we have an array that can hold $$M$$ key-value pairs, we need a hash function that can transform any given key into an index inthat array - an integer in the range $$[0, M-1]$$.
- A different hash function is needed for each key type we use (e.g. modular hashing).
- Requirements for a good hash function: consistency, efficiency and uniform key distribution.
- Hashing with separate chaining: collision resolution can be solved by building a linked list of the key-value pairs whose keys hash to each index.
- Hashing with linear probing: store $$N$$ key-value pairs in a hash table, and rely on empty entries in the table to help with collision resolution.
3.5 Applications
- Symbol tables play a critical role on many algorithms.
- Symbol tables can be used to implement sparse vectors.
4. Graphs
- Pairwise connections between items play a critical role on a vast array of computational applications.
- Graphes are used to model these connections and the related situations.
4.1 Undirected graphs
- In this graph, edges are nothing more than connections between vertices.
A graph is a set of vertices and a collection of edges that each connect a pair of vertices.
- Vertex naming: we use the names 0 through V-1 for the vertices in a V-vertex graph.
- Connection naming: V-W is an edge that connects V and W (same as _W-V).
- A graph is defined independently of a drawing.
- Anomalies
- A self-loop is an edge that connects a vertex to itself.
- Two edges that connect the same pair of vertices are paralel.
- Path: in a graph, is a sequence of vertices connected by edges.
- Simple path: no repeated vertices.
- Cycle: a path with at least one wedge whose first and last vertices are the same.
- Length: number of edges of a path or cycle.
- A graph is connected if there’s a path from every vertex to every other vertex in the graph.
- An acyclic graph is a graph with no cyles.
- A tree is an acyclic connected graph. A disjoint set of trees is called a forest.
- Graph density: the proportion of possible pairs of vertices that are connected by edges.
- Sparse graph: few edges present.
- Dense graph: few edges missing.
Depth-first search
- Maze search, uses the “Trémaux’s exploration”.
- To visit a vertex:
- Make it as having been visited.
- Visit (recursively) all the vertices that are adjacent to it and that have not yet been marked.
Breadth-first search
- Solves the question: “Is there a path from s to a given target vertex v? If so, find a shortest such path.”
4.2 Directed graphs
- In directed graphs, edges are one-way: the pair of vertices that define each edge is an ordered pair that specifies a one-way adjacency.
4.3 Minimum spanning trees
- An edge-weighted graph is a graph model where we associate weights or costs to each edge.
- Classical algorithms
- Prim’s algorithm
- Kruskal’s algorithm
4.4 Shortest paths
- This model consider edge-weighted digraphs.
- Main problem: Find the lowest-cost way to get from one vertex to the other."
- A shortest path from vertex s to vertex t in an edge-weighted digraph is a directed path from s to t where no other path has a lower weight.
- Classical algorithms
- Dijkstra’s algorithm
- Topological sort
- Bellman-Ford algorithm
5. Strings
5.1 String sorts
- Sort approaches
- LSD: least-significant digit, examines the characters in the keys in a right-to-left order.
- MSD: most-significant digit, examines the characters in the keys in a left to right order.
5.2 Tries
- Symbol-table implementation can be developed for efficient search using string keys.
- A search tree (trie) is a data structure built from the characters of the string keys that allows us to use the characters of the search key to guide the search.
5.3 Substring search
- Substring search: given a string of length n and a pattern string of length M, find an occurrence of the pattern within the text.
- Classic algorithms:
- Brute force substring search
- Knuth-Morris-Pratt substring search (uses automata, DFA)
- Bayer-Moore substring search
- Rabin-Karp fingerprint search
5.4 Regular expressions
- String patterns can be described using regular expressions.
- Non-finite automata.
5.5 Data compression
- Huffman compression: a method for lossless data compression.
- LZW compression: a universal lossless data compression algorithm.