07 Randomized Algorithms

Having equipped ourselves with the tools of probability theory, we now turn to the exciting realm of randomized algorithms. Randomized algorithms are algorithms that incorporate randomness into their decision-making process. Unlike deterministic algorithms, which always follow the same sequence of steps for a given input, randomized algorithms make random choices during their execution, leading to probabilistic behavior. This seemingly simple addition of randomness can lead to algorithms that are faster, simpler, or more efficient than their deterministic counterparts, especially for problems that are computationally challenging.

Advantages of Randomization

Randomized algorithms offer several potential advantages:

Simplicity: Randomized algorithms can often be simpler to design and implement compared to deterministic algorithms for the same problem. Introducing randomness can sometimes bypass complex case analysis or intricate data structures required in deterministic approaches.
Efficiency: Randomized algorithms can be faster on average, or even in the worst case, than the best known deterministic algorithms for certain problems. Randomness can help algorithms avoid worst-case scenarios and achieve better typical-case performance.
Robustness: Randomized algorithms can be more robust against adversarial inputs. In situations where an adversary can choose inputs to specifically target weaknesses in a deterministic algorithm, randomization can make it difficult for the adversary to predict and exploit the algorithm’s behavior.
Breaking Symmetry: Randomization can be used to break symmetry in distributed systems and parallel algorithms, allowing for efficient coordination and decision-making among multiple agents.

Types of Randomized Algorithms

Randomized algorithms are broadly classified into two main types based on their probabilistic behavior:

Monte Carlo Algorithms: Monte Carlo algorithms are randomized algorithms that may produce an incorrect answer with a certain probability. However, they are typically fast, with a guaranteed runtime, often polynomial. The probability of error can be controlled and reduced by repeating the algorithm multiple times.
Las Vegas Algorithms: Las Vegas algorithms are randomized algorithms that always produce a correct answer, but their runtime is a random variable. The expected runtime of a Las Vegas algorithm is typically finite and often polynomial. While the runtime can vary, we are guaranteed to get the correct output eventually.

The choice between Monte Carlo and Las Vegas algorithms often depends on the specific application and the trade-off between runtime and correctness. Monte Carlo algorithms are suitable when a small probability of error is acceptable, while Las Vegas algorithms are preferred when correctness is paramount, even at the cost of potentially longer runtimes in some cases.

Reducing Error Probability (2.8.1)

A key technique in randomized algorithms is error reduction. For Monte Carlo algorithms that have a probability of error, we can reduce this error probability to an arbitrarily small level by repeating the algorithm multiple times and combining the results. The basic idea is to exploit the independence of repeated runs to amplify the probability of getting a correct answer.

Satz 2.72: Let $A$ be a randomized algorithm that may produce an incorrect answer with probability at most $1 - ϵ$ (where $ϵ > 0$ ). We can construct a new randomized algorithm $A_{δ}$ by running algorithm $A$ independently $N = ⌈ \frac{l o g ( 1/ δ )}{- l o g ( ϵ )} ⌉$ times and taking the majority answer (or some other combination rule). The error probability of $A_{δ}$ can be made arbitrarily small, bounded by $δ$ .

This theorem provides a general method for reducing the error probability of Monte Carlo algorithms. By repeating the algorithm multiple times and taking a majority vote (or other suitable combination rule), we can exponentially decrease the probability of error with a linear increase in runtime (proportional to the number of repetitions $N$ ).

Randomized Selection: QuickSelect (2.8.2)

A classic example of a Las Vegas algorithm is QuickSelect, a randomized algorithm for finding the $k$ -th smallest element (or the element of rank $k$ ) in an unsorted array. QuickSelect is a randomized version of the deterministic QuickSort partitioning algorithm, but instead of recursively sorting both subarrays, it only recurses on the subarray that contains the desired $k$ -th smallest element.

Algorithm: QuickSelect(A, l, r, k)

Algorithm: QuickSelect(A, l, r, k)
Input: Array A, subarray indices l, r, rank k
Output: k-th smallest element in A[l...r]
 
if l == r:
    return A[l] // Base case: single element
if l < r:
    p = Uniform({l, l+1, ..., r}) // Choose pivot index randomly
    t = Partition(A, l, r, p) // Partition around pivot A[p]
    if t == l + k - 1: // Pivot is k-th smallest
        return A[t]
    else if t > l + k - 1: // k-th smallest is in left subarray
        return QuickSelect(A, l, t-1, k)
    else: // k-th smallest is in right subarray
        return QuickSelect(A, t+1, r, k - (t - l + 1))

QuickSelect works by randomly choosing a pivot element, partitioning the array around the pivot, and recursively searching in the appropriate subarray. The key to its efficiency lies in the random pivot selection, which ensures that, on average, the subarray size is reduced significantly in each recursive step.

Expected Runtime of QuickSelect: The expected runtime of QuickSelect is linear in the size of the input array, $O (n)$ . While the worst-case runtime can be quadratic, $O (n^{2})$ , this occurs with very low probability due to the randomization. QuickSelect is a Las Vegas algorithm because it always returns the correct $k$ -th smallest element, but its runtime is a random variable.

Randomized Primality Testing: Miller-Rabin (2.8.3)

Primality testing, the problem of determining whether a given number is prime, is a fundamental problem in number theory and cryptography. While deterministic polynomial-time algorithms for primality testing exist (e.g., Ask primality test), randomized algorithms, such as the Miller-Rabin primality test, are widely used in practice due to their simplicity and efficiency.

The Miller-Rabin test is a Monte Carlo algorithm that provides a probabilistic answer to the primality question. It is based on Fermat’s Little Theorem and properties of modular arithmetic. The test iteratively checks certain conditions that hold for prime numbers. If any condition fails, the number is declared composite (definitely not prime). If all conditions pass for a chosen number of rounds, the number is declared probably prime.

Algorithm: Miller-Rabin Primality Test

Algorithm: Miller-Rabin-Primzahltest(n)
Input: Integer n
Output: 'Primzahl' (probably prime) or 'keine Primzahl' (composite)
 
if n == 2:
    return 'Primzahl' // 2 is prime
else if n is even or n == 1:
    return 'keine Primzahl' // Even numbers > 2 and 1 are composite
 
Choose a ∈ {2, 3, ..., n-1} randomly and uniformly
Calculate k, d ∈ ℤ with n - 1 = d * 2^k and d odd
x = a^d mod n // Compute a^d mod n
 
if x == 1 or x == n - 1:
    return 'Primzahl' // Possibly prime
 
repeat k-1 times:
    x = x^2 mod n // Square x modulo n
    if x == 1:
        return 'keine Primzahl' // Composite
    if x == n - 1:
        return 'Primzahl' // Possibly prime
 
return 'keine Primzahl' // Composite

Error Probability of Miller-Rabin: If the Miller-Rabin test declares a number composite, it is guaranteed to be composite (no false negatives). However, if the test declares a number probably prime, there is a small probability that it is actually composite (false positive). The error probability of a single round of Miller-Rabin is at most 1/4. By repeating the test multiple times with independent random choices of $a$ , the error probability can be reduced exponentially.

Theorem 1.75: By repeating the Miller-Rabin test $k$ times, the probability of error (declaring a composite number probably prime) can be reduced to at most $(1/4)^{k}$ .

Miller-Rabin is a Monte Carlo algorithm with one-sided error (false positives are possible, but false negatives are not). It is widely used in practice due to its efficiency and low error probability. For cryptographic applications, a large number of rounds are typically performed to ensure a sufficiently low error probability.

Target Shooting: Estimating Probabilities through Sampling (2.8.4)

In situations where calculating probabilities directly is difficult, Monte Carlo methods provide a powerful approach to estimate probabilities through random sampling. Target Shooting is a Monte Carlo technique for estimating the probability of an event by repeatedly sampling outcomes from the sample space and counting the proportion of samples that fall within the event.

Algorithm: Target-Shooting

Algorithm: Target-Shooting(U, S, N)
Input: Universe U, subset S ⊆ U, number of samples N
Output: Estimate of |S|/|U|
 
Choose u1, u2, ..., uN ∈ U randomly, uniformly, and independently
return (1/N) * Σ_{i=1}^{N} I_S(ui) // Proportion of samples in S

The Target-Shooting algorithm works by randomly selecting $N$ elements from the universe $U$ and counting how many of these elements fall into the target set $S$ . The proportion of samples in $S$ provides an estimate of the ratio $∣ S ∣/∣ U ∣$ , which is often related to the probability of interest.

Accuracy of Target Shooting: The accuracy of the Target-Shooting algorithm depends on the number of samples $N$ . A larger number of samples generally leads to a more accurate estimate. Satz 2.79 (Theorem 2.79) provides a bound on the number of samples needed to achieve a desired level of accuracy with a certain probability.

Theorem 2.79: To estimate $∣ S ∣/∣ U ∣$ within a factor of $(1 \pm ϵ)$ with probability at least $1 - δ$ , the Target-Shooting algorithm requires a number of samples $N$ proportional to $\frac{∣ U ∣}{∣ S ∣} \cdot \frac{1}{ϵ ^{2}} \cdot lo g (1/ δ)$ .

This theorem quantifies the trade-off between accuracy, confidence, and sampling effort in Target Shooting. Higher accuracy (smaller $ϵ$ ) and higher confidence (smaller $δ$ ) require more samples, and the number of samples also depends inversely on the ratio $∣ S ∣/∣ U ∣$ . When the target set $S$ is small relative to the universe $U$ (rare events), more samples are needed to achieve a reliable estimate.

Finding Duplicates: Hashing and Bloom Filters (2.8.5)

The section concludes with applications of randomized algorithms to the problem of finding duplicates in a large dataset. Given a dataset of $n$ items, the goal is to efficiently identify pairs of identical items.

Hashing for Duplicate Detection

Hashing provides a randomized approach to duplicate detection. By using a hash function, we can map each item to a hash value (a smaller integer). If two items have the same hash value, they are potentially duplicates (hash collision). We can then efficiently check for equality only among items with the same hash value, reducing the number of pairwise comparisons needed.

Algorithm: Duplicate Detection using Hashing

Hash Mapping: Create a hash map (or hash table) to store items and their corresponding hash values. For each item $s_{i}$ in the dataset, compute its hash value $h (s_{i})$ and store the pair $(h (s_{i}), i)$ in the hash map, where $i$ is the index of the item.
Sorting by Hash Value: Sort the entries in the hash map based on their hash values (the first component of the pairs).
Duplicate Check: Iterate through the sorted hash map and compare items with identical hash values for equality. Report any pairs of equal items as duplicates.

Analysis of Hashing for Duplicates: The expected number of collisions (items with the same hash value that are not duplicates) can be controlled by choosing a hash function with good distribution properties and a sufficiently large hash table size. The overall expected runtime is dominated by the sorting step, resulting in a time complexity of $O (n lo g n)$ , and the space complexity is $O (n)$ for storing the hash map.

Bloom Filters for Approximate Duplicate Detection

Bloom filters are probabilistic data structures that provide a space-efficient way to check for the potential presence of an element in a set. They are particularly useful for very large datasets where memory is a constraint. Bloom filters allow for false positives (reporting an element as present when it is not), but they have a very low false positive rate and no false negatives (if an element is in the set, the Bloom filter will always report it as present).

Algorithm: Duplicate Detection using Bloom Filters

Bloom Filter Construction: Create a Bloom filter $M$ of size $m$ bits, initially all set to 0. Choose $k$ independent hash functions $h_{1}, \dots, h_{k} : U \to [m]$ .
Bloom Filter Population: For each item $s_{i}$ in the dataset, set bits $M [h_{j} (s_{i})] = 1$ for all $j = 1, \dots, k$ .
Duplicate Check: Initialize an empty list $L$ of potential duplicates. For each item $s_{i}$ in the dataset, check if all bits $M [h_{j} (s_{i})]$ are 1 for $j = 1, \dots, k$ . If yes, add $i$ to $L$ .
Verification: Iterate through the list $L$ . For each index $i \in L$ , compare $s_{i}$ with all preceding items $s_{1}, \dots, s_{i - 1}$ to identify actual duplicates.

Bloom filters provide a trade-off between space efficiency and accuracy. By controlling the Bloom filter size $m$ and the number of hash functions $k$ , we can adjust the false positive rate and memory usage. Bloom filters are widely used in applications where approximate duplicate detection is acceptable, such as web caching, network routing, and database indexing.

This section has provided a glimpse into the world of randomized algorithms, showcasing their potential for efficiency, simplicity, and robustness. We have explored fundamental techniques like error reduction, randomized selection, primality testing, and Monte Carlo estimation, along with applications to practical problems like duplicate detection. Randomized algorithms are a powerful and versatile tool in computer science, and their importance continues to grow as we tackle increasingly complex and data-intensive problems.

Prev: 06 Estimating Probabilities

CS Notes

Explorer