In this lecture we’ll show a clever algorithm for computing the median of an unsorted list of numbers in linear time.
Given an unsorted list of size , find the median of . For concreteness we’ll define the median as the smallest element of the list.
Just as in inductive proofs where one often needs to strengthen the inductive hypothesis and thus prove a stronger statement, here we will give an algorithm for a more general problem:
Given an unsorted list of size and an integer k, find the smallest element of .
Sort the list and return the element of the sorted array. This can be done in by a standard sorting algorithm, such as MergeSort.
Median of Medians Algorithm [BFPRT 1973]
We’ll present an time algorithm that was devised by Blum, Floyd, Pratt, Rivest and Tarjan in 1973. The high-level strategy is divide-and-conquer, following the same scheme as in QuickSort. As in QuickSort we choose a pivot p. We then partition the input array A into three sets based on the pivot: and . Whereas QuickSort then recursively sorts the two subarrays and , in our setting we only need to search in at most one subarray. The choice of the pivot is the main challenge in the design of the fast algorithm.
Here is pseudocode for the outline of the algorithm.
- Choose a pivot (How do we do this? This will be our main task.)
- Partition into three lists:
- If , return Select(, k)
If , return If , return Select(, )
Here is an example:
Consider the input and suppose we use pivot .
Then: and , and therefore:
- if , then recurse on Select().
- if or , then output .
- if , then recurse on Select().
In QuickSort the choice of a bad pivot can lead to a time algorithm. In our case we also need to be clever with the choice of the pivot. For example if the pivot is always the smallest element in the array then we only remove one element from our problem at each iteration, which gives an algorithm! This follows from solving the recurrence .
Recall the recurrence for MergeSort: which solves to . We’re aiming for an time algorithm, and hence we’re hoping for a recurrence such as: which solves to . Notice that if our one subproblem was still of size (3/4)n or even .99n, then our running time is still .
In fact, for any constant the recurrence:
solves to .
Our goal is to choose a pivot that guarantees we eliminate a constant fraction of the elements at each iteration. We will aim to find a pivot that guarantees that the subproblems are of size . Hence we make the following definition.
Definition (Good Pivot). We say a pivot is good if and .
Finding a good pivot
How do we find a good pivot?
Simple Randomized Scheme. When in doubt, act randomly! Choose an element r uniformly at random from A and set this random element r to be the pivot p.
What is the probability that r is a good pivot?
Consider a sorted version of A (this is just for the analysis, we don’t actually sort A). Notice that in sorted A, all elements between n/4 and 3n/4 are good pivots. Thus there are n/2 good pivots. Therefore:
Pr(r is a good pivot) = .
Notice that we can easily check if a proposed pivot p is good or not: simply compute and and check their sizes. Thus we have the following randomized procedure to find a good pivot.
- Choose a random element r from A.
- Check if r is a good pivot or not. If it’s a good pivot set p=r and use it. If not then repeat step 1.
Since r is a good pivot with probability 1/2 then in expectation we have to repeat the procedure two times to find a good pivot. Hence this procedure yields an algorithm whose expected running time is .
Deterministic scheme. Notice that if we find a good pivot in time then our recurrence will be so we still have roughly time left at our disposal. We’ll use the fact that for constants a and b where a+b<1 the following recurrence:
T(n) = T(an) + T(bn) + O(n) solves to T(n) = O(n).
We will aim for an approach that has a=.75 and b=.2 (In your homework you will prove that this case solves to O(n) time). So we will use this additional time of T(.2n) to find a good pivot. This suggests a recursive approach: Choose a subset S of A where , recursively find the median of S, and set p=median(S).
We want to choose a subset S that is representative of the entire set A. Suppose we set S to be the first n/5 elements of A. In the worst case S will be the n/5 smallest elements of A and then p=median(S) will be the (n/10)-th smallest element of A and hence we’ll have: . This gives the recurrence which does not solve to O(n) because 1/5 + 9/10 > 1.
So we need to find a set S that is more representative of the entire set A. We want a set S where for every , there are few elements of A which are ≤ x and a few which are ≥ x. Consider the first 5 elements of A. We want to choose a good representative for these 5 elements. Let’s sort these 5 elements (this takes only O(1) time since there are only 5 elements). Take the median m of these 5 elements. That’s a good representative since at least 3 elements are ≤ m and at least 3 elements are ≥ x. So let’s continue with this approach.
Break A into n/5 groups of 5 elements each (assume n is a power of 5 for simplicity so all groups are the same size). Call the groups Sort each of these groups and let denote the median of the i-th group for . Let denote the set of all n/5 medians. Finally, recursively run our median finding procedure to find the median of S, and set p=median(S). We will prove that this p is a good pivot. We now have our complete algorithm, here is the pseudocode:
- Partition into groups of elements each.
- For to :
- Sort and let Median
- = FastSelect()
- Partition into three lists , ,
- If , return FastSelect()
If , return If , return FastSelect()
It just remains to prove that p is a good pivot.
Running time analysis.
Claim 1: Pivot selected above is a good pivot
Without loss of generality, assume the groups are sorted by their medians i.e. . Note we are not doing this in the algorithm, we are simply relabelling the names of the groups in the proof. Now we have that the median of S is .
Let’s analyze how many elements of A are guaranteed to be ≤ p.
We have that half of S is ≤ p, i.e. .
Also, for each of the first n/10 groups , we have that is greater than or equal to at least 3 elements in .
Thus, is greater than or equal to at least elements of .
Thus, the number of elements strictly greater than .
Using a similar argument, the number of elements smaller than .
This proves that is a good pivot.
Claim 2: FastSelect() runs in time in the worst case
Let denote the time to run FastSelect on a list of size .
Time required to construct
Time required to find pivot
Time required to partition into
Since p is a good pivot, the time required to recursively solve one subproblem for or is at most (actually it’s if you recall the proof of Claim 1).
Thus, we have the recurrence relation .
Since , we have that .