Sum some sums! - Life, the Universe, and Everything: The adventures of William (Billy) Moses

The other day a friend got an email back from a Google recruiter asking if he’d like to interview.

Being a TA for the MIT’s introductory algorithms class, and generally knowledgable about algorithms (I often help prep friends for algo interviews), he asked if I could help him prepare.

I was more than happy to oblige.

Problem 1: 1 number missing

Naturally, I had to begin our session with a classic interview problem: given an unsorted list of \(n\) numbers from \(1\) to \(n\) (inclusive) with one number missing, find the missing number!

Of course he immediately knew the standard solution:

The sum of the the numbers should equal the following (by simple math): \(\sum_{i=1}^n i = \frac{n(n+1)}{2}\)

Thus if we sum the numbers in the array, the difference between the expected sum and the actual sum will be our missing number!

Problem 2: 2 numbers missing

Since he got this right away I decided to up the ante a little bit. Now instead of one missing number, lets suppose we had two.

Here, he got a little stuck, to which I gave him the advice: what other operators could we have used for the 1-missing-number case. He came up with xor (as I had hoped).

Instead of subtracting the sum of our array from the expected sum, we could’ve xor’d our array with the expected xor (i.e. the xor of \(1\) through \(n\)).

For ease, let’s call our two missing numbers \(a\) and \(b\) Suppose now we do both of these operations on our array. Now we know \(a+b\) and \(a ~\text{xor}~ b\).

Now the question is, how do we invert (sum(a,b), xor(a,b)), ignoring for the moment, the question as to whether this is indeed invertible. Miguel thought for a while trying to find a nice formula (channeling his inner math major), until I reminded him that since we already spend \(O(n)\) time finding these, we can just as well use \(O(n)\) time to do the inversion.

Given that, we came up with the following psedocode which simply iterates through possible \((a, b)\) values, choosing \(b\) from one of our constraints so we only have to search \(O(n)\) pairs.

def find_missing(xs, n):
  expected_sum = sum(range(1,n+1))
  expected_xor = xor(range(1,n+1))
  true_sum = sum(xs)
  true_xor = xor(xs)
  a_plus_b = expected_sum - true_sum
  a_xor_b  = expected_xor ^ true_xor
  for a in range(1, n+1):
    b = a_xor_b ^ a
    if a + b == a_plus_b:
      return (a, b)
  raise

Sum/xor to find 2 missing numbers.

Problem 3: \(k\) numbers missing

Keep going, I said! What if now we have \(k\) numbers missing. I decided to be especially generous and even give him as much time/space as he wanted.

Naturally, his first instinct was to merge-sort the list, looking for gaps between values. Such an algorithm runs in \(O(n \log n)\) worst case, but uses \(O(n)\) space.

Deciding that my generousity had run out, I asked him to do it in less space.

He offered up a quicksort instead which is \(O(n \log n)\) expected time, and indeed in place (\(O(1)\) extra space). However it was \(O(n^2)\) worst case (unless we do something fancy like Median-of-Medians, but I wasn’t planning on covering that at the time).

I then suggested a heap sort, to which Miguel replied “don’t you need extra space for a heap”. And it turns out you don’t!

If you make a max heap out of your array (which can be done in \(O(n)\) time), then repetitively swap out the max to the current end of your heap (and heapify-down to fix), you can do an in-place heap-sort!

It also turns out you can even do merge sort in-place, but that’s much more complicated.

Problem 4: \(k\) numbers missing, \(O(n)\) worst-case

Ok, enough fiddling around with suboptimal runtimes, let’s get this problem back to \(O(n)\).

Comparison sorts on bounded integers are for chumps. Lets use a counting sort-esque way of solving the problem.

First, let’s create a list of counts – representing the number of times we’ve seen any given value. We then iterate through all elements of our array, incrementing the appropriate count. Finally, we return any value with a 0 count that we were expecting to see.

def find_missing(xs, n):
  counts = [0]*(n+1)
  for x in xs:
    counts[x] += 1
  output = []
  for x in range(1, n+1):
    if counts[x] == 0:
      output.append(x)
  return output

Counting-sort to find \(k\) missing numbers.

Problem 5: \(k\) numbers missing, \(O(n)\) worst-case, \(O(k)\) extra space

This solution is nice and all (getting back to our best time bound), but I’m still not happy with that space bound.

Let’s fix that with a quick-and-dirty divide-and-conquer.

Specifically, suppose we’re looking through a list of numbers which ought to contain all the numbers from low to high.

Let’s run one round of the \(O(n)\) partition function from quicksort, using the average of low and high as our pivot.

def partition(xs, left, right, value):
  while True:
    while xs[left] <= value:
      left += 1
    while xs[right] > value:
      right -= 1

    if left < right
      xs[left], xs[right] = xs[right], xs[left]
    else:
      break

  return left

Partition function from quicksort.

Here’s where the interesting part comes in. Originally, we expected to have a total of \(k\) “gaps” in our original array. Now we’ve partitioned our array into pieces of roughly equal size, where any gaps less than our pivot are on the left of the pivot, and any gaps greater than our pivot are on the right side of our pivot.

In fact, we can quickly determine how many such gaps exist on either side, by simply checking the index of our partition!

Our overall algorithm now becomes the following: if there are no “gaps” on one of the partitions, recurse into the other partition. Otherwise, call the find_missing function on both sides, except now we’re looking for fewer gaps on both sides. As a quick and dirty base-case, we can simply call one of our 1 element missing functions above as a base case, but that’s slightly less fun.

def find_k_missing(xs, lindex, hindex, lvalue, hvalue, k):
  if k == 1:
    return [find_1_missing(xs, lindex, hindex, lvalue, hvalue)]
  mid = (lvalue + hvalue)/2
  split = partition(xs, lindex, hindex, mid)
  left_gaps  = (mid - lindex)  - (split - lindex)
  right_gaps = (hindex - mid) - (hindex - split)

  missing = []
  if left_gaps != 0:
    missing.appendAll(find_k_missing(xs, lindex, split, lvalue, mid, left_gaps))

  if right_gaps != 0:
    missing.appendAll(find_k_missing(xs, split, hindex, mid, hvalue, right_gaps))

  return missing

Here’s where the analysis gets interesting.

Clearly our routine works in \(O(n)\) time when we just have one element missing.

Let’s now prove the rest by induction. Suppose our function finds up to \(k-1\) elements missing in \(O(n)\) time. Does this imply we can find up to \(k\) elements in \(O(n)\) time?

In reality, we only have two cases we need to worry about. The first case is when our partition procedure results in \(m>0\) gaps on one side, and \(k-m>0\) gaps on the other. This takes \(O(n)\) time. However, at this point we simply run our algorithm for finding \(m\), and \(k-m\) gaps on either side. The original partition takes \(O(n)\) time and by our inductive hypothesis, each of the two subprocedures take \(O(n)\) time, meaning the entire thing runs in \(O(n)\). Great!

Now for the case where all the gaps are on one side. We know not to check the other side of the partition and recurse. I can write out the worst-case recurrence relation for this as follows:

\[T(n) = T(n/2) + O(n)\]

Either by the master method, or visualizing this as a geometric series, we see that the worst-case runtime of this procedure is just \(O(n)\).

And there you have it, a nice way of solving the \(k\) missing values problems in \(O(n)\) time without an auxillary array.

But Billy, you might say, how much space does this use?

I won’t go through a full-blown version of the space analysis, but you should be able to convince yourself that with proper tail-recursion eliminiation, or implementing this iteratively, you end up needing only \(O(k)\) extra space (worst case one recursive call for every gap).

Since we need \(O(k)\) space to output our final array this is the best we can do!

Using higher-order polynomials

Another common way to solve the \(k\) missing elements is to use various formulas for higher order polynomials. For instance we would compute not only the sum of all the elements, but also the sum of squares:

\[\sum_{i=1}^n i^2 = \frac{n(n+1)(2n+1)}{6}\]

There exist similar formulas for any positive integer power we choose.

In the same manner as our sum/xor formula above, if we were to assume we were missing two numbers \(a\) and \(b\), we would be able to derive \(a+b\) and \(a^2+b^2\). Using the quadratic formula, we can quickly derive the answer.

As \(k\) grows, we can use the same tecnique, to figure out the sum of different powers of our missing numbers. Unfortuantely, after \(4\), there isnt a nice quadratic formula way to solve this (though I believe there is a randomized algorithm to do so).

Either way, while this sort of thing works, it seems much sketchier to me because finding a solution to the set of equations becomes difficult with increasing \(k\) (whereas the divide-and-conquer solution doesn’t get more difficult with greater \(k\)).