Tuesday, September 30, 2014

Find duplicates in a array using only O(n) time and O(1) space

This question comes from a stackoverflow question. The original question is stated as:

Q> Given a size n integer array with all elements in [0, n-1]. Any number can appear for any times. Design an algorithm that can detect all duplicate elements in this array. Try to do this in O(n) time and O(1) space. Assume the original string can be modified.

A>


Algorithm 1:


It is trivial to come up with (O(n), O(n)) or (O(n^2), O(1)) algorithm. For me, figuring out (O(n), O(1)) algorithm is no easy task. Once I saw an algorithm that uses element sign bit as a flag to indicate whether a certain number has been visited or not. C++ code as follow:
vector<int> findDup(int a[], int len) {
 vector<int> ret;
 for (int i = 0; i<len; i++) {
  if(a[abs(a[i])] > 0) a[abs(a[i])] *= -1;
  else ret.push_back(abs(a[i]));
 }
 return ret;
}
with playable version here.

This method uses the property that all elements in the array are positive. Thus sign bit can be used as flag to denote whether we have visited it or not.

This method is not perfect. There are several drawbacks:

  1. It uses sign bit, so in a sense this is not O(1) space at all. What if the original array if defined as unsigned int array?
  2. If we have a 0 in the array we cannot detect it, by the requirement we need to be able to accomodate 0.
  3. If we have one number repeated for more than once, the output vector will contain multiple copy of this number.

Algorithm 2:


Stackoverflow user caf came up with this awesome answer here, while an improved version here is proposed by user j_random_hacker.

Caf's answer is also based on modifying the original array. But instead of modifying the sign bit, caf came up with another idea. Caf uses the index of the element as a flag.

The idea is as follow:
Suppose we are given the following array:
We scan from the left to right and treat a[i]'s value as a "pointer" to a[i]th element in the array. We will try to make a[a[i]] = a[i]. It works in this way:
In the top graph, we have a[a[i]] = j, and a[j] = a[a[i]] = k. To make a[j] = j, we only need to swap a[i]'s value, which is j, with a[j]'s value, which I denote as k for now. After swap we get the bottom figure.

Now what? now we have a[a[j]] = j, but check a[i] again, we have a new "pointer" a[i] = k. We will just go ahead and check a[a[i]] = a[k] = l, check whether k and l matches. This process will not complete until we have a[a[i]] = a[i].

Let's use the first figure to demonstrate this process. 
First swap:

Now a[0] = 3, check a[3] and repeat this process again.

Now a[0] = 1, check a[1].
a[1] = 1, thus this process for a[0] stops.

We haven't done yet. Now we check a[1], make sure its pointer also satisfies a[a[i]] = a[i]. Well it's already been done, we move on.

At index 2, well that also looks good. No swap needed.

We move on to index 3, 4 and 5, everything looks good. Eventually we will have:

Now what? Well it is quite apparent now. It can be seen that if a[i] <> i, then a[i] is a duplicate.

Code:
vector<int> findDup2(int a[], int len) {
 vector<int> ret;
 
 for(int i = 0; i < len; i++) {
  while(a[a[i]] != a[i]) {
   swap(a[a[i]], a[i]);
  }
 }
 
 for (int i = 0; i<len; i++) {
  if (a[i] != i) ret.push_back(a[i]);
 }
 return ret;
}
Playable version here.

As Caf's comment suggests, this may looks like O(n^2) but it is actually not. Every swap will fix one a[j] and makes a[j] equals to j. If this relation holds already then no swap is needed. Thus at most N-1 swaps are needed to fix all elements. Complexity O(n).


Algorithm 2plus:


From the output we can see an apparent problem. We did find out all numbers that are duplicate. But the output is not quite what we want. If a number is repeated n (n>2) times, it will show up for n-2 times in the output.

j_random_hacker solves this problem perfectly. He still uses the property a[a[i]] = a[i] (Suddenly it reminds of two star programming...). But for any found duplicate number, this property will be used only once. If this number is a duplicate, which means a[i] <> i, we check to see whether a[a[i]] == a[i] still holds. If it holds, we break it to prevent other duplicates to use this property, such that other duplicates will not be able to be recognized as a duplicate. 

But how do we break it? we set a[a[i]] to what? Suppose a[a[i]] = k, we need to be sure that the a[a[i]] itself will not be recognized as a duplicate, which means a[a[a[i]]] = a[k] (this is confusing... three star programming) cannot equals to k.  Simply put, we need to find a k that satisfies a[k]<>k. We already know a[i] <> i, thus k = i.

I know the above paragraph is confusing. Let's work on the above example. Hopefully it will clarify j_random_hack's brilliant idea.

OK, so again we starts with 
since we still want that nice a[a[i]] = a[i] property. But this time, we change the output part to:
vector<int> findDup2(int a[], int len) {
 vector<int> ret;
 
 for(int i = 0; i < len; i++) {
  while(a[a[i]] != a[i]) {
   swap(a[a[i]], a[i]);
  }
 }
 
 for (int i = 0; i<len; i++) {
  if (a[i] != i && a[a[i]] == a[i]) {
   ret.push_back(a[i]);
   a[a[i]] = i;
  }
 }
 return ret;
}

The changes are on line 11 and line 13. Let's apply this output code to the above graphical result.

For index 0, it is a duplicate since a[0] <> 0. Based on Caf's original code, we immediately implies a[a[0]] == a[0] = 1. Let's break this property, and change a[a[0]] to 0. We get:

For index 1, it is the original copy, but now we recognize it also as a duplicate since a[1] = 0 <> 1. But, based on j_random_hack's idea, since a[a[i]] <> a[i], this index used to be a duplicate but we have destroyed its property (a[k] <> k). We will not output it.

For index 2, it used to be a duplicate, but property destroyed just like a[1]. No output.

For index 3 and 4, they are not duplicates at all (a[i] == i), No output.

For index 5, we repeat what we have done for index 0, and get the following result:
Index 5 is recognized as a duplicate, original copy a[4] has been corrupted. Output a[5] = 4.

Playable final version can be found here.


Conclusion:

It all starts with the idea of using index as flag. This is way harder than using sign bit as flag. To use index as flag, we use a[i]'s value as a new index and see how we can maintain loop invariant a[a[i]] = a[i]. The followup algorithm moved even further, it uses both a[j] <> j to indicate a duplicate, and use a[a[i]] == a[i] as a breakable flag to indicate whether we have visited this element or not. (Shivering with awe....)


Followups:


  1. Find missing elements. Total array size n. Elements are 0, 1, ... n-1 except that 2 numbers are missing. These two numbers have been replaced by 2 other numbers in range [0, n-1]. Find these two missing numbers and replacement numbers. (Original array modifiable) Solution.
  2. Find the first non repeated character in a string. Suppose we are not handling unicode characters, just 8-bit ascii code. (Expand it to 256 length string, use the above algorithm2plus to break the original string, third scan to pick out the first a[i]=i character. (O(n), O(256)) Solution)


No comments:

Post a Comment