Tuesday, September 30, 2014

Find duplicates in a array using only O(n) time and O(1) space

This question comes from a stackoverflow question. The original question is stated as:

Q> Given a size n integer array with all elements in [0, n-1]. Any number can appear for any times. Design an algorithm that can detect all duplicate elements in this array. Try to do this in O(n) time and O(1) space. Assume the original string can be modified.

A>


Algorithm 1:


It is trivial to come up with (O(n), O(n)) or (O(n^2), O(1)) algorithm. For me, figuring out (O(n), O(1)) algorithm is no easy task. Once I saw an algorithm that uses element sign bit as a flag to indicate whether a certain number has been visited or not. C++ code as follow:
vector<int> findDup(int a[], int len) {
 vector<int> ret;
 for (int i = 0; i<len; i++) {
  if(a[abs(a[i])] > 0) a[abs(a[i])] *= -1;
  else ret.push_back(abs(a[i]));
 }
 return ret;
}
with playable version here.

This method uses the property that all elements in the array are positive. Thus sign bit can be used as flag to denote whether we have visited it or not.

This method is not perfect. There are several drawbacks:

  1. It uses sign bit, so in a sense this is not O(1) space at all. What if the original array if defined as unsigned int array?
  2. If we have a 0 in the array we cannot detect it, by the requirement we need to be able to accomodate 0.
  3. If we have one number repeated for more than once, the output vector will contain multiple copy of this number.

Algorithm 2:


Stackoverflow user caf came up with this awesome answer here, while an improved version here is proposed by user j_random_hacker.

Caf's answer is also based on modifying the original array. But instead of modifying the sign bit, caf came up with another idea. Caf uses the index of the element as a flag.

The idea is as follow:
Suppose we are given the following array:
We scan from the left to right and treat a[i]'s value as a "pointer" to a[i]th element in the array. We will try to make a[a[i]] = a[i]. It works in this way:
In the top graph, we have a[a[i]] = j, and a[j] = a[a[i]] = k. To make a[j] = j, we only need to swap a[i]'s value, which is j, with a[j]'s value, which I denote as k for now. After swap we get the bottom figure.

Now what? now we have a[a[j]] = j, but check a[i] again, we have a new "pointer" a[i] = k. We will just go ahead and check a[a[i]] = a[k] = l, check whether k and l matches. This process will not complete until we have a[a[i]] = a[i].

Let's use the first figure to demonstrate this process. 
First swap:

Now a[0] = 3, check a[3] and repeat this process again.

Now a[0] = 1, check a[1].
a[1] = 1, thus this process for a[0] stops.

We haven't done yet. Now we check a[1], make sure its pointer also satisfies a[a[i]] = a[i]. Well it's already been done, we move on.

At index 2, well that also looks good. No swap needed.

We move on to index 3, 4 and 5, everything looks good. Eventually we will have:

Now what? Well it is quite apparent now. It can be seen that if a[i] <> i, then a[i] is a duplicate.

Code:
vector<int> findDup2(int a[], int len) {
 vector<int> ret;
 
 for(int i = 0; i < len; i++) {
  while(a[a[i]] != a[i]) {
   swap(a[a[i]], a[i]);
  }
 }
 
 for (int i = 0; i<len; i++) {
  if (a[i] != i) ret.push_back(a[i]);
 }
 return ret;
}
Playable version here.

As Caf's comment suggests, this may looks like O(n^2) but it is actually not. Every swap will fix one a[j] and makes a[j] equals to j. If this relation holds already then no swap is needed. Thus at most N-1 swaps are needed to fix all elements. Complexity O(n).


Algorithm 2plus:


From the output we can see an apparent problem. We did find out all numbers that are duplicate. But the output is not quite what we want. If a number is repeated n (n>2) times, it will show up for n-2 times in the output.

j_random_hacker solves this problem perfectly. He still uses the property a[a[i]] = a[i] (Suddenly it reminds of two star programming...). But for any found duplicate number, this property will be used only once. If this number is a duplicate, which means a[i] <> i, we check to see whether a[a[i]] == a[i] still holds. If it holds, we break it to prevent other duplicates to use this property, such that other duplicates will not be able to be recognized as a duplicate. 

But how do we break it? we set a[a[i]] to what? Suppose a[a[i]] = k, we need to be sure that the a[a[i]] itself will not be recognized as a duplicate, which means a[a[a[i]]] = a[k] (this is confusing... three star programming) cannot equals to k.  Simply put, we need to find a k that satisfies a[k]<>k. We already know a[i] <> i, thus k = i.

I know the above paragraph is confusing. Let's work on the above example. Hopefully it will clarify j_random_hack's brilliant idea.

OK, so again we starts with 
since we still want that nice a[a[i]] = a[i] property. But this time, we change the output part to:
vector<int> findDup2(int a[], int len) {
 vector<int> ret;
 
 for(int i = 0; i < len; i++) {
  while(a[a[i]] != a[i]) {
   swap(a[a[i]], a[i]);
  }
 }
 
 for (int i = 0; i<len; i++) {
  if (a[i] != i && a[a[i]] == a[i]) {
   ret.push_back(a[i]);
   a[a[i]] = i;
  }
 }
 return ret;
}

The changes are on line 11 and line 13. Let's apply this output code to the above graphical result.

For index 0, it is a duplicate since a[0] <> 0. Based on Caf's original code, we immediately implies a[a[0]] == a[0] = 1. Let's break this property, and change a[a[0]] to 0. We get:

For index 1, it is the original copy, but now we recognize it also as a duplicate since a[1] = 0 <> 1. But, based on j_random_hack's idea, since a[a[i]] <> a[i], this index used to be a duplicate but we have destroyed its property (a[k] <> k). We will not output it.

For index 2, it used to be a duplicate, but property destroyed just like a[1]. No output.

For index 3 and 4, they are not duplicates at all (a[i] == i), No output.

For index 5, we repeat what we have done for index 0, and get the following result:
Index 5 is recognized as a duplicate, original copy a[4] has been corrupted. Output a[5] = 4.

Playable final version can be found here.


Conclusion:

It all starts with the idea of using index as flag. This is way harder than using sign bit as flag. To use index as flag, we use a[i]'s value as a new index and see how we can maintain loop invariant a[a[i]] = a[i]. The followup algorithm moved even further, it uses both a[j] <> j to indicate a duplicate, and use a[a[i]] == a[i] as a breakable flag to indicate whether we have visited this element or not. (Shivering with awe....)


Followups:


  1. Find missing elements. Total array size n. Elements are 0, 1, ... n-1 except that 2 numbers are missing. These two numbers have been replaced by 2 other numbers in range [0, n-1]. Find these two missing numbers and replacement numbers. (Original array modifiable) Solution.
  2. Find the first non repeated character in a string. Suppose we are not handling unicode characters, just 8-bit ascii code. (Expand it to 256 length string, use the above algorithm2plus to break the original string, third scan to pick out the first a[i]=i character. (O(n), O(256)) Solution)


Monday, September 22, 2014

Longest palindrome substring (Manacher's linear algorithm)

Intro


Before we start, I strongly recommend you read this post from Professor Johan Jeuring. It details a very important palindrome property: mirror property, aka "palindrome in palindrome". Once you read it you will understand the following algorithm better. Be sure to comment if you find my explanation of Manacher's algorithm does not make sense at all.

References: Dr. Jeuring used Haskell to describe his idea, which make it hard for me to understand. But if you are a Haskell guru, check out his post. If you prefer python, here is another awesome tutorial. The idea in this post comes from felix's posts (Chinese, English), which use C to describe Manacher's algorithm. Here I'm demonstrating felix's idea in my way using C++ with step by step walkthrough as detailed as possible.

Question: Given a string s, find its longest palindrome substring (not subsequence)


Step1: Preprocess


In Manacher's algorithm, palindrome is not defined by starting and ending indices, instead, palindrome center location and palindrome "radius" is used to represent a palindrome. 

For example, given string "aba", longest palindrome can be represented by center location at 'b' with a radius 2 (counting 'b' itself). Given string "abba", longest palindrome can be represented by center location between two 'b's with radius 2.

You can easily see when the longest palindrome has even length, it would be hard for us to represent its center by using character index. Thus what we would like to do is add placeholders between each character pair. We call the resulting string padded string of original string.

Sidenote: If you have read felix's post, because felix was handling standard C string with a '\0' ending char, felix added another placeholder at the beginning of original string. We are using C++ std::string, thus adding placeholders between character pairs is good enough.

Use felix's example string "12212321", we construct the padded string as:
where '#' represents a placeholder.

Now we can redefine the representation of a palindrome substring. I can immediate see one quite long palindrome from index 0 to index 8. For this palindrome substring "#1#2#2#1#", it is centered at index 4 with a radius of 5 (#1#2 # 2#1#). We can define an array representing the radius of the longest palindrome substring centered at current location, I will call it radius array. To be consistent with felix's post, I will use P[i] to represent it. Notice each character itself can be called a palindrome substring, so radius can be at least 1.

If we manually construct this radius array, we will have: 
I'll verify one of this P[i]. Take i = 7. The longest palindrome substring with s[i] = '1' as its center is s[4, 10] = '#2#1#2#'. Radius (1 #2#) has length 4.

Now it is easy to see when P[i] gets maximum value, we find the longest palindrome substring in the padded string.

You might see constructing P[i] is not hard at all, we simple visit each index and expand P[i] from 1 to the longest it can get by comparing s[i+P[i]] and s[i-P[i]]. This is a feasible way to solve this problem but that will have a time complexity of O(n^2). Manacher's algorithm can do better. But the idea is the same. Manacher's algorithm provides a better way to construct P[i] and reduces time complexity to O[n].


Step2: Iteration


Now let's show how Manacher constructs P[i].

Two variables are defined:
ind: where the latest longest palindrome substring centers at.
mx: the index of the character that is immediately after current longest palindrome substring.

Take string "ababa" for example. Scan from left to right. If we are standing at the first 'b', we can see current ind is 1 (longest palindrome substring at index 0 is 'a' itself), how about mx? current longest palindrome substring is aba with length 3, mx is the character immediately after "aba", which is the second 'b'. Now mx = 3.

At each index, three operations will be performed.

2.1 Precalculate P[i]. 

Here is when we will use palindrome substring's mirror property. I will assume you have already read Professor Jeuring's post. But just to refresh your memory, I'll repeat it based on my interpretation:

For a palindrome string residing in a longer palindrome string, its length must be the same as its mirror palindrome string's length. This mirror palindrome string is defined with respect to longer palindrome string's center.

Formularize this statement, we have multiple cases:

2.1.1. Current character does not resides in the latest palindrome substring.
Case: "abacde" 

Scan from left to right, when we are at index 3, we find our latest longest palindrome substring is s[0-2] = "aba". Index 3 is not part of it. Thus index 3 will have P[i] = 1. But wait! "aca" is also palindrome! Calm down. Title 2.1 says precalculate P[i]. We will update P[3] to correct value soon.

Codify case 2.1.1, using our predefined variables ind and mx, we can have:
if (i >= mx) P[i] = 1;
Just fyi, I'll repeat the definition of mx again: the index of the character that is immediately after current longest palindrome substring.

2.1.2 Current character reside in the latest palindrome substring.
Case: "xabaxabayz"
Scan from left to right, when we are at index 6, we find our latest longest palindrome substring is s[1-7] = "abaxaba" with center at 4. Now ind = 4, mx = 8.

Index 6's mirror index about axis 4 is 2*4-6 = 2. It has P[2] = 3. So we set P[6] to 3 based on mirror property. Wait, that is not correct! P[6] should be 2 just by observation! Why? because longest palindrome centered at 2, which is "xabax", does not totally reside in the green area. But, here comes the fun part, its "sub-palindrome" "aba" totally reside in the green area. Thus in this case, instead of using P[2*ind - i], we will use the shorter yet safer version "aba", which has radius (mx-i).

The above case is an extreme case. Sometimes we are not that unlucky and the mirror index will have its longest palindrome substring totally reside in the green area. So considering both safety and efficiency, we will use:
if (i < mx) {
	int mirrorPi = p[2 * ind - i];
	int safePi = mx - i;
	p[i] = mirrorPi < safePi ? mirrorPi : safePi;
}
(Here when mirrorPi is less than (mx-i), mirroPi is safer. I just use saferPi as a variable name that is easier to understand.)

2.2 Expand P[i]. 


Once we get over 2.1, the remaining steps are easier to understand. In the last section we were cautious, we are trying to find the safest P[i] to start with. Next we will expand P[i] as large as possible to make sure we don't miss anything.
while (i + p[i] < s.size() && i - p[i] >= 0 \
       && s[i + p[i]] == s[i - p[i]])
	p[i]++;
We can still use 2.1.1's case as an example. We will expand index 3's P[i] to 2 since "aca" is the longest parlindrome string centered at index 3.


2.3 Update mx and ind. 


Once we finish expansion at index i, we have found out the longest possible palindrome substring centered at i. We might find a longer palindrome substring then the current longest palindrome substring. Since we will keep using mx and ind in step 2.1, proper maintenance is needed to make sure we always have mx and ind representing the property of current longest palindrome substring. This is done by verifying mx is still the largest mx.
if (i + p[i] > mx) {
	mx = i + p[i];
	ind = i;
}

Step3: Cleanup


Since padding characters have been added to the original string, clean up work is need to restore final result without padded placeholders. In the following code, core concept is demonstrated. Corner cases such as empty string has not been handled. Cleanup has not be realized either. (Lazy OP).

pair<int int> manacher(string s) {
	// assuming string is in the format of #s[0]#s[1]#...#s[n]# 
	// format with length larger than 3 (original unpadded string
	// has more than one char).

	// preparation phase
	int* p = new int[s.size()]();
	p[0] = 1; p[1] = 2;
	int mx = 3;	// mx: store the index immediate after current
	            //longest palindrome
	int ind = 1; // ind: store current longest palindrome
	             //center index.

	for (int i = 2; i < s.size(); i++) {
		// step1: precalculate p[i]
		if (i < mx) {
			int mirrorPi = p[2 * ind - i];
			int safePi = mx - i;
			p[i] = mirrorPi < safePi ? mirrorPi : safePi;
		}
		else {
			p[i] = 1;
		}

		// step2: expand
		while (i + p[i] < s.size() \
			   && i - p[i] >= 0 \
			   && s[i + p[i]] == s[i - p[i]])
			p[i]++;

		// step3: update mx and ind
		if (i + p[i] > mx) {
			mx = i + p[i];
			ind = i;
		}
	}

	return make_pair(p[ind], ind);
}

You can test this code here.

Monday, September 15, 2014

Word break II

Reference link: https://oj.leetcode.com/problems/word-break-ii/

Question:
Given a string s and a dictionary of words dict, add spaces in s to construct a sentence where each word is a valid dictionary word.
Return all such possible sentences.
For example, given
s = "catsanddogs",
dict = ["cat", "cats", "and", "sand", "dog", "dogs"].

A solution is ["cats and dogs", "cat sand dogs"].

My thoughts:
A tough question for me. Dynamic programming is definitely necessary. But i will start from my brute force method.
To brute force this question is easy. DFS is needed. For the target array, construct a stack (to simulate DFS) to store last valid partition. Use a index to check stk.top() to index substring.

  • If the substring is valid in the dictionary:
    • If index at string end: A successful partition has been found. push current index to stack, output current partition combination. Pop top, set index to current top+1.
    • Else: Push index to stack, index++.
  • Else:
    • If index at string end: Set current index to stk.top()+1. Pop top.
    • Else: Increment index.
A illustration of this brute force process is attached (only key steps):
Here st is the stack, gray rows are successful rows, orange blocks are current index block.
As one can easily see, substring "DOGS" has been visited at least twice in this process because I did not memoize the results of "CATSAND". For a long input string, brute force method takes longer than a minute to finish. But, if I already have the result of "CATSAND", I can simply add "DOGS" string to all possible solutions of "CATSAND" and voila.

Now the recursive structure is clear. For a string S with length L, if I want to get all combinations of S(-1, L-1], I can check S(-1, sep] since this can be memoized just like what we see in the above figure. If S(-1, sep] can be represented by a series of dict elements, then we can forward to check S(sep, L-1] to see if it is in the dictionary. If both are true, we get a hit. To speed this process up, we can check S(sep, L-1] first since this part does not involve recursive steps.

An illustration of the above process with speed up is:

Notice here I do not have to memoize S(sep, L-1]. This is because if (sep, L-1] can be divided to (sep, sep2] + (sep2, L-1], then if I set sep to sep2, this case can collapse to case (-1, sep2] + (sep2, L-1] case. 

Code is as follow:
unordered_map<int, vector<string>> um;

vector<string> helper(string& s, int i, unordered_set<string> &dict) {
    if(um[i].size()) return um[i];
    vector<string> ret;
    if(i == 0) {
        if(dict.count(s.substr(0, 1))) ret.push_back(s.substr(0, 1) + " ");
        return ret;
    }
    if(dict.count(s.substr(0, i+1))) ret.push_back(s.substr(0, i+1) + " ");
    for(int sep = 1; sep <= i; sep++) {
        if(!dict.count(s.substr(sep, i-sep+1))) continue;
        auto tmp = helper(s, sep-1, dict);
        if(!tmp.size()) continue;
        for(auto &e:tmp) e += s.substr(sep, i-sep+1) + " ";
        ret.insert(ret.end(), tmp.begin(), tmp.end());
    }
    um[i] = ret;
    return ret;
}

vector<string> wordBreak(string s, unordered_set<string> &dict) {
    if(!s.size()) return *(new vector<string>);
    auto ret = helper(s, s.size()-1, dict);
    for(auto &e:ret) {
        if(e.size())e.pop_back();
    }
    return ret;
}

Friday, September 12, 2014

Linked list cycle II (Update)

https://oj.leetcode.com/problems/linked-list-cycle-ii/

This question looks more like a mathematical competition question than a algorithm problem. Discrete math is not in my favorite toolset so figuring out a good way to solve this problem took me a long time.

Algorithm worse than O(n) is trival. Solving it in O(n) time is when it get interesting. The question itself can be divided into two parts:
  1. Detect whether there is a cycle in the singly linked list.
  2. If there is a cycle, where does this cycle starts.
To answer question 1, think of the turtle and the rabbit, if the rabbit is fast enough, and there is loop in the linked list, rabbit can revisit turtle again. 

The second question is tricky. Suppose in question 1, turtle has speed of 1 and rabbit has speed of 2. When turtle and rabbit meet each other, turtle has visited k node, rabbit has visited 2k node. Suppose the existing cycle has perimeter of r, and rabbit has run n more loops than turtle, we can come up with an easy equation:

2k-k=n*r

Suppose the linked list has the following structure:

where non-cycle part has length s, turtle path k=s+m . Now we can come up with this:

                          2n*r = s+m+n*r
=>                        n*r = s+m
=>                        (s+m)%r = 0

Now it gets tricky. You can easily find n is not a number that we can calculate.
Other posts claim we can set n = 1, so we can have s = r-m, which means if a turtle starts at the original starting point, another turtle starts at the rendezvous point, when they meet, you will find the starting point of the circle.

I think we can think it like this, If we let the rabbit run one more loop, this relative position will not change. so now it is easy to see r = m+s based on:

[(n+1)*r} % r = (s+m)%r = 0

Code:
/**
 * Definition for singly-linked list.
 * struct ListNode {
 *     int val;
 *     ListNode *next;
 *     ListNode(int x) : val(x), next(NULL) {}
 * };
 */
class Solution {
public:
    ListNode *detectCycle(ListNode *head) {
        // special cases
        if(!head || !head->next) return NULL;
        if(!head->next->next) return NULL;
        if(head->next->next == head) return head;
        // now linked list has at least 3 nodes
        
        // ready? set
        int tortoiseTravelled = 1;
        ListNode* tortoise = head->next;
        ListNode* rabbit = head->next->next;
        
        // go!
        while(rabbit && rabbit->next && rabbit!=tortoise) {
            rabbit = rabbit->next->next;
            tortoise = tortoise->next;
            ++tortoiseTravelled;
        }
        
        if(!rabbit || !rabbit->next) return NULL;
        // now we found a cycle. Starting from current location,
        // let's find the starting point of loop.
        ListNode* surveyer = rabbit;
        
        ListNode* surveyer2 = head;
        while(surveyer != surveyer2) {
            surveyer2 = surveyer2->next;
            surveyer = surveyer->next;
        }
        
        return surveyer;
    }
};

Followup (From link):

The followup questions extend one linked list to two linked lists. Questions are:

  1. How to determine whether two acyclic singly linked lists have crossed section.
  2. If we don't know whether those two singly linked lists are acyclic or not, how to determine whether they have crossed section.
  3. In case 2, if those two linked lists have crossed section, how to find the starting node of the crossed section.
For question 1: 

Think about crossed section. Singly linked list can only have crossed section like this:

It is impossible for list 1 and list 2 to separate after they cross each other. Thus:

  1. A natural way to solve question 1 is to check whether list 1's ending node is the same as list 2's ending node.
  2. Or connect list 1's starting node to list 2's ending node, manually constructing a loop if list 1 and list 2 cross each other. If a loop is constructed then list 1 and list 2 do have cross section.

For question 2. 

The difference between question 1 and 2 is whether linked list 1 or 2 has loop or not. Divide this into 3(2) cases.

  1. List 1 has loop but list 2 does not have loop. 
  2. List 1 and 2 all have loop.
Bear in mind that we are handling a singly linked list. Having loop in only one list means these two lists cannot have cross section. If a cross section is found, then they will all be trapped in the loop. Case 1 means list 1 and 2 does not have cross section. Check the previous section for "checking whether a linked list has loop".
For case 2, we only need to make sure they have same loop. Find a loop node in list 1 as A and loop node in list 2 as B, check whether we can reach B from A. Problem solved.


For question 3. 

This one is tricky. Since we don't know whether they have loop or not. We only know they have cross section. Divide this into 2 cases.
  1. They don't have loop.
  2. They all have loops.
Case 1 can be solved just as what we did in linked list cycle II. We manually construct a loop and find the starting point of the loop. Problem solved.

Case 2 Type 1. When lists enter the loop, they have the same entry point.


Since we already know a loop exists for both list 1 and list 2, and all the nodes after rendezvous point pt are common nodes, if we can figure out list length of list 1 and list 2, we can easily get to rendezvous point using two pointers from list 1's head and list 2's head.

Suppose list 1 is longer with length(list1) and list 2 is shorter with length(list2), if list 1 pointer starts first (to compensate list length difference), after list 1 pointer travels length(list1) - length(list2) nodes, start another pointer from list 2, when these two pointer converge, we find point pt. This strategy can be applied to case 1 too.

Now the question goes back to how to determine list length given the list has a loop. Back to the first figure, we want to know s+(r-m)+m = s+r. s is what we calculated in the original leetcode problem, r can be achieved during s calculation. Problem solved.

Case 2 Type 2. When lists enter the loop, they have different entry points.
Why this case is different from Case 2 Type 1? Given linked list 1: {a,b,c,d,b,c,d,...} and {e,c,d,b,c,d,b,...}, they all have length 4. Use case 2 type 1 strategy we will get caught in a dead loop. Why this happens? Case 2 Type 1 will work only if from rendezvous point pt to entry point e exists for both list 1 and list 2. Thus when length difference is compensated they will travel in the loop side by side. For type 2, without pt->e part, they will never rendezvous, causing a big difference.

The strategy is still simple. We only need to first confirm we are in this case 2 type 2, then find out list 1's entry point e1, list 2's entry point e2. Problem solved.