Monday, September 22, 2014

Longest palindrome substring (Manacher's linear algorithm)

Intro


Before we start, I strongly recommend you read this post from Professor Johan Jeuring. It details a very important palindrome property: mirror property, aka "palindrome in palindrome". Once you read it you will understand the following algorithm better. Be sure to comment if you find my explanation of Manacher's algorithm does not make sense at all.

References: Dr. Jeuring used Haskell to describe his idea, which make it hard for me to understand. But if you are a Haskell guru, check out his post. If you prefer python, here is another awesome tutorial. The idea in this post comes from felix's posts (Chinese, English), which use C to describe Manacher's algorithm. Here I'm demonstrating felix's idea in my way using C++ with step by step walkthrough as detailed as possible.

Question: Given a string s, find its longest palindrome substring (not subsequence)


Step1: Preprocess


In Manacher's algorithm, palindrome is not defined by starting and ending indices, instead, palindrome center location and palindrome "radius" is used to represent a palindrome. 

For example, given string "aba", longest palindrome can be represented by center location at 'b' with a radius 2 (counting 'b' itself). Given string "abba", longest palindrome can be represented by center location between two 'b's with radius 2.

You can easily see when the longest palindrome has even length, it would be hard for us to represent its center by using character index. Thus what we would like to do is add placeholders between each character pair. We call the resulting string padded string of original string.

Sidenote: If you have read felix's post, because felix was handling standard C string with a '\0' ending char, felix added another placeholder at the beginning of original string. We are using C++ std::string, thus adding placeholders between character pairs is good enough.

Use felix's example string "12212321", we construct the padded string as:
where '#' represents a placeholder.

Now we can redefine the representation of a palindrome substring. I can immediate see one quite long palindrome from index 0 to index 8. For this palindrome substring "#1#2#2#1#", it is centered at index 4 with a radius of 5 (#1#2 # 2#1#). We can define an array representing the radius of the longest palindrome substring centered at current location, I will call it radius array. To be consistent with felix's post, I will use P[i] to represent it. Notice each character itself can be called a palindrome substring, so radius can be at least 1.

If we manually construct this radius array, we will have: 
I'll verify one of this P[i]. Take i = 7. The longest palindrome substring with s[i] = '1' as its center is s[4, 10] = '#2#1#2#'. Radius (1 #2#) has length 4.

Now it is easy to see when P[i] gets maximum value, we find the longest palindrome substring in the padded string.

You might see constructing P[i] is not hard at all, we simple visit each index and expand P[i] from 1 to the longest it can get by comparing s[i+P[i]] and s[i-P[i]]. This is a feasible way to solve this problem but that will have a time complexity of O(n^2). Manacher's algorithm can do better. But the idea is the same. Manacher's algorithm provides a better way to construct P[i] and reduces time complexity to O[n].


Step2: Iteration


Now let's show how Manacher constructs P[i].

Two variables are defined:
ind: where the latest longest palindrome substring centers at.
mx: the index of the character that is immediately after current longest palindrome substring.

Take string "ababa" for example. Scan from left to right. If we are standing at the first 'b', we can see current ind is 1 (longest palindrome substring at index 0 is 'a' itself), how about mx? current longest palindrome substring is aba with length 3, mx is the character immediately after "aba", which is the second 'b'. Now mx = 3.

At each index, three operations will be performed.

2.1 Precalculate P[i]. 

Here is when we will use palindrome substring's mirror property. I will assume you have already read Professor Jeuring's post. But just to refresh your memory, I'll repeat it based on my interpretation:

For a palindrome string residing in a longer palindrome string, its length must be the same as its mirror palindrome string's length. This mirror palindrome string is defined with respect to longer palindrome string's center.

Formularize this statement, we have multiple cases:

2.1.1. Current character does not resides in the latest palindrome substring.
Case: "abacde" 

Scan from left to right, when we are at index 3, we find our latest longest palindrome substring is s[0-2] = "aba". Index 3 is not part of it. Thus index 3 will have P[i] = 1. But wait! "aca" is also palindrome! Calm down. Title 2.1 says precalculate P[i]. We will update P[3] to correct value soon.

Codify case 2.1.1, using our predefined variables ind and mx, we can have:
if (i >= mx) P[i] = 1;
Just fyi, I'll repeat the definition of mx again: the index of the character that is immediately after current longest palindrome substring.

2.1.2 Current character reside in the latest palindrome substring.
Case: "xabaxabayz"
Scan from left to right, when we are at index 6, we find our latest longest palindrome substring is s[1-7] = "abaxaba" with center at 4. Now ind = 4, mx = 8.

Index 6's mirror index about axis 4 is 2*4-6 = 2. It has P[2] = 3. So we set P[6] to 3 based on mirror property. Wait, that is not correct! P[6] should be 2 just by observation! Why? because longest palindrome centered at 2, which is "xabax", does not totally reside in the green area. But, here comes the fun part, its "sub-palindrome" "aba" totally reside in the green area. Thus in this case, instead of using P[2*ind - i], we will use the shorter yet safer version "aba", which has radius (mx-i).

The above case is an extreme case. Sometimes we are not that unlucky and the mirror index will have its longest palindrome substring totally reside in the green area. So considering both safety and efficiency, we will use:
if (i < mx) {
	int mirrorPi = p[2 * ind - i];
	int safePi = mx - i;
	p[i] = mirrorPi < safePi ? mirrorPi : safePi;
}
(Here when mirrorPi is less than (mx-i), mirroPi is safer. I just use saferPi as a variable name that is easier to understand.)

2.2 Expand P[i]. 


Once we get over 2.1, the remaining steps are easier to understand. In the last section we were cautious, we are trying to find the safest P[i] to start with. Next we will expand P[i] as large as possible to make sure we don't miss anything.
while (i + p[i] < s.size() && i - p[i] >= 0 \
       && s[i + p[i]] == s[i - p[i]])
	p[i]++;
We can still use 2.1.1's case as an example. We will expand index 3's P[i] to 2 since "aca" is the longest parlindrome string centered at index 3.


2.3 Update mx and ind. 


Once we finish expansion at index i, we have found out the longest possible palindrome substring centered at i. We might find a longer palindrome substring then the current longest palindrome substring. Since we will keep using mx and ind in step 2.1, proper maintenance is needed to make sure we always have mx and ind representing the property of current longest palindrome substring. This is done by verifying mx is still the largest mx.
if (i + p[i] > mx) {
	mx = i + p[i];
	ind = i;
}

Step3: Cleanup


Since padding characters have been added to the original string, clean up work is need to restore final result without padded placeholders. In the following code, core concept is demonstrated. Corner cases such as empty string has not been handled. Cleanup has not be realized either. (Lazy OP).

pair<int int> manacher(string s) {
	// assuming string is in the format of #s[0]#s[1]#...#s[n]# 
	// format with length larger than 3 (original unpadded string
	// has more than one char).

	// preparation phase
	int* p = new int[s.size()]();
	p[0] = 1; p[1] = 2;
	int mx = 3;	// mx: store the index immediate after current
	            //longest palindrome
	int ind = 1; // ind: store current longest palindrome
	             //center index.

	for (int i = 2; i < s.size(); i++) {
		// step1: precalculate p[i]
		if (i < mx) {
			int mirrorPi = p[2 * ind - i];
			int safePi = mx - i;
			p[i] = mirrorPi < safePi ? mirrorPi : safePi;
		}
		else {
			p[i] = 1;
		}

		// step2: expand
		while (i + p[i] < s.size() \
			   && i - p[i] >= 0 \
			   && s[i + p[i]] == s[i - p[i]])
			p[i]++;

		// step3: update mx and ind
		if (i + p[i] > mx) {
			mx = i + p[i];
			ind = i;
		}
	}

	return make_pair(p[ind], ind);
}

You can test this code here.

1 comment:

  1. So I tried Manacher's algorithm on this question:
    https://oj.leetcode.com/problems/palindrome-partitioning-ii/ and successfully reduced overall complexity from O(N^3) to O(N^2) with space complexity O(3N). Even though this oj question has a better solution but it serves as a good practice of Manacher's algorithm. Code: http://ideone.com/jzewOk. AC time about 130ms. (Optimal solution takes about 50ms)

    ReplyDelete