Thursday, October 9, 2014

Distinct Subsequence

Leetcode link.

Q: Given a string S and a string T, count the number of distinct subsequences of T in S.
Example: S = "rabbbit", T = "rabbit". Return 3.

This question itself is already hard to understand. I failed in creating a DP formula for it (fired). But with KMP in mind, I come up with another method that use finite state machine to do this problem. Complexity(O(m*n), O(n)), where m and n are length of S and T.

Other guy's method (DP based):


Use num[i][j] to save the number of distinct subsequences of T(0, j) in S(0, i).
Then

  1. if T[j] != S[i], Formula num[i][j] = num[i-1][j].
  2. else we have two options for the new char S[i], either we take it (num[i-1][j-1]) or we don't take it (num[i-1][j]). Formula num[i][j] = num[i-1][j-1] + num[i-1][j].
Space optimization technique can be used just like what I did in this post. Final complexity (O(m*n), O(n)). The following code is borrowed from this post(Chinese).

int numDistinct(string S, string T) {  
    // Start typing your C/C++ solution below  
    // DO NOT write int main() function  
    int match[200];  
    if(S.size() < T.size()) return 0;  
    match[0] = 1;  
    for(int i=1; i <= T.size(); i++)  
        match[i] = 0;  
    for(int i=1; i<= S.size(); i ++)  
        for(int j =T.size(); j>=1; j--)  
            if(S[i-1] == T[j-1])  
                match[j]+= match[j-1];  
    return match[T.size()];  
}  


My state based method:


So each char in the T string is treated as a state. This state thing basically means at current state, if  I received the char I want from S, I can either take this char and progress to the next state, or I do not take this char. Example:

For target T = "rabbit" . I have state 0 to 5 standing for current matching state. If my current state is at 4, which is character "i", and I receive a char "t" from S, I can either progress to the next state "t", or I disregard this "t" and wait for the next "t" to appear. Basically similar to the DP formula:  num[i][j] = num[i-1][j-1] + num[i-1][j].

I construct a state count array ct of size T.size() and initialized as all 0 except for the first state ct[0] = 1.  This means at character i, how many states are waiting for the same T[i] in string S to appear. This seems hard to explain. Let me draw some figures.



So now, 1 state is waiting for character "r" from string S. No one is waiting for "a" to "t".

Send a "r" from string S.

This "r" can either be consumed or not. If consumed, we will wait for an "a" then, if not, we will still wait for a "r":

Send an "a" from string S.

First we consider the one waiting for an "a", as last time, it can choose to consume this "a" or not consume this "a". As to the one waiting for an "r", it will not progress. (Similar to DP formula num[i][j] = num[i-1][j]).


Send a "b" from string S.

This is the same as last time. I will just skip it and only paste the drawing.


Send another "b" from string S.

Now this is the fun part, I now have state 2 and state 3 all waiting for this coming "b". For state 2, if I consume this "b", it will be converted to state 3. But this newly converted state 3 cannot consume "b" again since I only received one "b" from S. On the other hand, those already sitting at state 3 can consume this "b" and get converted to state 4. Thus I will do state 3 first, and then do state 2.
For state 3 waiting for "b":
 (Green arrow means consume this "b", red means do not consume this "b"):

Then state 2 waiting for "b":
This is just like memory optimization, from last state to the fist state instead of first state to the last.
Final ct after this step:

Send the third "b" from string S:
Those at state 4 waiting for "i", at state 1 waiting for "a", at state 0 waiting for "r" will not be affected. Use similar analysis method, one can come up with this final ct array:

Send an "i" from string S:
Same as send "a" or send "r". Final ct array:

Send a "t" from string S:
Now those state waiting for "t", if they decide to consume this "t", then we will have 3 subsequences that matches the requirement. If they decide not to consume this "t", we leave it in the ct array.
Final ct array should be exactly the same as the above one, while all the "overflow" states get captured in a return variable.

Now we are at the end of string S. Loop ends.

Code looks so similar to DP version. (in fact, you can say I am using DP since this ct array is basically the memoization step).

int numDistinct(string S, string T) {
    //corner cases
    if (!S.size() || !T.size() || S.size() < T.size()) return 0;

    //state counter array
    int* ct = new int[T.size()]();
    ct[0] = 1;
    int ret = 0;
    for (int i = 0; i<S.size(); i++)
        for (int j = T.size() - 1; j >= 0; j--)
            if (S[i] == T[j])
                if (j == T.size() - 1) ret += ct[j];
                else ct[j + 1] += ct[j];
    
    return ret;
}

Playable version can be found here.


No comments:

Post a Comment