The Challenge
So here's the puzzle you're provided: Given a string pattern tell if another given string matches the pattern. The character "." is a wild card that can only match one character and a "*" is a wild card that can match as many characters as possible. Nothing matches a missing character.
For the pattern "a.c*f", for example, "abcdef" would match as would "azcrwgfdjkgfdkjhdfjkfhf" but "bbcdef" would not.
How would you go about solving this problem? When I first set out to solve this problem I got horribly stuck. I tried to use just a for loop and then got totally lost in that the possible permutations branched all over the place. It seemed nigh intractable (for me of course since I knew this was solvable since regex does even more coolness than this).
Side note: That is a great way to tell what computer science challenges deserve your attention most. If you literally can't even come close to tackling a problem understanding how to finally conquer it will make you a better developer by an order of magnitude each problem IMO.
Think about this for a bit cuz up next is the solution.
A Speedy Savior
So first off, one solution would be to create a string for every possible permutation of the pattern (up to the test string's length) and check if one of those patterns matches your test string. There are two issues with this approach: A) It takes up a lot of space and B) it would take a LOT of time.
To prove it would take a lot of time consider that our solution would essentially need to consider every letter in the place of the wild card. That's 26 additional runs PER single wild card... in the case of the asterisk that could be 26^N where N is the number of letters your wild card might take up. In the larger example above, with that technique, there'd be at least 766,467,265,200,361,890,474,622,976 permutations just to the wild card alone.
Thankfully as it turns out there's a better way: Dynamic programming! Ok so what is dynamic programming and why does it apply here? Basically, dynamic programming can be helpful whenever you have a problem that breaks down into individual subproblems that can be combined together to form a complete answer. Once you've solved all of the individual subproblems you start from the finishing line and make your program assume that you found your optimal solution and try to work out what the previous step must've been given that the current step is optimal.
What some of you might be thinking is... what if my string doesn't match? Assuming correctness seems wrong in that case huh? You definitely don't want to return a false positive but that won't happen. What will happen is that if you're program isn't able to match the strings, then the algorithm will be left incomplete and will never reach the block of code that says to return true. Essentially we only say there is a match if the algorithm starts at our finish line and ends up at the starting line we specify. If that path is broken, then our algorithm will return false.
Another benefit of this method is that the worst case runtime is O(mn) or for you non-computer science types, the number of steps the algorithm will take is approximately equal to the length of the pattern string multiplied by the length of the test string (the string we're matching against). In our large sample above that would be about 115 steps to figure out the solution (at the worst)!
Forming a Recurring Subsequence Problem
This is really the hardest part. You need to think about the problem from several different angles until you can see it as a sum of several smaller problems. Once you see this subsolution though the programming just pops out at you.
In our example with the strings, I had been thinking about how I could break this up for the better of the day in the back of head during idle cycles. Finally when I sat down to draw out some ideas (I'm very visual) I realized that I could create a table with the pattern string forming the columns and the test string forming the rows. Once that was done I marked the cells that matched with a T (for True) and an F if they didn't. I saw that in cases where I expected a match, I could traverse the table from the upper left corner to the lower right by traversing cells diagonally (or vertically in the case of an asterisk or a period).
I know this is hard to visualize so I'm going to *try* to give a visual here... bare with me. :)
1 is True and 0 is False BTW...
Pattern: ".a*.j*"
Test string: "cadeajmn"
. | a | * | . | j | * | |
c | 1 | 0 | 1 | 1 | 0 | 1 |
a | 1 | 1 | 1 | 1 | 0 | 1 |
d | 1 | 0 | 1 | 1 | 0 | 1 |
e | 1 | 0 | 1 | 1 | 0 | 1 |
a | 1 | 0 | 1 | 1 | 0 | 1 |
j | 1 | 0 | 1 | 1 | 1 | 1 |
m | 1 | 0 | 1 | 1 | 0 | 1 |
n | 1 | 0 | 1 | 1 | 0 | 1 |
That is the test that made me have my "a-ha" moment. So now looking at this primitative visualization let's see how our pattern matching rules show up here. First notice that the asterisks and periods have their entire columns set to true. The reason for this is that they can stand in for any letter so given appropriate positioning they **could** be true in any of those instances. I got a little tripped up here. I thought that this matrix should represent the fact that my period can only represent one letter but that turned out to be unnecessarily complicated during this phase. The 2nd phase, as you'll see, makes that much easier.
From here see if you can start at the finish line though and work your way back to the start. Now while it *is* possible in this table that you could move from cell to cell horizontally, our phase 2 rules won't allow for this. If we did allow this that would mean that two pattern characters could match to one test character and hopefully you can see why that makes no sense in our scenario. Also remember that you can only traverse cells vertically if you're in an asterisk column (since that means that that single pattern character matched multiple test strings and that's the only pattern character capable of doing that by definition).
Note that we know if our traversal was successful because at some point, on some path, we will arrive at cell (0,0). If we miss the starting line and go off the table, then we don't have a match.
These last two paragraphs we've basically listed out the rules for our recursive algorithm. More clearly, they are as follows:
Given a cell (i, j) where i is the column, j is the row
- if we reach cell (0,0) AND this cell's value is true we have a match
- if we reach a cell where either i or j are negative we have no match (for this path, regardless of the value)
- if none of the above but the current cell is true then test cell (i-1,j-1) (the cell to the upper left)
- if that doesn't have a complete path and we are in an asterisk column try the row above
Here are those same rules but in code:
private static bool _HasCompletePath(IList<List<bool>> matrix, int patternIndex, int stringIndex, string pattern){ var result = false; if(patternIndex < 0 || stringIndex < 0) return false; if (patternIndex == 0 && stringIndex == 0 && matrix[patternIndex][stringIndex]) { result = true; } else if(matrix[patternIndex][stringIndex]) { var tempResult = _HasCompletePath(matrix, patternIndex - 1, stringIndex - 1, pattern); if(!tempResult && _HasCompletePath(matrix, patternIndex, stringIndex-1, pattern) && pattern[patternIndex] == '*') { tempResult = true; } result |= tempResult; } return result;}
Wrapping up
I don't expect this to make perfect sense to someone who has never written a dynamic programming algorithm in their life. My main hope is that by me sharing the insights I gained while I'm still new to this that this will help others to learn this stuff a little easier. If nothing else, it should at least provide a new non-formalized non-mathematical perspective (which is REALLY difficult to find on the web).
Full source code is available here: http://github.com/jcbozonier/Dynamic-Programming-Sample/tree/master