Coding Quickie- Classification Tree Learning

Have you ever seen a large table of data and thought, "Somebody just vomited numbers on my monitor?" Some times it can be very hard to make sense out of tables like this and even more so when the tables are large. There is at least one machine learning technique that will help you make sense of the data: Classification Trees.

Classification trees work by pouring over your data column by column and then hierarchically grouping your data by the unique values in each column. As an example imagine if you had a table of data like this:

In code I write it like this:

How would you organize the data to make it more understandable?
Note that this example is a bit impractical because the table is pretty small but the mechanics will work the same. One caveat: the algorithm is working off of already enumerated values in a table. There's no continuous values like 8.132, or names, or IDs. In trying to use this in a real world scenario I'm finding myself preprocessing my data to create the buckets so I can run this algorithm.

The classification algorithm I learned tonight starts by looking at each column of data and then deciding which seems like it will give us the most information gain. Once it determines that it partitions the table and carries on recursively until we've come up with a full classification tree. You might be wondering how exactly does the algorithm determine which split will give us the most information. That's thanks to a calculation known as Shannon Entropy. It's outside the scope of this post but feel free to read more about it here.

Here's the code I ported from Python to Ruby:

With the algorithm I created, the following hierarchy is created:

This is one of the few problems that very neatly fits a recursive solution. Why this one? Well a couple reasons:
  1. You'll notice we're building a tree from this hierarchy. Trees and many of the solutions that involve them are self similar meaning, whatever you do at one level of the tree you'll need to do at every level. In this case, that would be partitioning our dataset into many sub-datasets.
  2. The recursion depth is fairly well limited by the context of the problem. You'll only ever run into it on datasets with columns that number in the thousands. If that's you, you'll need to unwind the recursion into a stack based design. Have fun! :)
This code is pretty tacky for Ruby but I'm posting it anyway. Suggest how I can clean it up!

To give credit where it's due I sat down with Machine Learning in Action tonight and while I didn't copy the algorithm word for word, it was extremely helpful. This is one of the few books on an advanced topic that seems to able to convey the author's knowledge very well. Some of the more experienced among you might notice that the way I handled evaluating the right way to slice the data is different from the standard way (at least when compared to my book). I'm ok with that but I might clean it up and straighten it out if the mood strikes me.

That's it! It's been a while since I blogged and this was the first thing at the top of my mind.

Using Graphs to Build Your Own Ruby Pattern Matcher

There's More Than One Way To Skin a Rat!

A while back I posted an article on using dynamic programming to implement a string matching algorithm I cobbled together. That article is here: http://justinbozonier.posterous.com/string-pattern-matching-welcome-to-dynamic-pr

Tonight, I'm writing about solving that same problem but by using a graph. It was a solution suggested by a few people and then finally a couple of days ago my boss (Kelly Leahy) brought up the problem once more as I was asking for suggestions of CS concepts he thinks I should tackle and we got back on the subject of that pattern matcher. If it was mentioned so much in the past why did I jump on it now? Quite simply, I've been learning Ruby lately at the insistence of James Thigpen. He's been ranting and raving (on top of many others) so I thought this would be a great first project.

Tools and Techniques Used

I tried my best to stick to pure TDD and unit testing with a context/spec naming style and test structure using RSpec. For my IDE I tried using RadRails, RubyMine, and NetBeans. Ultimately I chose NetBeans. For some reason it turned out to be the easiest for me. **shrug**

This is my very first Ruby project ever and I typically do C# development so if you see some ugly .NET-isms... please flame away so I might learn ;D

Stating the Problem

The problem to solve is given a pattern such as "a.c*z" we need a way to determine if a string like "abcdefghijklmnopqrstuvwxyz" matches it (it does) and that a string like "afdgkjbdkfdstg" does not. Our pattern matching grammar will have only three basic elements. A literal element that matches a character in a string exactly (ex. a, b, c, ..., x, y, z, 0-9, etc), a match any element which matches any single character (a period ie. "."), and a match all element (an asterisk "*") which matches zero or more occurences of any character or combination thereof.

It's like a REALLY simple regex evaluator.

And I know some people are going to ask, why not just use a regular expression? They're super sexy and simple to use in Ruby. Why reinvent the wheel?

I have this crazy idea that even though frameworks are pretty awesome at abstracting the difficult detais away from our view, living in ignorance of the basic data structures and algorithms that these frameworks are constructed of, that just leaves us using these things not because they save us time, but because we are dependent on them to get the job done. I'm all for frameworks, but I want to use them because I understand why I should use them, not because I can't understand how to live without them.

What Makes Solving This Difficult?

One of the key things that makes this a somewhat difficult problem to solve if you've never encountered it before is probably best shown as an example.

Imagine you have the following test string:
"*a*a*a"

now imagine how you would test that the following string matches it:
"aaaaababaaaaaaaaa"

See the problem is the * matches everything including the a so how do you know whether to treat the a as a part of the asterisk or as a literal? The answer that Kelly helped me to formulate is that you treat it as both. Basically you allow your solution to branch and when you come to the end of the string you're testing, you look to see if any of the branched paths finally completed the pattern.

I designed the program so that every character in the test pattern represents a given transition from one state to another. There are also three different types of states. A starting state where every match algorithm begins life, a terminal state which is where all matches will end life, and a standard state that represents an interim step from character to character.

So given our string pattern above, the program I wrote would generate the following graph:
  • Start State
    • Transitions to itself on any character (the asterisk)
    • Also, transitions to the next standard state via a literal transition edge that matches on the letter "a"
  • Literal Transition State
    • Transitions to itself on any character (the asterisk)
    • Also, transitions to the next standard state via a literal transition edge that matches on the letter "a"
  • Literal Transition State
    • Transitions to itself on any character (the asterisk)
    • Also, transitions to the next standard state via a literal transition edge that matches on the letter "a"
  • Literal Transition State
    • transitions to the terminal state via an always transition edge.
  • Terminal State
Getting Down to Brass Tacks

Let's look at the transitions and states in code to try to make some sense out of what has been a fairly abstract read up until now. First the graph states:

As you can see the states really do nothing aside from managing their transitions. I do a poor job of encapsulating these from the rest of my string matching world (ie. the objects that use this array could technically add to it directly) and that should be marked for refactoring later (whether or not it is is my lil secret! ;)

And then the transitions:

These are even simpler! The only bit of logic to a transition is it's test to determine whether or not the provided character can be transitioned on. 

Hopefully from just seeing this code you can get a pretty good feeling for how these things might fit together.

What gets more complicated is the building of the graph. I like to imagine that specs can help to explain the code they test somewhat so let me share some of those first:

My tests are awkwardly named so I could use some naming help if you care to leave a comment! :)

This is the class they test:

Here's a quick high level explanation:
  • First, create the root of the whole graph. This is ALWAYS a start state.
  • Store that as the state we're currently adding transitions to. (line 9)
  • Next, for each character in the test string we are matching against the string pattern create the appropriate transition.
  • Add the newly created transition to the current state
  • Set the current state to point to the destination of the new state provided by our new transition.
  • Rinse, lather, repeat...
  • Once that's done ALWAYS create a terminal state as the last state.
  • Attach it to our current state by adding a transition always transition to the current state that points at the final state.
Now about this transition always transition. It feels dirty and like needing to know the start and end state should be unecessary. Even in the current state needing to know the start state is probably unecessary, but the final state... I'm not sure how to handle that generically like the other states. Thoughts?

Also, notice the code block on line 11! That was my first sampling of Ruby awesomeness. I should totally have been able to avoid writing my own string reader by using ruby's string.each_char |c| {} code block but I learned about that after wrapping up and I didn't care to do more refactoring at the time. :)

Notice how I lied about the difference between a "." and a "*"? They're both the same transitions but they have different destination states. Here it is again:

The "*" transition acts as more of a loop back into the same state it left. The "." moves us forward to the next state.

Meat and Potatoes Time!!

Now all of that was built with the ultimate focus on getting the construction of the actual matcher to be as easy as possible. Here are the tests for how the matcher is ultimately used:

God it's painful showing off code that I'm sure could've been written more elegantly! I always convince myself it's for the best to not go back over it an edit it so that way it gives a more honest view of my coding process and doesn't turn into Justin just trying to look pretty.

So without further ado this is the entire implementation of the code that actually utilizes the graph to match a string against a given pattern:

Here's a general overview of the steps involved so that you can hopefully glean as much from my code as possible:
  • First "compile" the test pattern provided by the user into a graph representation.
  • Next make that graph the first state to be evaluated (a graph is really just represented by a start node that acts as the root.
  • Next we read the next character from the test string provided by the user.
  • We every transition of every state that has been queued for evaluation.
  • For each transition that tests true on its can_do method grab that transition's destination state and place that into our next states list/array.
  • Once we've done that for every state and transition until we've ran out of test string to process there should be at least one state in our next states queue that is ready to transition to the terminal state (assuming there is a match).
  • If there is then return true! We have a winner!
  • Else it's false and these strings don't match according to the given pattern.
Was It Good For You?

I try to share these kinds of learning experiences because so many times in the past I've been so thankful when someone else did. Maybe you're like me and don't have a CS background/degree and are trying to find ways to make data structures and algorithms more practical/applicable. It can be hard to take theory and find a way to apply it. Well anyway, hopefully it helps somebody.

Also, Ruby is a pretty nice language. I might even enjoy it more than Python (I like having an end keyword instead of ambiguous white space ;).

If the mood strikes you, you can find the whole project on my github account here: http://github.com/jcbozonier/DfaStringMatcher

Also, I know I call this a DFA in the github title but it isn't. A deterministic finite automata would have no cycles, whereas my graph does.

So that's that. If this was the least bit helpful let me know by leaving a comment and if you feel like I got something wrong or left something else out entirely leave a comment to that effect as well. Thanks for reading!