Have you ever seen a large table of data and thought, "Somebody just vomited numbers on my monitor?" Some times it can be very hard to make sense out of tables like this and even more so when the tables are large. There is at least one machine learning technique that will help you make sense of the data: Classification Trees.
Classification trees work by pouring over your data column by column and then hierarchically grouping your data by the unique values in each column. As an example imagine if you had a table of data like this:
In code I write it like this:
How would you organize the data to make it more understandable?
Note that this example is a bit impractical because the table is pretty small but the mechanics will work the same. One caveat: the algorithm is working off of already enumerated values in a table. There's no continuous values like 8.132, or names, or IDs. In trying to use this in a real world scenario I'm finding myself preprocessing my data to create the buckets so I can run this algorithm.
The classification algorithm I learned tonight starts by looking at each column of data and then deciding which seems like it will give us the most information gain. Once it determines that it partitions the table and carries on recursively until we've come up with a full classification tree. You might be wondering how exactly does the algorithm determine which split will give us the most information. That's thanks to a calculation known as Shannon Entropy. It's outside the scope of this post but feel free to read more about it here.
Here's the code I ported from Python to Ruby:
With the algorithm I created, the following hierarchy is created:
This is one of the few problems that very neatly fits a recursive solution. Why this one? Well a couple reasons:
- You'll notice we're building a tree from this hierarchy. Trees and many of the solutions that involve them are self similar meaning, whatever you do at one level of the tree you'll need to do at every level. In this case, that would be partitioning our dataset into many sub-datasets.
- The recursion depth is fairly well limited by the context of the problem. You'll only ever run into it on datasets with columns that number in the thousands. If that's you, you'll need to unwind the recursion into a stack based design. Have fun! :)
This code is pretty tacky for Ruby but I'm posting it anyway. Suggest how I can clean it up!
To give credit where it's due I sat down with Machine Learning in Action tonight and while I didn't copy the algorithm word for word, it was extremely helpful. This is one of the few books on an advanced topic that seems to able to convey the author's knowledge very well. Some of the more experienced among you might notice that the way I handled evaluating the right way to slice the data is different from the standard way (at least when compared to my book). I'm ok with that but I might clean it up and straighten it out if the mood strikes me.
That's it! It's been a while since I blogged and this was the first thing at the top of my mind.