What Did I Learn to Do?
Over the past couple of weeks I learned about two of the more popular data clustering algorithms: K-Means Clustering and Density Based Clustering (loosely conforms to DBSCAN). In this post I'll be showing my own implementation of DBSCAN and how I used it to separate images into separate files containing assorted features from the original image. Here's a quick example to better explain: My algorithm identified all of the white space in the Cheezburger logo and separated it into a new image file. It looks like this:
Before:
After (gray areas are transparent now):
It also extracted the text:
It should be noted I'll just be discussing the naive implementation of the algorithm. It takes FOREVER to process a larger limit. I will also discuss why that is and some abstract thoughts around how it could be improved.
Why Am I Learning This Now?
The main goal of this exercise is not to learn to do image processing nor is it to learn to write the best or fastest clustering algorithm. The point of this experiment was to learn enough about clustering that I could write my own naive implementation. It's one thing to say I know something and it's a whole other to know enough to be able to craft it.
What is Data Clustering?
Data clustering is something anyone who has ever looked at a scatter plot before has done. It's where you look for the areas where many points seem to group together and label them as separate groups. A good example is on Wikipedia, here's the image they use:
They have already separated the data into two groups for us. This is also a great example of why one might choose to use a density based data clustering algorithm over something like a k-means algorithm.
K-means works best at identifying circular groupings like the blue group above. K-means is also very straightforward to create. The biggest downside to that algorithm though is that the red cluster in the above graph would be split into multiple groups because K-means can't account for the long, curved shape. A density based clustering algorithm however can gracefully handle both but gets a bit more tricky as the algorithm essentially becomes a graph traversal problem that, given large numbers of points, prevents the simpler recursive graph travesal solutions from being tractable.
How I Applied All of This to Images
As I've been learning more about data analysis I keep relearning a saying I've heard before: "Images are a great data source for machine learning." In this case I chose to use images as a multi-dimensional data source using each pixel as an "ImagePoint" in a 6-dimensional space. Red, green, and blue were the first three dimensions. The last three were the alpha transparency value as well as the X and Y coordinates of the pixel in the image.
My main hypothesis was that I should see images separate by colors somewhat based upon their relative visual location. The Cheezburger logo above is a perfect example of my expectation.
Step By Step Through Code
Before we dive into the code there are three concepts you should be aware of that are key to how I implemented DBSCAN:
- Smallest Cluster- The minimum number of points necessary to consider a group of points a cluster.
- Cluster Distance- The maximum distance between two points for them to be "dense" enough to be considered for inclusion into a cluster.
The names used above are not defacto names but my own from the code.
Step 1. Load All Pixels Into Memory
To start with I load all of the image's pixels into memory and convert them to the ImagePoint objects I discussed earlier. For the sake of this conversation let's just establish that function call looks like this:
I get back the image dimensions so that once I write each cluster's image data to separate files I can position the images in the same places they would have appeared in the original in order to help me see how it worked.
Step 2. Cluster!
This step isn't quite as trivial as the previous one. I'll write out the psuedo-code first and then show you the original code.
- Take each unvisited image point in the original image space
- Mark the point as visited and then find all neighboring points
- Neighboring points are found by calculating the distance between the given point and all other points and only keeping the ones that are within the Cluster Distance.
- Then, if we have more neighboring points than the Smallest Cluster size we've found ourselves a new cluster!
- Create a new cluster and add it to our list of clusters and then also add the given image point to the new cluster.
- Now we can expand our newly created cluster!
- From here we start exploring every image point our given image point is densely connected to. Basically, it's time to follow the bread crumbs to the end of the road.
- Now let's go to all of the neighboring points we haven't visited yet and add them to our "connected image points to be examined" list.
- Using this list we'll keep track of the points we started with as well as any new ones that need to be examined. As we find points that meet our criteria we'll be adding them to our current cluster
- We also need to mark these points as having been queued to be visited so that they aren't added to our list multiple times and waste time.
- Now, while we still have connected image points to be examined
- Grab one of them and if it hasn't already been visited, let's start our visit!
- Start the visit by marking the image point as being visited
- Add this point to our cluster
- Find all of the neighboring image points to this image point that is itself a neighboring image point to the image point we've gotten at step 1! Complicated much?
- Add all of these newest neighboring image points to our "connected image points to be examined" list IF:
- They have not already been visited
- They are not already queued for a visit
- There are more of them than the Smallest Cluster size.
- As we add the image points to the "connected image points to be examined" list mark them as "Queue for Visit".
There are a couple of key simplifying assumptions I'm making. One is that a point only need be visited once. Also that once a single point in our outer loop identifies a new cluster, that cluster will be built in its entirety with the inner loop. This means that we never have to check whether or not a point has already been added to a cluster.
Here is the definition of the ImagePoint class:
Especially important is the distance function. If I wanted to, it's entirely possible to change the algorithm dramatically just by adding or removing what factors I want to include in the distance computation. One thing I could choose to remove is the X and Y coordinates for example and then I'd up with clusters of pixels that have smoothly transitioned colors (sort of).
Anyway, now the actual code that utilizes this:
Step 3: Create an image for each cluster
This is as simple as looping through all of our clusters and writing each pixel contained within to a Bitmap on disk. C# makes this fairly straight-forward (one reason I didn't do this in Ruby actually). It doesn't mean I couldn't have done it in Ruby, just that it wasn't immediately apparent as to how.
What Could Be Done Better?
My algorithm takes on the order of hours to run on any medium sized image. Some simple profiling has shown that my reliance on
image_points.Where(x => x.DistanceTo(image_point) <= cluster_distance).ToArray()
Is where the bottleneck lies (really within the Where expression). If anyone has any concrete tips around how I could cache that data to decrease the run time, I'm all ears. (EDIT: Apparently using a KD-Tree helps significantly with nearest neighbor searches like this. I forsee yet another blog post coming soon!)
Final Results!
As a final test I ran the algorithm over this meme:
And here are some samples of the images it was broken into:
A slice of the pie chart
The meme watermark:
That's all! Thanks for reading!