My Adventure in The Windy City

You may have seen that I took an impromptu family-ish weekend vacation to Chicago earlier this month. While there I had the opportunity to celebrate Savannah's birthday with her and her family and check out the city as we tried to figure out if we could imagine living there.

Oh and I had a little meeting with the fine folks at GrubHub.

My Time with Cheezburger

If you've been following me, you've seen my professional path shift from web development and computer science to machine learning and ultimately now to data, web analytics, and business intelligence. I was lucky enough to work for a company like Cheezburger that supported the kind of organic growth I needed to find what role I feel the happiest in. Being a software engineer, while I could do it well enough and enjoyed it, it just felt too hollow to me when I wasn't immersed in data about how my features were doing. If you've worked with me for any length of time then you know I'm not working on anything unless I truly get "the vision" and know what metrics we're trying to move. If my team didn't have that information... well we would by the time I finally acquiesced. To me, the software was something to be minimized a means to an end to get to see results.

When I was asked if I'd enjoy working on Cheezburger's (Notorious) Business Intelligence Team I was ecstatic. I learned an immense amount from the mentoring from both Delen Heisman and Loren Bast (who's now the Director of B.I. at Cheezburger). Delen taught me more about statistics and experiment design than I thought I'd ever care to know and Loren taught me the importance of weaving a coherent story with data. I eventually wound up being labeled "Sr. Developer/Analyst" and was more proud of having analyst in my title than I thought I would have been.

How does that lead to GrubHub?

When I left GrubHub's offices I wouldn't let Savannah's family even ask me questions about the job and I was afraid to admit to Savannah how much I wanted it. I was too excited. No place is perfect but damn. Let's just say I walked away very impressed and very happy to have had the day to talk with many of them.

So I had no choice.

Starting February 4th, I am joining GrubHub as their first Product Optimization Specialist. That's basically a fancy way of saying I'll be combing over the data from their products and services and looking for ways to improve them in a quantitative fashion. Everything from standard funnel analysis to split testing. I'll be helping to come up with the leanest experiments I can in order to effect the maximum improvement. I'll have a direct impact on the company's success and revenue in a real and quantifiable way and I'm immensely excited about that. In case you can't tell, I'm also very proud to have been offered the opportunity.

This means I'll no longer be a remote employee and will be relocating to Chicago. Savannah will be closer to her family (who live in Grand Rapids, MI) and her sister (who lives in Chicago). I'll be closer to Corey Haines #swoon and apparently there's a Fry's Electronics just outside the city limits!!

Adventure is ahead! Stay tuned.

First Draft of a Statistical Testing Overview

This is not yet complete or proof read... I could use any help I can get to do that.

 
I'm looking to improve it to about 80-90% accuracy so it can function as a great quick-glance guide and I also want to make the information more digestible and engaging via some better graphic/information design.
 
Planned additions: Examples of data for each test, illustrations of the shapes of the distributions, some sort of a handy index to find the right test for the job, and more.
 
Also if it somehow manages to be helpful as it currently is please let me know! Thanks!

 

The latest version:

Visualizing Multidimensional Data in Two Dimensions

Recently I ran into an issue where I had multidimensional data and didn't have an effective way to visualize it. Those who are less visual and more abstract are content to explore multidimensional data mathematically. If you're like me however you're visual and you want to see the shape of your data.

Enter star coordinates.
What are they? Here's my quick explanation then I'll direct you to the paper that taught me about them. Say you have a three dimensional point (3, 4, 5). You would go a distance of three along what you would normally call the x-axis. Next, and here is where it gets more interesting, pretend you're the point and turn (2 * pi) / dimension_count (three in this case) and then walk 4 units in this direction. Turn again and walk 5 units.

So theory is always great but here's how I implemented it and what my version looks like when graphing 4 dimensions.

The first thing I worked on was plotting a single point. I ended up with this solution:

Two things are involved, using some trigonometry to figure out how to move along the x and y axes for each dimension and aggregating them into a single point. You'll see in the code I have something I call "axial_rotation" that's how much I need to rotate as I step through each dimension of the point.

I didn't end up finding anything worth while in my data but maybe this technique will help me in the future... or maybe even help you.

I'm not 100% certain it's perfect. If anyone happens upon this and finds an error I'd love to know about it.

Here's the full demo code:

Identifying Features in Images with Cluster Analysis

What Did I Learn to Do?

Over the past couple of weeks I learned about two of the more popular data clustering algorithms: K-Means Clustering and Density Based Clustering (loosely conforms to DBSCAN). In this post I'll be showing my own implementation of DBSCAN and how I used it to separate images into separate files containing assorted features from the original image. Here's a quick example to better explain: My algorithm identified all of the white space in the Cheezburger logo and separated it into a new image file. It looks like this:

Before:
After (gray areas are transparent now):

It also extracted the text:

It should be noted I'll just be discussing the naive implementation of the algorithm. It takes FOREVER to process a larger limit. I will also discuss why that is and some abstract thoughts around how it could be improved.
Why Am I Learning This Now?

The main goal of this exercise is not to learn to do image processing nor is it to learn to write the best or fastest clustering algorithm. The point of this experiment was to learn enough about clustering that I could write my own naive implementation. It's one thing to say I know something and it's a whole other to know enough to be able to craft it.

What is Data Clustering?

Data clustering is something anyone who has ever looked at a scatter plot before has done. It's where you look for the areas where many points seem to group together and label them as separate groups. A good example is on Wikipedia, here's the image they use:
They have already separated the data into two groups for us. This is also a great example of why one might choose to use a density based data clustering algorithm over something like a k-means algorithm. 

K-means works best at identifying circular groupings like the blue group above. K-means is also very straightforward to create. The biggest downside to that algorithm though is that the red cluster in the above graph would be split into multiple groups because K-means can't account for the long, curved shape. A density based clustering algorithm however can gracefully handle both but gets a bit more tricky as the algorithm essentially becomes a graph traversal problem that, given large numbers of points, prevents the simpler recursive graph travesal solutions from being tractable.

How I Applied All of This to Images

As I've been learning more about data analysis I keep relearning a saying I've heard before: "Images are a great data source for machine learning." In this case I chose to use images as a multi-dimensional data source using each pixel as an "ImagePoint" in a 6-dimensional space. Red, green, and blue were the first three dimensions. The last three were the alpha transparency value as well as the X and Y coordinates of the pixel in the image.
My main hypothesis was that I should see images separate by colors somewhat based upon their relative visual location. The Cheezburger logo above is a perfect example of my expectation.

Step By Step Through Code

Before we dive into the code there are three concepts you should be aware of that are key to how I implemented DBSCAN:
  1. Smallest Cluster- The minimum number of points necessary to consider a group of points a cluster.
  2. Cluster Distance- The maximum distance between two points for them to be "dense" enough to be considered for inclusion into a cluster.
The names used above are not defacto names but my own from the code.

Step 1. Load All Pixels Into Memory

To start with I load all of the image's pixels into memory and convert them to the ImagePoint objects I discussed earlier. For the sake of this conversation let's just establish that function call looks like this:
I get back the image dimensions so that once I write each cluster's image data to separate files I can position the images in the same places they would have appeared in the original in order to help me see how it worked.

Step 2. Cluster!

This step isn't quite as trivial as the previous one. I'll write out the psuedo-code first and then show you the original code.
  1. Take each unvisited image point in the original image space
  2. Mark the point as visited and then find all neighboring points
  3. Neighboring points are found by calculating the distance between the given point and all other points and only keeping the ones that are within the Cluster Distance.
  4. Then, if we have more neighboring points than the Smallest Cluster size we've found ourselves a new cluster!
  5. Create a new cluster and add it to our list of clusters and then also add the given image point to the new cluster.
  6. Now we can expand our newly created cluster!
  7. From here we start exploring every image point our given image point is densely connected to. Basically, it's time to follow the bread crumbs to the end of the road.
  8. Now let's go to all of the neighboring points we haven't visited yet and add them to our "connected image points to be examined" list.
  9. Using this list we'll keep track of the points we started with as well as any new ones that need to be examined. As we find points that meet our criteria we'll be adding them to our current cluster
  10. We also need to mark these points as having been queued to be visited so that they aren't added to our list multiple times and waste time.
  11. Now, while we still have connected image points to be examined
  12. Grab one of them and if it hasn't already been visited, let's start our visit!
  13. Start the visit by marking the image point as being visited
  14. Add this point to our cluster
  15. Find all of the neighboring image points to this image point that is itself a neighboring image point to the image point we've gotten at step 1! Complicated much?
  16. Add all of these newest neighboring image points to our "connected image points to be examined" list IF:
    1. They have not already been visited
    2. They are not already queued for a visit
    3. There are more of them than the Smallest Cluster size.
  17. As we add the image points to the "connected image points to be examined" list mark them as "Queue for Visit".
There are a couple of key simplifying assumptions I'm making. One is that a point only need be visited once. Also that once a single point in our outer loop identifies a new cluster, that cluster will be built in its entirety with the inner loop. This means that we never have to check whether or not a point has already been added to a cluster.

Here is the definition of the ImagePoint class:

Especially important is the distance function. If I wanted to, it's entirely possible to change the algorithm dramatically just by adding or removing what factors I want to include in the distance computation. One thing I could choose to remove is the X and Y coordinates for example and then I'd up with clusters of pixels that have smoothly transitioned colors (sort of).

Anyway, now the actual code that utilizes this:

Step 3: Create an image for each cluster

This is as simple as looping through all of our clusters and writing each pixel contained within to a Bitmap on disk. C# makes this fairly straight-forward (one reason I didn't do this in Ruby actually). It doesn't mean I couldn't have done it in Ruby, just that it wasn't immediately apparent as to how.
 

What Could Be Done Better?

My algorithm takes on the order of hours to run on any medium sized image. Some simple profiling has shown that my reliance on 

image_points.Where(x => x.DistanceTo(image_point) <= cluster_distance).ToArray()

Is where the bottleneck lies (really within the Where expression). If anyone has any concrete tips around how I could cache that data to decrease the run time, I'm all ears. (EDIT: Apparently using a KD-Tree helps significantly with nearest neighbor searches like this. I forsee yet another blog post coming soon!)

Final Results!

As a final test I ran the algorithm over this meme:

 

And here are some samples of the images it was broken into:

A slice of the pie chart

 

The meme watermark:

That's all! Thanks for reading!

Coding Quickie- Classification Tree Learning

Have you ever seen a large table of data and thought, "Somebody just vomited numbers on my monitor?" Some times it can be very hard to make sense out of tables like this and even more so when the tables are large. There is at least one machine learning technique that will help you make sense of the data: Classification Trees.

Classification trees work by pouring over your data column by column and then hierarchically grouping your data by the unique values in each column. As an example imagine if you had a table of data like this:

In code I write it like this:

How would you organize the data to make it more understandable?
Note that this example is a bit impractical because the table is pretty small but the mechanics will work the same. One caveat: the algorithm is working off of already enumerated values in a table. There's no continuous values like 8.132, or names, or IDs. In trying to use this in a real world scenario I'm finding myself preprocessing my data to create the buckets so I can run this algorithm.

The classification algorithm I learned tonight starts by looking at each column of data and then deciding which seems like it will give us the most information gain. Once it determines that it partitions the table and carries on recursively until we've come up with a full classification tree. You might be wondering how exactly does the algorithm determine which split will give us the most information. That's thanks to a calculation known as Shannon Entropy. It's outside the scope of this post but feel free to read more about it here.

Here's the code I ported from Python to Ruby:

With the algorithm I created, the following hierarchy is created:

This is one of the few problems that very neatly fits a recursive solution. Why this one? Well a couple reasons:
  1. You'll notice we're building a tree from this hierarchy. Trees and many of the solutions that involve them are self similar meaning, whatever you do at one level of the tree you'll need to do at every level. In this case, that would be partitioning our dataset into many sub-datasets.
  2. The recursion depth is fairly well limited by the context of the problem. You'll only ever run into it on datasets with columns that number in the thousands. If that's you, you'll need to unwind the recursion into a stack based design. Have fun! :)
This code is pretty tacky for Ruby but I'm posting it anyway. Suggest how I can clean it up!

To give credit where it's due I sat down with Machine Learning in Action tonight and while I didn't copy the algorithm word for word, it was extremely helpful. This is one of the few books on an advanced topic that seems to able to convey the author's knowledge very well. Some of the more experienced among you might notice that the way I handled evaluating the right way to slice the data is different from the standard way (at least when compared to my book). I'm ok with that but I might clean it up and straighten it out if the mood strikes me.

That's it! It's been a while since I blogged and this was the first thing at the top of my mind.

What I Learned at StrataConf 2012

Recently I went to StrataConf (http://strataconf.com/stratany2012) to learn more about this crazy world of data I'm slowly slipping further into. I made several mind maps that I've posted at the end of this blog post.
Key Take Aways
  1. Hadoop is huge in the data mining space. Like HUGE.
  2. Data scientists get overly fixated on playing with their data like programmers do on coding. It wasn't caught until the company was basically about to die. BOTH times. Three solutions: 
    1. Appoint a kind of "canary" who isn't emotionally involved. Listening to the canary becomes the next hard problem.
    2. Have a hypothesis before diving into data!
  3. Data scientists need to be "Scrappy". For coders "Hacker" is a synonym.
    1. Steps
      1. Analyze- Take the time to understand your model and look at the data. No black boxes.
      2. Anticipate- Build a data viewer and proactively look for bugs. Bugs are the enemy. STOP THEM.
      3. Improvise- "Don't indulge in any unnecessary, sophisticated moves..." -Bruce Lee
      4. Adapt- Error data is GREAT data. Don't just give up... Understand.
    2. What's a "Data Scientist"?
      1. The venn diagram:   
  4. Real time data- Event oriented queries via Esper. Your algorithms shouldn't require rerunning the whole calculation on new data.

And last but not least! Mind maps!

Data Quickie- 911 Police Responses for Seattle Plotted in R

Lately I've been playing in R and have my own R statistics lab day every Saturday. Well today I downloaded Seattle's data set of >438k 911 police responses for the city and plotted them in R. You can find the data I used and more here: http://data.seattle.gov/

Here is the plot I created in R from this data:

First, compare my plot to this Google Map: 
View Larger Map

I've read about Facebook doing something similar and seeing maps formed but I've never experienced it for myself. It was awesome to see it happen.

Also (and very unsurprisingly) notice how the darkest area with the most incidents is in downtown and how abruptly the density changes at the city limits.

Here's how to do it.

  1. Downloaded this specific dataset in CSV format: http://data.seattle.gov/Public-Safety/Seattle-Police-Department-911-Incident-Response/3k2p-39jp
  2. Load it into R by executing the following command: police_911 <- read.csv(file.choose())
  3. Find where you downloaded the file and open it
  4. Then I always use summary to quickly look at the data and get my bearings: summary(police_911)
  5. Notice the Longitude and Latitude columns? Already neatly parsed for us! Time to plot!
  6. plot(police_911$Longitude, police_911$Latitude, pch=20, cex=.01)

That's it! The pch and cex parameters allow me to set the point shape and size respectively. By executing length(police_911$Latitude) I can find out how many rows there are in that column... 438,512. Schiesse!

Well that's it. Just a quick update about something I found to be interesting.

Update: Here's a map with Fire responses added to it as well. Blue is police and red is fire.

TDD for Business Value

A New Hope

Software Craftsmanship, SOLID principles, eXtreme Programming, the list of all the "best practice" guides I've learned over the years goes on and on. They are all extremely valuable to me. They also need to make room for a new style of development being ushered in by the ideas expressed in The Lean Startup. The best way to label it is as "the way Bozo is coding for now".

What exactly do am I talking about? Temporary code written fast and with little thought. Code meant to elicit learning and then promptly discarded and removed from production. Wait what? Let me take a step back.
I've been using a new process for split testing features. It flows like this:
  1. Formulate a hypothesis AKA have a testable opinion. I am nothing if not opinionated.
  2. Design a test. ie. Let's move this thing here, and that then there. List independent variables, dependent variables, covariates, expected benefits AKA $BLING$ Make an assumption that a certain segment of my customer base is representative (enough) of my full customer base.
  3. Code up something that only works for that segment... maybe it even only half works. Fuck any kind of automated testing unless it makes this step faster.
  4. Test what's been written and decide if it's "Good Enough" to get us reliable results.
  5. Release the test to production
  6. Once the test is over, delete all the code from production.
Why Is This Magical?

Step three is the one that's gonna make me sound like an unknowing ass hat. Allow me to explain:

If you're going to throw away code and aren't going to need to maintain it the rules of the Agile game change dramatically.

Why do we write automated tests? We accept that the slight cost increase now is worth it in the long run because the code will be around for a while and fellow team mates will need to interact with it. tl;dr It's a way of keeping our costs low. Great. Seriously.

So once we accept that this code will only live in production for a day and I can reasonably say no one will have to understand my code now why do I write automated tests? I don't.

Also, notice step two. If you severely limit the customer segments you're targeting it means you can take a bunch of shortcuts like:
Someone please ask me, "But what if there's A LOT of code for the test?" Two things:
  1. Your test is probably too damn big and testing more than isolated changes. (Notice the italics. #GuidelinesNotSteadfastRules)
  2. Remember step three? You wrote that code in a shitty unsustainable manner... BUT. You'll be able to rationalize taking the time to refactor after this. Lemme tell you why.
Know The Value of Your Work

You just developed, shipped, and validated the value of some code you are planning to put into production within two days. You now know how much money the feature makes your business and that means you also now know how much money your company will lose every time the feature breaks. Suddenly you're not just pushing out features because you're bored or trying to keep your team from looking idle, you're pushing out features because they make your company more money.

This is the new TDD guys. This is Red, Green, Refactor for business value. 
  1. Red- Think about what you're doing enough to understand how you can test it.
  2. Green- Find out it makes you money and delete the code.
  3. Refactor- Write the code well and in a disciplined manner knowing that you've proven it's worth the time.
Not Worth Testing

If a change is so small it isn't worth testing to see if it effects anything then WTF are we doing here? Who the hell is prioritizing that feature? Push back and bring this up. Branding might be a reason. I'm not sure how I feel about that. I can sympathize with design and... yeah. Separate blog post.

If a change is such a sure thing that you KNOW it will bring in major money then the cost to test it is probably insignificant given the long term value.

Have You Actually Tried This?

Yes.

Event Sourcing in Javascript

What is event sourcing (from a Bozo's perspective)?

Event sourcing is essentially the practice of storing a system's data in its most natural form, events. Rather than worrying about my database tables and codifying my data model into a db instead I store all of the events that have happened. So in DB land we're talking one table called Events with some columns with metadata about the event (to be covered later) and a data column. The data column is a blob field that the business objects can interpret to replay the event happening to them.

Why event sourcing?
There are a ton of benefits as well as some costs to choosing this architectural style. For myself, on this project, I chose it to learn more about event sourcing through experience and I don't fancy myself enough of an expert to give guidance here. The reasons I am interested in it are:
  • A data model that integrates logging- I step through exactly what my user did.
  • Optimizations become views of data- Is calculating your users state too slow? Need SQL? No problem. Run through all of the events and store them in OR tables for fast querying. Think of it as a cache. When a new event is added that invalidates the cache, play that event against your SQL tables as well and you're good to go.
  • Migrations maybe easier(?)- Your conceptual or data model change? No problem, blow away those SQL tables and build new ones based off of the events.
  • A data model that can handle multiple changes to the same data at the same time and no information loss. Because we're just collecting events, the UI may not show the data, but it's still in the system and able to be retrieved if need be.
  • Undo supported by default- Make a change but want to take it back? Go back to a previous moment in time demarcated by... EVENTS. :)
My Experiment
I'm building a small inventory system for a school cafeteria and in an effort to remain engaged and finish it I thought I'd add the wrinkle of letting this be my first pure event sourcing system. I've posted the code online here: https://github.com/jcbozonier/HeeHaw

Interesting results so far...
As I said at the start of this post, I'm doing this as a javascript RIA. The server in my example will be nothing more than Yet Another Event Store. That's been a somewhat surprising result. How I'm going to store the data and everything... just an implementation concern. Since right now the whole app can be brought back to its current state using just the event stream I know that all I need to do to provide persistence is to send the events to the server.... or the browser's local storage... or I can store them in gists via Github's API... the list goes on. 

In the meantime I can run a fully functional version of the system without DB/server access as long as I'm content with no cross session state persistence. MongoDB/Heroku might be pretty damn quick to deploy on but DropBox is even faster.

A negative interesting result? It's taken me a long time to do a fairly CRUDy app. Bending my mind around storing these events and driving the system off of them has been painful to say the least. Also most of my events bubble up from my event store directly to my views without any real logic needing to be done. This will change as the app becomes more complex, but still just wanted to point it out.

You might be wondering, what does a javascript event store look like?

Like this: 

And what do the events look like? Here's an example:

That's the event that fires when an inventory item is added to the system. One interesting aspect of this is the "generate_guid()" function call. I'm used to letting the DB handle that detail for me, but it always bugged me that my business objects couldn't handle that. Now that I use GUIDs I can. Just generate a random ID and assign it to an object.... There's a chance of a collision but there are 3.4x10^38 different possibilities. There are articles on the efficacy of GUIDs/UUIDs please Google for them at your leisure.

More coming shortly
As I make more progress on the app I'm building, I'll add another post. In the meantime I wanted to throw something out here to record my thoughts.

From Crazy Idea to Customer in a Weekend

I launched a small business called http://GeekRations.com and got a paying customer in a weekend. Full disclosure, that was my only customer. I'm still learning and likewise I'm not completely sure how to go about finding my next one. This was still a huge milestone for me this year. A paying customer. That was my goal.

You'll see a lot of parallels to The Lean Startup in this and that's totally cool. I'm not, however, trying to adhere to some methodology. Experience has taught me that that's the Wrong Thing. The zen I've been able to pick out of The Lean Startup and Customer Development is really just that a solid business model is testable. It's not a black art. There's no reason or excuse to go months building a product or service without talking to potential customers. For developers, you think these ideas are just for pointy-haired business people? I bet you've tried to start your own OSS project and garner some community support but couldn't. This shit applies to you too. We all care if our work is valuable, and we all want to know ASAP if it isn't so we don't waste our time.

Let's learn what we need to learn and iterate. Let's admit that no one cares and ask them what they do care about. Let's expect customers to buy now and when they don't let's ask them why.

Here's what I did:


  1. (Friday Night) Picked a market- I'm a programmer and a geek. All of my friends are. I have 400+ followers on Twitter who are as well. I also spend time on HackerNews which is mainly geeks.
  2. (Friday Night) Gauged interest- I put up the most cheesy generic Unbounce landing page possible. A couple sentences about my idea and split tested two pages. I wanted to know if my sense of humor would prevent people from signing up.  It didn't.  I announced it to every geeky community I am a part of. I had about a 5% conversion rate. 
  3. (Saturday) Built a single page website w/ 3 price points- I can design if I try **REALLY** hard... Fuck that. I hit  http://themeforest.net/ like a baws. Grabbed a template I could use. It was a bit too feminine for me but I thought fuck it we'll see if it works. Next I got a PayPal pay now button and then threw it all onto Heroku to be hosted for free. Also added Facebook and Twitter buttons so I could have some idea as to whether or not people were excited by the idea.
  4. (Sunday) Visitor Feedback- I began to see tweets from several people that they didn't really understand what the random gifts might be. They had no idea what they were getting into. In response to this I added a thin strip of images of things I could see myself sending to customers.
  5. (Sunday Night) Purchase of mid-price point.

One issue I have heard since then is that my product is more of a luxury item and a lot of the people who really like this idea don't have the money to spend so frivolously. A possible pivot might be to sell something like this to girlfriends who don't know what to give to their geeky boyfriends. Not sure though. I have a couple little businesses and next year's goals will require me to focus intently on one of them. I'm trying to understand which has the most likelihood of succeeding and being something I enjoy being immersed in.

Any advice (from experience) or other thoughts? Leave a comment and let's start a conversation.