This is not yet complete or proof read... I could use any help I can get to do that.
The latest version:
Over the past couple of weeks I learned about two of the more popular data clustering algorithms: K-Means Clustering and Density Based Clustering (loosely conforms to DBSCAN). In this post I'll be showing my own implementation of DBSCAN and how I used it to separate images into separate files containing assorted features from the original image. Here's a quick example to better explain: My algorithm identified all of the white space in the Cheezburger logo and separated it into a new image file. It looks like this:
- Smallest Cluster- The minimum number of points necessary to consider a group of points a cluster.
- Cluster Distance- The maximum distance between two points for them to be "dense" enough to be considered for inclusion into a cluster.
- Take each unvisited image point in the original image space
- Mark the point as visited and then find all neighboring points
- Neighboring points are found by calculating the distance between the given point and all other points and only keeping the ones that are within the Cluster Distance.
- Then, if we have more neighboring points than the Smallest Cluster size we've found ourselves a new cluster!
- Create a new cluster and add it to our list of clusters and then also add the given image point to the new cluster.
- Now we can expand our newly created cluster!
- From here we start exploring every image point our given image point is densely connected to. Basically, it's time to follow the bread crumbs to the end of the road.
- Now let's go to all of the neighboring points we haven't visited yet and add them to our "connected image points to be examined" list.
- Using this list we'll keep track of the points we started with as well as any new ones that need to be examined. As we find points that meet our criteria we'll be adding them to our current cluster
- We also need to mark these points as having been queued to be visited so that they aren't added to our list multiple times and waste time.
- Now, while we still have connected image points to be examined
- Grab one of them and if it hasn't already been visited, let's start our visit!
- Start the visit by marking the image point as being visited
- Add this point to our cluster
- Find all of the neighboring image points to this image point that is itself a neighboring image point to the image point we've gotten at step 1! Complicated much?
- Add all of these newest neighboring image points to our "connected image points to be examined" list IF:
- They have not already been visited
- They are not already queued for a visit
- There are more of them than the Smallest Cluster size.
- As we add the image points to the "connected image points to be examined" list mark them as "Queue for Visit".
image_points.Where(x => x.DistanceTo(image_point) <= cluster_distance).ToArray()
Is where the bottleneck lies (really within the Where expression). If anyone has any concrete tips around how I could cache that data to decrease the run time, I'm all ears. (EDIT: Apparently using a KD-Tree helps significantly with nearest neighbor searches like this. I forsee yet another blog post coming soon!)
As a final test I ran the algorithm over this meme:
And here are some samples of the images it was broken into:
A slice of the pie chart
The meme watermark:
That's all! Thanks for reading!
Have you ever seen a large table of data and thought, "Somebody just vomited numbers on my monitor?" Some times it can be very hard to make sense out of tables like this and even more so when the tables are large. There is at least one machine learning technique that will help you make sense of the data: Classification Trees.
- You'll notice we're building a tree from this hierarchy. Trees and many of the solutions that involve them are self similar meaning, whatever you do at one level of the tree you'll need to do at every level. In this case, that would be partitioning our dataset into many sub-datasets.
- The recursion depth is fairly well limited by the context of the problem. You'll only ever run into it on datasets with columns that number in the thousands. If that's you, you'll need to unwind the recursion into a stack based design. Have fun! :)
- Hadoop is huge in the data mining space. Like HUGE.
- Data scientists get overly fixated on playing with their data like programmers do on coding. It wasn't caught until the company was basically about to die. BOTH times. Three solutions:
- Appoint a kind of "canary" who isn't emotionally involved. Listening to the canary becomes the next hard problem.
- Have a hypothesis before diving into data!
- Data scientists need to be "Scrappy". For coders "Hacker" is a synonym.
- Analyze- Take the time to understand your model and look at the data. No black boxes.
- Anticipate- Build a data viewer and proactively look for bugs. Bugs are the enemy. STOP THEM.
- Improvise- "Don't indulge in any unnecessary, sophisticated moves..." -Bruce Lee
- Adapt- Error data is GREAT data. Don't just give up... Understand.
- What's a "Data Scientist"?
- The venn diagram:
- The venn diagram:
- Real time data- Event oriented queries via Esper. Your algorithms shouldn't require rerunning the whole calculation on new data.
And last but not least! Mind maps!
Lately I've been playing in R and have my own R statistics lab day every Saturday. Well today I downloaded Seattle's data set of >438k 911 police responses for the city and plotted them in R. You can find the data I used and more here: http://data.seattle.gov/
Here is the plot I created in R from this data:
First, compare my plot to this Google Map:
View Larger Map
I've read about Facebook doing something similar and seeing maps formed but I've never experienced it for myself. It was awesome to see it happen.
Also (and very unsurprisingly) notice how the darkest area with the most incidents is in downtown and how abruptly the density changes at the city limits.
Here's how to do it.
- Downloaded this specific dataset in CSV format: http://data.seattle.gov/Public-Safety/Seattle-Police-Department-911-Incident-Response/3k2p-39jp
- Load it into R by executing the following command: police_911 <- read.csv(file.choose())
- Find where you downloaded the file and open it
- Then I always use summary to quickly look at the data and get my bearings: summary(police_911)
- Notice the Longitude and Latitude columns? Already neatly parsed for us! Time to plot!
- plot(police_911$Longitude, police_911$Latitude, pch=20, cex=.01)
That's it! The pch and cex parameters allow me to set the point shape and size respectively. By executing length(police_911$Latitude) I can find out how many rows there are in that column... 438,512. Schiesse!
Well that's it. Just a quick update about something I found to be interesting.
Update: Here's a map with Fire responses added to it as well. Blue is police and red is fire.
Software Craftsmanship, SOLID principles, eXtreme Programming, the list of all the "best practice" guides I've learned over the years goes on and on. They are all extremely valuable to me. They also need to make room for a new style of development being ushered in by the ideas expressed in The Lean Startup. The best way to label it is as "the way Bozo is coding for now".
- Formulate a hypothesis AKA have a testable opinion. I am nothing if not opinionated.
- Design a test. ie. Let's move this thing here, and that then there. List independent variables, dependent variables, covariates, expected benefits AKA $BLING$ Make an assumption that a certain segment of my customer base is representative (enough) of my full customer base.
- Code up something that only works for that segment... maybe it even only half works. Fuck any kind of automated testing unless it makes this step faster.
- Test what's been written and decide if it's "Good Enough" to get us reliable results.
- Release the test to production
- Once the test is over, delete all the code from production.
- Only run this on non-mobile browsers
- Only run in Safari
- Only run if this user came from page "http://fookyou.com/fookme"
- Your test is probably too damn big and testing more than isolated changes. (Notice the italics. #GuidelinesNotSteadfastRules)
- Remember step three? You wrote that code in a shitty unsustainable manner... BUT. You'll be able to rationalize taking the time to refactor after this. Lemme tell you why.
- Red- Think about what you're doing enough to understand how you can test it.
- Green- Find out it makes you money and delete the code.
- Refactor- Write the code well and in a disciplined manner knowing that you've proven it's worth the time.
What is event sourcing (from a Bozo's perspective)?
- A data model that integrates logging- I step through exactly what my user did.
- Optimizations become views of data- Is calculating your users state too slow? Need SQL? No problem. Run through all of the events and store them in OR tables for fast querying. Think of it as a cache. When a new event is added that invalidates the cache, play that event against your SQL tables as well and you're good to go.
- Migrations maybe easier(?)- Your conceptual or data model change? No problem, blow away those SQL tables and build new ones based off of the events.
- A data model that can handle multiple changes to the same data at the same time and no information loss. Because we're just collecting events, the UI may not show the data, but it's still in the system and able to be retrieved if need be.
- Undo supported by default- Make a change but want to take it back? Go back to a previous moment in time demarcated by... EVENTS. :)
I launched a small business called http://GeekRations.com and got a paying customer in a weekend. Full disclosure, that was my only customer. I'm still learning and likewise I'm not completely sure how to go about finding my next one. This was still a huge milestone for me this year. A paying customer. That was my goal.
You'll see a lot of parallels to The Lean Startup in this and that's totally cool. I'm not, however, trying to adhere to some methodology. Experience has taught me that that's the Wrong Thing. The zen I've been able to pick out of The Lean Startup and Customer Development is really just that a solid business model is testable. It's not a black art. There's no reason or excuse to go months building a product or service without talking to potential customers. For developers, you think these ideas are just for pointy-haired business people? I bet you've tried to start your own OSS project and garner some community support but couldn't. This shit applies to you too. We all care if our work is valuable, and we all want to know ASAP if it isn't so we don't waste our time.
Let's learn what we need to learn and iterate. Let's admit that no one cares and ask them what they do care about. Let's expect customers to buy now and when they don't let's ask them why.
Here's what I did:
(Friday Night) Picked a market- I'm a programmer and a geek. All of my friends are. I have 400+ followers on Twitter who are as well. I also spend time on HackerNews which is mainly geeks.
- (Friday Night) Gauged interest- I put up the most cheesy generic Unbounce landing page possible. A couple sentences about my idea and split tested two pages. I wanted to know if my sense of humor would prevent people from signing up. It didn't. I announced it to every geeky community I am a part of. I had about a 5% conversion rate.
- (Saturday) Built a single page website w/ 3 price points- I can design if I try **REALLY** hard... Fuck that. I hit http://themeforest.net/ like a baws. Grabbed a template I could use. It was a bit too feminine for me but I thought fuck it we'll see if it works. Next I got a PayPal pay now button and then threw it all onto Heroku to be hosted for free. Also added Facebook and Twitter buttons so I could have some idea as to whether or not people were excited by the idea.
- (Sunday) Visitor Feedback- I began to see tweets from several people that they didn't really understand what the random gifts might be. They had no idea what they were getting into. In response to this I added a thin strip of images of things I could see myself sending to customers.
- (Sunday Night) Purchase of mid-price point.
One issue I have heard since then is that my product is more of a luxury item and a lot of the people who really like this idea don't have the money to spend so frivolously. A possible pivot might be to sell something like this to girlfriends who don't know what to give to their geeky boyfriends. Not sure though. I have a couple little businesses and next year's goals will require me to focus intently on one of them. I'm trying to understand which has the most likelihood of succeeding and being something I enjoy being immersed in.
Any advice (from experience) or other thoughts? Leave a comment and let's start a conversation.