How I Came Across A Real Live Data Scientist!
I was fortunate enough to be able to attend this year's Strangeloop conference (http://thestrangeloop.com/). Hilary Mason, data scientist extraordinaire, gave the opening keynote entitled "Machine Learning: A Love Story". As soon as she said we'd need a little bit of math to get through the presentation, I knew it was gonna be good. After healthy background on failed attempts at machine learning across the twentieth century she got into Bayesian statistics and then related this back to her work at bit.ly.
That's when I decided it was my weekend's goal to get her to hack on something, anything, related to data mining with me. Check her out on Twitter @hmason or her website @ http://www.hilarymason.com/
Graciously, she agreed and we set up the time and place. We ended up with around ten people in total hacking for about an hour in a small cafe here in St. Louis. I published the final product here: http://github.com/jcbozonier/Strangeloop-Data-Visualization
and Hilary is hosting the visualization here:
That's the background and this is what came of it for me.
Answers Are Easy, Asking The Right Questions are Hard
I've been self-studying data analysis for a few months in my spare time and it can be so confusing knowing what I'm doing right or wrong. It's not like programming where I can tell if I have a right answer... it's more or less just me thinking the answer feels right. That's really hard for me.
By grouping up with Hilary I was hoping to get some insight into her professional workflow, what tools she uses, and also I wanted to get a feel for her general approach and mindset for answering a given question with her data-fu.
The question we ultimately decided to work on was what "What does the Strangeloop social network look like on Twitter?" In other words, who's talking to who and how much? Our shared mental model for the problem was essentially a graph of nodes interconnected with a bunch of undirected edges which indicated those two people had communicated via Twitter. Hilary had already grabbed Protovis along with a sample of using it to create a force-directed layout so it was a perfect fit for answering that question.
Three Steps
Today I learned to think about data analysis as three main steps or phases (since the steps can get a little large).
1. Get Data- Get the data. In whatever form is easiest, just gather all of the data you'll need and get it on disk. Don't worry about how nice and neat it is.
2. Prune it- Now you can take that mass of data and start to think about what portions of it you can use. The pruning phase is your chance to trim your data down and focus it a bit. This is where you eliminate all aspects of the data except for the ones you'll want to visualize.
3. Glam it up- Here's where you figure out what you'll need to do to get your data into a visualizable form.
1. Getting Data From Twitter
To get our data I wrote a script that used Twitter's search api to download all tweets that contained the hash tag #strangeloop. Since the data is paged, my code had to loop through about 15 pages until it had exhausted Twitter's records.
This is the code. It's pretty simple but effective.
There may be errors or corner cases and that's fine. None of this is code I would unit test until it became apparent that I should. The main task at hand here is to get data and in this case at least that's a binary result. It's easy to know if some part of that code went wrong. Also, I need to be able to work quickly enough that I can stay in the flow of the problem at hand. I'm really just hacking at Twitter trying to get the data I want to a file on disk. If I have to do it by hand that's fine.
2. Pruning The Data To Fit My Mental Model
I chose to download the data as JSON because I assumed that would be a pretty simple format to integrate with. Now that Ruby 1.9 comes with a JSON module out of the box, it totally was! Well... pretty much.
Once I had downloaded all of the data I manually massaged each of the 15 JSON result objects to leave behind only their tweets and none of the meta-data surrounding the search. Once I had that completed I had a file containing 1400-1500 JSON tweet objects in a JSON array.
Now during our group session I didn't actually write this portion of the solution. It was actually David Joyner (follow him on Twitter as @djoyner) and he delivered the end result to Hilary in CSV format via Python. I've recoded it here because there was a bug in the code we wrote to create the data we visualized and I needed a way to regenerate the data once the bug was fixed. Since I didn't have his Python script I just opted to rewrite what he had done.
From here I just tried to get the data loaded up into Ruby via the JSON module. I load the saved JSON from disk with the following code:
My approach once again was very hack-oriented. Do a little bit of ruby script in such a way that I can verify that it worked via the command line, reiterate by adding another step or two and repeating. It's like TDD but much less thought, just hacking and feeling my way around the problem space.
3. Glamming It Up For Protovis
To recap, so far you've got me getting the data downloaded into a parseable form, this other guy loading that from disk, and then he also did the original work on pulling the data into a set of undirected edges of people talking to one another. I also rewrote this for lack of his code and for lack of Hilary's code converting his data into something Protovis could use. In order to make the graph really interesting we also decided to add up the number of times a given edge was used which you'll see being computed in this:
David Joyner was also kind enough to send me his original Python code that essentially does the same thing:
The thought was that the more active a person was on Twitter, the more they influenced the network. This could cause someone who was really chatty to get over-emphasized in the visualization but in our case it worked out well.
So ok we had all of this data but it wasn't in the form that Protovis needed to show our awesome visualization. Hilary figured this out by downloading a sample project from their project's website. The data needed to be put in this form:
If you scroll through that a ways you'll eventually see some data that looks like this:
{source:72, target:27, value:1},
Nice eh? Those numbers are basically saying draw a line from the node at index 72 of our list of nodes to the node at index 27. That complicated things a bit but Hilary got through it with some code I imagine wasn't too dramatically different from this:
I just basically create a hash where I store the index number for each Twitter user's name and then look it up when I'm generating that portion of the file.
Biggest Take Away: Baby Steps
There was definitely a fair amount of work here and without all of the team work we wouldn't have been able to get this done in the 45 minutes it took us. Part of the team work was just figuring out what components of work we had in front of us. The three steps I laid out in this article are how I saw us tackling the problem and there were many other much more iterative steps I left out.
When I do more data analysis in the future I plan to just work it through piece by piece and not get overwhelmed by all of the different components that will need to come together in the end.
The Other Biggest Take Away: Get Data At Any Cost Necessary
It's easy as a programmer for me to get bogged down in thoughts of "quality". Even Hilary was apologizing for the extremely hacked together code she had written. Ultimately though t really doesn't matter here. The code will not be ran continuously and hell it may never even be ran again! If the code falls apart and blows up, I can quickly rewrite it. I'm my own customer in this sense. I can tolerate errors and I can fix them on the fly. When I'm exploring a problem space the most important thing for me is to reduce the friction of my thought process. If I think best hacking together code then awesome. Once I can get my data I'm done. I don't care about robustness... I just need it to work right now.
I'm harping on this point because it's such a dramatic shift from the way I see production code for my day job. Code I write for work needs to be understood by a whole team, solid against unconsidered use cases, reliable, etc. Code I write to get me data really quick, I just need the data.
While Hilary is a pythonista, at one point I remember her commenting on programming language choice and saying something to the effect of "It doesn't matter, they all work well." She was so calm about it... it was almost zen like. After having so many passionate talks regarding programming languages with other programmers it was very refreshing to interact with someone who had a definite preference but was able to keep her eye on the prize... the data and more importantly the answers that the data held.
Next Steps
I'd like to work on a way to tell which of the people I follow on Twitter are valuable and which I should stop following. Essentially a classifier I guess. On top of that I'd like to write another one to recommend people I should follow based on their similarity to other people I do follow (and who are valuable)... We'll see. I've got another project that desperately needs my time right now. If you happen to write this though or know of anyone who has, let me know!