The Three Sexy Skills of Data Geeks

by Michael E. Driscoll | May 27, 2009

Marilyn Monroe Scatterplot MashupHal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

In prepping for tonite’s talk at the Google IO Ignite event, this quote inspired me to muse about how sex appeal and statistics might go together: so I chose to mash up a few scatter plots with Andy Warhol’s Marilyn Monroe.

Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills.  I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas:  statistics, data munging, and data visualization.  (In parentheses next to each, I’ve put the salient character trait needed to acquire it).

Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. It’s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only recently developed in 2004). I expect to be on its learning curve my entire life. This being the case, people who possess a solid grasp of modern statistics are rare.   And yet problems that require its application continue to multiply.  The text that I was exposed to in graduate school and find to be an unparalleled survey is Hastie, Tibshirani, and Friedman’s Elements of Statistical Learning.

Skill #2: Data Munging (Suffering). The second critical skill mentioned above is “data munging.” Among data geek circles (you can find us with a Twitter search for #rstats), this refers to the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy. At best it’s inconsistently delimited or packed into an unnecessarily complex XML schema. At worst, it’s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.

A good data munger excels at turning coffee into regular expressions and parsers, implemented in a high-level scripting language of choice (often Perl, Python, even Javascript). This is problem solving with programming, and quite different from statistics. An aspiration towards elegance — in the form of a perfect XSLT filter, for example — is rarely rewarded, and often punished. A decade ago, I thought that the world’s data would soon be well-structured, and my talent for syntactical incantations of regular expressions would be a moot skill. I was wrong. (Perhaps there’s an analogy with the paper industry: the growing volume of data means we’ll likely need more regular expressions before we need less).

Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).

And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to boot.

Skill #3: Visualization (Storytelling). This third and last skill that Professor Varian refers to is the easiest to believe one has.  Most of us have had exposure to basic chart-making widgets of Excel (and to date myself, tools like Harvard Graphics). But a little knowledge is a dangerous thing: these software tools are often insufficient when faced with the visualization of large, multivariate data sets.

Here it’s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals. The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst’s understanding of the data. These may consist of scatter plot matrices and histograms, where labels and colors are minimally set by default. Their goal is to help develop a hypothesis about the data, and their audience typically numbers one or a small team.

A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis. While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill — with separate tools.  (R is excellent for static visualizations, but cannot compete with the kinds of rich interactive visualizations that tools like Processing and Flare make possible). Luckily, successful collaboration often occurs between data analysts and designers, the occasional fracas notwithstanding.

The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince: whether it’s an academic discovery or a business proposal.

Put All Three Skills Together: Sexy. Thus with the Age of Data upon us, those who can model, munge, and visually communicate data — call us statisticians or data geeks — are a hot commodity.  I grew up before the age of geek chic, when the computer wizzes were social pariahs, and feature-length movies were dedicated to nerds seeking revenge.  But in the last decade, Steve Jobs became an icon, the Internet became cool, and an entire generation of tech kids grew up well adjusted.  They even built the social web to prove it.   I believe the same could happen to statistics and data geeks too.


(Update Aug-2009:  If you liked this post, consider voting for it at the 2010 SXSW Conference).Vote for my PanelPicker idea at SXSW

comments

46 Responses to “The Three Sexy Skills of Data Geeks”

  1. Michael Barton on May 27th, 2009

    That’s a nice well written post, I enjoyed reading it.
    I think visualisation comes under the heading of communication in general. I like to think I am reasonable decent at creating visualisations in R, but I suffer because I have a poor writing ability.

  2. Kesava M on May 27th, 2009

    Great graphic to illustrate the story :)

  3. Daniel Waisberg on May 27th, 2009

    Hi Michael,

    that was a very interesting post. However, I think it is missing a very important Skill #4: decision-making. It means understanding the business behind the numbers and provide recommendations based on the analysis.

    I have been dealing with numbers for a few years and I keep seeing people that do not understand numbers taking decisions. But I think that the world would be a better place if statisticians understood business so that they could be part of the whole process: Statistics, Data Munging, Visualization, Decision-making.

    Thanks.

  4. Aleks on May 28th, 2009

    Nicely written! My post on this was quite laconic, and lacked your guidelines.

  5. the three sexy skills of data geeks « Thinking Out Loud on May 28th, 2009

    […] the three sexy skills of data geeks By mdinh I saw this post and thought it was interesting – http://dataspora.com/blog/sexy-data-geeks/ […]

  6. The Three Sexy Skills of Data Geeks : Dataspora Blog « Netcrema - creme de la social news via digg + delicious + stumpleupon + reddit on May 28th, 2009

    […] The Three Sexy Skills of Data Geeks : Dataspora Blogdataspora.com […]

  7. Dan on May 28th, 2009

    @Daniel Waisberg:

    I question your assumption that a data geek would inevitably want to work in business. Indeed, I think some of the most interesting data most in need of munging, analysis and visualization is (obviously) biological.

    In fact, I’d argue that the needs of the biology revolution of the last few decades have largely created the “data geek.” (thought this might be controversial, because physicists have been doing this stuff since Galileo).

    Visualizing molecules, parsing genomes … for me it was this that really grew the field and made it an obviously important skillset.

  8. O'Reilly Radar on May 28th, 2009

    Four short links: 28 May 2009…

    Viral Epidemics Poised to go Mobile — Albert-Laszlo Barabasi (author of Linked: How Everything Is Connected To Everything Else) modelled mobile phone virus epidemiology for NSF and concluded that (in accordance with experience) no single OS has criti…

  9. Eduardo Flores on May 28th, 2009

    Cool… Inspirational for a data-geek wannabe

  10. Four short links: 28 May 2009 | Tech-monkey.info Blogs on May 28th, 2009

    […] Three Sexy Skills of Geeks — statistics, data munging, and visualization. I’m reading Visualizing Data right now and expect the universe to bury me in bootie before the day is out. “Processing: it’s cheaper than couple’s therapy and you can post pictures of it on the Internet without being fired.” (via mattb on Twitter) Mini-Bash’s Favorite TV Show Is… [Note] […]

  11. Todd S on May 28th, 2009

    Thanks for helping to let the world know how powerful us statisticians and mathematicians are! Very good read.

  12. Jason K. on May 28th, 2009

    Michael: very well put. I especially like how you described Skill #3, because if the analysis is to have an impact it needs to influence others. Storytelling summarizes that idea very nicely.

  13. Hadley on May 28th, 2009

    This is a great list, and these are the skills that I really hope that you’d gain if you did a master’s or PhD in statistics - unfortunately there are currently few statistics departments that cover the full range.

  14. Pete Skomoroch on May 28th, 2009

    Thanks for the mention Mike. As another datapoint, even Rachel Maddow seems to be following data geeks:

    http://twitter.com/maddow/status/1318133327

  15. Andrew on May 29th, 2009

    Really awesome post. I’m so happy I found an alternative to anorexia :)

  16. GerbertGuild » Blog Archive » #32: 11:11 is the New Black on May 29th, 2009

    […] friend passed along a blogpost by Dataspora called “Sexy Data Geeks“.  It’s a great short synopsis of necessary skills for a data geek.  The […]

  17. my so-called blog » links for 2009-05-29 on May 29th, 2009

    […] The Three Sexy Skills of Data Geeks : Dataspora Blog Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills. I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics, data munging, and data visualization. (In parentheses next to each, I’ve put the salient character trait needed to acquire it). (tags: visualization statistics visualisation information datamining) […]

  18. My daily readings 05/30/2009 « Strange Kite on May 30th, 2009

    […] The Three Sexy Skills of Data Geeks : Dataspora Blog […]

  19. Robert on May 31st, 2009

    Woot for the insights and the links. Grok out man

  20. Blake Zenger on June 1st, 2009

    For over 20 years I’ve always been sheepish to tell friends and family that one of my favorite classes in college was statistics — but no longer. I’ve got sexy skills… and it feels good! Data visualization even sounds cool.

  21. David Sterry on June 1st, 2009

    This.

    It’s why I read this blog. I like the part about the data munging because I enjoy extracting order from chaos. Regex building is fun and yes coffee is the chief reagent. Thanks.

  22. Sara Wood on June 1st, 2009

    Totally sexy. Great post!

  23. Atul Kulkarni on June 3rd, 2009

    Hi,
    I feel the most important thing in data analysis is the ability to judge the situatino with respect to certain aspects of the world to which the data belongs.
    So, without the skills and good knowledge of philosophy all these skills would remain underutilized.

    Regards

  24. Michael Driscoll on June 3rd, 2009

    @Aleks - Thanks for linking to your original article on what introductory statistics should teach. Perhaps it could be called “Statistics in the Wild” versus the “cooking-show” statistics course that is traditionally taught.

    @Dan - For sure, the biological sciences have done a lot to usher in the era of Big Data — in part b/c genomics data (principally sequence and annotations), as far as I can tell, is a lot more heterogeneous than what’s found in other domains (from finance to physics).

    @Daniel and @Atul - I think the common thread you both point out is that having domain expertise is the paramount skill. Several recent articles point out how quants on Wall St. lacked enough domain knowledge — about, as you say, “certain aspects of the world to which the data belongs” — with catastrophic consequences.

  25. Rise of the Data Scientist | FlowingData on June 4th, 2009

    […] up to Varian’s now-popular quote among data fans, Michael Discroll of Dataspora, discusses the three sexy skills of data geeks. I won’t rehash the post, but here are the three skills that Michael […]

  26. Rise of the Data Scientist | weloveyourwalls design blog on June 5th, 2009

    […] up to Varian’s now-popular quote among data fans, Michael Discroll of Dataspora, discusses the three sexy skills of data geeks. I won’t rehash the post, but here are the three skills that Michael […]

  27. Tal Galili on June 5th, 2009

    I loved your post, loved it.
    Great outline and informative links - thank you for writing it.

    p.s: I also heard Hal Varian talks about the same thing in his interview with econtalk. a very good listen:
    http://www.econtalk.org/archives/2008/07/varian_on_techn.html

  28. Bogdan on June 7th, 2009

    Didn’t understand how the statistic getting connected to data munging.

  29. The Guardian OpenPlatform DataStore – Just a Toy, or a Trusted Resource? « OUseful.Info, the blog… on June 8th, 2009

    […] the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy […]

  30. Via a comment via a link. The three sexy skills of data geeks. on June 8th, 2009

    […] post the three sexy skills of data geeks is excellent. Here is the concluding paragraph, but read all of it. Put All Three Skills Together: […]

  31. Via a comment via a link. The three sexy skills of data geeks. « Vendorprisey on June 8th, 2009

    […] post the three sexy skills of data geeks is excellent. Here is the concluding paragraph, but read all of it. Put All Three Skills Together: […]

  32. adoption curve dot net » Blog Archive » links for 2009-06-08 on June 8th, 2009

    […] The Three Sexy Skills of Data Geeks : Dataspora Blog Hal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly: “The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.” (tags: data visualisation statistics skills) […]

  33. Duncan Stuart on June 9th, 2009

    Good article - and it implicitly challenges the McKinsey way of having standardised cookie-cutter models to explain the universe (which keeps changing.) I refer to my team as “data stylists” - though life being what it is, we’re often more like barbers: two chairs, no waiting - slice and dice.

  34. Ed on June 25th, 2009

    Hello

    Great article and website.

    I am a English Major, I think I want to be a data geek. I took stats in Highschool and college, and liked it.

    Right now Im finishing service in the Peace Corps. I have seen first hand the damage that untested assumptions, and not collecting the right data, can do. Lots of wasted time and money.

    How can I get these skills? where do I start?

    thanks

  35. Michael E. Driscoll on June 25th, 2009

    Ed - I would consider pursuing a Masters degree in a field you’re interested in — perhaps Public Policy or Public Health based on your background — in a program that has a quantitative bent.

    I frankly found attending lectures and taking exams painful, but the truth is that you need a good foundation to build on, and formal schooling is often the shortest path to gaining that.

    I’d advise you to search around for scholars or people doing the kind of research that interests you, find out where they teach, and try to take a class from them.

    Here are a few examples of the kind of people that might interest you:

    * Andrew Gelman is a Prof. at Columbia, is a data guru, and runs the Applied Statistics Center and widely regarded .

    * Gary King at Harvard is a leader in using quantitative methods in public policy.

    * A good friend out here in SF is Chris Albon a Ph.D. candidate at UC Davis, frequent blogger, and a certifiable data geek.

  36. Ed on June 27th, 2009

    Many thanks for the Information!

  37. Wanted: Data Geek/Visual Designer « Communicating with Health Data on July 5th, 2009

    […] 5, 2009 · Leave a Comment I was motivated to get back to blogging by a recent post from Michael E. Driscoll.  (See his great mashup of scatter plots with Andy Warhol’s Marilyn Monroe.) Driscoll takes a […]

  38. Three Skills of Data Geeks: Redux « Open Data on July 14th, 2009

    […] Skills of Data Geeks: Redux Michael Driscoll recently wrote a nice blog article entitled the Three Skills of the Data Geeks in the Dataspora Blog.  He lists this as studying, data munging and “story-telling”.  […]

  39. The Future Of Work: It’s Data, Baby on August 12th, 2009

    […] post “Three Sexy Skills Of Data Geeks,” which explains statistics, data munging and visualization — or studying, suffering […]

  40. More on a Data-driven World: Linkage « billpetti on August 15th, 2009

    […] post “Three Sexy Skills Of Data Geeks,” which explains statistics, data munging and visualization — or studying, suffering and […]

  41. biopunk.hu » Blog Archive » data scientist on August 24th, 2009

    […] a Flowing Data nevű blogon, amit egyébként is érdemes nyomon követni. A dataspora blog pedig tovább részletezi az “adatgeek” számára szükséges […]

  42. Valdis Krebs on August 28th, 2009

    This client of mine put together those three skills in their first ever social network analysis project — I mentored them a bit, but it was their effort & passion that made it happen…

    http://orgnet.com/slumlords.html

  43. The Three Sexy Skills of Data Geeks « hit & link on August 28th, 2009

    […] Three Sexy Skills of Data Geeks : Dataspora Blog – http://dataspora.com/blog/sexy-data-geeks/ […]

  44. Cactus Acide » » L’observatoire du neuromancien 08/30/2009 on August 30th, 2009

    […] The Three Sexy Skills of Data Geeks : Dataspora Blog […]

  45. SQL is Dead. Long Live SQL. : Dataspora Blog on November 25th, 2009

    […] The Three Sexy Skills of Data Geeks […]

  46. Three Skills of Data Geeks: Redux « Open Data Group on January 1st, 2010

    […] Driscoll recently wrote a nice blog article entitled the Three Skills of the Data Geeks in the Dataspora Blog.  He lists this as studying, data munging and “story-telling”.  […]

Leave a Reply