The Three Sexy Skills of Data Geeks
Hal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”
In prepping for tonite’s talk at the Google IO Ignite event, this quote inspired me to muse about how sex appeal and statistics might go together: so I chose to mash up a few scatter plots with Andy Warhol’s Marilyn Monroe.
Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills. I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics, data munging, and data visualization. (In parentheses next to each, I’ve put the salient character trait needed to acquire it).
Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. It’s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only recently developed in 2004). I expect to be on its learning curve my entire life. This being the case, people who possess a solid grasp of modern statistics are rare. And yet problems that require its application continue to multiply. The text that I was exposed to in graduate school and find to be an unparalleled survey is Hastie, Tibshirani, and Friedman’s Elements of Statistical Learning.
Skill #2: Data Munging (Suffering). The second critical skill mentioned above is “data munging.” Among data geek circles (you can find us with a Twitter search for #rstats), this refers to the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy. At best it’s inconsistently delimited or packed into an unnecessarily complex XML schema. At worst, it’s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.
A good data munger excels at turning coffee into regular expressions and parsers, implemented in a high-level scripting language of choice (often Perl, Python, even Javascript). This is problem solving with programming, and quite different from statistics. An aspiration towards elegance — in the form of a perfect XSLT filter, for example — is rarely rewarded, and often punished. A decade ago, I thought that the world’s data would soon be well-structured, and my talent for syntactical incantations of regular expressions would be a moot skill. I was wrong. (Perhaps there’s an analogy with the paper industry: the growing volume of data means we’ll likely need more regular expressions before we need less).
Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).
And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to boot.
Skill #3: Visualization (Storytelling). This third and last skill that Professor Varian refers to is the easiest to believe one has. Most of us have had exposure to basic chart-making widgets of Excel (and to date myself, tools like Harvard Graphics). But a little knowledge is a dangerous thing: these software tools are often insufficient when faced with the visualization of large, multivariate data sets.
Here it’s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals. The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst’s understanding of the data. These may consist of scatter plot matrices and histograms, where labels and colors are minimally set by default. Their goal is to help develop a hypothesis about the data, and their audience typically numbers one or a small team.
A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis. While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill — with separate tools. (R is excellent for static visualizations, but cannot compete with the kinds of rich interactive visualizations that tools like Processing and Flare make possible). Luckily, successful collaboration often occurs between data analysts and designers, the occasional fracas notwithstanding.
The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince: whether it’s an academic discovery or a business proposal.
Put All Three Skills Together: Sexy. Thus with the Age of Data upon us, those who can model, munge, and visually communicate data — call us statisticians or data geeks — are a hot commodity. I grew up before the age of geek chic, when the computer wizzes were social pariahs, and feature-length movies were dedicated to nerds seeking revenge. But in the last decade, Steve Jobs became an icon, the Internet became cool, and an entire generation of tech kids grew up well adjusted. They even built the social web to prove it. I believe the same could happen to statistics and data geeks too.
(Update Aug-2009: If you liked this post, consider voting for it at the 2010 SXSW Conference).
comments
46 Responses to “The Three Sexy Skills of Data Geeks”
Leave a Reply
That’s a nice well written post, I enjoyed reading it.
I think visualisation comes under the heading of communication in general. I like to think I am reasonable decent at creating visualisations in R, but I suffer because I have a poor writing ability.
Great graphic to illustrate the story
Hi Michael,
that was a very interesting post. However, I think it is missing a very important Skill #4: decision-making. It means understanding the business behind the numbers and provide recommendations based on the analysis.
I have been dealing with numbers for a few years and I keep seeing people that do not understand numbers taking decisions. But I think that the world would be a better place if statisticians understood business so that they could be part of the whole process: Statistics, Data Munging, Visualization, Decision-making.
Thanks.
Nicely written! My post on this was quite laconic, and lacked your guidelines.
[…] the three sexy skills of data geeks By mdinh I saw this post and thought it was interesting – http://dataspora.com/blog/sexy-data-geeks/ […]
[…] The Three Sexy Skills of Data Geeks : Dataspora Blogdataspora.com […]
@Daniel Waisberg:
I question your assumption that a data geek would inevitably want to work in business. Indeed, I think some of the most interesting data most in need of munging, analysis and visualization is (obviously) biological.
In fact, I’d argue that the needs of the biology revolution of the last few decades have largely created the “data geek.” (thought this might be controversial, because physicists have been doing this stuff since Galileo).
Visualizing molecules, parsing genomes … for me it was this that really grew the field and made it an obviously important skillset.
Four short links: 28 May 2009…
Viral Epidemics Poised to go Mobile — Albert-Laszlo Barabasi (author of Linked: How Everything Is Connected To Everything Else) modelled mobile phone virus epidemiology for NSF and concluded that (in accordance with experience) no single OS has criti…
Cool… Inspirational for a data-geek wannabe
[…] Three Sexy Skills of Geeks — statistics, data munging, and visualization. I’m reading Visualizing Data right now and expect the universe to bury me in bootie before the day is out. “Processing: it’s cheaper than couple’s therapy and you can post pictures of it on the Internet without being fired.” (via mattb on Twitter) Mini-Bash’s Favorite TV Show Is… [Note] […]
Thanks for helping to let the world know how powerful us statisticians and mathematicians are! Very good read.
Michael: very well put. I especially like how you described Skill #3, because if the analysis is to have an impact it needs to influence others. Storytelling summarizes that idea very nicely.
This is a great list, and these are the skills that I really hope that you’d gain if you did a master’s or PhD in statistics - unfortunately there are currently few statistics departments that cover the full range.
Thanks for the mention Mike. As another datapoint, even Rachel Maddow seems to be following data geeks:
http://twitter.com/maddow/status/1318133327
Really awesome post. I’m so happy I found an alternative to anorexia
[…] friend passed along a blogpost by Dataspora called “Sexy Data Geeks“. It’s a great short synopsis of necessary skills for a data geek. The […]
[…] The Three Sexy Skills of Data Geeks : Dataspora Blog Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills. I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics, data munging, and data visualization. (In parentheses next to each, I’ve put the salient character trait needed to acquire it). (tags: visualization statistics visualisation information datamining) […]
[…] The Three Sexy Skills of Data Geeks : Dataspora Blog […]
Woot for the insights and the links. Grok out man
For over 20 years I’ve always been sheepish to tell friends and family that one of my favorite classes in college was statistics — but no longer. I’ve got sexy skills… and it feels good! Data visualization even sounds cool.
This.
It’s why I read this blog. I like the part about the data munging because I enjoy extracting order from chaos. Regex building is fun and yes coffee is the chief reagent. Thanks.
Totally sexy. Great post!
Hi,
I feel the most important thing in data analysis is the ability to judge the situatino with respect to certain aspects of the world to which the data belongs.
So, without the skills and good knowledge of philosophy all these skills would remain underutilized.
Regards
@Aleks - Thanks for linking to your original article on what introductory statistics should teach. Perhaps it could be called “Statistics in the Wild” versus the “cooking-show” statistics course that is traditionally taught.
@Dan - For sure, the biological sciences have done a lot to usher in the era of Big Data — in part b/c genomics data (principally sequence and annotations), as far as I can tell, is a lot more heterogeneous than what’s found in other domains (from finance to physics).
@Daniel and @Atul - I think the common thread you both point out is that having domain expertise is the paramount skill. Several recent articles point out how quants on Wall St. lacked enough domain knowledge — about, as you say, “certain aspects of the world to which the data belongs” — with catastrophic consequences.
[…] up to Varian’s now-popular quote among data fans, Michael Discroll of Dataspora, discusses the three sexy skills of data geeks. I won’t rehash the post, but here are the three skills that Michael […]
[…] up to Varian’s now-popular quote among data fans, Michael Discroll of Dataspora, discusses the three sexy skills of data geeks. I won’t rehash the post, but here are the three skills that Michael […]
I loved your post, loved it.
Great outline and informative links - thank you for writing it.
p.s: I also heard Hal Varian talks about the same thing in his interview with econtalk. a very good listen:
http://www.econtalk.org/archives/2008/07/varian_on_techn.html
Didn’t understand how the statistic getting connected to data munging.
[…] the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy […]
[…] post the three sexy skills of data geeks is excellent. Here is the concluding paragraph, but read all of it. Put All Three Skills Together: […]
[…] post the three sexy skills of data geeks is excellent. Here is the concluding paragraph, but read all of it. Put All Three Skills Together: […]
[…] The Three Sexy Skills of Data Geeks : Dataspora Blog Hal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly: “The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.” (tags: data visualisation statistics skills) […]
Good article - and it implicitly challenges the McKinsey way of having standardised cookie-cutter models to explain the universe (which keeps changing.) I refer to my team as “data stylists” - though life being what it is, we’re often more like barbers: two chairs, no waiting - slice and dice.
Hello
Great article and website.
I am a English Major, I think I want to be a data geek. I took stats in Highschool and college, and liked it.
Right now Im finishing service in the Peace Corps. I have seen first hand the damage that untested assumptions, and not collecting the right data, can do. Lots of wasted time and money.
How can I get these skills? where do I start?
thanks
Ed - I would consider pursuing a Masters degree in a field you’re interested in — perhaps Public Policy or Public Health based on your background — in a program that has a quantitative bent.
I frankly found attending lectures and taking exams painful, but the truth is that you need a good foundation to build on, and formal schooling is often the shortest path to gaining that.
I’d advise you to search around for scholars or people doing the kind of research that interests you, find out where they teach, and try to take a class from them.
Here are a few examples of the kind of people that might interest you:
* Andrew Gelman is a Prof. at Columbia, is a data guru, and runs the Applied Statistics Center and widely regarded .
* Gary King at Harvard is a leader in using quantitative methods in public policy.
* A good friend out here in SF is Chris Albon a Ph.D. candidate at UC Davis, frequent blogger, and a certifiable data geek.
Many thanks for the Information!
[…] 5, 2009 · Leave a Comment I was motivated to get back to blogging by a recent post from Michael E. Driscoll. (See his great mashup of scatter plots with Andy Warhol’s Marilyn Monroe.) Driscoll takes a […]
[…] Skills of Data Geeks: Redux Michael Driscoll recently wrote a nice blog article entitled the Three Skills of the Data Geeks in the Dataspora Blog. He lists this as studying, data munging and “story-telling”. […]
[…] post “Three Sexy Skills Of Data Geeks,” which explains statistics, data munging and visualization — or studying, suffering […]
[…] post “Three Sexy Skills Of Data Geeks,” which explains statistics, data munging and visualization — or studying, suffering and […]
[…] a Flowing Data nevű blogon, amit egyébként is érdemes nyomon követni. A dataspora blog pedig tovább részletezi az “adatgeek” számára szükséges […]
This client of mine put together those three skills in their first ever social network analysis project — I mentored them a bit, but it was their effort & passion that made it happen…
http://orgnet.com/slumlords.html
[…] Three Sexy Skills of Data Geeks : Dataspora Blog – http://dataspora.com/blog/sexy-data-geeks/ […]
[…] The Three Sexy Skills of Data Geeks : Dataspora Blog […]
[…] The Three Sexy Skills of Data Geeks […]
[…] Driscoll recently wrote a nice blog article entitled the Three Skills of the Data Geeks in the Dataspora Blog. He lists this as studying, data munging and “story-telling”. […]