The Three Sexy Skills of Data Geeks
Hal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”
In prepping for tonite’s talk at the Google IO Ignite event, this quote inspired me to muse about how sex appeal and statistics might go together: so I chose to mash up a few scatter plots with Andy Warhol’s Marilyn Monroe.
Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills. I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics, data munging, and data visualization. (In parentheses next to each, I’ve put the salient character trait needed to acquire it).
Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. Read more
Dataviz Salon SF #2: Maps, Grammars, & Models
A few nights ago the talented folks at Stamen Design hosted us at their studios for our second dataviz salon in San Francisco. (Special thanks to Tom Carden and Michal Migurski for inviting us). Four talks were given, which I’ll review in turn.
- Stamen: Reaching through Maps
- Protovis: A Declarative, Open Source Graphical Toolkit
- A Mathematician’s View: A Visualization is a Hypothesis
- UUorld: Multidimensional Extrusion Maps
Stamen: Reaching through Maps
Eric Rodenbeck (Stamen) started by highlighting several mapping visualizations that Stamen has been hacking on recently and in the past, including Cabspotting in San Francisco , Crimespotting in Oakland, and Olympic Stadium spotting in London.
Eric showed how Stamen has attempted to move away from what Schuyler Erle has dubbed “red dot fever”, whereby the overlayed data can overwhelm our visual attention, and toward allowing various data layers to “reach through” the maps.
For example, the London Olympic maps provide a mixture of schematic, satellite, and webcam images. These various drill-downs of detail are not all exposed, but rather collaged. Even more interesting was a movable ‘lens’ that, as it is moved over regions of a map, reveals another layer (reminiscent of a polarized-light based mural at Boston’s MoS). In these ways, additional layers of data are only selectively brought into focus (echoing a design pattern in Japanese gardening, mie gakure, meaning “seen and unseen”).
Read more
Color: The Cinderella of dataviz
“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.” — Envisioning Information, Edward Tufte, Graphics Press, 1990
Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.
Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.
While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question: Read more
People who love scatter plots & connecting dots

We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop Shane Booth, dataviz wiz Lee Byron , computational journalist Brad Stenger, data wrangler Pete Skomoroch , and any/all data enthusiast Brendan O’Connor .
I was going to blog all about it — but Tom Carden of Stamen Design already has a great write-up.
… Dataspora invited a few people to a Dataviz Salon yesterday evening. Mike and I went along and huddled in a brick-built basement in SoMa to listen to the following:
.
How Google and Facebook are using R

(March 26th Update: Video now available)
Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled “The R and Science of Predictive Analytics”, co-located with the Predictive Analytics World conference here in SF.
The panel comprised of four recognized R users from industry:
- Bo Cowgill, Google
- Itamar Rosenn, Facebook
- David Smith, Revolution Computing
- Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)
The panelists were asked to explain how they use R for predictive analytics within their firms, its strengths and weaknesses as a tool, and provide a case study. What follows is my summary with comments.
Is Big Data at a tipping point?
(5/18/09 update - included an overdue reference to linked data!)
Stuart Kauffman, in one of his books about complexity, discusses tipping points in networks — what he calls a phase transitions — by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string at random. At first, you have just pairs of buttons. Then, you have clusters of threes, which in turn get tied into ever larger clumps. The question is: How long until picking any button off the floor pulls them all off together, in one connected mass?
It turns out that this supercluster of buttons doesn’t build gradually as we tie more threads, it emerges suddenly. This rapid phase transition, from relatively unconnected to mostly connected, occurs right around where we have about half as many threads as buttons (see figure). This is the tipping point of the system: where a few threads make a big difference.
A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data.
The same trend is occurring in the larger universe of data that these organizations inhabit. Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.
At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another: comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).
Yet there’s a slow thaw underway as evidenced by a number of initiatives: Aaron Swartz’s theinfo.org, Flip Kromer’s infochimps, Carl Malamud’s bulk.resource.org, the Tim-Berners-Lee-inspired LinkedData.org, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets. These are all ambitious projects, but the challenge of weaving these data sets together is still greater.
How far are we from the tipping point of Big Data? When will the world’s icebergs of data melt into one sea? More importantly, when it happens, will we be ready to do something useful with it all?
What can Darwin’s finches tell us about the downturn?
Newspaper articles paint the markets in metaphors like “difficult climate” and “harsh landscape” –but these clichéd phrases have a kernel of truth. Thinking about markets as natural environments reveals that selective forces are at work. But it also predicts when they work. In the natural world, as the story of Darwin’s finches tells us, selection acts in times of crisis: drought, famine, and disease. For our markets, that time is now.
(Aside: I confess that relating the economic crisis to Darwin is a symptom of an academic bad habit: namely, mapping every phenomenon onto the intellectual giant of one’s field. Somewhere there is a psychologist blogging about Freud and the economy).
What I’ll be presenting at O’Reilly Money Tech 2009
(April 2009 Update: Unfortunately, The Money Tech Conference was indefinitely postponed, but fortunately I will be presenting a version of this talk in July at OSCON 2009).
I’ve been invited to speak at O’Reilly’s Money Tech conference this coming February 4-6th in New York City and thought I’d share the abstract for my talk here. I’ll likely be in New York for several days, if you’d like to get together to chat about data drop me a line!
My talk is entitled “Open Source Analytics: Visualization and Predictive Modeling of Big Data with the R Programming Language”
Read more
How do you measure a major league slugger?
I gave a talk last month at SAP Labs in Palo Alto, along with Jim Porzak of ResponSys, introducing the R Statistical Language to a Business Intelligence interest group. The goal was to highlight how open source tools, like R, can be used to build predictive models. The example I gave centered around baseball and a simple question: how do you measure a baseball slugger?
Michael Lewis, in Moneyball , described how the baseball analyst Bill James was frustrated by the fact that major league hitters were consistently rated by their batting averages. James wrote: Read more
The case for open source data visualization
When I was in graduate school, the most closely studied part of the scientific publications we read was not the results, but the methods sections. (It was also, incidentally, often the hardest section to write for one’s own publications.) Methods sections are wonderful because they allow you to verify that someone else’s work is correct — by reproducing it yourself. But more importantly, methods sections allow you to build upon the work of others. They are the open source code of science.
Unfortunately, for all but a small fraction of data visualizations on the web, there are no methods sections being published. This is a shame, because it slows the free flow of ideas and prevents the creative extension of other people’s work.
Three conditions must be met for a data visualization to be considered open and reproducible:
Read more


