The Seven Secrets of Successful Data Scientists
At O’Reilly’s “Making Data Work” seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.
What follows is a blog-ified and amended version of that talk, originally entitled “Secrets of Successful Data Scientists.”
1. Choose The Right-Sized Tool
Or, as I like to say, you don’t need a chainsaw to cut butter.
If you’ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it (yes, curse the Flying Spaghetti Monster, I’ve just endorsed that dull knife called Excel).
In fact, Excel’s and Emacs’ program-by-example keyboard macros can be fantastic tool for quick and dirty data clean-up.
Read more
The Data Singularity, Part II: Human-Sizing Big Data
“There are no more promising or important targets for basic scientific research than understanding how human minds… solve problems and make decisions effectively.” - Herbert Simon
In my previous post , I discussed the forces behind what I’m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless — as humans have been excised from information read-write loops — the velocity and volume of data in the world is increasing, and at an exponential rate.
But where we go from here? What are the consequences of living in an age where every datum is stored? Where are the bottlenecks, pain points, and opportunities? Which technologies are addressing these?
The upshot is this: a new class of tools are evolving for Big Data because traditional approaches can’t scale up. But these tools share a common goal: scaling down data, and making it human-sized. That’s the “reduce” part of MapReduce, the single statistic from analysis, or the hundred pixel line from one hundred million events.
What’s happening today isn’t entirely new, though. There were echoes of it decades ago, when surveillance satellites first began scanning the globe.
VI. How Satellite Data Paralyzed the CIA
Beginning in the early 1970s the CIA began relying more on global satellite reconnaissance imagery for its intelligence operations. But according to one history, this massive, rich data didn’t accelerate the pace of US intelligence: it slowed it down.
Why? Because confronted with this firehose, CIA leaders attempted to analyze every image, chase every half-formed hypothesis, simply because it was possible. The few good leads were washed out by the many mediocre. The CIA didn’t adjust their decision-making to this new scale, and they were drowned by it.
Many organizations are at a similar inflection point now, with access to massive, rich data about their customers or products. And, like like the CIA in the 1970s, they find themselves paralyzed by the possibilities.
VII. People Still Pull the Big Levers
That Big Data paralyzes human decision-makers matters, because humans still make the big decisions. When someone praises a company as being “data-driven”, I’d like to imagine that this is literally true: that the company is nothing more than a few server racks blinking & humming away, slinging bits and earning money.
But no such company exists. What “data-driven” really means is that the executives & employees use data as inputs for making decisions. Companies may be data-fueled, but they’re people-driven.
VIII. Human-sizing Big Data: Filter & Crunch
Read more
The Data Singularity is Here
In this blog post I’ll attempt to sketch the forces behind what I’m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences.
In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren’t even at the terminal node of action. International cargo shipments, high-frequency stock trades, and genetic diagnoses are all made without us.
Absent humans, these data and decision loops have far less friction; they become constrained only by the costs of bandwidth, computation, and storage– all of which are dropping exponentially.
The result is an explosion of data thrown off from these machine-mediated pipelines, along with data about those flows (and data about that data, and so on). The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.
So whether or not the Singularity is Near, the Data Singularity is here, and its consequences are being felt.
But before I discuss these consequences, I’d like to expand on the premise. The world wasn’t always drowning in this data deluge, so how did we get here?
I. Data at the Speed of Speech
Read more
SQL is Dead. Long Live SQL!
“The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.”– E.F. Codd, 1969
“Database research has produced a number of good results, but the relational database is not one of them.” – Henry Baker, 1991
Outside of programming language flame wars, few questions raise the hackles of hackers more than: “how should I store my data?”
I will argue here, like many such debates , the answer is: it depends on what you’re doing.
While the rise of non-relational data stores serves a much-needed niche, the death of SQL and relational databases has been much exaggerated. E.F. Codd may be dead, but SQL is alive and well as a simple yet powerful data query language.
3NF Crusaders vs NoSQL Rebels
While the current critique relational databases shares features of earlier debates (such as in the 1990s, when object-oriented databases were heralded as the next big thing), it has some new twists. Thus to review the players and their positions:
On our right are the relational curmudgeons, the kind of folks who pen manifestos and crusade against NULL values. They have converted nearly all of big business to their ministry, and have billions of dollars in their coffers to show for it. They insist that data should be stored in terms of its relations, to protect its integrity and facilitate its analysis. Ideally that means third-normal form, but more liberal branches of the church exist.
How XML Threatens Big Data
Confessions from a Massive, Nightmarish Data Project
Back in 2000, I went to France to build a genomics platform. A biotech hired me to combine their in-house genome data with that of public repositories like Genbank. The problem was the repositories, all with millions of records, each had their own format. It sounded like a massive, nightmarish data interoperability project. And an ideal fit for a hot new technology : XML.
So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (”taxon” or “species”? attribute or element?). At night I dreamt in ontologies. It was perfect.
Then reality struck. The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM.
We had a deadline to meet (and, mon dieu, a 35 hour work-week). So we changed course. We hacked our Perl scripts to emit a flat tab-delimited format — “TabML” — which was bulk loaded into Oracle. It wasn’t elegant, but it was fast and it worked.
Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov.
In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity. Finally, I generalize to three rules that advocate a more liberal approach to data.
The Rise of the Data Web
The future of the web is data, not documents. The web has evolved from Tim Berners-Lee’s original vision of “some big, virtual documentation system in the sky” into an vibrant ecosystem of data where documents — and human actors — will play an ever smaller role.
As others have noted, we’ve reached a tipping point in history: more data is being manufactured by machines — servers, cell phones, GPS-enabled cars — than by people. The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.
Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext. Similarly, we’ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.
The web we experience will continue to be dominated by documents — e-mail, blogs, and news. And while many sites are data-centric — Google maps, Weather.com, and Yahoo finance — it’s the web that we can’t see that surging with data. It’s not about us, it’s about servers in the cloud mediating entire pipelines of data, only occasionally surfacing in a browser.
But the web’s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data. As we build out the data web, we ought to embrace standards that mirror data’s form in its natural habitats — as programmatic data structures, relational tables, or key-value pairs — while taking advantage of data’s stream-like nature. Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.
The Three Sexy Skills of Data Geeks
Hal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”
In prepping for tonite’s talk at the Google IO Ignite event, this quote inspired me to muse about how sex appeal and statistics might go together: so I chose to mash up a few scatter plots with Andy Warhol’s Marilyn Monroe.
Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills. I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics, data munging, and data visualization. (In parentheses next to each, I’ve put the salient character trait needed to acquire it).
Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. Read more
Dataviz Salon SF #2: Maps, Grammars, & Models
A few nights ago the talented folks at Stamen Design hosted us at their studios for our second dataviz salon in San Francisco. (Special thanks to Tom Carden and Michal Migurski for inviting us). Four talks were given, which I’ll review in turn.
- Stamen: Reaching through Maps
- Protovis: A Declarative, Open Source Graphical Toolkit
- A Mathematician’s View: A Visualization is a Hypothesis
- UUorld: Multidimensional Extrusion Maps
Stamen: Reaching through Maps
Eric Rodenbeck (Stamen) started by highlighting several mapping visualizations that Stamen has been hacking on recently and in the past, including Cabspotting in San Francisco , Crimespotting in Oakland, and Olympic Stadium spotting in London.
Eric showed how Stamen has attempted to move away from what Schuyler Erle has dubbed “red dot fever”, whereby the overlayed data can overwhelm our visual attention, and toward allowing various data layers to “reach through” the maps.
For example, the London Olympic maps provide a mixture of schematic, satellite, and webcam images. These various drill-downs of detail are not all exposed, but rather collaged. Even more interesting was a movable ‘lens’ that, as it is moved over regions of a map, reveals another layer (reminiscent of a polarized-light based mural at Boston’s MoS). In these ways, additional layers of data are only selectively brought into focus (echoing a design pattern in Japanese gardening, mie gakure, meaning “seen and unseen”).
Read more
Color: The Cinderella of dataviz
“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.” — Envisioning Information, Edward Tufte, Graphics Press, 1990
Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.
Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.
While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question: Read more
People who love scatter plots & connecting dots

We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop Shane Booth, dataviz wiz Lee Byron , computational journalist Brad Stenger, data wrangler Pete Skomoroch , and any/all data enthusiast Brendan O’Connor .
I was going to blog all about it — but Tom Carden of Stamen Design already has a great write-up.
… Dataspora invited a few people to a Dataviz Salon yesterday evening. Mike and I went along and huddled in a brick-built basement in SoMa to listen to the following:
.
