John Snow and Big Data; Now, NASA and Really Big Data

In 1854 John Snow was living in a London that was literally awash in sewage. Cholera epidemics, what we now know to be caused by a specific pathogen in drinking water, were both routine and catastrophic, so common in that time that the population growth of the city had flattened out, and the likelihood of an infant reaching the age of three was less than one in five.

In spite of this terrible scourge, Londoners were helpless to stop it. The germ theory was not yet established, and physicians continued to consider “miasmas,” or toxic vapors, as the source of cholera. While Snow was skeptical of the miasma theory, his approach to the 1854 epidemic, was nothing short of brilliant; it was early “big-data” science. Without benefit of a real working hypothesis Snow began to map the geographic locations of deaths from cholera, and discovered a large cluster of deaths at a certain part on Broad St, in the St. James Parish region of London. In collaboration with a local physician and clergyman, the source of the outbreak was tracked to a water source (the famous Broad Street pump), which in turn resulted from cross contamination by a nearby cell pool. That cell pool was used by the family of no less than the index case of the cholera outbreak. Snow and colleagues, had, with a map of London and some colored pins, found the source of the cholera outbreak, and its’ mechanism of disease, without benefit of the first principle of what we now know to be microbiology.

One of the great handicaps of reading history is that we don’t fully understand that people living through the historical event really don’t know how the story ends. It took a good while for the scientific community to put it all together, but they did. Snow went on to “quantify” the anesthetics used in that time (ether and chloroform) in ways that made their administration far safer and more effective.

This summer NASA announced a public-private Crowdfunding, part of a larger
“Asteroid Grand Challenge,” “to detect, map, and classify objects in space that may be an impact risk for our planet.”

This is in the face of the fact that there’s no viable means, currently available, to change the trajectory of a large space object. There are really two problems here, of course: first find the object, a non-trivial objective in large-data parsing, and then figure out how to exploit it. But that’s not stopped people from Crowdfunding an asteroid mining project: classic Yankee entrepreneurialism, what doesn’t kill you might make you rich.

The whole thing has a bit of John Snow about it.

What, at the core, is interesting about big data is that it often sidesteps the hypothesis stage, and sometimes even the end-game. It’s not trivial science. Nobody working in meteorology, astronomy, or bioinformatics sniffs at the concept of Big Data. But many casual readers often confuse “Big Data” with data-intensive computation. Big Data can be computationally intense, but it is more than that. It is an attempt to find the solution without really having a hypothesis or working model. And, that’s as revolutionary as John Snow taking the handle off the Broad St Pump.

For a good primer on “Big Data” Science, I recommend: “The Fourth Paradigm,” by Hey and Transley.

Or for the skeptics, try a nice piece in the New Yorker by Gary Marcus, “Steamrolled by Big Data.”

Either way, it’ll be fun to see how it turns out.

—Rob Carnes