Mining the web to predict the future: what about ‘long data’?

A week ago Wired had a interesting opinion piece by Samuel Arbesman, an applied mathematician and network scientist, on why we shouldn’t ignore the value of long data in the era big data. I recommend you to read the entire piece, but I have included some highlights in this post.

On what he means with long data:

But no matter how big that data is or what insights we glean from it, it is still just a snapshot: a moment in time. That’s why I think we need to stop getting stuck only on big data and start thinking about long data. By “long” data, I mean datasets that have massive historical sweep — taking you from the dawn of civilization to the present day. The kinds of datasets you see in Michael Kremer’s “Population growth and technological change: one million BC to 1990,”

What the value of long data is:

So we need to add long data to our big data toolkit. But don’t assume that long data is solely for analyzing “slow” changes. Fast changes should be seen through this lens, too — because long data provides context. Of course, big datasets provide some context too. We know for example if something is an aberration or is expected only after we understand the frequency distribution; doing that analysis well requires massive numbers of datapoints.

Big data puts slices of knowledge in context. But to really understand the big picture, we need to place a phenomenon in its longer, more historical context.

I like the idea of adding context to big data by placing more current datasets within larger historic ones. It suits the general understanding that we can only think of the future if we understand our past.

The idea of long data is actually a basic idea for some recent developments at the New York Times. Last week they announced that researchers from Microsoft and the Technion-Israel Institute of Technology are creating software that analyzes 22 years of New York Times archives, Wikipedia and about 90 other web resources to predict future disease outbreaks, riots and deaths. And maybe even prevent them. I am aware that 22 years is not a look back at the ‘dawn of civilization’, however 22 years of data has great historic value and is ‘longer’ in terms of this historic context than the datasets we usually consider as ‘big’ data.

Eric Horvitz of Microsoft Research and Kira Radinsky of the Technion-Israel Institute also published a research paper titled “Mining the Web to Predict Future Events” (PDF). One example from the project examined the way that news about natural disasters like storms and droughts could be used to predict cholera outbreaks in Angola. Following those weather events, “alerts about a downstream risk of cholera could have been issued nearly a year in advance”.

(more on the project at GigaOm & Technology Review)

The researchers also describe the advantages of letting software handle these types of research: software has the ability to learn patterns, do tireless researching, a greater acces to news and a lack of bias (this last one is actually up for debate if you’d ask me).

Learning from the past to predict the future is what predictive analytics is all about. What should ‘long data’ look like from a business perspective?

 

Big Data, Crowdsourcing and Gamification: making Data Science a sport

Adding crowdsourcing and gamification to Big Data? That’s the combination that is at the heart of start-up Kaggle, a platform for predictive modelling and analytics competitions. The idea is quite simple: companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. So far organizations such as NASA, Wikipedia, Deloitte and Allstate have been using Kaggle and its competitions. By far the most lucrative prize on Kaggle is a $3 million reward offered by Heritage Provider Network to the person who can most accurately forecast which patients will be admitted to a hospital within the next year by looking at their past insurance claims data. More than 1,000 people have downloaded the anonymized data that covers four years of hospital visits and they have until April 2013 to post answers.

Data competitions
This crowdsourcing approach is especially interesting for companies experimenting with big data or companies who are eager to find out what the data they already own can tell them. Kaggle offers a community of data scientists from quantitative fields such as computer science, statistics, econometrics, maths and physics to crunch the numbers for you. There are three ways companies can use Kaggle to put their data to work:

Identify is all about posting a sample of your dataset to the Kaggle community and let members explore the data, post comments and conduct analyses. Winning ideas are determined from a pool of the highest-voted proposals by a panel of judges, consisting of data scientists from the host organization and the Kaggle data science team. In Analyze-mode it all revolves around your data and a specific question. Set out the data-mission (public of privately with selected contenders) and a prize and let the community come up with answers. Finally in Implement-mode the Kaggle engine enables you to implement the winning model(s) from your competition and integrate them into existing systems.

208% prediction improvement
Dunnhumby, a U.K. firm that does analytics for supermarket chains, was looking to build a model to predict when supermarket shoppers will next visit the store and how much they will spend. Players in this data competition (also take a look at the competition page) were given a data set that included details of every visit made by 100,000 customers over a year. Customers were identified only by number and amount spent on a given date. Based on one year’s worth of purchasing data, players had to predict when each of the 100,000 customers would next visit the store, and how much they would spend on that visit.

Around 2000 entries entered the $10,000 prize competition over the course of two months. The winning entry, by D’yakonov Alexander a 32-year-old associate professor of mathematics at Moscow State University who used a method that gave more weight to recent visits to predict the next visit, was 208% more accurate than the existing benchmark.

Kaggle offers a way to gain insights in owned data and how to put this data to work in the future. A lot of businesses struggle with a lack of expertise and experience when it comes to big data, with Kaggle data experts are within close reach. Also, the most talented people often work for the biggest companies so Kaggle is especially interesting for smaller companies who do not have a in-company data scientist. Kaggle offers a community with experts that you can tap into without having to hire anyone. And if you decide to hire a data scientist, take a look at the Kaggle leaderboard. The Top 10 should have some interesting candidates for the position.

Is Big Social becoming smarter then all the sociologists in the world?

The next two to three years Big Social will still be about web-analytics, visitor tracking and making sense of all the data coming from social media feeds. Central to these activities will be the tracking of sentiment, identifying influencers and building rich data profiles and locating patterns in human behavior. But are we getting better at these activities and if so, in what way? Could we come up with a mathematical approach to analyzing humans? A formula that can accurately know our preferences?

In a paper by Alon Halevy, Peter Norvig and Fernando Pereira, the authors present a important argument in this context:

“Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.”

They argue and validate that if a dataset is large enough, it works just as a ‘actual’ formula. You could design a complex model to calculate how many people get the flue, however wading through Google search results delivers the same or even better results. Another example is a strategy for setting a price on used products. One could come up with a economic model that calculates the right pricing strategy for each product, but leveraging years of data from a site like Ebay probably is more accurate. In fact, if there is a need for text interpretation, one is better of performing a statistical calculation on the internet then trying to analyse sentences and deriving meaning of this interpretation.

This might sound trivial, but is means that data is capable of providing answers that until recently could only be found with the help of complex models. On top of that answers are now available without the necessity of knowing the model behind the answer. It sort of feels like using a calculator to solve the root of two, without knowing wat the root actually is. Is Big Social becoming smarter then all the sociologists in the world?