All The Data And Still Not Enough: How Data Scientists Predict The Future

It may seem strange, but even with mass amounts of data available, it is very difficult to predict future behavior. 

Claudia Perlich should know. As Chief Scientist for the digital advertising company dstillery, Perlich “designs, develops, analyzes and optimizes the machine learning that drives digital advertising.”

In doing so, she’s all too familiar with one huge paradox that comes with using data to predetermine results: Big Data may be sexy, but sexy doesn’t guarentee you get the information you want, even (and perhaps especially) when you need it the most.

Big Data, predicting clicks, and pleasing the ad oracle

As we’ve explored previously in our Data Skeptics series at Curiousmatic, mass amounts of information does not always — or even commonly — provide perfect answers to the problems we want to solve with it.

[contextly_auto_sidebar id=”rYRex3YFCm5WhNnBlGsdO3Hb8kSYT67U”]At the latest Data Skeptics meetup in NYC, Perlich explained to a room of keen listeners how data scientists like herself use the figures available through Big Data to create predictive models that can anticipate, for example, the behavior of consumers when shown online advertisements.

From regular and thorough data mining, we may assume advertisers know everything about us. But the truth is, they don’t know enough.

What is our budget? Are we at risk of a disease? How will we respond to an ad for a luxury car? The information needed to accurately answer these questions is just not there, or else not suitable.

For example, if self-reported local data (through, say, Foursquare) were accurate, Perlich says, “30 percent of people would be traveling at the speed of light, and 10,000 people standing on top of each other” at the same location.


Can you trust this data as is? Certainly not. The data is illogical.

Data regarding website traffic can be similarly skewed, in this case due to bots that inflate pageviews (an issue that can cost advertisers millions). Can you trust that a bot is not a person? Certainly not –unless you can accurately distinguish it.

Luckily, data scientists can track bot behavior to show when views are from bots and not people.


Curiously, from Perlich’s experience, bots tend to hang out almost exclusively at women’s health and wrestling news websites.

Digital advertising works in mysterious ways

Don’t ask me to optimize clicks based on CTR (click through rate),” Perlich says. If you want high click through rates, Perlich jokes, “show the ad on flashlight app to a bunch of people fumbling around in the dark.”

Because Perlich’s job is to optimize digital advertising, she has a fair amount of keen insights to offer on some of the everyday digital ads we enjoy or despise — even outside the scope of her job.

Here are two you may have experienced:

1. Streaming a TV show online, and have the same ad shown to you again and again?

This is because each ad play counts as a separate impression on you (the target audience). Third parties report impressions as percentages of demographics reached, instead of looking at the individual impact. Perlich calls this incomplete truth “pleasing the ad oracle.”

Think that this may be counter-productive? It doesn’t matter. If online ad placements have an adverse effect or none at all, that’s rarely something the agencies want to hear.

2. Buy something online, only to see an ad for that very same thing targeted toward you?

That’s because of tracking cookies, and though it may seem unnecessary for you to see an ad for something you’ve bought already, as a one-time customer you are much more likely to buy again than a random person.

“Re-targeting,” Perlich concludes, “is profitable.”

The art of making due with second best

In spite of a lack of foresight (supernatural, scientific, or otherwise) predictive solutions are found every day — Perlich calls it the “art of making due with second best.” What ends up happening is that you reframe the problem ever so slightly to make a model that works.

The only way to know the true impact of an ad before you run it is to have a time machine — and until science masters time and space,such a feat is fundamentally impossible. So what’s the solution? You pretend the ad doesn’t matter. You change the question to address what data is available.

This is not just the case for advertising, but a multitude of tasks predictive modeling is used for, like diagnosing disease, city planning, or catching criminals.

At the end of the day, the point is this: even mass amounts of data, collected from your specific movements around the Internet, can’t predict your future any more than the stars can. But unlike psychics, who are content to study tea leaves, data scientists like Perlich find and test innovative ways to get around impossibilities and make due with what they have.

“It’s all a matter of how creative you are at cheating,” Perlich says.

All of the data may not be enough, but it’s a lot more accurate than fortune telling if you’re smart enough to bend the rules.

We measure success by the understanding we deliver. If you could express it as a percentage, how much fresh understanding did we provide?
Jennifer Markert