Data Skeptics And The Deep Web: The Explorations Of A Data Scientist

Photo courtesy of the NYC Data Skeptics Meetup, modified by Curiousmatic. 

The pulsing underbelly of the Internet, known as the Deep Web, holds vast and unchecked amounts of dark and often dirty data. With little structure or regulation, skepticism is an important approach to delving into seas of information, beneath which villains operate in tides of secrecy.

Data Skeptics is a Curiousmatic series that covers intriguing Internet and data topics worthy of close and careful examination. The series explores various spaces within the vast and mysterious world of data,  from which we derive insight for curious minds through a lens of informed skepticism.

When did skepticism become a bad word, associated with furrowed brows, frown lines and naysayers? It is not the same as cynicism, and an even further cry from denial.

By its philosophical definition, skepticism is the theory that certain knowledge is impossible. Such is the approach of most data skeptics, whom are well aware of the nuances and vulnerabilities in the belly of the beast that is Big Data.

Image courtesy of Marius B via Flickr.

Perhaps data science consultant Cathy O’Neil describes it best in her paper (pdf), On Being A Date Skeptic, when she says skeptics maintain “a consistently inquisitive attitude toward facts, opinions, or (especially) beliefs stated as facts.” Skepticism, she says, puts the science in data science; it’s the necessary middle ground between unfounded confidence and dismissive doubt.

It’s this attitude that allows growth among our relationships with behemoth concepts like Big Data, in which trust is often placed blindly. Because much like religion, data is nuanced and contextual, shaped by human psychology and endless modes of interpretation.

Importantly, it’s also deeper, more extensive, and muddled than one might presume. Today, from the angle of inquisitive skepticism, we’re taking a dive into the noisiest girth of it all: the Deep Web.

Dr. Jerry And The Deep Web

In New York City, Data Skeptics meet every several months to explore and engage with the gritty details obscured by the hyped up and growing omnipresence of Big Data.

Image courtesy of infocux Technologies via Flickr.

Enter Dr. Jerry Smith, an entrepreneurial data scientist. Boasting 16 years of machine learning, predictive models, and analytics, Dr. Smith is Chief Data Scientist for Capgemini’s Advance Digital Intelligence (ADI) group and Data Science & Analytics (DSA) group.

He would know a thing or two about adventuring into the beyond, having died for 90 seconds half a decade ago. God put him on earth to find answers in data, he says, though he wasn’t dead for long enough to get the full message.

Dr Jerry also flew the A6 intruder for the U.S. Navy (the plane was inspiration for the film Flight of the Intruder) and operated submarine-based nuclear reactors.

“In our hearts, we’re explorers,” Smith says of data scientists like himself. “We look at ones and zeros to find out what is being asked, and what the answers are.”

Smith is here to focus on the open source intelligence, specifically that of the Deep Web and the criminals that operate there.

Dark space, fire, and ice

For a refresher, the Deep Web consists of information that is open source, but not indexed by search engines, and therefore largely inaccessible. It also dwarfs the surface web considerable at 7500 terabytes, as opposed to just 19.

In poetic contrast, Smith compares the Deep Web both to the freezing submerged majority of an iceberg of knowledge (representative of the Internet) and Dante’s circles of Hell. The Deep Web is for Bad Guys, he explains — which makes sense, considering Dante’s Inferno says the worse offenders “are submerged, while in a fiery hell, up to their necks in ice.”


The level closest to the surface contains websites like Reddit, foreign social media channels, and message boards. The furthest level — level 5 — is what Smith calls “theoretical space,” meaning it’s believed to exist but is in essence untouchable.

In between is where the shady business happens: It’s here you’ll find (or more likely, won’t find) your “Bad Guys,” which according to Smith’s definition can be described as “those not on your side.” Information is made private and untraceable here through encrypted peer-to-peer networks like Tor.

The activities performed in this dark space (also called the Darknet) are as monstrous as they come, with classes of crime including identity theft, a lucrative assassination industry, home-grown terrorism, sex trafficking, and various underworld bazaars (offspring of the late Silk Road) that allow the buying and selling of illegal drugs and weapons.

All of this, Smith says, is virtually unseen by law enforcement, in spite of its exponential growth in all quadrants.

Seek and you shall find

How does a skeptic look at the Deep Web and its dark data, in relation to data science? Data, whether dark or otherwise, isn’t just information, nor is it infallible — without human psychology and sociology, it’s incomplete. Correlations that scientists or marketers draw, as we know, can be spurious; it’s causation that counts, and which is trickiest to pin down.

Smith introduces two definitions that help corroborate this concept:

Apophenia: the spontaneous perception of connections and meaningfulness of unrelated phenomena; a popular term among skeptics of all types.

1202: An error code used by NASA to indicate “information overload,” which almost aborted Apollo 11’s lunar landing in 1969.

In relation to the Deep Web, these definitions represent the challenges data scientists like Smith face when it’s their job to mine complex data sets for answers. Smith, not unlike a digital detective or vigilante, wades through the inherent anarchy of the Deep Web to find correlation and causation between data and the all-too-human miscreants behind it, from the meth dealers in his own neighborhood to violent activists in Europe.

But the challenge of the Deep Web is its lack of structure. Smith asks hypothetically, “how does one process unstructured data?” The answer? “You process it.”

Image courtesy of Bart via Flickr.

The tools and capabilities used to mine, crawl, and process dark data take multidimensional space and “squish it out” systematically in a way that explains behavior dynamics. Of this process, Smith tells us “you can turn time into space;” he continues with an addendum: “That’s some Doctor Who shit.”

As an example of a typical investigation, Smith tells us a story of his dissection of the social networking surrounding Greenpeace Italia. For this project, Smith and his colleagues performed a Social Network Analysis (SNA), and using tools such as Netviss, Gephi, Between Centrality, and modularity, mapped and narrowed its social relationships to uncover a more dangerous target: the activist group Art of Resistance.

Art of Resistance was revealed to be an organization that persuades pacifist groups to engage in violent behavior. Smith knew this not only from probing the Deep Web’s first layer, but because he encountered the group in person: Disguised as a journalist, he was able to approach and witness/record their intentions unfold into action during a rally.

The latter encounter is perhaps not so typical.

Even in such cases where clarity is achieved, Smith warns that the more you think you know, the more you don’t think you don’t know — knowledge of data grants us false confidence about what we aren’t aware of. Skeptics should not be so easily fooled.

knowyouknowThe role of motives shaping outcomes isn’t explicitly stated, but Smith doesn’t shy away from speculating ours: “You’re here because you’re fascinated by bad people. And you want to find them.” He even hypothesizes that there may, statistically, be several among us.

All hope abandon, ye who enter here

Whether only one or legions of us dip our hands, toes, or submerge completely into the dark kinetics of the Deep Web, the bulk and weight of it affects us all. Just because we may be swimming on the surface doesn’t mean a monster won’t take a bite. After all, the tip of an iceberg sank the Titanic — what could the rest of it have done?

The Deep Web isn’t what one pictures when they think of Big Data, nor is it even what many data scientists are willing to take into consideration. Yet it’s bigger by far, and as such, worthy of critical awareness at the least and active processing at best.

Think of it this way: Just the facade of a body can’t possibly speak to its messy psyche or the quality of its organs. Similarly, .03 percent of the Internet isn’t wholly reliable, in this case, without an understanding of the remaining noise and depravity.

Image courtesy of NASA’s Marshall Space Flight Center via Flickr.

There’s no reason to think, however, that this mass of chaotic information can’t be tackled. If a 1202 information overload failed to stop a trip to the moon, data scientists can forge through, and better with a healthy dose of skepticism than without it.

The Deep Web isn’t for everyone, atmospherically. But we’re sure the skeptic in you has an equally bottomless well of questions — the extent of which can only be answered by descending ring-by-ring yourself, amidst the ice and flames.

Whether you’ll find answers is another matter entirely. Certainly, not all, but some, are worth searching for. Just remember — humanity’s capacity to know has a ceiling, but our avidity to keep seeking, even in the darkest of places, is infinite.

Tune in next month for the second segment of our Data Skeptics series. 

For more information:
The NYC Data Skeptics Meetup page.
Dr Jerry Smith’s website.
Video of Smith’s entire lecture.

We measure success by the understanding we deliver. If you could express it as a percentage, how much fresh understanding did we provide?
Jennifer Markert