# Data's Use in the 21st Century

Posted by **Cameron Davidson-Pilon** on

The technological challenges, and achievements, of the 20th Century handed society powerful tools. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (*à la* Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it.

What these technologies have in common is that are all *deterministic engineering solutions*. By that, I mean they have been created by techniques in mathematics, physics and engineering: often being modeled in a mathematical language, guided by physics' calculus and constrained and brought to life by engineering. I argue that these types of problems, of modeling deterministically, are problems that our father's had the luxury of solving.

### The universe of all problems

Consider the universe of all problems. It is a large universe, no doubt. For simplicity, think of it as a three dimensional space, though in reality it is infinite-dimensional. The sub-space of all problems that can be solved by modeling deterministically, like building a bridge or modeling an airplane, constitute a line (or higher-dimensional equivalent of a line if you wish) in our three dimensional space. In mathematics, a line in 3-D space has *measure 0*, essentially its contribution to filling the space is negligible.

The problems we solved in the 20th Century, like flight, radio and digital computers, lie on this line. Using our current mathematics and engineering knowledge, we are reaching the limits of exploring that line: after all, the easy problems have already been solved (this bring to mind the tautology "Science is hard because all the easy problems are solved"). What's undiscovered on that line is still much, but major progression has slowed considerably (see previous sentence). Improvements are marginal.

What I am arguing is that our previous problem-solving steps of 1. model, 2. apply mathematics 3. ??? 4. profit does not have the same power as it use to have when applied to current day, and future, problems. We need to, and are starting to, explore problems off of this measure-0 line, like the red dot in the figure above. So, if this line characterizes all deterministic modeling problems, what problem might lie off this line? * Statistical problems *.

### 21st Century problems are statistical problems

Statistical problems describe the space we haven't explored yet. Statistical problems are not new: they are likely as old as deterministic problems. What is new is our ability to solve them. Spear-headed by the (constantly increasing) tidal wave of data, practitioners are able to solve *new problems* otherwise thought impossible. Consider the development of a spellchecker: in a deterministic approach, an algorithm for spell checking would have needed to incorporate context and complicated ideas from the language's grammar (I shutter at the nested `if`

statements ), unique only up to that language; whereas a statistical approach can be written in under 20 lines. The difference between the two approaches is that the latter has taken advantage of the presence of a large corpus of text -- a very lenient assumption.

This isn't another *big data* article, but its hard underestimate, let along imagine, what we will be doing with these casual data sets. Fields like medicine, that previously relied on small sample sizes to make important *one-size-fits-all* decisions, will evolve into a very personal affair. By investigating traffic data, dynamic solutions can be built that mimic past successes. Aided by machine learning, specifically recommendation engines, companies can invoke desires never previously thought about in our minds. Ideas like multi-armed bandits will motivate UI and AI development.

### What is a solution to a statistical problem?

Of course, there is a tradeoff. To speak of *solving a statistical problem* is silly. Whereas in our 20th Century past, we either solved the deterministic problem or did not, i.e. find or did not find a solution given the physical and engineering constraints, in statistical problems we are subject to some fraction of failure (hence why we don't build probabilistic bridges). This can be described by another visualization. The space of deterministic engineering problems can be found lying on the ends of the unit interval $[0,1]$. A 1 represents a problem that can be, or is, *solved* completely, eg: we successfully designed a method of flight. The space in between 0 and 1 is represented by statistical problems. For example, spell checking cannot be *solved* in the traditional sense, but it can be accurate 95% of the time. Harder statistical problems are those that involve * reflexivity*, that is your actions will affect the outcome (problems like the stock market and ad-targeting, where consumers can become desensitized to your optimized ad). Finally, problems that are unsolvable using current science and technology are assigned to 0.

### Have we made any 21st Century breakthroughs yet?

The 20th Century breakthroughs were not breakthroughs at the time of their discoveries, minus a few exceptions. The technologies took years to percolate through society, and only after they became cheap enough for public consumption. Therefore, if we were to ask if we have already invented any breakthrough technologies of the 21st Century, we should search liberally through what we have done. I would suggest that yes, we have made a breakthrough: quality information search. Tech giants like Google and Microsoft are working on these technologies, but the are still in their infancy. Furthermore, this technology is the most naive technology given our data supply. Imagine you were the world's first librarian, and you have just received thousands of books. The first, and most naive, thing you would do is organize the books, i.e. make queries of the books easier. Returning to the present, we are at this stage where we have overcome our own data-indexing problem. In fact, we have gone a step further, and we can not only return *all* results, but *damn good* results too. Imagine an alternative internet that may have occurred where you browsed by selecting more and more specific topics from accordion menus until you reached a desired webpage -- this is a possible realization of the internet organization that luckily did not occur.

### Conclusion

I am sticking my neck out, but I should point out an error in my overall argument. Previous authors may have been saying that we are at the end of so-and-so technology for years, right before a big new breakthrough. I, of course, cannot imagine these breakthroughs (else it would exist), hence I underestimate future deterministic solutions. There are still great advances possible, teleportation and quantum computers come to mind, that will be classified as breakthrough deterministic tech.

Simply, I claim we will start to see novel technological uses of data to statistical problems that we cannot even fathom right now (who could have imagined nuclear power in a pre-nuclear society). These technologies will be as revolutionary as radar was to man. So, why haven't you picked up Bayesian Methods for Hackers yet?

### Appendum

EDIT: I was probably too hasty in discounting possible ventures of new technologies. I think there is a second venture that compliments data-driven technological advances: the bio-tech industry. The fictional biotech I can think of now include nanobiotechnology, brain-machine interfaces and genetic engineering. Each of these techs, and probably all biotech advances, will produce massive amounts of data as a byproduct. This is why I said the data-driven tech and biotech will compliment each other.

- Tags: data, data science

## Latest Data Science screencasts available

Comments