Big Data

Big data is a broad and somewhat undefined term. In general it refers to data sets that become so large that traditional tools cannot store, index, search, analyze, and retrieve data effectively. When I think back to my early years at IBM working with Ted Codd on System R, the first relational database model, and later DB2, I remember the dreaded “query from hell”, an SQL statement that could bring the world’s most powerful mainframe to a crawl, taking hours or even days to complete. That was using traditional tools, searching only a gigabyte of information. Since the 80’s the world has doubled its capacity to store data every 40 months, to the point that today we create a zettabyte of new data every year. What is a zettabyte of data and how can we put it in perspective so we can visualize it? Let’s start with a more common term a gigabyte of data. A gigabyte of data would be equivalent to more than ten yards of books sitting side by side on a shelf. A zettabyte is equal to a trillion gigabytes, so a zettabyte of printed data would be approximately equal to a stack of books from Earth to Saturn, nearly 800 million miles of books. Today Seagate has shipped a 3 zettabyte storage system, and it is estimated that by 2025 the would will be creating 175 zettabytes of data a year.

In the world of big data, you could easily be analyzing a zettabyte of data and need the answer in seconds, not hours or days or months. This exponential shift from regular searching of gigabytes or even terabytes of information to zettabytes of information not only requires new file systems, new databases, new search engines, new processors, new storage, and new memory to effectively process, but it also creates vast new possibilities in research. A great fictional example of this is the TV show Person of Interest, where a supercomputer is doing complex predictive analysis, processing zettabytes of information in real-time to determine life and death risks to random individuals’ hours or days before the actual danger. While this may seem far-fetched, a real-life example of this is going on in a California police department today, where crimes are being statistically predicted and subsequently stopped before they occur. New research in cancer, meteorology, genomics, complex physics, chemistry, cryptanalysis, and hundreds of other areas is possible with big data, where meaningful patterns and results could not be found with smaller data sets. This is the real significance of big data. New information can be found that simply is not visible in smaller data sets. Think of it in terms of the new information we learned when we became able to look inside an atom. Before sub-atomic research we thought an atom was where the physical world stopped, but once we could look deeper we found endless new information. It is the same with big data. Looking at a terabyte of cellular phone calls would probably provide no patterns of useful information, but looking at a zettabyte of cellular phone calls will produce endless patterns of information that will help us do everything from reducing dropped calls to stopping terrorist attacks before they happen.

Consider the new NSA data center built in Utah in recent years. This is the most aggressive attempt by the NSA to built a datacenter capable of storing and analyzing all the words and images flying across the telecommunications networks. The NSA wants to intercept and analyze 20 terabytes of real-time communications every minute. They want to store, index, retrieve, and analyze all the data they are intercepting, both now and in the future. Even beyond zettabytes they are planning to have yottabyte (1,000 zettabytes) storage capacity in this one data center. Whether this kind of technology is used for NSA surveillance, medical research, or advertising optimization, it will fundamentally change the world of IT.

© 2021 Jack B. Blount

Big Data Artificial Intelligence
Quantum Computing
Mobile Computing
Cloud Computing