Basics: How Big is your Data?
Aug 31, 2020,

Last time we promised a buffet of data-related topics, so here we are, keeping our promises (a default powered by our teams’ environment). Advancing from abstract notions such as data drivenness, we thought a fair path towards the main course would be an introduction to BigData. In most of the cases, talking about sizes is a bit giggly pointing towards “Men’s Health” or “Cosmopolitan” best sellers.

Alas, when we’re going to talk about data sizes – a less entertaining topic though quite exciting for those invested in building data-centric architectures in data-driven environments. We can all agree: BigData processing implies massive ingestions of info, regardless of its nature. The main quarrel here is how much must be ingested before you can get that much desired ‘BigData’ badge?

Before we dig in – what is BigData?

Before grabbing our rulers, weights, dials and pocket watches, a proper definition must be fitted around the “Big Data” term. Given all circumstances, we must understand what we’re measuring, before declaring a verdict.

When turning to Wikipedia we found one of the most generic definitions cited in almost any other source and well known by editors and techies alike:

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise, deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.

– honestly, I’d give this one a cold shoulder since it doesn’t do much to sort this muddle, if not for that propper directive: size here is not the main culprit. Instead, Jack Ketch here, set his aim on complexity as his axe longs yet another appropriate slicing.

The “Big” here, is relative. Think of your last happy hour (we all miss those) where the alcohol in each brew represents data complexity. You’ve got your friend Jimmy, whos’ dunking quotidian whiskey shots drinking like a fish; your friend Samantha who enjoys her occasional pint of chardonnay and Kenny, a beer-loving acquaintance that dunks barrels, until he finds the original root of his beverage. While all of them are gulping alcohol (aka data complexity), each of them experiences a unique tub-thumping journey.

Jimmy is doing shots – in terms of volume, it isn’t much but compared to Kenny and Samantha, his liquor is quite complex and his body will have to consume a lot of more resources to process that shot’s complexity, rather than the volume of his drink. In terms of data, he might process a mere 500GB of data (the equivalent of any SSD provided in a modern laptop), but the complexity of it, might take a huge toll on resources (at the operational, technical and hardware provisioning levels).

Samantha will probably have 2 to 4 glasses of some ruby-red goodness. Compared to Jimmy, her liver will most likely be grateful the next day, since wine is a bit more diluted as compared to whiskey. Going back to our data analogy, Sam might process some well-organised data, that measures-up to a few hundred terabytes of data, but since it’s a bit more structured (compared to Jimmy’s batch), her efforts towards data-centricity might carry an easier tool on operational and technical levels, whilst consuming considerable hardware resources and time.

Kenny, as compared to both of our friends here, will have to dunk 5 to 6 pints to catch-up (of course, depending on how much alcohol is in his favourite root beverage). Like Samanta, Kenny here will have to ingest a bit more liquor, but his dataset is organised in tiny bits, giving his liver a well-deserved break; his bladder tho (physical resources) will take a heavier toll though.

Anyone else who’s concerned about their health, having an occasional glass of water though; falls through this analogy as there is no complexity either volume added to its ephemeral data-drink, hence, you can’t really defend his BigData use-case to his/her drinking habit. Considering the above, it’s only safe to assume that:
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing

The best way to look at the “BigData Processing” term is by considering both: volume and complexity, rather than just quantifiable margins. Of course, you can think about it from different perspectives. Some might even say that any data analysed by Spark, Hadoop or any other advanced-insight platform is BigData, running the risk of omitting other means of processing analytics such as Programming with Big Data in R (pbdR).

When Size Matters Most

Since a definition alone won’t nail this conundrum down, let’s get into the wild and explore our virtual jungle – introducing some “average sizes” to the BigData perspective, in the effort of delivering a sense of just how much information the average data-driven pioneer has to store, analyze and integrate today.

Now, for tapping that barrel, let’s eye some metrics that place BigData on a scale:

  • IDC forecasts the global data sphere will reach 175 zettabytes by 2025 (that’s 175 billion GT Thumb drives);
  • All youtube videos are estimated at 800 Petabytes and counting;
  • According to Forbes, an average American uses 4,416,720 GB of internet data;
  • Wordstream states that on average, there are about 500 million tweets sent every day.

That’s not all, especially with fields such as the Internet of Things becoming a reality in the Retail Sector. Here, IDC’s estimate might very well fall way behind.

Regardless of one’s religious beliefs in complexity, as opposed to sacrificing a lamb to the volume quantifier, one thing is certain: when dealing with a whale it’s gonna be Big. Hence the “exception that proves the rule”: in defining BigData, when size does matter.

To begin with, only storing so much, becomes a serious challenge; imagine making sense of all of this ingestion flux. As we advance into our next millennia, Data will get even bigger and we might need to adjust our definitions. It is no surprise IDC measured its prognosis in Zettabytes. BigData progression, unlike the “Moore’s Law” (the exponential growth of computing power), is a bit different and more unpredictable.

Each of us, industry professionals or 12-year-old Tick-Tock-ers, generate enough data to fill any logical space there is. 100 years from now, it’s estimated we’re going to store 42 Yottabytes(that’s 42000 Zettabytes) every year – where are we going to store all that? If we don’t change anything and keep using the same approaches run by companies such as Amazon and Google, we might have to build enough data centres to cover 12 Jupiter surfaces.

It’s theorised that DNA itself might hold the answer. Harvard researches have been able to write entire libraries (petabytes of data) in just a few grams of genetic material – hence there’s a chance our next digital revolution could actually be more alive than we can imagine. Of course, this doesn’t mean we have an idea of how we’re going to manipulate, centralise, analyse or exploit all this data – but for sure will redefine the notion of “saving” the world.


Regardless of how we define BigData today, one thing is certain: while most of this data will definitely be catalogued as useless, inaccurate, or hard to organise, the quantity of data will lead to changes in the quality of how we live and understand our world – probably in ways we can’t even imagine.

There’s more coming on this as other concepts require a wee bit more column inches, but for the time being, this should be more than enough to give you an idea of what BigData is. Next week we’re planning a proper dissection piece, so make sure to join in. Until then, use that comment box below and make sure to like share our piece (of course, if you found it useful).