Data Streaming – the treadmill of BigData

Data Streaming – the treadmill of BigData

When you run a business that is focused on the “right here, right now” premise, going Big on Data might not be enough and if your business model relies on progressing terabytes, petabytes or even exabytes you’re most likely familiar with the “batch processing” term, mostly because it plays a pretty big role on how much you’ll spend on operations this month.

You might not only be familiar with batch processing, but also feel comfortable using it further, simply because it works with you. On the other hand, if you’re in need of near to real-time analytics, this processing model will probably fail you and here is where Streaming data processing comes in handy.

A classic definition of Data Streaming

The “Data Streaming” term is pretty much the definition of this model since it involves thousands (or more) data sources, generating simultaneously small-sized (think of kilobytes) input.

This allows data to be processed in sliding time windows, following a sequential, incremental, and record-by-record processing method that enables generation of a wide variety of analytics that enable filtering, aggregations, correlations, and/or sampling processes. Here is where BigData steps on a treadmill and powers up visibility at the service usage or customer behavior level. As a result billing, reacting to “force majeure” situations or planning against event-driven conditions (based on machine-generated input) becomes cheaper on resources and operations, increasing reaction times in ever-emerging situations.

Understanding Data Streaming vs Batch processing

While Batch Processing focuses on computing arbitrary queries over different sets of data Streaming Processing works through ingesting data sequences and updating metrics incrementally, allowing summary statistics and reports against new-arrived records. In other words, Data Streaming allows you direct access to a flow of records, while Batch Processing relies on the last state of those records, that need to be analysed and stored first. At a more functional lever, here are the main differences between these two models:

Don’t get us wrong, Batch Processing still excels at data persistence and this is why in most of the cases it is wise to maintain two processing layers: a real-time and a batch one. The purpose of the above comparison is just to illustrate how building a data streaming layer on top of your batch processing one might help you to have your cake and eat it too.

Main challenges outlined by Data Streaming

We must outline that stream processing can be tricky since you must maintain two separate layers within the Data Streaming layer. The first one, commonly called the storage layer, needs to order and ensure data consistency for each received record, to enable fast and sustainable read/write processes of large streams of data. The second one, commonly called the processing layer is responsible for ingesting data from the second layer, running calculations on that data and then serving the storage layer perishable, or no longer needed metrics to be deleted. There is also the scale and persistence dilemmas that need to be addressed.

Fortunately, all of the above challenges can be easily addressed since there are a couple of well-designed infrastructures like Spark, Storm or Apache Flume that enables a relatively easy process of deploying Data Streaming applications for your business. You might just need some expertise to get you started on these.

The benefits of deploying a stream layer to your BigData application

Running a streaming layer will first of all cut-down on resource usage used withing data analysis tasks.

  • Less horsepower – It might sound repetitive but processing smaller, well-addressed records requires less horsepower as compared to similar operations performed by batch processing.

  • Near-to Real-time analytics– If data is required now, there is no other way to acquire it faster than running a data streaming layer. In addition to the above, you’ll get more specifics and make use of a wider analytic pool.

  • Power of Predictive Maintenance – When machine input in involved, data streaming is a marble. Since you can receive diagnosing messages in no time, you can plan against certain machine events or be notified about a device’s maintenance before any damage occurs.

Besides these three main benefits, data streaming has advantages beyond real-time analytics, however, a series of posts is required to cover this area. To keep-up with our weekly posts follow us on TwitterLinkedin or Facebook or simply reach us back if you’re in need of an in-depth analysis.

About the Author:

Adobe CS Geek, Illustrator fan and font-face gambler playing around with words, colors and wire-frames @ EBS-Integrator