Fast Data and Actionable Data will replace big data (2024)

“Fast data” and “actionable data” will replace big data, according to some experts. The argument is that big isn’t necessarily better when it comes to data, and that businesses don’t use a fraction of the data they have access too. Instead, the idea suggests companies should focus on asking the right questions and making use of the data they have — big or otherwise

The way that big data gets big is through a constant stream of incoming data. In high-volume environments, that data arrives at incredible rates, yet still needs to be analyzed and stored. Experts proposes that instead of simply storing that data to be analyzed later, perhaps we've reached the point where it can be analyzed as it's ingested while still maintaining extremely high intake rates using tools such as Apache Kafka.

Less than a dozen years ago, it was nearly impossible to imagine analyzing petabytes of historical data using commodity hardware. Today, Hadoop clusters built from thousands of nodes are almost commonplace. Open source technologies like Hadoop reimagined how to efficiently process petabytes upon petabytes of data using commodity and virtualized hardware, making this capability available cheaply to developers everywhere. As a result, the field of big data emerged.

A similar revolution is happening with so-called fast data. First, let's define fast data. Big data is often created by data that is generated at incredible speeds, such as click-stream data, financial ticker data, log aggregation, or sensor data. Often these events occur thousands to tens of thousands of times per second. No wonder this type of data is commonly referred to as a "fire hose." When we talk about fire hoses in big data, we're not measuring volume in the typical gigabytes, terabytes, and petabytes familiar to data warehouses. We're measuring volume in terms of time: the number of megabytes per second, gigabytes per hour, or terabytes per day. We're talking about velocity as well as volume, which gets at the core of the difference between big data and the data warehouse. Big data isn't just big; it's also fast.

Capturing value in fast data

The best way to capture the value of incoming data is to react to it the instant it arrives. If you are processing incoming data in batches, you've already lost time and, thus, the value of that data. To process data arriving at tens of thousands to millions of events per second, you will need two technologies: First, a streaming system capable of delivering events as fast as they come in; and second, a data store capable of processing each item as fast as it arrives.

Delivering the fast data

Two popular streaming systems have emerged over the past few years: Apache Storm and Apache Kafka. Originally developed by the engineering team at Twitter, Storm can reliably process unbounded streams of data at rates of millions of messages per second. Kafka, developed by the engineering team at LinkedIn, is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart. Kafka was designed to be a message queue and to solve the perceived problems of existing technologies. It's sort of an über-queue with unlimited scalability, distributed deployments, multitenancy, and strong persistence. An organization could deploy one Kafka cluster to satisfy all of its message queueing needs. Still, at its core, Kafka delivers messages. It doesn't support processing or querying of any kind.

Processing the fast data

Messaging is only part of a solution. Traditional relational databases tend to be limited in performance. Some may be able to store data at high rates, but fall over when they are expected to validate, enrich, or act on data as it is ingested. NoSQL systems have embraced clustering and high performance, but sacrifice much of the power and safety that traditional SQL-based systems offered. For basic fire hose processing, NoSQL solutions may satisfy your business needs. However, if you are executing complex queries and business logic operations per event, in-memory NewSQL solutions can satisfy your needs for both performance and transactional complexity. Like Kafka, some NewSQL systems are built around shared-nothing clustering. Load is distributed among cluster nodes for performance. Data is replicated among cluster nodes for safety and availability. To handle increasing loads, nodes can be transparently added to the cluster. Nodes can be removed -- or fail -- and the rest of the cluster will continue to function. Both the database and the message queue are designed without single points of failure. These features are the hallmarks of systems designed for scale.

In addition, Kafka and some NewSQL systems have the ability to leverage clustering and dynamic topology to scale, without eschewing strong guarantees. Kafka provides message-ordering guarantees, while some in-memory processing engines provide serializable consistency and ACID semantics. Both systems use cluster-aware clients to deliver more features or to simplify configuration. Finally, both achieve redundant durability through disks on different machines, rather than RAID or other local storage schemes.

Big data plumbers toolkit

Look for a system with the redundancy and scalability benefits of native shared-nothing clustering.

Look for a system that leans on in-memory storage and processing to achieve high per-node throughput.

Look for a system that offers processing at ingestion time. Can the system perform conditional logic? Can it query gigabytes or more of existing state to inform decisions?

Look for a system that isolates operations and makes strong guarantees about its operations. This allows users to write simpler code and focus on business problems, rather than handling concurrency problems or data divergence. Beware of systems that offer strong consistency but at greatly reduced performance.Systems with these properties are emerging from the NewSQL, NoSQL, and Hadoop communities, but different systems make different trade-offs, often based on their starting assumptions. For organizations that want to act on fast data in real time, these tools can remove much of the complexity involved in understanding data with velocity.

Kafka provides a safe and highly available way to move data between myriad producers and consumers, while offering performance and robustness to put admins at ease. An in-memory database can offer a full relational engine with powerful transaction logic, counting, and aggregation, all with enough scalability to meet any load. More than acting as a relational database, this system should serve as a processing engine complementary to Kafka's messaging infrastructure.