With the increasing number of open-source frameworks such as Apache Flink, Apache Spark, Apache Storm, and cloud frameworks such as Google Dataflow, creating realtime data-processing jobs has become quite easy. The APIs are well defined, and the standard concepts such as Map-Reduce follow almost similar semantics across all frameworks.

However, still today, a developer starting in the realtime data processing world struggles with some of the peculiarities of this domain. Due to this, they unknowingly create a path that leads to rather common errors in the application.

Let’s take a look at a few of the odd concepts which you might need to conquer while designing your realtime application.

Event time

The timestamp at which the source generates the data is called Event time, whereas the timestamp at which your application processes data is known as Processing time. Not distinguishing between these timestamps is the cause of most common pitfall in the realtime data-streaming applications.

Let’s elaborate on it.

Data queues are prone to delays due to several issues such as high GC in brokers or too much data leading to backpressure. I’ll denote an event as (E,P) where E is event timestamp (HH:MM:SS format), and P is processing timestamp. In an ideal world, E==P, but that doesn’t happen anywhere.

Let’s assume we receive the following data

('05:00:00', '05:00:02'), ('05:00:01', '05:00:03'),      ('05:00:01', '05:00:03'), ('05:00:01', '05:00:05'),
('05:00:02', '05:00:05'), ('05:00:02', '05:00:05')

Now let’s assume there is a program that counts the number of events received each second. Based on event time, the program returns

[05:00:00, 05:00:01) = 1
[05:00:01, 05:00:02) = 3
[05:00:02, 05:00:03) = 2

However, based on processing time the output is

[05:00:00, 05:00:01) = 0
[05:00:01, 05:00:02) = 0
[05:00:02, 05:00:03) = 1
[05:00:03, 05:00:04) = 2
[05:00:04, 05:00:05) = 0
[05:00:05, 05:00:06) = 3

As you can see, both of these are entirely different results.

Unusual delays in data streams

Most of the realtime data applications consume data from a distributed queue such as Apache Kafka, RabbitMQ, Pub/Sub, etc. The data in the queue is generated by other services such as the clickstream from the consumer app or logs from a database.

The issue queues are susceptible to delays. An event generated can arrive in your job even in some tens of milliseconds or can take more than an hour (extreme back pressure) in the worst case. The data can be delayed due to the following reasons —

  • High load on Kafka
  • Producer buffering data in their servers
  • Slow consumer due to backpressure in your application

Not assuming there will ever be a delay in data is a pitfall. A developer should always have tools to measure latency in the data. e.g. In Kafka, you should keep a check on the offset lag.

You should also monitor the backpressure as well as the latency (i.e. the difference between event time and processing time) in your jobs. Not having these will lead to unexpected misses in data e.g. a 10 min. time window can appear to have no data and the window 10 min. after it to have twice as much the expected values.

Joins

In a batch data-processing systems, joining two datasets is relatively trivial. In a streaming world, the situation becomes a bit cumbersome.

//The dataset is in the format (timestamp, key, value)

//Datasteam 1
(05:00:00, A, value A), (05:00:01, B, value B),
(05:00:04, C, value C), (05:00:04, D, value D)

//Datastream 2
(05:00:00, A, value A'), (05:00:02, B, value B'), 
(05:00:00, C, value C')
Both data streams represented on a single time scale
Both data streams represented on a single time scale

We now join both of the data streams on their keys. For simplicity, we’ll be doing an inner join.

Key A — Both value A & value A’ arrive at the same time. Thus we can easily combine them in a single function and emit the output

Key B — value B comes 1 second earlier than value B`. Hence we need to wait for at least 1 second on datastream 1 for the join to work. Thus, You need to consider the following-

  • Where will you store the data for that 1 second?
  • What if that 1 second is not a fixed delay and changes irregularly rising to 10 minutes in the worst case?

Key C — value C arrives 4 seconds later than value C’. This is the same situation as before, but now you have an irregular delay in both datastream 1 and 2 with no fixed pattern of which stream will give the value 1.

Key D — value D arrives, but no value D’ is observed. Consider the following-

  • How long do you wait for value D`?
  • What if value D` can come any time from at least 5 seconds to anywhere close to 1 hour?
  • What if this was an outer join and you had to decide when to emit value D alone?
  • What if in the previous case after 1-minute of emitting value D, the value D` arrives?
Join in a streaming application
Join in a streaming application

The answer to all of the above questions will depend upon your use case. The important part is to consider all of these questions rather than ignoring the complexity of streaming systems.

Configuration

In a standard microservice, the configuration is present inside the job or in a DB. You can do the same in data streaming applications. However, you need to consider the following before going on with this approach.

How frequently are you going to access the config?

If the config needs to be accessed for each event and the number of events is a lot (more than a million RPM), then you can also try alternative approaches. One is to store the config inside your job state. This can be done in Flink and Spark using stateful processing. The config can be populated in-state using a file reader or another stream in Kafka.

In a streaming world, making a DB call for each event can slow down your application and lead to backpressure. The choice is to either use a fast DB or eliminate network calls by storing state inside the application.

How large is your config?

If the config is very large, you should only use an in-application state if the config can be split across multiple servers e.g., A config that holds some threshold value per user. Such a config can be split across several machines bases on the user id key. This helps in reducing the storage per server.

Prefer a DB if the config can’t be split across nodes. Otherwise, all the data will need to be routed to a single server which contains the config and then re-distributed again. The only server containing the config acts as a bottleneck in the scenario.

Config present in a single server leading to a bottleneck
Config present in a single server leading to a bottleneck

Designing real-time data streaming applications can seem easy but developers make a lot of mistakes like the ones mentioned above especially if they come from the microservices world.

The important part is to understand the basics of the data streams and how to process a single stream and then move on to the complex applications dealing with multiple joins, realtime configuration updates, etc.

One of the most important books in this domain is

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems.