Apache Flink is an open source streaming platform which provides you tremendous capabilities to run real-time data processing pipelines in a fault-tolerant way at a scale of millions of events per second.

The key point is that it does all this using the minimum possible resources at single millisecond latencies.

So how does it manage that and what makes it better than other solutions in the same domain?

Low latency on minimal resources

Flink is based on the DataFlow model i.e. processing the elements as and when they come rather than processing them in micro-batches (which is done by Spark streaming).

Micro-batches can contain huge number of elements and the resources needed to process those elements at once can be substantial. In the case of a sparse data stream (in which you get only a burst of data at irregular intervals), this becomes a major pain point.

You also don’t need to go through the trial and error of configuring the micro-batch size so that the processing time of the batch doesn’t exceed it’s accumulation time. If it happens, then the batches start to queue up and eventually all the processing will come to a halt.

Dataflow allows flink to process millions of records per minutes at milliseconds of latencies on a single machine (it’s also because of Flink’s managed memory and custom serialisation but more on that in next article). Here are some benchmarks.

Variety of Sources and Sinks

Flink provides seamless connectivity to a variety of data sources and sinks.

Some of these include:

Fault tolerance

Flink provides robust fault-tolerance using checkpointing (periodically saving internal state to external sources such as HDFS).

However, Flink’s checkpointing mechanism can be made incremental (save only the changes and not the whole state) which really reduces the amount of data in HDFS and the I/O duration. The checkpointing overhead is almost negligible which enables users to have large states inside Flink applications.

Flink also provides a high availability setup through zookeeper. This is for re-spawning the job in the cases when the driver (which is known as JobManager in Flink) crashes due to some error.

High level API

Unlike Apache Storm (which also follows a data flow model), Flink provides a extremely simple high level api in the form of Map/Reduce, Filters, Window, GroupBy, Sort and Joins.

This provides a developer lot of flexibility and speeds up the development while writing new jobs.

Stateful processing

Sometimes an operation requires some config or data from some other source to perform an operations. A simple example will be to count the number of records of type Y in a stream X. This counter will be known as the state of the operation.

Flink provides a simple API to interact with state like you would interact with a java object. States can be backed by Memory, Filesystem or RocksDB which are check pointed and are thus fault tolerant. e.g. With respect to the above example, in case your application restarts, your counter value will still be preserved.

Exactly once processing

Apache Flink provides exactly once processing like Kafka 0.11 and above with minimal overhead and zero dev effort. This is not trivial to do in other streaming solutions such as Spark Streaming and Storm and is not supported in Apache Samza.


SQL Support

Like Spark streaming Flink also provides a SQL API interface which makes writing a job easier for people with non programming background. Flink SQL is maturing day by day and is already being used by companies such as UBER and Alibaba to do analytics on real time data.

Environment Support

A Flink job can be run in a distributed system or in local machine. The program can run on mesos, yarn, kubernetes as well as standalone mode (e.g. in docker containers). Since Flink 1.4, Hadoop is not a pre-requisite which opens up a number of possibilities for places to run a flink job.

Awesome community

Flink has a great dev community which allows for frequent new features and bug fixes as well as great tools to ease the developer effort further. Some of these tools are:

  • Flink Tensorflow — Run Tensorflow graphs as a Flink process
  • Flink HTM —Anomaly detection in a stream in Flink
  • Tink — A temporal graph library build on top of Flink

Flink SQL and Complex Event Processing (CEP) were also initially developed by Alibaba and contributed back to flink.

Note : Spark Streaming 2.3 has started offering support for continuos processing rather than micro-batching. Check it out here. I’ll run some benchmarks using yahoo-streaming-benchmarks and post the results in next article.

Connect with me on LinkedIn or Facebook or drop a mail to [email protected] to share the feedback.