Apache Kafka - An Open Source Distributed event streaming platform

Did you know more than one-third of Fortune 500 companies are using Kafka? From top travel companies, banks, insurance agencies, and telecom companies all use Kafka for real-time streaming data or to collect big data.

Companies want something that is scalable, fast, and durable when it comes to streaming data. Kafka just happens to be a popular one-stop solution for all.

But why Kafka’s implementation?

Its simplicity at its best! If you want stability, reliability, durability, and a flexible publish-subscribe/queue that easily scales with your target consumers…you will have a solid asset in hand.

In this blog post, we will discuss in detail all you need to know and understand about Kafka. So stay tuned.

What is Kafka?

First time I heard about Kafka I immediately thought of Franz Kafka, the great German writer, and novelist whose novels I hold in great revere for the portrayal of “absurd” and surrealism.

In technical terms, Kafka is an open-source distributed event streaming platform. Originally developed by Linkedin for managing the event streaming flows, Kafka has now been open-sourced and is now supported by Apache Software Foundation.

Kafka follows a Publisher Subscriber Model for data streaming and is mostly deployed in a distributed configuration using the Kafka Cluster consisting of multiple Kafka Brokers.

But before diving deeper into Kafka internals and architecture we will describe a few terms related to Kafka, as great French Philosopher Voltaire famously said,

“If you wish to converse with me, define your terms”

Publisher Subscriber Model

In a publisher-subscriber model, publishers publish messages in certain classes (in Kafka they are called topics) whereas consumers are subscribed to a certain class of messages. A typical publisher-subscriber model is implemented using a messaging queue or a broker. One or more publishers publish messages to one or more queues and one or more consumers subscribed to respective publishers consume the messages from the broker.

This model provides great flexibility as it enforces loose coupling by design enabling the configuration of publisher/consumers without the headache of being concerned with the actual implementation.

In Kafka, this model allows the producers and consumers to be oblivious to each other (decouples) and hides the implementation details read and write operations from and to the broker.

Distributed System

When we define Kafka as a distributed system, we have to first understand what exactly a distributed system really is?

A distributed system has multiple nodes usually hosted on independent physical systems working and coordinating with each other on a network to achieve one or multiple objectives. Distributed computing leverages the parallelizability of a system to achieve common objectives like concurrency, availability, fault tolerance, etc.

Why Kafka?

Kafka is used in the context of data integration as well as real-time stream processing.

We are more accustomed to hearing ETL (Extract Transform Load) in association with Data integration. Giants like Netflix, Facebook, or LinkedIn process petabytes of data daily, which powers their search, recommendation engines, and analytics.

The question is how are they able to handle such a huge amount of data. An obvious answer is a lot of hardware. But what kind of software they run on their server farms makes all of this possible?

Traditionally batch ETL processes have been enough to serve most of our data warehouse needs. A traditional software system typically generates data from 1…n sources that include various applications, relational/nonrelational databases, APIs, etc. ETL is the glue in the data integration process that is responsible for seamless data transfer to the warehouse.

With storage becoming cheaper and faster every passing day and CPUs capable of processing huge amounts of data bringing us to the age of data analytics and with the emergence of Big Data and Modern systems requirements we are now logging events, web traffic that is ultimately fed to a number of storage (S3, GCS, etc) and data processing systems (e.g Spark) to build services like real-time analytics, reporting systems, data lakes, and data warehousing.

Common use cases of Kafka as stated by official Kafka docs are following

Messaging
Website Activity Tracking
Metrics
Log Aggregation
Stream Processing
Event Sourcing
Commit Log

Getting Started

Schedule a Consultation with Kafka Expert

Opinions from an expert will help you resolve unexpected issues and complex problems.

Kafka Overview

Apache Zookeeper

Kafka by design requires another service Zookeeper to manage its broker/cluster topology. Zookeeper is a core dependency to run Kafka. Zookeeper is a service that is specially designed for managing distributed processes to coordinate with each other.

Since Kafka is a distributed streaming application and deployed in clusters, Zookeeper takes the pain of managing the leadership election for Kafka Broker and Topic partition pairs. Whenever a new node is added or removed or goes offline Zookeeper synchronizes the information to all the nodes/brokers in the Kafka cluster to make sure every node is aware of the overall cluster configuration.

Kafka Broker

Kafka server, broker, or node refers to the same concept and all the terms are synonymous.

A broker in literal terms is defined as an individual or entity that executes buying and selling orders on behalf of others. Kafka Broker receives messages from the producers, stores them into the partitions organized by topic, and allows the consumer clients to read the messages from it. A group of Kafka brokers can be deployed in a cluster that is managed by Zookeeper.

Topics and Partitions

Kafka organizes data streams in logical structures called topics. The topic is an abstraction and associated with a unique string id that is the same across the whole Kafka cluster.

Why does Kafka store data in the partition?

Answer: To achieve concurrency that enables Kafka to scale for high-speed read/writes.

The producer writes the message to the topic and Kafka determines on the basis of the key which partition the message should go. Partitions are ordered immutable storage entities where producers can only append the data. Records are assigned a sequence “ID” called an offset that is used to identify the record location within a partition.

Partitions allow Kafka to scale a topic across multiple Kafka nodes. Since there can be multiple producers writing to the same topic, the way Kafka achieves write scalability is by distributing the messages on multiple servers acting as a load balancer.

On the Consumer end, Kafka utilizes the partition mechanism to allow low latency read access to the topic. Kafka maintains the consumer offset with each consumer in an internal Kafka topic. This allows Kafka to remember every consumer associated with a topic.

We can have as many consumers reading from the topic at the same time without worrying about maintaining commit logs or job history as we usually do in incremental ETLs.

Kafka APIs

In addition to command-line tooling for management and administration tasks, Kafka has five core APIs for Java and Scala:

The Admin API

Manage and inspect topics, brokers, and other Kafka objects.

The Producer API

Used to Publish event streams to one or more Kafka topics.

The Consumer API

Allows the applications to Subscribe to (read) one or more topics and to process the stream of events produced to them.

The Kafka Streams API

Implement the API for efficient handling of streaming data. It provides higher-level functions to process event streams, transformations, and operations like aggregations, joins, windowing, processing based on event-time, and more. Messages are read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.

The Kafka Connect API

Build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka.

For example, a connector to a relational database like PostgreSQL might capture every change to a set of tables. However, in practice, you don’t have to implement your own connectors because the Kafka community already provides hundreds of ready-to-use connectors.

Requirements and Installation

Both Kafka and Zookeeper require Java (Java Virtual Machine) to run so make sure you have installed Java version >= 8. (link)

Kafka best works with Unix-like Operating Systems e.g Linux. As Kafka operations require memory buffering to achieve low latency, it enforces a minimum memory requirement to run. If you are getting an insufficient memory error you can either upgrade the physical memory or increase the virtual memory space (Swap space in Linux) to at least have 6 GB of memory.

Installation

$ wget https://dlcdn.apache.org/kafka/2.8.0/kafka_2.12-2.8.0.tgz

$ tar -xzf kafka_2.12-3.0.0.tgz

$ cd kafka_2.13-3.0.0.tgz

Create a Kafka Topic

$ bin/kafka-topics.sh --create --topic quickstart-events 

--bootstrap-server localhost:9092

Write events to Kafka topic using Kafka Producer Client

$ bin/kafka-console-producer.sh --topic quickstart-events 

--bootstrap-server localhost:9092
Example event 1
Example event 2

Read events from Kafka Topic using the Kafka Consumer Client

$ bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning 

--bootstrap-server localhost:9092 

This is my first event 

This is my second event

Conclusion

To be able to use Kafka, you need Kafka consultation right away. Even as an easy solution, there are complex details that can be best explained and handled by the experts.

As Kafka consultants, we help you use Kafka stream for a better and clearer insight from the data streams. Even for multiple regions, the experts help to set up Kafka for full data recovery in case of any technical disasters. Plus, the Kafka expert input to build Big Data apps is easy to analyze and integrate a high-velocity data source too.

So why worry about it all on your own when you can have a helping hand? Esketchers is a solid platform composed of expert Kafka Consultants. A lot of interesting Kafka use cases in banking and other industries can be discovered with a solid strategy.

Accumulate our Kafka services to get started.

Apache Kafka – An Open Source Distributed event streaming platform