Apache Kafka powers Linkedin where it was conceived and built. It’s been open sourced under Apache banner and now, it’s one of the widely used platforms in big data stack powering number of enterprises.
Purpose
Apache Kafka is a publish-subscribe, replicated, distributed, massive scale, real-time messaging platform.
Kafka lets
- Publish/subscribe to stream feeds from/to number of sources/sinks.
- Stage the stream records in fault-tolerant manner for set time. Number of copies configurable.
- Supports real-time stream processing.
Design
Apache Kafka has been designed from ground-up with simplicity to deliver high throughput with low latency at massive volume. It’s distributed leveraging Apache Zookeeper for node coordination achieving fault-tolerance and replication of data. Kafka’s persistence is based on log file in disk leaving the memory management to Operating System (OS)(Page cache centric architecture). This permits Kafka to retain messages for the set duration.
Kafka notions
- Producers publish messages to topics. Topics support partitions. Producers could write to partition and consumer could read from partition.
- Consumers polls from topics. Supports consumer groups – multiple consumers from the same group could consume the same topic.
- Connector APIs supports connecting to disparate systems and technologies.
- Support for stream processing.
Use-cases
Primary use-case of Kafka is to build real-time stream pipelines between disconnected systems. It supports stream processing too to transform stream records or act on them.
- As a message broker within an enterprise connecting systems with low latency and high throughput.
- Aggregating logs – Stream logs from multiple instances and reduce them to downstream systems for analysis.
- Data collection – Aggregate stream of data feeds such as website usage data or IOT device data that could be used for real-time processing or reporting.
- Event bus – Collect events generated by varied sources and deliver them to consuming systems.
Non-functional aspects
- Scales horizontally.
- Fault tolerant as the leader election mechanism built-in.
- Data replicated across broker nodes. Number of replicas configurable.
- Security – Authentication of clients (producer & consumer) with Kafka brokers available. Authentication support for zookeeper available. Supports SSL connections.