Over the past five years, I have helped design and operate streaming data pipelines processing trillions of events for some of the world's most demanding production workloads. Every time I engage with a new enterprise team struggling to reduce their analytics latency from hours to seconds, I see the same set of architectural mistakes being made. This article distills the most important lessons I have learned — and the patterns that consistently deliver the sub-10ms latency that modern business applications require.
Before diving into specific patterns, it is worth being precise about what we mean by "low-latency." In the context of streaming data pipelines, latency has multiple components: ingestion latency (time from event creation to when it enters the pipeline), processing latency (time to apply transformations and enrichments), and delivery latency (time from processing completion to when results are available in the destination). Optimizing any one of these in isolation while ignoring the others is a common source of frustration when end-to-end performance fails to meet expectations.
The choice of message broker is the most consequential architectural decision you will make when designing a streaming pipeline. Apache Kafka remains the dominant choice for enterprise deployments with good reason: its log-based architecture provides exceptional throughput, replayability, and consumer group semantics that are difficult to replicate elsewhere.
However, Kafka's performance characteristics are highly sensitive to configuration. Default settings are tuned for safety, not throughput. In production deployments, the following Kafka producer configurations dramatically reduce end-to-end latency:
Setting linger.ms=0 eliminates producer batching delay at the cost of slightly reduced throughput — acceptable for latency-sensitive pipelines but inappropriate for bulk ingestion scenarios. acks=1 rather than acks=all reduces producer latency by removing the synchronous wait for replica acknowledgment; use this only when occasional message loss is acceptable, such as in analytics contexts where a single missed event has negligible impact. compression.type=lz4 reduces network bandwidth without meaningful CPU overhead on modern hardware.
For deployments requiring single-digit millisecond producer acknowledgment, consider Kafka's newer features: KRaft mode eliminates the ZooKeeper dependency that adds round-trip latency for partition leader elections, and the introduction of tiered storage in Kafka 3.6 enables separation of hot and cold data without sacrificing consumer performance on recent records.
Poor partitioning strategy is the second most common cause of unexpectedly high pipeline latency at scale. Kafka's parallelism model is inherently tied to partition count — a topic with 12 partitions can be consumed by at most 12 parallel consumer threads within a consumer group. This becomes a bottleneck when event volumes grow beyond what a single consumer thread can process within your latency budget.
The right partition count depends on your target throughput, expected event sizes, and consumer processing latency. A useful rule of thumb: start with a partition count that allows each partition to receive no more than 10MB/s of data at expected peak load, and ensure your consumer instances can each process all messages from their assigned partitions within your latency SLA during peak throughput.
Partition key selection is equally important. Partitioning by customer ID or entity ID enables ordered processing within an entity context and allows efficient state management in stateful stream processors. However, cardinality extremes can create hot partitions — a small number of high-volume customers whose events all map to the same partition, creating a bottleneck that affects tail latency for the entire consumer group. Monitor partition lag per partition, not just aggregate consumer group lag, to detect hot partition scenarios early.
The most significant latency challenges in enterprise streaming pipelines arise from stateful operations: windowed aggregations, joins between streams, and enrichment lookups against reference data. Apache Flink's checkpoint-based state management is the industry standard for exactly-once stateful processing, but its performance characteristics require careful tuning.
Flink's RocksDB state backend is the correct choice for production deployments with state sizes exceeding a few hundred megabytes per task manager. The in-memory HashMapStateBackend delivers lower read/write latency but does not support state larger than available JVM heap, making it unsuitable for large-scale production deployments. When using RocksDB, enable block cache optimization and configure appropriate compaction settings — default RocksDB settings are optimized for disk-based key-value storage, not in-process streaming computation.
For enrichment lookups — for example, enriching a transaction event with customer profile data from a database — the naive approach of synchronous database queries within the stream processor is almost always a latency disaster. Each synchronous lookup adds the round-trip database latency to every event's processing time. The correct pattern is to maintain a local, in-memory cache populated via a change data capture (CDC) stream from the reference database. This transforms enrichment from a synchronous remote call into a local in-process lookup, typically reducing enrichment latency from tens of milliseconds to single-digit microseconds.
Exactly-once semantics in distributed streaming systems come with a real cost: they require coordination between producers, brokers, and consumers that adds latency and reduces throughput compared to at-least-once processing. Understanding when exactly-once is actually necessary — versus when idempotent consumers make it unnecessary — can dramatically simplify pipeline architecture and improve performance.
Financial transaction processing, inventory management, and any pipeline that writes to a database that maintains account balances or counters genuinely requires exactly-once semantics. A billing pipeline that accidentally double-processes a payment event is a serious production incident.
Many analytics pipelines, however, can tolerate occasional duplicate events with negligible business impact. A dashboard showing page views may show 0.001% more views than actually occurred due to a transient consumer failure and reprocessing — this is operationally acceptable. Designing these pipelines with at-least-once semantics while ensuring downstream consumers are idempotent can reduce end-to-end latency by 30-40% compared to a full exactly-once pipeline.
The most sophisticated streaming architecture in the world will fail to meet its latency SLA if the team operating it cannot detect degradation before it impacts production workloads. Effective observability for streaming pipelines requires instrumentation at every stage of the pipeline, with millisecond-resolution metrics rather than the minute-level aggregation that most monitoring tools default to.
The three most important metrics to track in a low-latency streaming pipeline are: consumer group lag (the number of unconsumed messages waiting in each partition), processing latency percentiles (p50, p95, p99 end-to-end latency for each pipeline stage), and checkpoint duration for stateful Flink jobs (checkpoint failures or degradation are often the first sign of approaching resource exhaustion).
Set latency SLO alerting at the p99 level, not p50. The median latency of a high-throughput streaming pipeline is often excellent while the tail latency is terrible — and tail latency is what your application actually experiences during traffic spikes, which is precisely when low latency matters most.
Schema changes are one of the most common causes of unexpected pipeline latency incidents. A producer team that adds a required field to an Avro schema without a coordinated consumer update will cause consumer deserialization failures — and the resulting consumer restart cycle can introduce minutes of lag that takes hours to drain.
Enforce schema registry usage for all production Kafka topics. Confluent Schema Registry with full compatibility mode prevents incompatible schema changes from being published without consumer updates. Establish a schema change process that requires consumer team sign-off before any producer deploys a schema update.
Achieving and maintaining sub-10ms end-to-end latency in production streaming pipelines is entirely achievable with modern open-source infrastructure — but it requires deliberate architectural decisions at every layer of the stack. The teams I have seen succeed are those that treat their streaming infrastructure with the same engineering rigor they apply to their production databases: monitoring everything, planning for failure, and investing in operational tooling that makes the system transparent and controllable.
At Rapidata, we have codified many of these patterns into our managed platform, so teams can benefit from production-proven streaming infrastructure without needing to replicate months of operational learning. If you are building or scaling a real-time data pipeline, we would love to share more — reach out at info@rapideta.us.