Top 6 Worst Apache Kafka JIRA Bugs

Entomophiliac, Anna McDonald (Principal Customer Success Technical Architect, Confluent) has seen her fair share of Apache Kafka® bugs. For her annual holiday roundup of the most noteworthy Kafka bugs, Anna tells Kris Jenkins about some of the scariest, most surprising, and most enlightening corner cases that make you ask, “Ah, so that’s how it really works?”She shares a lot of interesting details about how batching works, the replication protocol, how Kafka’s networking stack dances with Linux’s one, and which is the most important Scala class to read, if you’re only going to read one.In particular, Anna gives Kris details about a bug that he’s been thinking about lately – sticky partitioner (KAFKA-10888). When a Kafka producer sends several records to the same partition at around the same time, the partition can get overloaded. As a result, if too many records get processed at once, they can get stuck causing an unbalanced workload. Anna goes on to explain that the fix required keeping track of the number of offsets/messages written to each partition, and then batching to force more balanced distributions.She found another bug that occurs when Kafka server triggers TCP Congestion Control in some conditions (KAFKA-9648). Anna explains that when Kafka server restarts and then executes the preferred replica leader, lots of replica leaders trigger cluster metadata updates. Then, all clients establish a server connection at the same time that lots TCP requests are waiting in the TCP sync queue.The third bug she talks about (KAFKA-9211), may cause TCP delays after upgrading…. Oh, that’s a nasty one. She goes on to tell Kris about a rare bug (KAFKA-12686) in Partition.scala where there’s a race condition between the handling of an AlterIsrResponse and a LeaderAndIsrRequest. This rare scenario involves the delay of AlterIsrResponse when lots of ISR and leadership changes occur due to broker restarts.Bugs five (KAFKA-12964) and six (KAFKA-14334) are no better, but you’ll have to plug in your headphones and listen in to explore the ghoulish adventures of Anna McDonald as she gives a nightmarish peek into her world of JIRA bugs. It’s just what you might need this holiday season!EPISODE LINKSKAFKA-10888: Sticky partition leads to uneven product msg, resulting in abnormal delays in some partitionsKAFKA-9648: Add configuration to adjust listen backlog size for AcceptorKAFKA-9211: Kafka upgrade 2.3.0 may cause tcp delay ack(Congestion Control)KAFKA-12686: Race condition in AlterIsr response handlingKAFKA-12964: Corrupt segment recovery can delete new producer state snapshotsKAFKA-14334: DelayedFetch purgatory not completed when appending as followerOptimizing for Low Latency and High ThroughputDiagnose and Debug Apache Kafka IssuesWatch the videoJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperUse PODCAST100 to get $100 of free Confluent Cloud usage (details

Om Podcasten

Streaming Audio features all things Apache Kafka®, Confluent, real-time data, and the cloud. We cover frequently asked questions, best practices, and use cases from the Kafka community—from Kafka connectors and distributed systems, to data mesh, data integration, modern data architectures, and data mesh built with Confluent and cloud Kafka as a service. Join our hosts as they stream through a series of interviews, stories, and use cases with guests from the data streaming industry. Apache®️, Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.