How I managed to replicate my Kafka Cluster efficiently

Ben Yaakobi
3 min readJul 16, 2019

Replicating Kafka Cluster sounds pretty easy but can be found to be tough. It really is a simple task, but no tool seem to be simple enough or good enough. In my workplace, I’ve tried to find the tool that will provide enough simplicity but can also catch up with the speed of the messages.

In our requirement, there are 5 topics that need to be replicated. While most of them are pretty simple, there’s one (let’s call it problem) that can reach, with a lot of small files, to 500MiB per minute. Those are small files(about 85kib each), but when so many of them are coming per minute, it’s can be a hard task finding the correct tool.

I did see some posts over the Internet saying that they gave up on finding this majestic tool, and wrote their own tool. But I really wanted to find something that already reaches the full potential of Kafka.

Our first choice for this was Kafka Connect by Confluent with the replicator plugin. We tested it a lot (although it’s not very well documented — at all). It did the job for the PoC but the CPU usage reached to maximum (with 8 cores). Eventually, we discovered that it’s not free either. I’m sure it could be configured to work more efficiently but it seems like we would need full support from Confluent, which would cost my company more money.

Since we wanted a free solution (because our requirements are not really complicated), we turned to uReplicator by Uber. However, it wasn’t really documented good enough either. So after a little thinking, I thought “why not NiFi?”.

So we’ve tested NiFi. The initial results were good, but not good enough because of topic problem which, as explained above, receives a lot of small messages. So how did we reduce a lag of approximately 392,717,716 messages to zero in a few minutes?

Well, the naive way of developing this dataflow is just connect ConsumeKafka to PublishKafka and put the connection details, right? Well, the developers of Apache kept another thing up their sleeves. A property named “Message Demarcator” which put a bunch of messages in one single FlowFile. The number of messages that will be put in one FlowFile is configured in a property named “Max Poll Records”.

the properties “Message Demarcator” & “Max Poll Records”. Please note that “Message Demarcator” should be unique so it won’t mix up with the actual messages!

When configured on both ConsumeKafka and PublishKafka, they build up a pretty intense replicator. It can reach up to 25GiB easy! And it only cost us 1.25 cores!

But pay close attention: “Message Demarcator” must be unique because of how it works. It puts the demarcator between your messages in the FlowFile. That way, it can tell one message from another. So if you’ll choose an easy demarcator, say “a”, and you’ll have a message with the letter “a”, say “bac”, it will count “b” as a message and “c” as a different message!

the replicator template during the lag reducing process

So back to business, if you’ll take a look at the picture above, it processed 114,592 FlowFiles. Each FlowFile contains about 2500 messages. If you’ll do the math, you’ll find out that those are 286,480,000‬ messages! Easy!

Of course there’s a disadvantage to this: The message key is not maintained. In our case, we didn’t care about the message key. We just needed any key. So we set the message key as the source partition and set “Partitioner class” as “ RoundRobinPartitioner”.

So if you don’t care about your key and you’re looking for a great replicator, I think NiFi would do the job!

--

--