Apache NiFi lets us querying whole tables using pagination. But what if we have a custom query that we can’t put in a table?

This was the question I had to answer when I was assigned to pull data from a MySQL database, push it to another database and then perform an aggregation on that other database. Normally, I could just use ExecuteSQL, but in this case — the returned data is far too large and it can cause an OutOfMemoryException. So I could have just used pagination using GenerateTableFetch but unfortunately it can only work on actual tables and not custom queries. …

Design Patterns are important even in Apache NiFi. In this article, we shall review two design patterns and the use of Funnel.

Apache NiFi is a DataFlow Management tool. It basically lets you design code without writing it, only config it. Every good developer knows that a good code has good design patterns to reuse or make the code more readable and easy to maintain — but in NiFi, it can even add new functionality!

Let us review an example.

Retry Counter

Say we have the following simple flow:

This flow consumes another Kafka cluster’s messages and publish them to another Kafka cluster. So what if one message fails? Well, it would retry indefinitely. But what if we want it to retry 3 times…

As I’ve already stated in previous articles before, Apache NiFi has a lot to offer. This time, I’ll review the DNS part of it.

I’ve been given a task — to use CaptureChangeMySQL to track delete operations in certain tables in certain MySQL DB using the binlog. The problem is that we don’t use GTID in our database. This means that if we use multiple servers and it switches to next server, it will continue from the same offset, even if it’s a different server. So if this thing does happen, we might miss a few deletions.

The general solution to these problems in my job is to use a hostname that change its IP address in accordance with the most accurate server to…

Optimizing any application can be found out to be a tough mission. It depends on a variety of factors: memory, CPU, the application’s consumption, and many other factors.

In NiFi, various optimization can be made. In this article, I will review the application optimization of NiFi and the difficulties behind it.

As you may know, there’s an option in NiFi to adjust the Maximum Timer Driven Thread Count as we’ve already seen in my previous article.
But this time, we’ll review how to adjust it correctly.

So, what if we have a rather simple task that simply copies data from…

Replicating Kafka Cluster sounds pretty easy but can be found to be tough. It really is a simple task, but no tool seem to be simple enough or good enough. In my workplace, I’ve tried to find the tool that will provide enough simplicity but can also catch up with the speed of the messages.

In our requirement, there are 5 topics that need to be replicated. While most of them are pretty simple, there’s one (let’s call it problem) that can reach, with a lot of small files, to 500MiB per minute. …

What is this expression language? Is it totally exception proof when the syntax is valid? And how did we use it in our own project?

One of the greatest features NiFi has given us, is the expression language. This feature allows us to program our dataflow while dynamically changing the behavior towards the FlowFile, according to the FlowFile we transfer.

It always starts with a dollar sign and an opening curly brace(“${“) and ends with a closing curly brace(“}”). The first variable would be either an attribute from the FlowFile(first priority) or from the Variable Registry. Then we can add a colon and a function. Example: ${variable:function()} .

NiFi allows us to use a variety of functions that allow us to be extremely dynamic with…

How does NiFi decide to schedule processors? How does NiFi not fail when a processor does? What are “yield” and “penalty”? How can NiFi be so fast with all of these processors running and not overloading the system?

NiFi allows configuration of number of threads that a processor will run(“Concurrent Tasks”). In fact, what would run is “onTrigger” function in the Processor interface that needs to be prepared to run on several threads and be thread safe as well. Of course there are exceptions, marked with “@TriggerSerially” annotation, and need to be run on 1 thread only(like ListHDFS that writes to cache which is based on ZooKeeper and slow on writes).

the “Concurrent Tasks” configuration property

So, if I set “Concurrent Tasks” to be 100, will my processor be run on 100 threads? Not necessarily. NiFi has a configuration property named “Maximum Timer…

When people hear that Apache NiFi is I/O intensive, they usually think that each processor that a FlowFile passes, a read/write is made from/to disk. Actually, it’s more complicated than this.

A FlowFile in NiFi is more than just a file on the disk. A FlowFile is constructed of two parts: The Content Repository & The FlowFile Repository.

The FlowFile Repository

The FlowFile Repository only holds metadata of the FlowFile(mainly attributes and which queue holds it right now). Not so complicated, is it? Well, It is. FlowFiles are held in a HashMap in memory so it makes it a lot faster to transfer them and deal with them. So how is it, that the JVM memory is not exceeded? Well, NiFi has a special way of dealing with this, using swapping. Not the OS…

How can we synchronize different nodes, that are running on different machines, or more precisely, how can we do it with Apache ZooKeeper?

Say we have an application that is constructed of multiple nodes. Now, we have a task that we want only one of the nodes to perform. So how can it be done with Apache ZooKeeper?

First, what is Apache ZooKeeper? Well, “ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.”(According to its site). It’s basically a filesystem where files(ZNodes) are also directories hence can have child ZNodes. All ZooKeeper nodes hold a snapshot of the current state in their memory(and a persistent store) resulting in a very fast reading(and a slow writing). All ZNodes can contain up to 1MB of data. ZooKeeper doesn’t let any concurrent writes.

So how this works?

When a client sends a request to the ZooKeeper cluster to write to a ZNode or…

So I was requested to make up a task that basically just execute once an hour an Apache Spark job, let’s call it job1, and every 4 hours execute another Apache Spark job, let’s call it job2.

However, the request was a bit more complicated than this. Job2 should only be executed if the last time job1 was executed was successful(meaning, it did not fail.)

I was requested to do so, so I can compare it to Apache Airflow. Of course Apache NiFi, is not fitted for this purpose(since it is a DataFlow management tool, not a workflow management tool)…

Ben Yaakobi

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store