About Apache ZooKeeper & Cluster Synchronization

How can we synchronize different nodes, that are running on different machines, or more precisely, how can we do it with Apache ZooKeeper?

Ben Yaakobi
4 min readJan 27, 2019

Say we have an application that is constructed of multiple nodes. Now, we have a task that we want only one of the nodes to perform. So how can it be done with Apache ZooKeeper?

First, what is Apache ZooKeeper? Well, “ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.”(According to its site). It’s basically a filesystem where files(ZNodes) are also directories hence can have child ZNodes. All ZooKeeper nodes hold a snapshot of the current state in their memory(and a persistent store) resulting in a very fast reading(and a slow writing). All ZNodes can contain up to 1MB of data. ZooKeeper doesn’t let any concurrent writes.

So how this works?

When a client sends a request to the ZooKeeper cluster to write to a ZNode or create a new one, the request is received by the node that the client is currently connected to. The request is then transferred to the leader ZooKeeper node and only the leader can process those write requests. Then, the leader distributes the request to the follower nodes and they should return an acknowledgement. It is enough for most of the cluster to acknowledge the request for it to be processed(quorum).
For example: in a 4 nodes cluster, 3 nodes need to acknowledge the request.

And how is this helping us?
It helps us by the guarantee that no two nodes can create or overwrite the same ZNode at the same time.

ZNode creation in ZooKeeper can have one of four creation modes:
Ephemeral: “The znode will be deleted upon the client’s disconnect.”
Persistent: “
The znode will not be automatically deleted upon client’s disconnect.”
Ephemeral Sequential: “
The znode will be deleted upon the client’s disconnect, and its name will be appended with a monotonically increasing number.”
Persistent Sequential: “The znode will not be automatically deleted upon client’s disconnect, and its name will be appended with a monotonically increasing number.”

The interesting mode(for us) is the ephemeral sequential. This, in fact, can help us in creating a leader selector based on a waiting list. It basically creates a ZNode with the name of the ZNode with a following increasing number.

Let’s take an example:
Say we chose the name “my_lock”. So we have node-2 of our application create a ZNode with the name “my_lock” and creation mode EPHEMERAL_SEQUENTIAL. If we’ll take a look, we’ll see that the ZNode was created with the name “my_lock0000000001”. And then, we will have node-1 of our application create a ZNode named “my_lock”. We will see, that it was created under the name “my_lock0000000002”. Now, if node-2 will crash, “my_lock0000000001” will be deleted, and upon removal of the first “my_lock” ZNode, “my_lock0000000002” will be first and node-1 can be the new leader.

Take another example from Apache NiFi. Apache NiFi uses the LeaderSelector class of Apache Curator Recipes to assign the Primary Node and Cluster Coordinator roles.
If we’ll take a look in its code, we’ll see that it uses the InterProcessMutex to lock the role and wrap it with listener to allow waiting for the lock the be released and also allow the node to requeue itself for the leader role. That way, if the node happens to lose connection, it can requeue itself and wait once again for its turn to be leader. That way, we will always have a node in the leader role. Or in our case, Primary Node or Cluster Coordinator. It is important, however, to mention that currently LeaderSelector by itself is not aware of losing its ZNode after acquiring leadership(unlike NiFi’s wrapper), and as it seems, this isn’t going to change. So if you’d like to have the option to move the leader position to the next node of your application, it is not possible using the plain LeaderSelector class.

Illustration for a two nodes NiFi cluster:

As you can see, we have two locks under /nifi/leaders/Primary Node
node-1 is the current primary node
node-1 is also the first lock owner
by deleting the ZNode, node-2 should become the Primary Node
node-1 as already created a new ZNode with the sequential number of 9265
as we foreseen, node-2 is now the primary node

--

--