VerneMQ can be easily clustered. Clients can then connect to any cluster node and receive messages from any other cluster nodes. However, the MQTT specification gives certain guarantees that are hard to fulfill in a distributed environment, especially when network partitions occur. We'll discuss the way VerneMQ deals with network partitions in its own subsection
Before you go ahead and experience the full power of clustering VerneMQ, be aware of its stateful character. An MQTT broker is a stateful application and a VerneMQ cluster is a stateful cluster.
What does this mean in detail? It means that clustered VerneMQ nodes will share information about connected clients and sessions but also meta-information about the cluster itself.
For instance, if you stop a cluster node, the VerneMQ cluster will not just forget about it. It will know that there's a node missing and it will keep looking for it. It will know there's a netsplit situation and it will heal the partition when the node comes back up. But if the missing nodes never comes back there's an eternal netsplit. (still resolvable by making the missing nodes explicitly leave).
This doesn't mean that a VerneMQ cluster cannot dynamically grow and shrink. But it means you have to tell the cluster what you intend to do, by using join and leave commands.
If you want a cluster node to leave the cluster, well... use the
vmq-admin cluster leave command. If you want a node to join a cluster, well... use the
vmq-admin cluster join command.
Makes sense? Go ahead and create your first VerneMQ cluster!
vmq-admin cluster join discovery-node=<OtherClusterNode>
vmq-admin cluster leave node=<NodeThatShouldGo> (only the first step!)
A cluster leave will actually do a lot more work, and gives you some options to choose. The node leaving the cluster will go to great length trying to migrate its existing queues to other nodes. As queues (online or offline) are live processes in a VerneMQ node, it will only exit after it has migrated them.
Let's look at the steps in detail:
vmq-admin cluster leave node=<NodeThatShouldGo>
This first step will only stop the MQTT Listeners of the node to ensure that no new connections are accepted. It will not interrupt the existing connections, and behind the scenes the node will not leave the cluster yet. Existing clients are still able to publish and receive messages at this point.
The idea is to give a grace period with the hope that existing clients might re-connect (to another node). If you have decided that this period is over (after 5 minutes or 1 day is up to you), you proceed with step 2: disconnecting the rest of the clients.
vmq-admin cluster leave node=<NodeThatShouldGo> -k
-k flag will delete the MQTT Listeners of the leaving node, taking down all live connections. If this is what you want from the beginning, you can do this right away as a first step.
Now, queue migration is triggered by clients re-connecting to other nodes. They will claim their queue and it will get migrated. Still, there might be some offline queues remaining on the leaving node, because they were pre-existing or because some clients do not re-connect and do not reclaim their queues.
VerneMQ will throw an exception if there are remaining offline queues after a configurable timeout. The default is 60 seconds, but you can set it as an option to the cluster leave command. As soon as the exception shows in console or console.log, you can actually retry the cluster leave command (including setting a migration timeout (
-t), and an interval in seconds (
-i) indicating how often information on the migration progress should be printed to the console.log):
vmq-admin cluster leave node=<NodeThatShouldGo> -k -i 5 -t 120
After this timeout VerneMQ will forcefully migrate the remaining offline queues to other cluster nodes in a round robin manner. After doing that, it will stop the leaving VerneMQ node.
So, case A was the happy case. You left the cluster with your node in a controlled manner, and everything worked, including a complete queue (and message) transfer to other nodes.
Let's look at the second possibility where the node is already down. Your cluster is still counting on it though and possibly blocking new subscription for that reason, so you want to make the node leave.
To do this, use the same command(s) as in the first case. There is one important consequence to note: by making a stopped node leave, you basically throw away persistant queue content, as VerneMQ won't be able to migrate or deliver it.
Let's repeat that to make sure:
vmq-admin cluster show