VerneMQ can be easily clustered. Clients can then connect to any cluster node and receive messages from any other cluster nodes. However, the MQTT specification gives certain guarantees that are hard to fulfill in a distributed environment, especially when network partitions occur. We'll discuss the way VerneMQ deals with network partitions in its own subsection
vmq-admin cluster join discovery-node=<OtherClusterNode>
vmq-admin cluster leave node=<NodeThatShouldGo> (only the first step!)
A cluster leave will actually do a lot more work, and gives you some options to choose. The node leaving the cluster will go to great length trying to migrate its existing queues to other nodes. As queues (online or offline) are live processes in a VerneMQ node, it will only exit after it has migrated them.
Let's look at the steps in detail:
vmq-admin cluster leave node=<NodeThatShouldGo>
This first step will only stop the MQTT Listeners of the node to ensure that no new connections are accepted. It will not interrupt the existing connections, and behind the scenes the node will not leave the cluster yet. Existing clients are still able to publish and receive messages at this point.
The idea is to give a grace period with the hope that existing clients might re-connect (to another node). If you have decided that this period is over (after 5 minutes or 1 day is up to you), you proceed with step 2: disconnecting the rest of the clients.
vmq-admin cluster leave node=<NodeThatShouldGo> -k
-k flag will delete the MQTT Listeners of the leaving node, taking down all live connections. If this is what you want from the beginning, you can do this right away as a first step.
Now, queue migration is triggered by clients re-connecting to other nodes. They will claim their queue and it will get migrated. Still, there might be some offline queues remaining on the leaving node, because they were pre-existing or because some clients do not re-connect and do not reclaim their queues.
VerneMQ will throw an exception if there are remaining offline queues after a configurable timeout. The default is 60 seconds, but you can set it as an option to the cluster leave command. As soon as the exception shows in console or console.log, you can actually retry the cluster leave command (including setting a migration timeout (
-t), and an interval in seconds (
-i) indicating how often information on the migration progress should be printed to the console.log):
vmq-admin cluster leave node=<NodeThatShouldGo> -k -i 5 -t 120
After this timeout VerneMQ will forcefully migrate the remaining offline queues to other cluster nodes in a round robin manner. After doing that, it will stop the leaving VerneMQ node.
So, case A was the happy case. You left the cluster with your node in a controlled manner, and everything worked, including a complete queue (and message) transfer to other nodes.
Let's look at the second possibility where the node is already down. Your cluster is still counting on it though and possibly blocking new subscription for that reason, so you want to make the node leave.
To do this, use the same command(s) as in the first case. There is one important consequence to note: by making a stopped node leave, you basically throw away persistant queue content, as VerneMQ won't be able to migrate or deliver it.
Let's repeat that to make sure:
vmq-admin cluster show