how to handle connection lost and router server - single point of failure

I’m trying to build a scalable crossbar router - there are a few topics I could not find in our forum

1/ how to make sure the WebSocket connection from the edge or from the application is persistent with the application. for example, if an edge device loses a connection with the router, it will miss its topic’s messages from other applications - and likewise, the applications lose the connection before reconnecting it may miss the signal from the edge device as well.

Is there any pointers/discussion to resolve the above issue?

2/ another problem is that the router would become a single point of failure. what if the router server is down, we would have some backup server but with that we have to ask all the clients of the server to register topics, func, etc…

Is there any pointers/discussion to resolve the issues?


I will attempt to give some guidance on the first topic. Trying to keep a TCP connection persistent is “very hard”. So, edge devices do have to handle re-connecting. The connection may go away for many reasons (including “router is re-started”, but that’s just one). Upon re-connection, state needs to be re-synchronized.

For “publish / subscribe” topics, you can have history (or just a “last published value”). If your application is designed such that this last value is the current state, that can be sufficient. If the state is “too large” to publish each time, you can subscribe to state updates and immediately call some sort of “get_current_state” RPC.

See for more about event history. For just keeping the last even on a topic, see retain and get_retained in the publish/subscriber options and also this example:

It is extremely unlikely that Crossbar will ever have “persistent TCP connections”, especially across re-starts of the router itself. This typically requires specialized hardware – and crossbar keeps a lot of state besides the TCP connections themselves.

To the other topic (single point of failure) currently a single realm is run on a single process. We are actively implementing “router to router links” that will allow a single logical realm to be served by multiple processes (on separate machines). Such failover will still rely on edge devices re-connecting (if the router process they’ve attached to goes down) – however, they will be able to do so immediately. There are other pieces to the “scaling” picture which are already implemented such as “proxy workers” which can offload most of the workload (TLS termination, authentication, etc) the router process often does. These can already run in multiple processes.

So, to summarize: your design does need to anticipate re-connections and crossbar + WAMP provide some tools (including event-history, retained events) to help with this. Scaling a single realm across multiple processes is coming soon, but will ultimately rely on re-connecting to divert traffic to other processes / machines. Additionally there are existing features to help scale across multiple cores.

Hope this helps!