[bisq-contrib] Reliability and monitoring of key nodes on the Bisq network

Tue Dec 19 10:10:54 UTC 2017

Over the last few weeks, we have attempted to decentralize the operation of key network resources, including Bisq’s seed nodes, price nodes and Bitcoin full nodes.

This effort has largely failed thus far, because we have been unable to keep these resources up and running in a reliable fashion. These reliability issues have caused critical arbitration messages to get dropped, price feeds to become unavailable (making trading effectively impossible), and may have contributed to a handful of cases where trading fees have been lost (which we as arbitrators are now reimbursing traders for).

Several weeks ago, we started an effort to monitor these key network resources, in order to make sure that operators know when something is down or malfunctioning.

This monitoring effort has largely failed too. There were far too many messages in the beginning, desensitizing everyone involved and resulting in people (understandably) ignoring these messages. Message volumes have reduced a bit now, and the contents of individual messages have become more helpful, but all in all, it is still not enough. It is not making the difference it needs to.

As a stopgap measure, I have just renamed the #pricenode, #seednode and #fullnode channels to #pricenode-monitoring, #seednode-monitoring and #fullnode-monitoring, respectively. These channels will continue to contain automated notification messages, but please take any discussion about actually fixing issues, or anything else regarding the maintenance and operation of these nodes to the new, noise-free #pricenode, #seednode and #fullnode channels. This will help ensure that everyone involved can stay up to date with the most important conversations going on around these resources, and not have them get lost in the monitoring messages.

Slack must never become a noisy place that people get trained to ignore. It is damaging on multiple levels when we have useless stuff being pumped into it. The first problem is what I already mentioned: people just start to ignore things, and it erodes trust that Slack is an effective place to get work done. The second problem is that we’re on a free plan with Slack, and we only have a 10,000 message history. When we have dozens of automated messages coming into multiple channels every day, these low-value automated messages push out older, higher-value human messages, making it impossible for us to see them / search back through them, etc. Moving to a paid plan with Slack is something we can consider doing in the future, but right now, it would be fairly expensive (around $1500 a year with our current set of members). I don’t mind the idea of paying for this valuable service, but I certainly do not want to do it because of this entirely avoidable issue of noisy notification messages.

Let’s get to the heart of the matter, though. Why has this been so difficult? Why have we spent weeks failing to get the kind of reliability we need, and failing to get the kind of monitoring we need to make that happen?

I am not close to this work, and I am not operating any nodes right now, so I cannot say for certainty exactly where the problems lie. What I can say is that I think we have not been nearly proactive enough about squashing these problems. Manfred has been implementing a new monitoring solution over the last couple days, and this is a total waste of his time. He should be (a) fixing critical bugs in the Bisq client that have been cropping with increased usage and (b) coding the DAO. He should be working on nothing else, if we can possibly manage it.

Ultimately, I think the problem is that we have the cart before the horse. We've implemented monitoring without a precise definition of what needs to get monitored, and what constitutes critical failures that must get fixed. This has led to all kinds of monitoring messages, many of which are just not important, and it’s led to operators doing things in a variety of ways with no clear guidance as to the way things *should* be done. Fortunately, though, we have a structure in place that is designed to make exactly these kinds of requirements and instructions clear: role specifications.

I suggest the following course of action:

1. @Emzy, your full nodes are known reliable. You appear to have a configuration and hosting arrangement that works. Would you please take ownership of writing the Bitcoin Fullnode Operator role specification? See @csacher’s previous email that was just published to this list for instructions on how to do this. In this spec, we should spell out what it looks like to run a reliable full node. This doesn’t have to be elaborate, but this document should make it as clear as possible what settings should be used, what kind of uptime full nodes should have, and so forth.

2. Monitoring of fullnodes should be based directly on this fullnode operator specification. Monitoring messages should reflect violations of that specification, and nothing else. If an operator sees that their node is in violating these QoS terms, that should be a "drop everything and fix it" kind of moment. We should not issue any kind of warnings or informational messages beyond these. If we find out that there’s something we should be monitoring that is not in the spec, then we should put it in the spec, and then implement monitoring to cover it. Somebody needs to step up and own doing this, and needs to own it completely. This stuff needs to just work. It cannot require ongoing intervention from Manfred or myself. Writing emails like this does not scale.

3. The same kind of spec-it-out-and-then-implement-monitoring-to-cover-it approach needs to be done for seed nodes and price nodes, too. Somebody needs to own this as well, and it should be the people who know running seed nodes and price nodes best. It makes sense that Manfred should / must be involved in the specification process, but please, we must not have him burdened with implementing monitoring as well.

In any case, going forward, I will vote against any the compensation of any monitoring work that does not fulfill the following requirements:

a. Monitoring messages must be explicitly based on the role specification for the type of node being monitored.
b. Monitoring messages must effectively notify operators when their nodes are in violation of their spec, and about nothing else.

Let’s please discuss what we’re going to do now, and let’s please have that conversation here in this email thread. Who is going to own each aspect of this? Who is going to work together to make this happen? It is important to be explicit and set expectations here. We simply cannot afford to have unreliable services, and we cannot tolerate solutions—monitoring or otherwise—that slow us down.

Thanks,

- Chris

P.S. If I have mischaracterized or otherwise gotten anything wrong in the communication above, my apologies. Call it out and correct me. The most important thing is that we begin having frank, honest discussions out in the open when things are not working, such that we can fix them as quickly and effectively as possible. Mistakes and failures are OK, but not learning from them is unacceptable. And we can't learn as a team unless we communicate as a team, so let’s get in the business of talking freely about what doesn’t work.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bisq.network/pipermail/bisq-contrib/attachments/20171219/319c73d0/attachment.html>