[bisq-contrib] Reliability and monitoring of key nodes on the Bisq network

Tue Dec 19 11:18:53 UTC 2017

Hi Chris,

I am sorry sadly I do not have the cycles to help out with any setup now,
perhaps experience with leading a service delivery team in a 20k+ network
might help. To be honest this is the second email I get (did not hear about
the other problems). Might be my inbox being spamed all day with ton of
emails, or these issues were communicated on a different channel. So having
it discussed on this thread is a good idea!

The slack solution sounds a bit cumbersome for me, slack is not designed
for such a thing IMHO. What sounds you guys needs is a ticketing system
with different severities specified. The idea is the system sends you sev
1, send it to all the owners of sev 1 (guys who are knowledgeable enough to
do the job quickly). So simply you need teams that do different severity
works. If an issue happens and a sev 1 email is sent out (or other
communication channel) the person who has the time to deal with it marks
the issue as his in the ticketing system. A good ticketing system has to
have history, you have to have a way to document the issue so others can
learn from it, you have to able to define roles in it. If you want SLAs,
than you need teams who are "on call" during a time period specified in
advance. E.g. person A, B, C are "on call" on Monday 8EST - 16 EST. It
means they need to able to jump to a computer and grab the issue in a
specified amount of time (e.g. 30 minutes). If person C can not be on call
that day has to get others on board. The idea is to have more tiers of
support than, so the really heavy guns come in only if necessary. This
would give a learning curve for tier 2, tier 3 guys ... as resolving issues
always gives you experience.

So if you want to be ablet to maintain such a complex network (it will only
get worse) you need processes set up first, documentations for roles (as
you describe above) and a good ticketing system. You can work on the other
stuff just afterwards. I guess the key idea here is to have hardware
providers (e.g. node operators) have an automatic support from the network,
so people who have the hw, but not the knowledge necessarily to fix all
issues would get it as a service (for free, as they are giving hw resources
to the network). If they meet the expectations, they can be put to
monitoring.

Of course the art is not to over-complicate all the process parts and stuff
like that.

Is the whole monitoring solution you guys have in place now documented
somewhere? Is there a repo/place where a new comer can go and learn how
things work in a relatively short time?  Is there a document defining node
types, services the network runs/needs?

These are perhaps the questions we should have better answered to be
effective in creating a minimum quality standard of services.

This is a discussion, so sorry I bombarded it with a lot of questions, all
are intended to push us forward as bisq is a cool project.

Tomas

On Tue, Dec 19, 2017 at 11:10 AM, Chris Beams <chris at beams.io> wrote:

> Over the last few weeks, we have attempted to decentralize the operation
> of key network resources, including Bisq’s seed nodes, price nodes
> and Bitcoin full nodes.
>
> This effort has largely failed thus far, because we have been unable to
> keep these resources up and running in a reliable fashion. These
> reliability issues have caused critical arbitration messages to get
> dropped, price feeds to become unavailable (making trading effectively
> impossible), and may have contributed to a handful of cases where trading
> fees have been lost (which we as arbitrators are now reimbursing traders
> for).
>
> Several weeks ago, we started an effort to monitor these key network
> resources, in order to make sure that operators know when something is down
> or malfunctioning.
>
> This monitoring effort has largely failed too. There were far too many
> messages in the beginning, desensitizing everyone involved and resulting
> in people (understandably) ignoring these messages. Message volumes have
> reduced a bit now, and the contents of individual messages have become
> more helpful, but all in all, it is still not enough. It is not making the
> difference it needs to.
>
> As a stopgap measure, I have just renamed the #pricenode, #seednode and
> #fullnode channels to #pricenode-monitoring, #seednode-monitoring and
> #fullnode-monitoring, respectively. These channels will continue to contain
> automated notification messages, but please take any discussion about
> actually fixing issues, or anything else regarding the maintenance and
> operation of these nodes to the new, noise-free #pricenode, #seednode and
> #fullnode channels. This will help ensure that everyone involved can stay
> up to date with the most important conversations going on around these
> resources, and not have them get lost in the monitoring messages.
>
> Slack must never become a noisy place that people get trained to ignore.
> It is damaging on multiple levels when we have useless stuff being pumped
> into it. The first problem is what I already mentioned: people just start
> to ignore things, and it erodes trust that Slack is an effective place
> to get work done. The second problem is that we’re on a free plan with
> Slack, and we only have a 10,000 message history. When we have dozens
> of automated messages coming into multiple channels every day, these
> low-value automated messages push out older, higher-value human messages,
> making it impossible for us to see them / search back through them, etc.
> Moving to a paid plan with Slack is something we can consider doing in the
> future, but right now, it would be fairly expensive (around $1500 a year
> with our current set of members). I don’t mind the idea of paying for this
> valuable service, but I certainly do not want to do it because of this
> entirely avoidable issue of noisy notification messages.
>
> Let’s get to the heart of the matter, though. Why has this been so
> difficult? Why have we spent weeks failing to get the kind of reliability
> we need, and failing to get the kind of monitoring we need to make that
> happen?
>
> I am not close to this work, and I am not operating any nodes right now,
> so I cannot say for certainty exactly where the problems lie. What I can
> say is that I think we have not been nearly proactive enough about
> squashing these problems. Manfred has been implementing a new
> monitoring solution over the last couple days, and this is a total waste of
> his time. He should be (a) fixing critical bugs in the Bisq client that
> have been cropping with increased usage and (b) coding the DAO. He should
> be working on nothing else, if we can possibly manage it.
>
> Ultimately, I think the problem is that we have the cart before the horse.
> We've implemented monitoring without a precise definition of what needs to
> get monitored, and what constitutes critical failures that must get fixed.
> This has led to all kinds of monitoring messages, many of which are just
> not important, and it’s led to operators doing things in a variety of ways
> with no clear guidance as to the way things *should* be done. Fortunately,
> though, we have a structure in place that is designed to make exactly these
> kinds of requirements and instructions clear: *role specifications.*
>
> I suggest the following course of action:
>
> 1. @Emzy, your full nodes are known reliable. You appear to have a
> configuration and hosting arrangement that works. Would you please
> take ownership of writing the Bitcoin Fullnode Operator role specification?
> See @csacher’s previous email that was just published to this list
> for instructions on how to do this. In this spec, we should spell out what
> it looks like to run a reliable full node. This doesn’t have to be
> elaborate, but this document should make it as clear as possible what
> settings should be used, what kind of uptime full nodes should have, and so
> forth.
>
> 2. Monitoring of fullnodes should be based *directly* on this fullnode
> operator specification. Monitoring messages should reflect *violations* of
> that specification, and nothing else. If an operator sees that their node
> is in violating these QoS terms, that should be a "drop everything and fix
> it" kind of moment. We should not issue any kind of warnings or
> informational messages beyond these. If we find out that there’s something
> we should be monitoring that is not in the spec, then we should put it in
> the spec, and then implement monitoring to cover it. Somebody needs to step
> up and own doing this, and needs to own it completely. This stuff needs to
> just work. It cannot require ongoing intervention from Manfred or myself.
> Writing emails like this does not scale.
>
> 3. The same kind of spec-it-out-and-then-implement-monitoring-to-cover-it
> approach needs to be done for seed nodes and price nodes, too. Somebody
> needs to own this as well, and it should be the people who know running
> seed nodes and price nodes best. It makes sense that Manfred should / must
> be involved in the specification process, but please, we must not have him
> burdened with implementing monitoring as well.
>
>
> In any case, going forward, I will vote against any the compensation of
> any monitoring work that does not fulfill the following requirements:
>
> a. Monitoring messages must be explicitly based on the role specification
> for the type of node being monitored.
> b. Monitoring messages must effectively notify operators when their nodes
> are in violation of their spec, and about nothing else.
>
>
> Let’s please discuss what we’re going to do now, and let’s please have
> that conversation here in this email thread. Who is going to own each
> aspect of this? Who is going to work together to make this happen? It is
> important to be explicit and set expectations here. We simply cannot afford
> to have unreliable services, and we cannot tolerate solutions—monitoring or
> otherwise—that slow us down.
>
> Thanks,
>
> - Chris
>
>
> P.S. If I have mischaracterized or otherwise gotten anything wrong in the
> communication above, my apologies. Call it out and correct me. The most
> important thing is that we begin having frank, honest discussions out in
> the open when things are not working, such that we can fix them as quickly
> and effectively as possible. Mistakes and failures are OK, but not learning
> from them is unacceptable. And we can't learn as a team unless we
> communicate as a team, so let’s get in the business of talking freely about
> what doesn’t work.
>
>
>
> _______________________________________________
> bisq-contrib mailing list
> bisq-contrib at lists.bisq.network
> https://lists.bisq.network/listinfo/bisq-contrib
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bisq.network/pipermail/bisq-contrib/attachments/20171219/2c0880f2/attachment-0001.html>