[bisq-network/proposals] Bisq Network Monitor Revisited (#62)

Wed Dec 12 12:52:47 UTC 2018

> _This is a Bisq Network proposal. Please familiarize yourself with the [submission and review process](https://docs.bisq.network/proposals.html)._

*Abstract: Tor and P2P network issues do and will affect the performance and acceptance of Bisq - a monitoring system greatly assists in finding their cause. Practically, the [current monitoring system](http://seedmonitor.0-2-1.net/) still leaves us with a lot of tedious guesswork. I propose a fresh monitoring solution which is properly designed for the task at hand (unlike fast quick and in a hurry as the current one had to be). The solution features monitoring node(s) in the P2P network gathering metrics while an external service (i.e. [Prometheus](https://prometheus.io/)) takes care of history and presentation. The presentation can be suitable for developers and users alike, provide detailled insight into Bisq's network layer, lets us grasp the value of Bisq to the world, and prepare Bisq for the future.*

# Introduction

Bisq is getting bigger and bigger and thus, issues in Bisq's network layer appear more frequently. In late 2017, for example, the network simply did not perform (https://github.com/bisq-network/bisq/issues/1172). In an attempt to understand why, @ManfredKarrer created a [monitoring do-that](http://seedmonitor.0-2-1.net/) fast quick and in a hurry. However, while clearly showing the situation, the new monitor did not help very much in understanding the cause, let alone foster strategies to prevent such a situation in the future. Finally, the network magically recovered. Since then, Bisq has grown even bigger.

# Challenge

An ideal monitoring solution has to serve multiple purposes. First of all, it should support developers in finding fixes to pending issues (https://github.com/bisq-network/bisq/issues/1241, https://github.com/bisq-network/bisq/issues/1299, ). Second, it should support developers in analyzing and understanding the network. A better understanding of the network lets us anticipate upcoming issues and maybe stop them in their tracks before they become an actual issue. Third, numerical performance values let us evaluate the effectiveness of network tweaks more clearly and make informed decisions whether to keep the tweaks or not. Furthermore, numerical performance values can be fed to some sort of attack detection mechanisms which trigger countermeasures on demand. Leaving the realm of development, a historical display of numerical performance values allows users to get an idea why their offer is taking so long to be published and maybe pick a time where the network is less busy (https://github.com/bisq-network/bisq/issues/1575). And last but not least, the collective of metrics lets people get an idea of the value Bisq brings to the world.

The [current monitoring solution](http://seedmonitor.0-2-1.net/), unfortunately, is very limited (because it has been created fast quick and in a hurry in order to get hold of actual pending issues). First of all, there is no historical data: a dev cannot correlate historical data with other sources ([Tor metrics](https://metrics.torproject.org/) for example) in order to either conclude whether or not it is Bisq's fault when the network suffers from performance loss. Second, the current monitoring solution does not provde sole Tor performance values nor does it provide network load values. The only value available is a roundtrip time metric which might indicate performance loss, but does not say if it is caused by poor Tor performance, by a high network load or by congestion caused by a way-to-high network load. Whereas the latter should have been visible before congestion actually kicked in. Third and last, the data presentation is not suitable for people other than developers. The statistics site is static and has to be manually refreshed to get up-to-date data, there is no historical data, and the metrics displayed are too cryptic to be understood by the average (Bisq-affine) Joe. All in all, the current monitoring solution leaves us guessing what Bisq's network layer looks like inside, if Tor is blocking our request due to their DoS protection or if our optimiziations do really optimize things. Hence, the current monitoring solution does not come near the ideal solution sketched above.

# Proposal

It is time to create a proper monitoring solution (https://github.com/bisq-network/bisq/issues/1361). From a technical point of view, the shiny new monitoring solution of course has to conform to the usual [bullsh*t bingo](https://www.wikihow.com/Play-Bullshit-Bingo): it has to be clean, modular, extensible, easy to deploy, low maintenance effort, use existing solutions where possible, fast initial time to market, etc.

Having that in mind, I propose
1) starting fresh and design the monitoring solution from the ground up. A clean start does not trick us into reusing approaches just because they are already there, without thinking about their usefullness and effectiveness. Futhermore, by discarding old code we do not pull deprecated and/or dead code into a shiny new project, especially since the existing monitoring solution has been created in a hurry.
2) making the new monitoring solution easy to deploy and operate. I had my share of application servers, tomcats and IT departments - thus, I suggest instead of wasting time fighting these we invest a little more development time in order to create a simple executable which everyone, who is able to run the Bisq client, is able to run.
3) making the monitor highly modular and configurable so an operator can easily pick a (sub)set of metrics he wants to run. Furthermore, a simple Java-properties-based configuration should be used to control how these metrics behave.
4) to focus on extensibility. We might have a good idea of which metrics we need right now. However, there certainly are metrics we do not think of right now. Such future metrics have to be addable without rewriting the whole monitor thing. Furthermore, a new-to-Bisq developer should find her way around the monitoring tool quite easily. I am not saying that the actual metric themselves have to be super clean, just that a developer can add a set of new metrics without having to spend a lot of time solving riddles.

Following these proposals, we IMHO should be able to create a monitoring solution which properly allow for a deeper look ainsidet Bisq's network layer while not being outdated by tomorrow.

Please find a big picture of the proposed monitoring solution in the illustration below. There are two main components to the monitoring solution. First, *Monitoring Node*s are inserted into Bisq's P2P network. These ![monitor-bigpicture](https://user-images.githubusercontent.com/1070734/49868712-83db8880-fe0e-11e8-8723-d946246bbb81.png) nodes only gather data, they do not keep a historical record of the data. Second, a *Monitoring Service* scrapes the *Monitoring Node*(s) for data, keeps a historical record and visualizes the data. Offline discussions yielded [Prometheus](https://prometheus.io/) to be the monitoring service to be used. Please note that the underlying Tor network is part of every connection in the illustration above.

## Monitor Node(s)

Please find an architectual overview of a *Monitor Node* in the illustration below. A *Monitor Node* uses a *Scheduler* as its central component. the *Scheduler* executes *Metric*s and supplies them with their share of *Configuration*. The minimal *Configuration* for each *Metric* contains whether the *Metric* is enabled and if
![monitor-architecture](https://user-images.githubusercontent.com/1070734/49868609-3b23cf80-fe0e-11e8-9791-bddac1d108a3.png)
yes, at which intervals the *Metric* is to be run. The collected data is offered as a [Prometheus job exporter](https://prometheus.io/docs/introduction/overview/) (maybe we need to add a Pushgateway) via a Tor Hidden Service. The *Monitor Node* is to be run as a simple executable from the command line (and thus can be easily turned into a system service). Furthermore, on Linux systems, the *Monitor Node* can be instructed to reload its configuration and react to changes (enable/disable metrics, change intervals) by a `kill -USR1` signal without the need for restarting the executable (since we expect to run multiple Tor binaries and restarting the whole thing would take an awful lot of time while at the same time loosing running average data).

## Monitoring Service

The *Monitoring Service* collects the data provided by the *Monitor Node*(s), keeps a historical record and by some means also provides a GUI. This service is to be provided by the open-source monitoring solution [Prometheus](https://prometheus.io/). It is well-maintained and active, takes a few minutes to set up and handles recording and displaying data quite nicely.

# Implementation Details

Please note the priority list and/or time line below on how I propose to get the proposed monitoring system up and running. Each release is meant to be set productive. The *Babysteps* release is meant to complement the existing [monitoring do-that](http://seedmonitor.0-2-1.net/) as it primarily adds Tor metrics. As some of the work is already done, I believe that we can provide *Babysteps* in January already. The *Showing Off* release is then ment to supersede Manfreds monitor as by then, the new monitor includes all metrics provided by Manfreds monitor. The *Settled* release then focuses on making the value of Bisq to the world somehow visible.

- [ ] create basic infrastructure for a monitoring node
- [ ] create Prometheus instance
- [ ] benchmark Tor startup
- [ ] benchmark Hidden Service startup
- [ ] benchmark Tor roundtrip time to seednodes
- [ ] **release: Babysteps**
- [ ] p2p RTT (using ping/pong)
- [ ] p2p network load (messages per timeslot)
- [ ] p2p network load histogram (messagetype per timeslot)
- [ ] **release: Showing off**
- [ ] estimate the p2p network size (peer count)
- [ ] estimate the number of open offers per timeslot
- [ ] estimate the number of successful trades per timeslot
- [ ] **release: Settled**
- [ ] benchmark Hidden Service startup on system Tor
- [ ] ?

# Aftermath

The proposed monitoring solution should pave the way of Bisq's future quite a bit. With the solution, we are able to understand Bisqs network layer better and thus, take better care of it. 

Please feel free to suggest further Metrics and how they fit in the list above. Please also feel free to raise any questions and concerns! It is a rather big project and more minds usually perform better in creating a more complete picture.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/bisq-network/proposals/issues/62
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bisq.network/pipermail/bisq-github/attachments/20181212/08023e6f/attachment-0001.html>