[bisq-network/roles] Seednode Operator (#15)

Sat Jan 12 13:49:10 UTC 2019

We had a severe incident yesterday with all seed nodes. 

Reason was that I updated the --maxMemory program argument from 512 to 1024 MB. My servers have 4 GB RAM and run 2 nodes each, so I thought that should be ok. But was not. It caused out of memory errors and nodes became stuck (required kill -9 to stop them). 

I increased the maxMemory setting because I saw that they restarted every 2-3 hours (earlier it was about once a day). The seed nodes check the memorey they consume and if it hits the maxMemory they automatically restart. That is a work-around for a potential memory leak which seems to occure only on Linux (and/or seed nodes). At least on OSX with normal Bisq app I never could reproduce it, i could even run the app with about 100 connections, which never worked on my Linux boxes. So I assume its some OS setting causing it. We researched a bit in the past but never found out what is the real reason (never dedicated enough effort - we should prioritize that old issue in the near future). 

The situation was discovered late night as a user posted a GH issue that he has no arbitrators, checking the monitor page alerted me as all nodes have been without data basically and most not responsive. From stats on my hoster I saw that the situation somewhere in the last 12-24 hours.

The 2 nodes from Mike and Stephan have been responsive (as they did not change anything) but also were missing data (as they restart every few hours as well and therefor connect to other seeds to gather the data - as the other seeds lost data over time they also became corrupted).

It was a lesson that it is not a good idea to change too much and change all seeds at the same time! 
Good thing is that it could recover at the end quite quickly and the network is quite resilient even in case all seeds fail (as it was the case more or less).

To recover I started locally one seed and removed all other seed addresses (in the code), so it connected after a while to any persisted peer (normal Bisq apps). From those it got the data which are present in the network and then I used that seed as dedicated seed (using --seedNodes) for the other seeds to start up again. So my seeds all become filled with data again. Mikes and Stephans seeds needed a few hours until they got up to date again once they restarted (so the too fast restart interval was a benefit here).

I updated my servers to 8 GB (4GB / node) and will test now more carfully how far I can go with the --maxConnections and --maxMemory settings. Currently I run 4 nodes with --maxConnections=30 --maxMemory=1024 and 2 with --maxConnections=25 --maxMemory=750.
Stephan told me he had anyway already 4 GB and --maxConnections=30 --maxMemory=1024 which seems a safe setting. Mike has not responded so far, but I assume he has lower settings as his node recovered quite fast (restarted faster).

What we should do:

- Better Alert/Monitoring
We need to get an alert from the monitoring in severe cases like that. Looking passively to the monitor page is not enough. Alerts have to be good enough to not have false positives (like the email alerts  receive from out simple Tor connection monitoring which I tend to ignore as 99.9% there is nothing severe)

- Improvements in code for more resiliance 
When a node starts up it connects to a few seed nodes for initital data, that was added for more resilience if one seed node is out of date. We should extend that to include normal peristed non-seed-node peer as well, so in case that the seeds are all failing (like that incident) the network still exchanges at startup the live data. Only first time users would have a problem then.

- Investigate memory increase/limitations
Investigate the reason for the memory increase (might be a OS setting like limit of some network resources)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/bisq-network/roles/issues/15#issuecomment-453749038
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bisq.network/pipermail/bisq-github/attachments/20190112/e9dad876/attachment.html>