[bisq-network/bisq] Track p2p data store files using Git LFS (#4114)

Florian Reimair notifications at github.com
Fri Apr 3 11:31:53 UTC 2020

I believe we need some more context to fully grasp the topic:

## Why are these datastores in place?

Given a fresh install of Bisq, Bisq needs to sync up its local copy of our distributed database. It could do so by just asking the network and download the data from there. We do not have the bandwidth, as our P2P network is still very small.

Thus, every release comes with a snapshot of the distributed database. Thus, fresh Bisq installs do not have to download huge amounts (we might just sit at the 100MB mark) of data just to get started.

(Alas, the user just downloaded the data via the installer, so technically, it does not make a difference at the user level. However, given a too-small network and generally, upstream << downstream, it could take a while. So we went for hosting it via the installers.)

## Does changing the storage location alter Bisq functionality?

No, the database snapshots are still going to be included in the release binaries. Only builds from source might have to be adjusted. There is no need to change Bisqs functionality and inner workings.

## Further options I can think:

5. only add the updated data as proposed as a side effect in https://github.com/bisq-network/projects/issues/25. Of course this is a short term countermeasure and only slows the problem down. But that could be substantial as well: going from adding a database with `size(t) = size(t-1) + newdata` to adding a database with `size(t) = newdata` reduces the order of size growth significantly.
Again, this is NOT a solution, but it might buy us time to think about the issue properly.

6. use git releases to host the database snapshots
7. use git packages to host the database snapshots

## Historical data vs. live data

The database (and its snapshots) evolves like (just made up the names though):

`hd(release) = hd(release-1) + livedata`

with `release` being a point in time, where a new bisq version has been released (eg. v1.2.9). This release ships historical data `hd(bisq version)`. `livedata` is data coming in in between releases when Bisq business commences. Trades are being conducted, new accounts are created (Note, offers are not included as they will be gone after they have been taken). Before `livedata` gets compiled into `historical data`, it only lives in the P2P network

For example:
- historical data `hd(v1.2.9)` is shipped with release v1.2.9
- `hd(v1.2.9)` includes `hd(v1.2.8) + livedata` that accumulated in the time between the releases

(Please note that upgrading a bisq version does not use the shipped database snapshot of the new version as for now)

That being said, I understand your (4.) is aimed at `livedata`:

> 4. **Re-think our approach to distributing data stores entirely.** So the question arises, why put all of this on the seed nodes in the first place? Why not have the entire network of Bisq nodes share this data with one another

- Well, the entire network is sharing data with one another. That is how it works and how it is designed.
- seednodes do NOT distribute the database snapshots, these snapshots are shipped with the installer binaries.

- as time goes on, a database snapshot shipped with the installer start missing objects immediately (because life goes on and trades happen and ...)
- the Bisq app, however, needs all the data, especially the latest, to work properly
- thus, bisq does the following steps before it shows the actual GUI to the user
  1. ask 2 seednodes for preliminary data (> 4MB request)
  2. ask 2 seednodes for updated data (4MB again) after hidden service is online in case there has been something happening between preliminary request and the hidden service coming online (note that having v3 HS now and their very low publishing time, we might think of loosing the preliminary request (see what I did there? it finally comes together :1st_place_medal: ) )
  3. ask 2 seednodes for peers
- by doing it like so, the Bisq network eliminates issues with source file availability. Just think back to the good ol' days with eMule when you had downloaded 99% of the ubuntu4.iso and there just hasn't been a source around for the very last 4k block of data you have been missing. As our P2P network is still very small, we might encounter that as well.

- other approaches have been suggested, one of the most promising imho is to scramble up the current steps and split the data load.
  1. ask for peers
  2. ask these peers + seed nodes for data, but do not ask all data from all of them
which of course is not trivial, because we do not know which data we are missing...

## Dev Call

I have a feeling we should discuss this on a dev call someday. Needs some more info though.

You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bisq.network/pipermail/bisq-github/attachments/20200403/53f7f23d/attachment-0001.html>

More information about the bisq-github mailing list