Backup and Restore
While all components in the system are designed to be crash-fault-tolerant, there is always a chance of failures from which the system cannot immediately recover, e.g. due to misconfiguration or bugs. In such cases, we will need to restore a component, a full validator node, or even the whole network, from backups or from dumps from components that are still operational.
Backup of Node Identities
Once your validator node is up and onboarded, please make sure to backup the node identities of your node. Note that this information is highly sensitive, and contains the private keys of your participant, so make sure to store it in a secure location, such as a Secret Manager. On the other hand, it is crucial for maintaining your identity (and thus, e.g. access to your Canton Coin holdings), so must be backed up outside of the cluster.
Your identites may be fetched from your node through the following endpoint:
curl "https://wallet.validator.YOUR_HOSTNAME/api/validator/v0/admin/participant/identities" -H "authorization: Bearer <token>"
where <token> is an OAuth2 Bearer Token with enough claims to access the Validator app, as obtained from your OAuth provider. For context, see the Authentication section here.
If you are using a docker-compose deployment, replace https://wallet.validator.YOUR_HOSTNAME with http://wallet.localhost.
If you are running the docker-compose deployment with no auth, you can use the utility Python script get-token.py
to generate a token for the curl command by running python get-token.py administrator
(requires pyjwt).
Backups of postgres instances
To backup a validator node, please make sure that all Postgres instances are backed up at least every 4 hours. Note that there is a strict order requirement between the backups: the backup of the apps postgres instance must be taken at a point in time strictly earlier than that of the participant. Please make sure the apps instance backup is completed before starting the participant one. We will provide guidelines on retention of older backups at a later point in time.
If you are running your own Postgres instances in the cluster, backups can be taken either using tools like pgdump, or through snapshots of the underlying Persistent Volume. Similarly, if you are using Cloud-hosted Postgres, you can either use tools like pgdump or backup tools provided by the Cloud provider.
If you are running a docker-compose deployment, you can run the following commands to backup the Postgres databases. This will create two dump files, one for the participant and one for the validator app.
docker exec -i compose-postgres-splice-1 pg_dump -U cnadmin validator > "${backup_dir}"/validator-"$(date -u +"%Y-%m-%dT%H:%M:%S%:z")".dump
active_participant_db=$(docker exec compose-participant-1 bash -c 'echo $CANTON_PARTICIPANT_POSTGRES_DB')
docker exec compose-postgres-splice-1 pg_dump -U cnadmin "${active_participant_db}" > "${backup_dir}"/"${active_participant_db}"-"$(date -u +"%Y-%m-%dT%H:%M:%S%:z")".dump
Restoring a validator from backups
Assuming backups have been taken, the entire node can be restored from backups. The following steps can be taken to restore a node from backups:
Scale down all components in the validator node to 0 replicas.
Restore the storage and DBs of all components from the backups. The exact process for this depends on the storage and DBs used by the components, and is not documented here.
Once all storage has been restored, scale up all components in the validator node back to 1 replica
NOTE: Currently, you have to manually re-onboard any users that were onboarded after the backup was taken.
If you are running a docker-compose deployment, you can restore the Postgres databases as follows:
Stop the validator and participant using
./stop.sh
.Wipe out the existing database volume:
docker volume rm compose_postgres-splice
.Start only the postgres container:
docker compose up -d postgres-splice
Check whether postgres is ready with:
docker exec compose-postgres-splice-1 pg_isready
(rerun this command until it succeeds)Restore the validator database (assuming validator_dump_file contains the filename of the dump from which you wish to restore):
docker exec -i compose-postgres-splice-1 psql -U cnadmin validator < $validator_dump_file
Restore the participant database (assuming participant_dump_file contains the filename of the dump from which you wish to restore, and migration_id contains the latest migration ID):
docker exec -i compose-postgres-splice-1 psql -U cnadmin participant-$migration_id < $participant_dump_file
Stop the postgres instance:
docker compose down
Start your validator as usual
Disaster recovery from loss of the CometBFT storage layer of the global synchronizer
In case of a complete disaster, where the complete CometBFT layer of the network is lost beyond repair, the SVs will follow a process somewhat similar to the migration dumps used for Synchronizer Upgrades with Downtime to recover the network to a consistent state from before the disaster. Correspondingly, the validators will need to follow a process similar to the one described in Synchronizer Upgrades with Downtime. The main difference from that process from a validator’s perspective is that the existing synchronizer is assumed to be unusable for any practical purpose, so validators cannot catchup if they are behind. Moreover, the timestamp from which the network will be recovering will most probably be earlier than the time of the incident, and data loss is expected to occur.
The steps at the high level are:
All SVs agree on the timestamp from which they will be recovering, and follow the disaster recovery process for SVs.
Validator operators wait until the SVs have signaled that the restore procedure has been successful, and to which timestamp they have restored.
Validator operators create a dump file through their validator app.
Validator operators copy the dump file to their validator app’s PVC and restart the app to restore the data.
Technical Details
We recommend first familiarizing yourself with the migration process, as the disaster recovery process is similar. In case of disaster, the SVs will inform you of the need to recover, and indicate the timestamp from which the network will be recovering.
The following steps will produce a data dump through the validator app, consisting of your node’s private identities as well as the Active Contract Set (ACS) as of the required timestamp. That data dump will then be stored on the validator app’s PVC, and the validator app and participant will be configured to consume it and restore the data from it.
Please make sure before you fetch a data dump from the validator app that your participant was healthy around the timestamp that the SVs have provided. The data dump can be fetched from the validator app by running the following command:
curl -sSLf "https://wallet.validator.YOUR_HOSTNAME/api/validator/v0/admin/domain/data-snapshot?timestamp=<timestamp>&force=true" -H "authorization: Bearer <token>" -X GET -H "Content-Type: application/json" > dump_response.json
cat dump_response.json | jq '.data_snapshot' > dump.json
where <token> is an OAuth2 Bearer Token with enough claims to access the Validator app, as obtained from your OAuth provider, and <timestamp> is the timestamp provided by the SVs, in the format “2024-04-17T19:12:02Z”.
If the curl command fails with a 400 error, that typically means that your participant has been pruned beyond the chosen timestamp, and your node cannot generate the requested dump. If it fails with a 429, that means the timestamp is too late for your participant to create a dump for, i.e. your participant has not caught up to a late enough point before the disaster. Either way, you will need to go through a process of recreating your validator and recovering your balance, which will be documented soon.
This file can now be copied to the Validator app’s PVC:
kubectl cp dump.json validator/<validator_pod_name>:/domain-upgrade-dump/domain_migration_dump.json
where <validator_pod_name> is the full name of the pod running the validator app.
Migrating the Data:
Please follow the instructions in the Deploying the Validator App and Participant section to update the configuration of the validator app and participant to consume the migration dump file.
For docker-compose validator deployments, the process is similar with the following modifications:
For the endpoint for fetching the data dump, replace https://wallet.validator.YOUR_HOSTNAME with http://wallet.localhost.
If you are running your validator without auth, you can use the utility Python script get-token.py to generate a token for the curl command by running
python get-token.py administrator
(requires pyjwt).Copy the dump file to the validator’s docker volume using:
docker run --rm -v "domain-upgrade-dump:/volume" -v "$(pwd):/backup" alpine sh -c "cp /backup/dump.json /volume/domain_migration_dump.json"