Day-2 operations
Everything below is codified in the repo’s deploy/ routines — idempotent,
re-entrant bash with every step logged to deploy/logs/. The golden rule for all
of it: the order is allowlist-first, because the dstack KMS will not release
keys to a CVM whose compose hash isn’t approved on chain.
Updating a node’s compose or sealed env
Section titled “Updating a node’s compose or sealed env”CLUSTER=<cluster> MEMBER_IMPL=<impl> COMPOSE=deploy/compose/node-1.yaml \ deploy/node-pathA.sh attestmesh-node-1 updateThe update routine:
- Runs a prepare-only update to learn the new compose hash without touching the CVM.
- Checks
allowedComposeHasheson the cluster and, if needed, allowlists the new hash (a cluster-owner transaction). - Applies the update — the CVM reboots into the new compose and passes the KMS boot gate because the hash is already approved.
The node re-registers nothing: its member record, keys, and identity carry over.
It re-meshes and re-acquires the CSK on its way back to healthy.
Rolling a new sidecar image
Section titled “Rolling a new sidecar image”Images are tagged :latest and published by CI on every push to sidecar/**.
Because the compose doesn’t change, no allowlist step is needed — restart to
re-pull:
deploy/node-pathA.sh attestmesh-node-1 restartdeploy/node-pathA.sh attestmesh-node-1 mesh-verifyRoll one node at a time and mesh-verify between them if you want the mesh to
stay continuously available; a simultaneous roll of every node also recovers, just
with a brief gap.
Operating the shared indexer
Section titled “Operating the shared indexer”The indexer is one instance per chain, serving every cluster — never deploy it per cluster. It is a stock dstack app (not a cluster member), so its updates have no cluster allowlist step:
deploy/indexer.sh attestmesh-indexer-1 update # new compose/env + restartdeploy/indexer.sh attestmesh-indexer-1 verify # health via the gateway + registry read-backRe-registration (deploy/indexer.sh <name> register) is only needed when the
signing key or endpoint changes — for example, deploying a replacement CVM. The
registry record is owner-controlled; subscribing sidecars re-discover it
automatically and re-verify every push against the new key.
If the indexer is ever down, nothing breaks: sidecars log the failed subscription, fall back to their own chain-log polling, and resubscribe when it returns.
Adding a node to a live cluster
Section titled “Adding a node to a live cluster”Identical to first bring-up — the routine is the same for node 3 as for node 1:
CLUSTER=<cluster> MEMBER_IMPL=<impl> COMPOSE=deploy/compose/node-1.yaml \ deploy/node-pathA.sh attestmesh-node-3 allExisting members discover the newcomer from chain state on their next reconcile pass (within seconds when the indexer push wakes them), add the wireguard peer, and serve it the CSK once its heartbeats verify.
Health checklist
Section titled “Health checklist”When something looks wrong, in order:
/healthzvia the gateway — which phase is it stuck in?- CVM logs (
npx phala cvms logs <cvm>) — the sidecar logs every phase transition, registration attempt, peer configuration, envelope, and CSK step. - Chain state —
memberCount,memberIdOf(app_id),cskCommitment,IndexerRegistry.current()answer most “is it registered/configured?” questions directly. - Field notes — the failure modes we have actually hit, with their signatures and fixes.