Field notes
AttestMesh’s modules were “implemented and unit-tested” well before they worked in production. Getting from there to a live mainnet mesh surfaced ten bugs, none of which a unit test could have caught — every one lived at a boundary with the real world: the TEE runtime, the bundler, the gateway, the chain’s size, restart semantics. They are documented here because they are exactly the bugs the next attested-systems project will meet.
1. The TEE API drifted under us — twice
Section titled “1. The TEE API drifted under us — twice”dstack 0.5.x removed the /DeriveKey guest-agent endpoint in favor of /GetKey.
The sidecar found this on its first live boot; the indexer found it again months
later because it carried its own copy of the client.
Lesson: runtime APIs you don’t control get re-verified on a live system, and duplicated clients duplicate the failure.
2. The paymaster rejected a dummy signature
Section titled “2. The paymaster rejected a dummy signature”Sponsorship gas estimation requires a placeholder signature — Alchemy rejects odd-length hex. A one-character fix found only by submitting a real operation.
3. ECDSA recovery ids: 0/1 vs 27/28
Section titled “3. ECDSA recovery ids: 0/1 vs 27/28”The dstack KMS chain signs with recovery ids 0/1; OpenZeppelin’s ECDSA reverts on
v < 27. On-chain verification of real attestation proofs failed until the
sidecar normalized them.
4. A webhook route pointing at the wrong worker
Section titled “4. A webhook route pointing at the wrong worker”Sponsorship denied with opaque 401s: the custom domain’s Cloudflare route still
targeted a different project’s worker. The webhook code was fine; the routing
wasn’t. Now deploy/webhook.sh asserts the route on every deploy.
Lesson: encode infra invariants into deploy routines, not memories.
5. The 4337 nonce that silently became zero
Section titled “5. The 4337 nonce that silently became zero”The nonce fetch called a bundler RPC method that doesn’t exist — and swallowed the
error into nonce = 0. Registration (genuinely nonce 0) worked, masking it;
every subsequent operation failed with AA25 invalid account nonce. Fixed by
reading EntryPoint.getNonce via eth_call and propagating failures.
Lesson: unwrap_or(default) on an RPC result is a time bomb; the masked case
always ships.
6. A convergence rule that could never be satisfied
Section titled “6. A convergence rule that could never be satisfied”Heartbeats reported each node’s connected set excluding itself; the convergence check compares views against the live set, which includes everyone. A two-node mesh could mathematically never converge. The module’s own unit tests had the correct semantics — the integration passed the wrong data.
7. The CSK restart deadlock
Section titled “7. The CSK restart deadlock”After a restart, the TEE’s sealed store came back empty — and with the on-chain
commitment already set, the originator took the “pull from a peer” path along with
everyone else. Every node waited for every other node: distributed deadlock in
pulling-csk. The fix leaned on determinism: the originator re-derives the CSK
from its TEE seed and verifies it against the commitment before falling back.
Lesson: “persisted” is a claim about someone else’s system. Determinism is the stronger recovery primitive.
8. The genesis scan that would never finish
Section titled “8. The genesis scan that would never finish”The indexer’s first catch-up paged from block 0 — ~47 million blocks on Base. Its
health and gRPC listeners only start after catch-up, so it sat dark for what
would have been days. Bounded by INDEXER_START_BLOCK (the factory’s deploy
block, found automatically by binary-searching getCode), catch-up takes ~23
seconds.
Lesson: test data has no history. Mainnet does.
9. ANSI codes in scraped logs
Section titled “9. ANSI codes in scraped logs”The deploy routine that scrapes the indexer’s boot-derived pubkey from CVM logs matched nothing: pretty-format tracing interleaves color escapes between the label and the hex. Strip codes before grepping.
10. The gateway speaks h2 but says http/1.1
Section titled “10. The gateway speaks h2 but says http/1.1”The dstack gateway proxies gRPC fine — grpcurl (which forces h2) worked
perfectly — but its ALPN answer says http/1.1, so tonic refused with
HTTP/2 was not negotiated. Fixed with ClientTlsConfig::assume_http2(true):
trust the verified behavior over the negotiation header.
Lesson: when a transport works under one client and not another, diff the negotiation, not the payload.
The meta-lesson
Section titled “The meta-lesson”Five of the ten (1, 5, 7, 8, 10) share a single shape: a dependency behaved differently in production than its interface implied, and the failure surfaced somewhere else. The mitigations that actually worked were: verbose logging at every boundary, deploy routines that assert their own preconditions, fallbacks that don’t share the primary’s failure mode (log polling under the indexer, re-derivation under the sealed store), and treating every live deployment as the final integration test it really is.