Skip to content

Field notes

AttestMesh’s modules were “implemented and unit-tested” well before they worked in production. Getting from there to a live mainnet mesh surfaced ten bugs, none of which a unit test could have caught — every one lived at a boundary with the real world: the TEE runtime, the bundler, the gateway, the chain’s size, restart semantics. They are documented here because they are exactly the bugs the next attested-systems project will meet.

dstack 0.5.x removed the /DeriveKey guest-agent endpoint in favor of /GetKey. The sidecar found this on its first live boot; the indexer found it again months later because it carried its own copy of the client.

Lesson: runtime APIs you don’t control get re-verified on a live system, and duplicated clients duplicate the failure.

2. The paymaster rejected a dummy signature

Section titled “2. The paymaster rejected a dummy signature”

Sponsorship gas estimation requires a placeholder signature — Alchemy rejects odd-length hex. A one-character fix found only by submitting a real operation.

The dstack KMS chain signs with recovery ids 0/1; OpenZeppelin’s ECDSA reverts on v < 27. On-chain verification of real attestation proofs failed until the sidecar normalized them.

4. A webhook route pointing at the wrong worker

Section titled “4. A webhook route pointing at the wrong worker”

Sponsorship denied with opaque 401s: the custom domain’s Cloudflare route still targeted a different project’s worker. The webhook code was fine; the routing wasn’t. Now deploy/webhook.sh asserts the route on every deploy.

Lesson: encode infra invariants into deploy routines, not memories.

5. The 4337 nonce that silently became zero

Section titled “5. The 4337 nonce that silently became zero”

The nonce fetch called a bundler RPC method that doesn’t exist — and swallowed the error into nonce = 0. Registration (genuinely nonce 0) worked, masking it; every subsequent operation failed with AA25 invalid account nonce. Fixed by reading EntryPoint.getNonce via eth_call and propagating failures.

Lesson: unwrap_or(default) on an RPC result is a time bomb; the masked case always ships.

6. A convergence rule that could never be satisfied

Section titled “6. A convergence rule that could never be satisfied”

Heartbeats reported each node’s connected set excluding itself; the convergence check compares views against the live set, which includes everyone. A two-node mesh could mathematically never converge. The module’s own unit tests had the correct semantics — the integration passed the wrong data.

After a restart, the TEE’s sealed store came back empty — and with the on-chain commitment already set, the originator took the “pull from a peer” path along with everyone else. Every node waited for every other node: distributed deadlock in pulling-csk. The fix leaned on determinism: the originator re-derives the CSK from its TEE seed and verifies it against the commitment before falling back.

Lesson: “persisted” is a claim about someone else’s system. Determinism is the stronger recovery primitive.

8. The genesis scan that would never finish

Section titled “8. The genesis scan that would never finish”

The indexer’s first catch-up paged from block 0 — ~47 million blocks on Base. Its health and gRPC listeners only start after catch-up, so it sat dark for what would have been days. Bounded by INDEXER_START_BLOCK (the factory’s deploy block, found automatically by binary-searching getCode), catch-up takes ~23 seconds.

Lesson: test data has no history. Mainnet does.

The deploy routine that scrapes the indexer’s boot-derived pubkey from CVM logs matched nothing: pretty-format tracing interleaves color escapes between the label and the hex. Strip codes before grepping.

10. The gateway speaks h2 but says http/1.1

Section titled “10. The gateway speaks h2 but says http/1.1”

The dstack gateway proxies gRPC fine — grpcurl (which forces h2) worked perfectly — but its ALPN answer says http/1.1, so tonic refused with HTTP/2 was not negotiated. Fixed with ClientTlsConfig::assume_http2(true): trust the verified behavior over the negotiation header.

Lesson: when a transport works under one client and not another, diff the negotiation, not the payload.


Five of the ten (1, 5, 7, 8, 10) share a single shape: a dependency behaved differently in production than its interface implied, and the failure surfaced somewhere else. The mitigations that actually worked were: verbose logging at every boundary, deploy routines that assert their own preconditions, fallbacks that don’t share the primary’s failure mode (log polling under the indexer, re-derivation under the sealed store), and treating every live deployment as the final integration test it really is.