DOCS CANONICAL · GENERATED FROM REPO RELEASE v0.2.0-beta.1 SOURCE github.com/cubby-network/platform

Rollback runbook

Operations · Source: docs/ROLLBACK.md

Rollback & recovery runbook

When something goes wrong on a change Cubby executed, the platform has already captured enough evidence to reconstruct what happened and, in most cases, recover automatically. This runbook is for the operator holding the pager when a test user's (or your own) change has left the network in an unexpected state.

Three cases, in order of severity:

  1. The workflow rolled itself back — safest, most common, almost always self-resolving.
  2. The workflow failed mid-execute and is stuck — the canary applied but verification failed and automatic rollback didn't finish; you have to drive the rollback yourself.
  3. The workflow thinks it succeeded but the network is broken — you can't trust the harness's verdict; need to manually reconstruct and remediate.

Every one of these assumes you can read the signed evidence chain. If the chain itself is damaged or unavailable, jump to §4 chain recovery first.

Where to find the evidence

<repo-root>/var/evidence/
  bundle-<prefix>.json          # the signed bundle (per stage + final)
  bundle-<prefix>.manifest.json # the signature manifest (alg, key_id, prev_sha256, signature)
  chain.tip                     # the hash of the most recent bundle — the tip of the chain

Every transition the workflow state machine makes writes a new bundle whose prev_sha256 points at the previous one. chain.tip is authoritative for "what was the last thing this deployment signed."

Inspect a bundle:

jq '.' var/evidence/bundle-<prefix>.json | less

Every bundle contains:

  • intent_id, workflow_state, actor (who triggered the transition)
  • snapshot_before / snapshot_after — the device state captured around the change
  • commands + rollback_commands — exactly what Cubby asked the device to run
  • validation_report, precheck, verify — each stage's structured outcome
  • policy_decision, approval — what was approved and by whom

Verify the chain end-to-end:

cubby verify-chain   # or: python -m apps.cli.main verify-chain

This returns {ok: true, total, legacy_count, chain_errors: []}. Any chain_errors is a red flag — see §4.

1. Workflow rolled itself back

Symptoms: workflow_state = ROLLED_BACK, the most recent bundle contains a rollback phase, re-querying the device shows it back to its pre-change state.

What to do: read the bundle to understand why. The verify block will tell you what invariant failed. Common reasons Cubby rolls back:

  • Target VLAN not in the device's allowed_vlans
  • Interface mode is trunk but the intent was access
  • Post-change snapshot shows a neighboring device lost LLDP on the affected link
  • Verification timeout (device didn't re-advertise the expected state)

Recovery: usually none needed — the network is back to the pre-change state. If the root cause is environmental (allowed_vlans list out of date), fix the fixture and retry the intent. The new run gets a fresh intent_id, a fresh signed chain, and the old ROLLED_BACK bundle stays in the audit trail as context.

2. Workflow stuck mid-execute

Symptoms: workflow_state is CANARY_EXECUTING / FULL_EXECUTING / ROLLBACK_PENDING and isn't advancing. The process may have crashed, the device may have become unreachable, or the persist hook failed.

What to do:

  1. Identify the last confirmed state. Read the highest-numbered bundle for this intent_id: ```sh jq 'select(.intent_id == "<intent-id>") | {workflow_state, phase, commands, timestamp}' \ var/evidence/bundle-*.json ``` The workflow_state in the latest bundle is the last state that was signed — i.e., the last state the harness persisted successfully.
  2. Inspect the forward commands the harness was executing. They're in commands[] on the bundle that matches CANARY_EXECUTING or FULL_EXECUTING. Each entry has phase, command, description, and the device it targets.
  3. Pull the rollback block from the same bundle — rollback_commands[]. These are the commands Cubby would have executed if it had reached ROLLBACK_PENDING on its own. You can apply them manually (SSH to the device, paste the commands in order) to restore the pre-change state.
  4. Snapshot the device after manual rollback and compare against snapshot_before in the bundle. If they match, the device is recovered.
  5. Mark the workflow failed. From the CLI: ```sh cubby mark-failed --intent <intent-id> --reason "manual rollback applied" ``` This writes a final bundle transitioning the workflow to FAILED → CLOSED with an audit entry. The chain now closes cleanly and future evidence verification passes.

3. Workflow claims success but network is broken

Symptoms: workflow_state = CLOSED, verify was ok, but users are reporting an outage or the monitoring is red.

This is the hardest case. The harness's verification invariant passed on whatever it checked, but the real-world impact is different from what the invariant captured. Common shapes:

  • The intent was valid but the upstream routing was misconfigured and Cubby's snapshot doesn't include that scope.
  • A stale state fixture meant Cubby verified against intended state that didn't match observed state.
  • The change itself was correct but a concurrent manual change on a different device conflicted with it.

What to do:

  1. Assume the harness's self-verification is no longer reliable. Don't trust the verify.ok: true; go to the device directly with a read-only tool and compare.
  2. Read the intent. The bundle's intent block tells you exactly what the operator asked for. Determine whether the request itself was wrong (user error) or whether Cubby's translation of the request into commands was wrong (code bug).
  3. Apply the rollback manually from the bundle's rollback_commands[], same as §2.
  4. File the evidence bundle with the incident. The bundle is the single most valuable artifact for a post-mortem: it captures the intent, the plan, the pre-state, the commands, and the post-state. Attach it to the incident ticket.
  5. If a code bug caused the mis-translation, open an issue and tag with workflow-safety. The team should reproduce against the same fixture that produced the wrong plan.

4. Chain recovery

Symptoms: cubby verify-chain reports chain_errors != [], or var/evidence/ was accidentally deleted / overwritten.

Conceptual recovery:

  • chain.tip stores the SHA-256 of the last bundle. Every new bundle signs its own prev_sha256 = chain.tip_at_write_time. A break means one of: chain.tip is pointing at a bundle that isn't present, a bundle was deleted mid-chain, or two parallel writers forked the chain (shouldn't happen given the file lock around writes).
  • The safe play is always to fork a new segment rather than rewrite history. Cubby supports this via the NETOPS_EVIDENCE_CHAIN_RESET_BUNDLE_IDS env var: list the bundle IDs where the verifier should tolerate a prev_sha256 mismatch, and the chain verification treats them as explicit segment boundaries.

Recovery steps:

  1. Do not delete bundles. Even broken ones carry the signed intent + plan + snapshots — you may want them for the post-mortem.
  2. Identify the break point. cubby verify-chain will show which bundle_id the prev_sha256 check failed at. That's the first bundle of the new segment.
  3. Mark the segment boundary. Set the env var: ```sh export NETOPS_EVIDENCE_CHAIN_RESET_BUNDLE_IDS="<bundle-id-of-first-bundle-in-new-segment>" ``` Future verify-chain runs accept the known break and pass.
  4. Rotate the signing key. If the chain break is from anything other than a planned reset (test fork, manual file deletion), rotate the evidence HMAC key so any prior bundles signed under the old key are now legacy-signed-only: ```sh export NETOPS_EVIDENCE_LEGACY_KEY_IDS="<old-key-id>" export NETOPS_EVIDENCE_HMAC_SECRET="<new-secret>" ```
  5. Write a new bundle (any workflow action does this) to re-anchor chain.tip under the new key.
  6. Post-mortem — every chain break deserves an investigation. Likely culprits: a test run that hit the same var/evidence/ dir, a backup-restore that didn't preserve chain.tip, a shared-secret signer that got rotated mid-chain without NETOPS_EVIDENCE_LEGACY_KEY_IDS being set.

Things to never do

  • Do not rm -rf var/evidence/. Every bundle is audit evidence. If you need to start fresh for a clean test, move the old dir aside (mv var/evidence var/evidence-<date>), don't delete it.
  • Do not edit a bundle JSON file. The manifest signature over the canonical payload will break verification and you'll have to reset the chain.
  • Do not skip verify-chain after a recovery. The whole point of the chain is that you can prove no unsigned change happened. Run the verify step and make sure it's clean before calling the recovery done.

Escalation

If the chain is broken, a device is in a state you can't reconcile, or the harness is refusing to boot and you need it running now:

  1. Save the current var/evidence/ and var/runbooks/ directories intact (the team will need them for root-cause).
  2. File the incident. Attach the last good bundle ID and the error output from cubby verify-chain.
  3. If the harness itself is broken, a last-resort bypass is to drive the rollback commands manually from the bundle's rollback_commands[] — SSH to each device and apply them in order. The device doesn't care what issued the rollback.