Operator guide

Operations · Source: docs/OPERATOR_GUIDE.md

Cubby operator guide

QUICKSTART.md gets you to a signed change against a lab device. This document is for the operator driving the system on something more than a laptop lab — a dev-net a vendor lent you, a pre-production cell, or a small pilot of your own network.

Read this before you invite other humans to drive the system.

Environment-variable matrix

Every var is read in exactly one place (packages/common/runtime_config.py), cached as a singleton, and is discoverable via cubby config show. The ones that materially change behaviour:

Variable	Default	What changes when you set it
`NETOPS_ENV`	`development`	`production` flips the plugin registry to strict mode — simulated adapters are rejected at register time, so the default `build_demo_harness` will refuse to boot. You must wire real adapters yourself before setting this.
`NETOPS_API_AUTH_MODE`	`dev`	`dev` prints a random token on start-up. `hmac` validates bearer tokens signed with `NETOPS_API_HMAC_SECRET`. `oidc` validates JWTs against the IdP. `production` + `dev` is refused at boot.
`NETOPS_API_HMAC_SECRET`	`(empty)`	HMAC signing secret for API bearer tokens. 32+ bytes. Required when `NETOPS_API_AUTH_MODE=hmac`.
`NETOPS_OIDC_ISSUER` / `NETOPS_OIDC_AUDIENCE` / `NETOPS_OIDC_JWKS_URL`	`(empty)`	OIDC validator config. JWKS URL must be HTTPS — non-HTTPS URLs are refused at refresh time. JWKS is cached 15 minutes by default.
`ANTHROPIC_API_KEY`	`(empty)`	Selects `ClaudeAgentRuntime` over any OpenAI option. Preferred when multiple are set.
`NETOPS_ANTHROPIC_MODEL`	`claude-opus-4-7`	Override the Claude model id.
`OPENAI_API_KEY`	`(empty)`	Selects `OpenAIAgentRuntime` with a static API key. Honours `OPENAI_BASE_URL` for Azure / vLLM / Ollama / etc.
`NETOPS_CODEX_CREDENTIAL_PATH`	`(empty)`	Path to a Codex CLI `auth.json`. Bills against a ChatGPT subscription via OAuth refresh; requires `NETOPS_CODEX_TOKEN_URL` for the refresh endpoint.
`NETOPS_EVIDENCE_HMAC_SECRET`	`(empty)`	Production evidence-signing key. When unset, a deterministic dev key is written under `var/keys/`. Set `NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1` to refuse the dev fallback.
`NETOPS_APPROVAL_HMAC_SECRET`	`(empty)`	Approval-signing key (distinct from evidence). Same dev-key fallback policy as above.
`NETOPS_EVIDENCE_LEGACY_KEY_IDS`	`(empty)`	Comma-separated list of key_ids the verifier tolerates without cryptographic check. Use only for unrecoverable key-loss scenarios.
`NETOPS_API_MAX_BODY_BYTES`	`65536`	HTTP request body cap in bytes. Rejects both `Content-Length` and chunked requests that exceed the cap.
`NETOPS_WIKI_ROOT`	`<repo>/docs`	Root of the hand-curated knowledge base the agents read.
`NETOPS_CAB_ACKNOWLEDGE_SHARED_SECRET`	`(empty)`	Set to `1` to silence the stderr boot banner in non-production envs. Has no effect in production — `NETOPS_ENV=production` with a multi-member CAB backed by a single HMAC signer fails fast unconditionally. Load per-approver Ed25519 public keys into `SignerKeyring` before going to production.

cubby config show renders this matrix against the current process environment so you can see what's resolved vs what's falling back to defaults.

Demo vs production posture

Two failure modes the platform enforces at boot when NETOPS_ENV=production:

No simulated adapters. The plugin registry refuses to register any plugin with simulated=True, so build_demo_harness() fails fast with SimulationLeakError on the first simulated device adapter. You must wire real vendor adapters (plugins/device/*/real_adapter.py) and/or custom adapters before the harness will construct.
No dev auth. NETOPS_API_AUTH_MODE=dev is refused — you must set hmac or oidc and supply the matching secret/issuer config.

Both are intentional: it's much safer for the system to refuse to start than to silently boot a prod-tagged deployment on demo adapters or a printed dev token.

Wiring real device adapters

Real adapters exist today for:

Cisco IOS-XE (plugins/device/cisco_iosxe/real_adapter.py)
Cisco NX-OS (plugins/device/cisco_nxos/real_adapter.py)
Arista EOS (plugins/device/arista_eos/real_adapter.py)
Junos (plugins/device/junos/real_adapter.py)
PAN-OS (plugins/device/panos/real_adapter.py)
Fortinet (plugins/device/fortinet/real_adapter.py)
Nokia SR Linux (plugins/device/nokia_srl/real_adapter.py)

Easy path: `cubby init-pilot`

For a guided end-to-end walkthrough — prerequisites, smoke tests, first device read, dry-run change, evidence verification, failure modes — read docs/PILOT_BETA.md. The summary below is the 30-second version.

The pilot wizard generates a pilot-config.yaml and .env.template so you don't write Python:

$ cd /opt/cubby
$ cubby init-pilot
... interactive prompts: NetBox URL env-var name, vendors in scope, transport, auth mode ...
✓ Wrote pilot-config.yaml
✓ Wrote .env.template

Fill in every CHANGE_ME_* placeholder in .env.template, copy it to .env, and boot:

$ NETOPS_PILOT_CONFIG=pilot-config.yaml cubby serve

The NETOPS_PILOT_CONFIG env var is read by both apps.api.main and apps.cli. With it set, the harness reads the YAML, registers the real adapter for every vendor you declared, and falls back to the demo defaults for anything not in the config. With it unset, the demo / production builders behave as before.

Manual path: explicit Python wrapper

If you need finer control than the wizard supports — custom adapter classes, alternative inventory sources, vendor-specific transports per device — write a thin wrapper around build_demo_harness(..., allow_simulated=False, pilot_config=...) or call build_pilot_harness() programmatically. A reference lives at tests/devicelab/harness.py:build_lab_harness.

If your vendor isn't in the list above, you can either:

Build a plugin that inherits from VendorRealAdapterBase (plugins/device/_common/real_adapter_base.py) and implement _build_change_commands, precheck, execute, verify;
Or use the generic ssh_exec transport (packages/transport/ssh_exec.py) with a per-vendor command-wrapper and let Cubby drive it as a CLI over SSH.

CAB signing — from shared-secret to per-approver

The default bootstrap pairs a multi-member CAB (alice, bob, carol, …) with a single HMAC approval-signing key. That configuration works, but at boot the system logs a loud warning because anyone holding the HMAC secret can mint approvals under any approver name — quorum separation is nominal, not cryptographic. Production refuses to boot in that shape.

The upgrade path: drop one Ed25519 public-key file per approver into a directory and point NETOPS_APPROVAL_KEYRING_DIR at it. The bootstrap loads every file into the approval SignerKeyring under each approver's key_id; each SignedApproval verification picks the right public key from the embedded signer_key_id.

File format

Two on-disk shapes are accepted (one file per approver, flat directory, no recursion):

JSON (preferred — explicit fields, room for metadata):

{
  "approver_id": "alice",
  "key_id": "alice-2026-q2",
  "algorithm": "Ed25519",
  "public_key_pem": "-----BEGIN PUBLIC KEY-----\nMCowBQYDK2VwAyEA...\n-----END PUBLIC KEY-----\n"
}

.pub (convenience — for ssh-keygen -t ed25519-style workflows): A bare PEM-encoded SubjectPublicKeyInfo. approver_id and key_id both default to the filename stem (e.g. alice.pub → alice / alice).

Generating an approver keypair

The approver keeps the private key (on a YubiKey, HSM, or vaulted secret); only the public half ships to the verifier:

# Generate keypair (Python, no extra tools):
python3 - <<'PY'
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
sk = Ed25519PrivateKey.generate()
sk_pem = sk.private_bytes(
    encoding=serialization.Encoding.PEM,
    format=serialization.PrivateFormat.PKCS8,
    encryption_algorithm=serialization.NoEncryption(),
).decode()
pk_pem = sk.public_key().public_bytes(
    encoding=serialization.Encoding.PEM,
    format=serialization.PublicFormat.SubjectPublicKeyInfo,
).decode()
print("# Private (keep on approver's device, NEVER on the verifier):")
print(sk_pem)
print("# Public (drop into the keyring directory):")
print(pk_pem)
PY

Bootstrap loader contract

The loader refuses (raises ApprovalKeyringError and the harness fails-closed at boot):

Duplicate approver_id across files (quorum would be ambiguous).
Duplicate key_id across files (manifest lookups would collide).
Unsupported algorithm — only Ed25519 is accepted; HMAC is rejected here on purpose since a multi-party CAB with a shared symmetric secret is the failure mode the loader exists to prevent.
Malformed JSON or PEM.
Empty directory in production (lab pilots get a WARNING and the HMAC fallback).
The cryptography package not installed (pip install cubby-network[crypto]).

Source: packages/orchestrator/approval_keyring.py.

Rotation

To rotate an approver's key: generate a new keypair, drop the new public-key file in the directory under a new key_id (the old file can stay for verifying in-flight approvals). When the chain has no remaining bundles signed under the old key_id, remove the old file. Add the retired key_id to NETOPS_EVIDENCE_LEGACY_KEY_IDS if you want the verifier to tolerate historical bundles after the public key is gone.

Until you wire this, assume the CAB is "one person with the secret can do anything" and size your deployment's operator trust accordingly.

Production composition profile

Three plugin categories carry simulated defaults in the bundled tree. The strict-mode production registry refuses simulated adapters; the production composition profile is how you swap each one for a real implementation. Each is independent so you can roll forward one category at a time.

Plugin category	Knob	Values	Notes
Pre-change validator	`NETOPS_VALIDATOR_PROFILE`	`simulated` (default) · `real_batfish`	`real_batfish` requires `NETOPS_BATFISH_HOST`. The bootstrap doesn't yet ping Batfish at construction (the lazy `pybatfish` session opens on first validate call); install via `pip install cubby-network[batfish]`.
Telemetry	`NETOPS_TELEMETRY_PROFILE`	`simulated` (default) · `real_prometheus`	`real_prometheus` requires `NETOPS_PROMETHEUS_BASE_URL`. The bootstrap runs a `/-/healthy` probe at construction and fails-closed on a non-2xx response.
Credential lease (device login)	`NETOPS_AUTH_PROFILE`	`simulated` (default) · `vault_dynamic` · `fail_closed`	`vault_dynamic` requires `VAULT_ADDR` + `VAULT_TOKEN` and a Vault SSH-secrets role named per `intent.metadata['device_address']`. `fail_closed` lets the production composition boot before you have a real backend wired — every change-execution workflow then fails-closed at credential issuance with an actionable error pointing at this knob. No simulated ISE/TACACS adapter is shipped; the `simulated` value here is the demo-only `IseTacacsAuthAdapter`.

The four-step path from the bundled demo to a production composition:

Pick a category. Start with the validator or telemetry (read-only impact). Auth touches the change path and should land last.
Wire the dependencies. Stand up the Batfish service, Prometheus server, or Vault SSH role.
Set the env knob and the supporting URL/host knob. Restart the harness.
Verify with cubby smoke and a read-only workflow before pointing real change traffic at it.

Once all three are real_* (or fail_closed for auth), build_production_harness boots without the simulated-plugin refusal. The replicas:1 ValidatingAdmissionPolicy is still in force — HA requires Postgres-primary state, distributed evidence storage, and a multi-pod-safe approval queue, which is the next workstream after this one.

API auth — dev → HMAC → OIDC

Three modes, increasing production-readiness:

dev: A single token is generated (or read from NETOPS_API_DEV_TOKEN) and all holders get network-operator + auditor roles. Local work only. Refused when NETOPS_ENV=production.
hmac: Tokens are HMAC-SHA256 over "<subject>|<roles>|<expiry>". Issue with HmacTokenValidator.issue(); the validator checks HMAC + expiry. Subject and roles are whatever you encoded — the system trusts them because the signature proves the issuer authorised them.
oidc: Tokens are JWTs validated against a configured issuer + audience + JWKS URL. Roles come from a configurable claim (NETOPS_OIDC_ROLES_CLAIM, defaults to roles; override for Azure AD's groups or Auth0's namespaced URL claim). JWKS fetch is HTTPS only and cached 15 minutes.

SAML? cubby doesn't ship native SAML. Every modern enterprise IdP (Keycloak, Okta, Azure AD / Entra, Auth0, Ping) bridges SAML to OIDC out of the box. Operator recipe with three IdP walkthroughs lives at docs/SAML_VIA_OIDC.md. This is the supported production posture for SAML-required deployments.

Role names the routes check today:

network-operator — can call mutating routes (/access-port/change-vlan, /runbooks/evaluate, /events/webhook, …)
auditor — read-only token; sees /knowledge/similar and authenticated /readyz?detail=1 but is refused from mutating routes with 403
Plugin-specific roles (lead:security, cab:carol, …) are CAB member identities, not API role gates

Secrets custody — what's dev-generated and what must be rotated

Everything under var/keys/ is dev-generated and committed to state between runs. On a first prod deployment, rotate all of them:

File	Role	Rotation path
`var/keys/dev_evidence_hmac.key`	Signs evidence bundles	Set `NETOPS_EVIDENCE_HMAC_SECRET` (inline) or `NETOPS_EVIDENCE_HMAC_KEY_PATH` (file). Set `NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1` to refuse fallback.
`var/keys/dev_approval_hmac.key`	Signs CAB approvals	Same mechanism as evidence, with `NETOPS_APPROVAL_*` env vars. Ideally replaced with per-approver Ed25519 keys (see above).
`var/evidence/chain.tip` + `var/evidence/.chain.lock`	Prev-hash pointer + writer lock for the evidence chain	Deployment-scoped — never commit, never share between deployments. `chain.tip` points at the SHA of the most recent bundle THIS instance signed. Two deployments writing to the same path produce a fork; an operator who pulls a clone with someone else's `chain.tip` sees `verify-chain` failures because their local bundles don't match. Source-repo `var/evidence/` is `.gitignore`d for this reason. Move both files to durable, deployment-scoped storage (PVC, encrypted volume, S3-backed FUSE mount). Do not delete on a running deployment; use `NETOPS_EVIDENCE_CHAIN_RESET_BUNDLE_IDS` for known planned resets.

The operator should also rotate:

NETOPS_API_HMAC_SECRET (or OIDC config)
ANTHROPIC_API_KEY / OPENAI_API_KEY — treat as secrets; pass via secret store, not .env files

Test-user readiness checklist

Before letting another human operator drive the system against anything other than a lab they own:

[ ] NETOPS_ENV unset or development, OR you've wired real adapters AND removed every simulated adapter.
[ ] API auth is hmac or oidc. Dev auth is off.
[ ] Evidence + approval HMAC secrets are set via env, not falling back to dev keys.
[ ] CAB signer is per-approver Ed25519 (or you've told the operator "one secret = full approval authority").
[ ] The operator has a bearer token scoped to the role they need — no shared network-operator+auditor token in a chat channel.
[ ] var/evidence/chain.tip is on durable storage (not a container /tmp).
[ ] A monitoring endpoint is polling /livez and /readyz so a broken bootstrap is visible.
[ ] The operator has read QUICKSTART.md end-to-end and run cubby smoke against their own harness.

Where to go if something's wrong

Something broke on a change Cubby executed — read docs/ROLLBACK.md. Covers self-rollback, stuck workflows, false-success cases, and evidence-chain recovery.
Workflow failures — check var/evidence/ for the bundle of the failing run; every stage is signed and captures the snapshot at that point.
Agent failures — set NETOPS_LOG_LEVEL=DEBUG and inspect SafetyGate verdicts + AgentContext.metadata. Injection hits are logged at WARNING.
CAB failures — the reasons array surfaces plan hash mismatch / signature invalid / failed signer verification (generic — detail is in the server log).
Lab-only issues — see tests/devicelab/README.md; most SR Linux / EOS boot-timing issues are covered there.