Operator guide
Cubby operator guide
QUICKSTART.md gets you to a signed change against a lab device. This document is for the operator driving the system on something more than a laptop lab — a dev-net a vendor lent you, a pre-production cell, or a small pilot of your own network.
Read this before you invite other humans to drive the system.
Environment-variable matrix
Every var is read in exactly one place (packages/common/runtime_config.py), cached as a singleton, and is discoverable via cubby config show. The ones that materially change behaviour:
| Variable | Default | What changes when you set it |
|---|---|---|
NETOPS_ENV | development | production flips the plugin registry to strict mode — simulated adapters are rejected at register time, so the default build_demo_harness will refuse to boot. You must wire real adapters yourself before setting this. |
NETOPS_API_AUTH_MODE | dev | dev prints a random token on start-up. hmac validates bearer tokens signed with NETOPS_API_HMAC_SECRET. oidc validates JWTs against the IdP. production + dev is refused at boot. |
NETOPS_API_HMAC_SECRET | (empty) | HMAC signing secret for API bearer tokens. 32+ bytes. Required when NETOPS_API_AUTH_MODE=hmac. |
NETOPS_OIDC_ISSUER / NETOPS_OIDC_AUDIENCE / NETOPS_OIDC_JWKS_URL | (empty) | OIDC validator config. JWKS URL must be HTTPS — non-HTTPS URLs are refused at refresh time. JWKS is cached 15 minutes by default. |
ANTHROPIC_API_KEY | (empty) | Selects ClaudeAgentRuntime over any OpenAI option. Preferred when multiple are set. |
NETOPS_ANTHROPIC_MODEL | claude-opus-4-7 | Override the Claude model id. |
OPENAI_API_KEY | (empty) | Selects OpenAIAgentRuntime with a static API key. Honours OPENAI_BASE_URL for Azure / vLLM / Ollama / etc. |
NETOPS_CODEX_CREDENTIAL_PATH | (empty) | Path to a Codex CLI auth.json. Bills against a ChatGPT subscription via OAuth refresh; requires NETOPS_CODEX_TOKEN_URL for the refresh endpoint. |
NETOPS_EVIDENCE_HMAC_SECRET | (empty) | Production evidence-signing key. When unset, a deterministic dev key is written under var/keys/. Set NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1 to refuse the dev fallback. |
NETOPS_APPROVAL_HMAC_SECRET | (empty) | Approval-signing key (distinct from evidence). Same dev-key fallback policy as above. |
NETOPS_EVIDENCE_LEGACY_KEY_IDS | (empty) | Comma-separated list of key_ids the verifier tolerates without cryptographic check. Use only for unrecoverable key-loss scenarios. |
NETOPS_API_MAX_BODY_BYTES | 65536 | HTTP request body cap in bytes. Rejects both Content-Length and chunked requests that exceed the cap. |
NETOPS_WIKI_ROOT | <repo>/docs | Root of the hand-curated knowledge base the agents read. |
NETOPS_CAB_ACKNOWLEDGE_SHARED_SECRET | (empty) | Set to 1 to silence the stderr boot banner in non-production envs. Has no effect in production — NETOPS_ENV=production with a multi-member CAB backed by a single HMAC signer fails fast unconditionally. Load per-approver Ed25519 public keys into SignerKeyring before going to production. |
cubby config show renders this matrix against the current process environment so you can see what's resolved vs what's falling back to defaults.
Demo vs production posture
Two failure modes the platform enforces at boot when NETOPS_ENV=production:
- No simulated adapters. The plugin registry refuses to register any plugin with
simulated=True, sobuild_demo_harness()fails fast withSimulationLeakErroron the first simulated device adapter. You must wire real vendor adapters (plugins/device/*/real_adapter.py) and/or custom adapters before the harness will construct. - No dev auth.
NETOPS_API_AUTH_MODE=devis refused — you must sethmacoroidcand supply the matching secret/issuer config.
Both are intentional: it's much safer for the system to refuse to start than to silently boot a prod-tagged deployment on demo adapters or a printed dev token.
Wiring real device adapters
Real adapters exist today for:
- Cisco IOS-XE (
plugins/device/cisco_iosxe/real_adapter.py) - Cisco NX-OS (
plugins/device/cisco_nxos/real_adapter.py) - Arista EOS (
plugins/device/arista_eos/real_adapter.py) - Junos (
plugins/device/junos/real_adapter.py) - PAN-OS (
plugins/device/panos/real_adapter.py) - Fortinet (
plugins/device/fortinet/real_adapter.py) - Nokia SR Linux (
plugins/device/nokia_srl/real_adapter.py)
Easy path: cubby init-pilot
For a guided end-to-end walkthrough — prerequisites, smoke tests, first device read, dry-run change, evidence verification, failure modes — read docs/PILOT_BETA.md. The summary below is the 30-second version.
The pilot wizard generates a pilot-config.yaml and .env.template so you don't write Python:
$ cd /opt/cubby
$ cubby init-pilot
... interactive prompts: NetBox URL env-var name, vendors in scope, transport, auth mode ...
✓ Wrote pilot-config.yaml
✓ Wrote .env.template
Fill in every CHANGE_ME_* placeholder in .env.template, copy it to .env, and boot:
$ NETOPS_PILOT_CONFIG=pilot-config.yaml cubby serve
The NETOPS_PILOT_CONFIG env var is read by both apps.api.main and apps.cli. With it set, the harness reads the YAML, registers the real adapter for every vendor you declared, and falls back to the demo defaults for anything not in the config. With it unset, the demo / production builders behave as before.
Manual path: explicit Python wrapper
If you need finer control than the wizard supports — custom adapter classes, alternative inventory sources, vendor-specific transports per device — write a thin wrapper around build_demo_harness(..., allow_simulated=False, pilot_config=...) or call build_pilot_harness() programmatically. A reference lives at tests/devicelab/harness.py:build_lab_harness.
If your vendor isn't in the list above, you can either:
- Build a plugin that inherits from
VendorRealAdapterBase(plugins/device/_common/real_adapter_base.py) and implement_build_change_commands,precheck,execute,verify; - Or use the generic
ssh_exectransport (packages/transport/ssh_exec.py) with a per-vendor command-wrapper and let Cubby drive it as a CLI over SSH.
CAB signing — from shared-secret to per-approver
The default bootstrap pairs a multi-member CAB (alice, bob, carol, …) with a single HMAC approval-signing key. That configuration works, but at boot the system logs a loud warning because anyone holding the HMAC secret can mint approvals under any approver name — quorum separation is nominal, not cryptographic. Production refuses to boot in that shape.
The upgrade path: drop one Ed25519 public-key file per approver into a directory and point NETOPS_APPROVAL_KEYRING_DIR at it. The bootstrap loads every file into the approval SignerKeyring under each approver's key_id; each SignedApproval verification picks the right public key from the embedded signer_key_id.
File format
Two on-disk shapes are accepted (one file per approver, flat directory, no recursion):
JSON (preferred — explicit fields, room for metadata):
{
"approver_id": "alice",
"key_id": "alice-2026-q2",
"algorithm": "Ed25519",
"public_key_pem": "-----BEGIN PUBLIC KEY-----\nMCowBQYDK2VwAyEA...\n-----END PUBLIC KEY-----\n"
}
.pub (convenience — for ssh-keygen -t ed25519-style workflows): A bare PEM-encoded SubjectPublicKeyInfo. approver_id and key_id both default to the filename stem (e.g. alice.pub → alice / alice).
Generating an approver keypair
The approver keeps the private key (on a YubiKey, HSM, or vaulted secret); only the public half ships to the verifier:
# Generate keypair (Python, no extra tools):
python3 - <<'PY'
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
sk = Ed25519PrivateKey.generate()
sk_pem = sk.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=serialization.NoEncryption(),
).decode()
pk_pem = sk.public_key().public_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PublicFormat.SubjectPublicKeyInfo,
).decode()
print("# Private (keep on approver's device, NEVER on the verifier):")
print(sk_pem)
print("# Public (drop into the keyring directory):")
print(pk_pem)
PY
Bootstrap loader contract
The loader refuses (raises ApprovalKeyringError and the harness fails-closed at boot):
- Duplicate
approver_idacross files (quorum would be ambiguous). - Duplicate
key_idacross files (manifest lookups would collide). - Unsupported
algorithm— onlyEd25519is accepted; HMAC is rejected here on purpose since a multi-party CAB with a shared symmetric secret is the failure mode the loader exists to prevent. - Malformed JSON or PEM.
- Empty directory in production (lab pilots get a WARNING and the HMAC fallback).
- The
cryptographypackage not installed (pip install cubby-network[crypto]).
Source: packages/orchestrator/approval_keyring.py.
Rotation
To rotate an approver's key: generate a new keypair, drop the new public-key file in the directory under a new key_id (the old file can stay for verifying in-flight approvals). When the chain has no remaining bundles signed under the old key_id, remove the old file. Add the retired key_id to NETOPS_EVIDENCE_LEGACY_KEY_IDS if you want the verifier to tolerate historical bundles after the public key is gone.
Until you wire this, assume the CAB is "one person with the secret can do anything" and size your deployment's operator trust accordingly.
Production composition profile
Three plugin categories carry simulated defaults in the bundled tree. The strict-mode production registry refuses simulated adapters; the production composition profile is how you swap each one for a real implementation. Each is independent so you can roll forward one category at a time.
| Plugin category | Knob | Values | Notes |
|---|---|---|---|
| Pre-change validator | NETOPS_VALIDATOR_PROFILE | simulated (default) · real_batfish | real_batfish requires NETOPS_BATFISH_HOST. The bootstrap doesn't yet ping Batfish at construction (the lazy pybatfish session opens on first validate call); install via pip install cubby-network[batfish]. |
| Telemetry | NETOPS_TELEMETRY_PROFILE | simulated (default) · real_prometheus | real_prometheus requires NETOPS_PROMETHEUS_BASE_URL. The bootstrap runs a /-/healthy probe at construction and fails-closed on a non-2xx response. |
| Credential lease (device login) | NETOPS_AUTH_PROFILE | simulated (default) · vault_dynamic · fail_closed | vault_dynamic requires VAULT_ADDR + VAULT_TOKEN and a Vault SSH-secrets role named per intent.metadata['device_address']. fail_closed lets the production composition boot before you have a real backend wired — every change-execution workflow then fails-closed at credential issuance with an actionable error pointing at this knob. No simulated ISE/TACACS adapter is shipped; the simulated value here is the demo-only IseTacacsAuthAdapter. |
The four-step path from the bundled demo to a production composition:
- Pick a category. Start with the validator or telemetry (read-only impact). Auth touches the change path and should land last.
- Wire the dependencies. Stand up the Batfish service, Prometheus server, or Vault SSH role.
- Set the env knob and the supporting URL/host knob. Restart the harness.
- Verify with
cubby smokeand a read-only workflow before pointing real change traffic at it.
Once all three are real_* (or fail_closed for auth), build_production_harness boots without the simulated-plugin refusal. The replicas:1 ValidatingAdmissionPolicy is still in force — HA requires Postgres-primary state, distributed evidence storage, and a multi-pod-safe approval queue, which is the next workstream after this one.
API auth — dev → HMAC → OIDC
Three modes, increasing production-readiness:
dev: A single token is generated (or read fromNETOPS_API_DEV_TOKEN) and all holders getnetwork-operator+auditorroles. Local work only. Refused whenNETOPS_ENV=production.hmac: Tokens are HMAC-SHA256 over"<subject>|<roles>|<expiry>". Issue withHmacTokenValidator.issue(); the validator checks HMAC + expiry. Subject and roles are whatever you encoded — the system trusts them because the signature proves the issuer authorised them.oidc: Tokens are JWTs validated against a configured issuer + audience + JWKS URL. Roles come from a configurable claim (NETOPS_OIDC_ROLES_CLAIM, defaults toroles; override for Azure AD'sgroupsor Auth0's namespaced URL claim). JWKS fetch is HTTPS only and cached 15 minutes.
SAML? cubby doesn't ship native SAML. Every modern enterprise IdP (Keycloak, Okta, Azure AD / Entra, Auth0, Ping) bridges SAML to OIDC out of the box. Operator recipe with three IdP walkthroughs lives at docs/SAML_VIA_OIDC.md. This is the supported production posture for SAML-required deployments.
Role names the routes check today:
network-operator— can call mutating routes (/access-port/change-vlan,/runbooks/evaluate,/events/webhook, …)auditor— read-only token; sees/knowledge/similarand authenticated/readyz?detail=1but is refused from mutating routes with 403- Plugin-specific roles (
lead:security,cab:carol, …) are CAB member identities, not API role gates
Secrets custody — what's dev-generated and what must be rotated
Everything under var/keys/ is dev-generated and committed to state between runs. On a first prod deployment, rotate all of them:
| File | Role | Rotation path |
|---|---|---|
var/keys/dev_evidence_hmac.key | Signs evidence bundles | Set NETOPS_EVIDENCE_HMAC_SECRET (inline) or NETOPS_EVIDENCE_HMAC_KEY_PATH (file). Set NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1 to refuse fallback. |
var/keys/dev_approval_hmac.key | Signs CAB approvals | Same mechanism as evidence, with NETOPS_APPROVAL_* env vars. Ideally replaced with per-approver Ed25519 keys (see above). |
var/evidence/chain.tip + var/evidence/.chain.lock | Prev-hash pointer + writer lock for the evidence chain | Deployment-scoped — never commit, never share between deployments. chain.tip points at the SHA of the most recent bundle THIS instance signed. Two deployments writing to the same path produce a fork; an operator who pulls a clone with someone else's chain.tip sees verify-chain failures because their local bundles don't match. Source-repo var/evidence/ is .gitignored for this reason. Move both files to durable, deployment-scoped storage (PVC, encrypted volume, S3-backed FUSE mount). Do not delete on a running deployment; use NETOPS_EVIDENCE_CHAIN_RESET_BUNDLE_IDS for known planned resets. |
The operator should also rotate:
NETOPS_API_HMAC_SECRET(or OIDC config)ANTHROPIC_API_KEY/OPENAI_API_KEY— treat as secrets; pass via secret store, not.envfiles
Test-user readiness checklist
Before letting another human operator drive the system against anything other than a lab they own:
- [ ]
NETOPS_ENVunset ordevelopment, OR you've wired real adapters AND removed every simulated adapter. - [ ] API auth is
hmacoroidc. Dev auth is off. - [ ] Evidence + approval HMAC secrets are set via env, not falling back to dev keys.
- [ ] CAB signer is per-approver Ed25519 (or you've told the operator "one secret = full approval authority").
- [ ] The operator has a bearer token scoped to the role they need — no shared
network-operator+auditortoken in a chat channel. - [ ]
var/evidence/chain.tipis on durable storage (not a container/tmp). - [ ] A monitoring endpoint is polling
/livezand/readyzso a broken bootstrap is visible. - [ ] The operator has read
QUICKSTART.mdend-to-end and runcubby smokeagainst their own harness.
Where to go if something's wrong
- Something broke on a change Cubby executed — read
docs/ROLLBACK.md. Covers self-rollback, stuck workflows, false-success cases, and evidence-chain recovery. - Workflow failures — check
var/evidence/for the bundle of the failing run; every stage is signed and captures the snapshot at that point. - Agent failures — set
NETOPS_LOG_LEVEL=DEBUGand inspectSafetyGateverdicts +AgentContext.metadata. Injection hits are logged at WARNING. - CAB failures — the reasons array surfaces
plan hash mismatch/signature invalid/failed signer verification(generic — detail is in the server log). - Lab-only issues — see
tests/devicelab/README.md; most SR Linux / EOS boot-timing issues are covered there.