Docs

Run controlled chaos experiments against your services. Inject real network faults, watch a guardrail, abort automatically.

Install

The agent runs on Linux and needs CAP_NET_ADMIN to manipulate the kernel queueing discipline (tc/netem). The control plane is portable Go and runs anywhere.

Linux host

# single-line installer (puts /usr/local/bin/chaos in place)
curl -fsSL https://get.straightchaos.com | sh

# verify
chaos version

Docker

The agent needs network admin capabilities. Use one of:

# minimum permissions
docker run --rm \
  --cap-add=NET_ADMIN --network=host \
  straightchaos/agent:latest \
  agent --control-plane $CP --token $ENROLL_TOKEN

# or fully privileged (simpler, less isolated)
docker run --rm --privileged --network=host straightchaos/agent:latest \
  agent --control-plane $CP --token $ENROLL_TOKEN
--network=host is required. The agent shapes traffic on real network interfaces, not on Docker's bridge.

Kubernetes (DaemonSet)

A reference manifest is in deploy/daemonset.yaml. The pod template needs:

securityContext:
  capabilities:
    add: ["NET_ADMIN"]
hostNetwork: true
env:
  - { name: CHAOS_CONTROL_PLANE, value: https://cp.example.com }
  - { name: CHAOS_TOKEN, valueFrom: { secretKeyRef: { name: chaos-enroll, key: token } }

From source

git clone https://github.com/straightchaos/agent
cd agent
go build -o /usr/local/bin/chaos ./cmd/chaos
go build -o /usr/local/bin/control-plane ./cmd/control-plane

Go 1.22+ required. No external dependencies.

macOS: real fault injection is Linux-only — macOS doesn't have tc/netem. The control plane and dashboard run natively; for the agent, either run it in a Linux VM/container, or use --simulate for the wiring without kernel changes.

Quickstart

From zero to a first experiment in five steps.

1. Start the control plane

control-plane --addr :8080

The control plane keeps state in memory by default. See Operations for persistence.

2. Sign up via the dashboard

Open /dashboard.html, pick Sign up, choose a workspace name. You're now signed in.

3. Mint an enrollment token

From the dashboard's + Connect button or Settings → Enrollment tokens. The dashboard shows the exact command to run on your host.

4. Start the agent

chaos agent --control-plane http://localhost:8080 --token sce_<your-token>

It registers, heartbeats, and shows up in the dashboard within a second or two.

5. Launch your first experiment

In the dashboard, configure a latency spec, pick a target (e.g. 10.0.3.21:5432 for your database), set the HTTP health check guardrail to your service's /healthz, and click Launch. You'll see live samples in the chart. If the health check fails, the agent rolls back automatically.

Tip: Add --simulate to the agent for a dry pass — the agent goes through the whole flow but doesn't touch the kernel. Useful for verifying your wiring before granting CAP_NET_ADMIN.

Concepts

Agent

A Go binary that runs on the host you want to test. Registers with the control plane, polls for work, executes faults, reports events back.

Control plane

HTTP API + workspace/auth model + dashboard. Holds experiments, dispatches them to agents, ingests events, exposes everything via REST.

Experiment

A single run of a fault against a single agent, with a guardrail and a hard duration. Has a status (queued → running → completed | aborted-by-slo | aborted-by-signal | error) and an event timeline.

Fault

The thing being injected. Today: latency (delay + jitter on egress, scoped or device-wide) via tc/netem. More coming.

Guardrail

What watches your service while the fault is active. HTTP health probe or PromQL query. Breaches it → fault is removed immediately.

Blast radius

How much traffic the fault affects: scoped (one destination host or CIDR) or whole device (all egress on the interface).

Deadman

A detached process that removes the fault unconditionally after duration + grace. Survives even if the agent crashes. Last line of defense.

Steady state

What "healthy" means for your service. Define it as an endpoint that returns 200, or a PromQL query that stays under a threshold.

Faults

Latency

Adds delay (with optional jitter) to outbound traffic. Implemented with tc queueing disciplines and the netem module.

Spec

FieldTypeDescription
devstringNetwork interface, e.g. eth0.
targetstringDestination host, host:port, or CIDR. Empty = whole-device blast radius.
delay_msintAdded latency in milliseconds. Capped at 60,000.
jitter_msintVariance around delay_ms (normal distribution). Capped at 10,000.
duration_secintHow long the fault is active. Capped at 3,600.
grace_secintExtra time before the deadman fires (default 10s).

Scoped (one dependency)

tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 200ms 30ms distribution normal
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
   match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3

Only packets to that host:port are delayed; everything else flows unaffected.

Whole device (all egress)

tc qdisc add dev eth0 root handle 1: netem delay 200ms 30ms distribution normal

Loss

Drops a configurable fraction of outbound packets, optionally with correlation so drops cluster into bursts rather than uniformly random. Implemented with tc/netem.

Spec

FieldTypeDescription
devstringNetwork interface, e.g. eth0.
targetstringDestination host, host:port, or CIDR. Empty = whole-device blast radius.
loss_percentfloatPercentage of packets to drop, (0, 100]. Above ~50% effectively partitions the path.
loss_correlationfloat0–100. 0 = uniform random loss. Higher values make consecutive drops more likely (burstier).
duration_secintHow long the fault is active. Capped at 3,600.
grace_secintDeadman grace beyond duration (default 10s).

Scoped (one dependency)

tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 5% 25%
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
   match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3

Whole device (all egress)

tc qdisc add dev eth0 root handle 1: netem loss 5%

From the CLI

chaos run loss --dev eth0 --target 10.0.3.21:5432 \
  --percent 5 --correlation 25 --duration 30s \
  --health-url http://localhost:8080/healthz
What this exposes. Latency tests timeout/retry/circuit-breaker behavior. Loss exposes idempotency: are retries safe? Do you have at-least-once semantics that survive duplicate delivery? Even small loss percentages (1–5%) on a high-RPS path will surface re-entrancy bugs that never show under latency.

Partition

Fully blocks egress traffic to a target (or the whole device) by dropping 100% of matching packets at the kernel queueing layer. Tests how a service behaves when a dependency becomes completely unreachable — the failover, retry, and circuit-breaker story under absence rather than degradation.

Spec

FieldTypeDescription
devstringNetwork interface, e.g. eth0.
targetstringDestination host, host:port, or CIDR. Empty = whole-device blast radius (the host is effectively offline from this NIC).
duration_secintHow long the partition holds. Capped at 3,600.
grace_secintDeadman grace beyond duration (default 10s).

Scoped (cut off one dependency)

tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 100%
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
   match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3

Traffic to that host:port disappears into the void; everything else flows unaffected.

Whole device (full network isolation)

tc qdisc add dev eth0 root handle 1: netem loss 100%

From the CLI

chaos run partition --dev eth0 --target 10.0.3.21:5432 \
  --duration 30s --health-url http://localhost:8080/healthz
What this exposes. Latency tests timeout settings. Loss tests retry safety. Partition tests failover — whether the service has a clear, exercised path to abandon a failed dependency. Services that "work" because they've never lost a backing service often fail spectacularly when one actually goes away. The default behavior of many HTTP clients under TCP-level black-holing is to hang for the full connect/read timeout, which is often minutes — a partition experiment surfaces that hang before production does.

CPU burn

Saturates CPU cores with tight spin loops, testing how a service behaves under compute starvation. Exposes autoscaling response time, GC pressure under contention, and whether request latency degrades gracefully or collapses.

Spec

FieldTypeDescription
coresintNumber of cores to saturate. 0 = all available cores (the default). Capped at 128.
duration_secintHow long to burn. Capped at 3,600.

Mechanism

The agent spawns one goroutine per core, each locked to an OS thread (runtime.LockOSThread) in a tight spin loop. On teardown, the child context is cancelled and all goroutines exit. No kernel objects persist — if the agent process dies, the burn stops instantly.

From the CLI

chaos run cpu --cores 4 --duration 60s \
  --health-url http://localhost:8080/healthz
What this exposes. CPU burn reveals whether a service has enough headroom to absorb a noisy-neighbor spike. Services running at 80%+ utilization in steady state often hit cascading failures under a CPU burn because the remaining 20% is consumed by GC, connection handling, and health checks. If the health probe trips, the service doesn't have enough margin.

Memory pressure

Allocates and pins a fixed amount of memory, testing OOM-killer behavior, swap thrash, GC pauses under heap contention, and whether memory-limited containers handle pressure gracefully.

Spec

FieldTypeDescription
megabytesintMB to allocate and pin. Capped at 32,768 (32 GB).
duration_secintHow long to hold the allocation. Capped at 3,600.

Mechanism

The agent allocates a single byte slice of the requested size and touches every page (4 KB stride) to defeat lazy allocation and overcommit. This forces the OS to commit physical memory. On teardown, the slice is nil'd and runtime.GC() is called as a release hint. No kernel objects persist.

From the CLI

chaos run memory --megabytes 512 --duration 60s \
  --health-url http://localhost:8080/healthz
What this exposes. Memory pressure reveals whether containers are properly limited (resources.limits.memory in K8s), whether the OOM killer targets the right process, and whether the service restarts cleanly or enters a crash loop. It also tests GC behavior — a Go service under heap pressure may see stop-the-world pauses that trip latency-based health checks even though the process isn't OOM'd.

DNS block

Drops outbound DNS queries for specific domain names, simulating a DNS outage for individual dependencies. Tests how a service behaves when it can't resolve a dependency's hostname — failover to cached records, circuit-breaker activation, graceful degradation vs hard crash.

Spec

FieldTypeDescription
domains[]stringDomain names to block. Up to 32 entries. E.g. ["api.stripe.com", "db.internal"].
devstringOptional: restrict to a specific interface (-o <dev>). Empty = all interfaces.
duration_secintHow long to block. Capped at 3,600.

Mechanism

The agent adds iptables rules that match outbound UDP port-53 packets containing the wire-encoded domain name (using -m string --algo bm --hex-string). Matching packets are dropped (-j DROP), causing the application's DNS resolver to time out. On teardown, the rules are removed with iptables -D.

An eBPF TC-BPF upgrade path exists in bpf/dns_block.c: a classifier that does the same matching inside the kernel at line rate, with per-CPU stats. The wire format and Spec interface are identical; swapping implementations doesn't affect the rest of the stack.

From the CLI

chaos run dns --domains api.stripe.com,db.internal \
  --duration 60s --health-url http://localhost:8080/healthz
What this exposes. DNS failures are one of the most common real-world outage triggers, yet they're rarely tested because tc/netem can't selectively target specific domains. A service that works fine when Postgres is slow might crash hard when it can't resolve the Postgres hostname. This fault surfaces stale DNS caches, missing TTL respect, and services that retry resolution in a tight loop (amplifying the failure). The eBPF version (future) will also expose DNS query volume via BPF map stats — useful for understanding resolution patterns under pressure.

Process kill

Coming. Targeted SIGKILL / SIGTERM of a named process, optionally recurring (keep killing if it respawns). Different mechanism from CPU/memory but slots in behind the same Spec interface.

Guardrails

Every running fault has a guardrail watching the system. The instant it trips, the agent removes the fault and reports aborted-by-slo.

HTTP health check (default)

Polls an endpoint on a fixed interval. Aborts on non-2xx or when the response takes longer than the configured maximum.

# in spec form
health_url: "http://localhost:8080/healthz"
health_max_latency_ms: 500
interval_sec: 2

Failed requests (connection refused, timeout) count as a breach — the experiment may be what broke the endpoint, so that's the safe interpretation.

Prometheus

Polls a PromQL query that returns a scalar or instant vector. Aborts on abort_above / abort_below threshold cross.

prometheus_url: "http://prom:9090"
query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))'
abort_above: 0.5
interval_sec: 5

Transient query errors are tolerated (a flaky scrape isn't a breach), but a sustained threshold violation aborts.

Deadman

Always on. When a fault is applied, the agent forks a detached process that sleeps for duration + grace seconds, then removes the fault unconditionally. Survives agent crashes, kernel panics on the agent's own process, anything short of a full reboot — and a reboot wipes the qdisc anyway.

Server-side caps

The control plane rejects experiments that exceed these limits before they reach an agent:

Workspace kill switch

POST /v1/kill aborts every running experiment in the workspace. Use it when something's gone wrong and you don't have time to find the right experiment ID.

Agent CLI

The chaos binary is both the CLI for one-off runs and the long-running daemon.

chaos agent

Run the agent connected to a control plane.

FlagDescription
--control-plane URLControl plane URL.
--token TOKENEnrollment token (one-time) or agent token (already enrolled).
--heartbeat DURHeartbeat interval. Default 5s.
--simulateRun the full loop without touching the kernel. Safe for any host.

chaos run latency

Run a one-off experiment locally — no control plane required.

FlagDescription
--dev IFACEInterface, default eth0.
--target HOSTDestination host[:port] or CIDR. Omit for whole-device.
--delay DURAdded latency, e.g. 200ms.
--jitter DURVariance, e.g. 30ms.
--duration DURHow long to hold the fault.
--grace DURDeadman grace window (default 10s).
--health-url URLHTTP guardrail; aborts on non-2xx.
--health-max-latency DURAborts if a health probe takes longer than this.
--prom URLPrometheus base URL.
--query QPromQL query.
--abort-above N / --abort-below NThreshold(s) for the query.
--interval DURProbe interval, default 5s.
--simulateWalk through the pipeline without touching the kernel.
--dry-runPrint the tc plan and exit without applying.
-y / --yesSkip confirmation prompt.

Exit codes: 0 success, 7 aborted-by-slo (a SLO violation auto-aborted the experiment — useful as a CI signal that the system isn't resilient to this fault).

chaos abort / chaos status

Inspect and forcibly tear down local experiments started with chaos run:

chaos status          # show local state file
chaos abort           # remove any active qdisc and clear state

API reference

Every endpoint takes a Bearer token in Authorization. Three token types, scoped differently:

TypePrefixIssued byUse
Session tokenscs_signup/loginDashboard and admin API.
Enrollment tokensce_POST /v1/enrollment-tokensOne-time, agent uses it to register. Revocable.
Agent tokensca_returned from registerPer-agent credential for heartbeat / poll / events.

Auth

POST/v1/auth/signup
{ "email": "you@co.com", "password": "...", "workspace": "acme-prod" }
→ { "session_token": "scs_...", "workspace_id": "ws-1", "email": "you@co.com" }
POST/v1/auth/login
{ "email": "you@co.com", "password": "..." }
→ { "session_token": "scs_...", "workspace_id": "ws-1", "email": "you@co.com" }

Enrollment tokens

GET/v1/enrollment-tokens
POST/v1/enrollment-tokens  { "label": "prod-east" }
POST/v1/enrollment-tokens/{token}/revoke

Agents

POST/v1/agents/register  (uses enrollment token)
{ "hostname": "...", "kernel": "...", "instance_id": "...", "device": "eth0", "version": "..." }
→ { "agent_id": "agt-1", "agent_token": "sca_..." }
GET/v1/agents
DELETE/v1/agents/{id}
POST/v1/agents/{id}/heartbeat  (agent token)
GET/v1/agents/{id}/next  long-poll, 25s

Experiments

POST/v1/experiments
{ "agent_id": "agt-1", "spec": { "dev":"eth0", "target":"10.0.3.21:5432",
  "delay_ms":200, "jitter_ms":30, "duration_sec":30, "grace_sec":10,
  "interval_sec":2, "health_url":"http://app:8080/healthz", "health_max_latency_ms":500 } }
→ { "id":"exp-1", "status":"queued", ... }
GET/v1/experiments
GET/v1/experiments/{id}
POST/v1/experiments/{id}/abort
POST/v1/experiments/{id}/events  (agent token)
POST/v1/kill  workspace-wide

Operations

Persistence

The default control-plane store is in-memory — fine for dev and the first alpha, but state is lost on restart. A Postgres backend that satisfies the same Store interface is on the roadmap; the schema is documented in STORAGE.md.

TLS and hosting

The control plane speaks plain HTTP. For production, front it with a TLS-terminating reverse proxy (nginx, Caddy, an ALB). Make sure your agents are configured to use the public URL.

Permissions

The agent calls tc, which requires CAP_NET_ADMIN. On a bare host, run as root or grant the capability:

setcap cap_net_admin+ep /usr/local/bin/chaos

In a container, use --cap-add=NET_ADMIN (or --privileged). In Kubernetes, set the capability in the pod's securityContext.

Troubleshooting

Agent registers but no experiments dispatch

The agent long-polls /v1/agents/{id}/next for 25s at a time. Make sure your reverse proxy / load balancer doesn't terminate long-lived requests sooner than that.

RTNETLINK answers: Operation not supported

The sch_netem kernel module isn't available. Verify with tc qdisc add dev lo root netem delay 10ms. On a slim container kernel, install linux-modules-extra-$(uname -r) or use a different base image.

Experiment ends instantly with aborted-by-slo

The guardrail tripped on the first sample. For HTTP health probes, check that the URL is reachable from the agent and that the endpoint actually returns 2xx when the system is healthy. For Prometheus, check that the query already returns a value below your threshold before injecting.

Agent disappears as "offline" while still running

The agent token may have been revoked (workspace policy). The agent's register call will succeed on the next retry only if you give it a fresh enrollment token.

FAQ

Does this work on macOS?

The agent's real fault injection is Linux-only — macOS doesn't have tc/netem. The control plane and dashboard run natively on macOS; the agent compiles and runs on macOS only in --simulate mode. To validate real injection from a Mac, run the agent inside a Linux container (Docker Desktop / OrbStack / Lima) with --privileged or --cap-add=NET_ADMIN.

How is this different from Chaos Mesh / Litmus / Gremlin?

Straight Chaos is built for a service-team primary user, not a platform team. Open core: the eBPF kernel-level fault engine (planned) is the moat; the control plane (safety/scoring/audit/scheduling) is the SKU. Day-one: a single binary, no Helm chart, no operator, no Kubernetes prerequisite, and an HTTP health probe as the default guardrail so you don't need Prometheus to start.

What happens if the agent crashes during an experiment?

The deadman process is detached (setsid + own process group) and survives. It sleeps for duration + grace, then runs tc qdisc del regardless of agent state. The fault is removed even if the agent is gone.

Is my data isolated between workspaces?

Yes — every API call is workspace-scoped at the auth layer. An agent token grants access to one workspace. A session token is bound to one workspace. Cross-workspace reads / writes return 404. This is covered by the TestTenancyIsolation test in internal/server.

Can I use this without the control plane?

Yes — chaos run latency is fully usable on its own. It runs the same plan, the same guardrails, the same deadman, locally. The exit code (0 / 7) makes it a useful resilience gate in CI.