Docs
Run controlled chaos experiments against your services. Inject real network faults, watch a guardrail, abort automatically.
Install
The agent runs on Linux and needs CAP_NET_ADMIN to manipulate the kernel queueing discipline (tc/netem). The control plane is portable Go and runs anywhere.
Linux host
# single-line installer (puts /usr/local/bin/chaos in place)
curl -fsSL https://get.straightchaos.com | sh
# verify
chaos version
Docker
The agent needs network admin capabilities. Use one of:
# minimum permissions
docker run --rm \
--cap-add=NET_ADMIN --network=host \
straightchaos/agent:latest \
agent --control-plane $CP --token $ENROLL_TOKEN
# or fully privileged (simpler, less isolated)
docker run --rm --privileged --network=host straightchaos/agent:latest \
agent --control-plane $CP --token $ENROLL_TOKEN
Kubernetes (DaemonSet)
A reference manifest is in deploy/daemonset.yaml. The pod template needs:
securityContext:
capabilities:
add: ["NET_ADMIN"]
hostNetwork: true
env:
- { name: CHAOS_CONTROL_PLANE, value: https://cp.example.com }
- { name: CHAOS_TOKEN, valueFrom: { secretKeyRef: { name: chaos-enroll, key: token } }
From source
git clone https://github.com/straightchaos/agent
cd agent
go build -o /usr/local/bin/chaos ./cmd/chaos
go build -o /usr/local/bin/control-plane ./cmd/control-plane
Go 1.22+ required. No external dependencies.
tc/netem. The control plane and dashboard run natively; for the agent, either run it in a Linux VM/container, or use --simulate for the wiring without kernel changes.Quickstart
From zero to a first experiment in five steps.
1. Start the control plane
control-plane --addr :8080
The control plane keeps state in memory by default. See Operations for persistence.
2. Sign up via the dashboard
Open /dashboard.html, pick Sign up, choose a workspace name. You're now signed in.
3. Mint an enrollment token
From the dashboard's + Connect button or Settings → Enrollment tokens. The dashboard shows the exact command to run on your host.
4. Start the agent
chaos agent --control-plane http://localhost:8080 --token sce_<your-token>
It registers, heartbeats, and shows up in the dashboard within a second or two.
5. Launch your first experiment
In the dashboard, configure a latency spec, pick a target (e.g. 10.0.3.21:5432 for your database), set the HTTP health check guardrail to your service's /healthz, and click Launch. You'll see live samples in the chart. If the health check fails, the agent rolls back automatically.
--simulate to the agent for a dry pass — the agent goes through the whole flow but doesn't touch the kernel. Useful for verifying your wiring before granting CAP_NET_ADMIN.Concepts
Agent
A Go binary that runs on the host you want to test. Registers with the control plane, polls for work, executes faults, reports events back.
Control plane
HTTP API + workspace/auth model + dashboard. Holds experiments, dispatches them to agents, ingests events, exposes everything via REST.
Experiment
A single run of a fault against a single agent, with a guardrail and a hard duration. Has a status (queued → running → completed | aborted-by-slo | aborted-by-signal | error) and an event timeline.
Fault
The thing being injected. Today: latency (delay + jitter on egress, scoped or device-wide) via tc/netem. More coming.
Guardrail
What watches your service while the fault is active. HTTP health probe or PromQL query. Breaches it → fault is removed immediately.
Blast radius
How much traffic the fault affects: scoped (one destination host or CIDR) or whole device (all egress on the interface).
Deadman
A detached process that removes the fault unconditionally after duration + grace. Survives even if the agent crashes. Last line of defense.
Steady state
What "healthy" means for your service. Define it as an endpoint that returns 200, or a PromQL query that stays under a threshold.
Faults
Latency
Adds delay (with optional jitter) to outbound traffic. Implemented with tc queueing disciplines and the netem module.
Spec
| Field | Type | Description |
|---|---|---|
dev | string | Network interface, e.g. eth0. |
target | string | Destination host, host:port, or CIDR. Empty = whole-device blast radius. |
delay_ms | int | Added latency in milliseconds. Capped at 60,000. |
jitter_ms | int | Variance around delay_ms (normal distribution). Capped at 10,000. |
duration_sec | int | How long the fault is active. Capped at 3,600. |
grace_sec | int | Extra time before the deadman fires (default 10s). |
Scoped (one dependency)
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 200ms 30ms distribution normal
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3
Only packets to that host:port are delayed; everything else flows unaffected.
Whole device (all egress)
tc qdisc add dev eth0 root handle 1: netem delay 200ms 30ms distribution normal
Loss
Drops a configurable fraction of outbound packets, optionally with correlation so drops cluster into bursts rather than uniformly random. Implemented with tc/netem.
Spec
| Field | Type | Description |
|---|---|---|
dev | string | Network interface, e.g. eth0. |
target | string | Destination host, host:port, or CIDR. Empty = whole-device blast radius. |
loss_percent | float | Percentage of packets to drop, (0, 100]. Above ~50% effectively partitions the path. |
loss_correlation | float | 0–100. 0 = uniform random loss. Higher values make consecutive drops more likely (burstier). |
duration_sec | int | How long the fault is active. Capped at 3,600. |
grace_sec | int | Deadman grace beyond duration (default 10s). |
Scoped (one dependency)
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 5% 25%
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3
Whole device (all egress)
tc qdisc add dev eth0 root handle 1: netem loss 5%
From the CLI
chaos run loss --dev eth0 --target 10.0.3.21:5432 \
--percent 5 --correlation 25 --duration 30s \
--health-url http://localhost:8080/healthz
Partition
Fully blocks egress traffic to a target (or the whole device) by dropping 100% of matching packets at the kernel queueing layer. Tests how a service behaves when a dependency becomes completely unreachable — the failover, retry, and circuit-breaker story under absence rather than degradation.
Spec
| Field | Type | Description |
|---|---|---|
dev | string | Network interface, e.g. eth0. |
target | string | Destination host, host:port, or CIDR. Empty = whole-device blast radius (the host is effectively offline from this NIC). |
duration_sec | int | How long the partition holds. Capped at 3,600. |
grace_sec | int | Deadman grace beyond duration (default 10s). |
Scoped (cut off one dependency)
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 100%
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3
Traffic to that host:port disappears into the void; everything else flows unaffected.
Whole device (full network isolation)
tc qdisc add dev eth0 root handle 1: netem loss 100%
From the CLI
chaos run partition --dev eth0 --target 10.0.3.21:5432 \
--duration 30s --health-url http://localhost:8080/healthz
CPU burn
Saturates CPU cores with tight spin loops, testing how a service behaves under compute starvation. Exposes autoscaling response time, GC pressure under contention, and whether request latency degrades gracefully or collapses.
Spec
| Field | Type | Description |
|---|---|---|
cores | int | Number of cores to saturate. 0 = all available cores (the default). Capped at 128. |
duration_sec | int | How long to burn. Capped at 3,600. |
Mechanism
The agent spawns one goroutine per core, each locked to an OS thread (runtime.LockOSThread) in a tight spin loop. On teardown, the child context is cancelled and all goroutines exit. No kernel objects persist — if the agent process dies, the burn stops instantly.
From the CLI
chaos run cpu --cores 4 --duration 60s \
--health-url http://localhost:8080/healthz
Memory pressure
Allocates and pins a fixed amount of memory, testing OOM-killer behavior, swap thrash, GC pauses under heap contention, and whether memory-limited containers handle pressure gracefully.
Spec
| Field | Type | Description |
|---|---|---|
megabytes | int | MB to allocate and pin. Capped at 32,768 (32 GB). |
duration_sec | int | How long to hold the allocation. Capped at 3,600. |
Mechanism
The agent allocates a single byte slice of the requested size and touches every page (4 KB stride) to defeat lazy allocation and overcommit. This forces the OS to commit physical memory. On teardown, the slice is nil'd and runtime.GC() is called as a release hint. No kernel objects persist.
From the CLI
chaos run memory --megabytes 512 --duration 60s \
--health-url http://localhost:8080/healthz
resources.limits.memory in K8s), whether the OOM killer targets the right process, and whether the service restarts cleanly or enters a crash loop. It also tests GC behavior — a Go service under heap pressure may see stop-the-world pauses that trip latency-based health checks even though the process isn't OOM'd.DNS block
Drops outbound DNS queries for specific domain names, simulating a DNS outage for individual dependencies. Tests how a service behaves when it can't resolve a dependency's hostname — failover to cached records, circuit-breaker activation, graceful degradation vs hard crash.
Spec
| Field | Type | Description |
|---|---|---|
domains | []string | Domain names to block. Up to 32 entries. E.g. ["api.stripe.com", "db.internal"]. |
dev | string | Optional: restrict to a specific interface (-o <dev>). Empty = all interfaces. |
duration_sec | int | How long to block. Capped at 3,600. |
Mechanism
The agent adds iptables rules that match outbound UDP port-53 packets containing the wire-encoded domain name (using -m string --algo bm --hex-string). Matching packets are dropped (-j DROP), causing the application's DNS resolver to time out. On teardown, the rules are removed with iptables -D.
An eBPF TC-BPF upgrade path exists in bpf/dns_block.c: a classifier that does the same matching inside the kernel at line rate, with per-CPU stats. The wire format and Spec interface are identical; swapping implementations doesn't affect the rest of the stack.
From the CLI
chaos run dns --domains api.stripe.com,db.internal \
--duration 60s --health-url http://localhost:8080/healthz
Process kill
Coming. Targeted SIGKILL / SIGTERM of a named process, optionally recurring (keep killing if it respawns). Different mechanism from CPU/memory but slots in behind the same Spec interface.
Guardrails
Every running fault has a guardrail watching the system. The instant it trips, the agent removes the fault and reports aborted-by-slo.
HTTP health check (default)
Polls an endpoint on a fixed interval. Aborts on non-2xx or when the response takes longer than the configured maximum.
# in spec form
health_url: "http://localhost:8080/healthz"
health_max_latency_ms: 500
interval_sec: 2
Failed requests (connection refused, timeout) count as a breach — the experiment may be what broke the endpoint, so that's the safe interpretation.
Prometheus
Polls a PromQL query that returns a scalar or instant vector. Aborts on abort_above / abort_below threshold cross.
prometheus_url: "http://prom:9090"
query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))'
abort_above: 0.5
interval_sec: 5
Transient query errors are tolerated (a flaky scrape isn't a breach), but a sustained threshold violation aborts.
Deadman
Always on. When a fault is applied, the agent forks a detached process that sleeps for duration + grace seconds, then removes the fault unconditionally. Survives agent crashes, kernel panics on the agent's own process, anything short of a full reboot — and a reboot wipes the qdisc anyway.
Server-side caps
The control plane rejects experiments that exceed these limits before they reach an agent:
delay_ms≤ 60,000jitter_ms≤ 10,000duration_sec≤ 3,600
Workspace kill switch
POST /v1/kill aborts every running experiment in the workspace. Use it when something's gone wrong and you don't have time to find the right experiment ID.
Agent CLI
The chaos binary is both the CLI for one-off runs and the long-running daemon.
chaos agent
Run the agent connected to a control plane.
| Flag | Description |
|---|---|
--control-plane URL | Control plane URL. |
--token TOKEN | Enrollment token (one-time) or agent token (already enrolled). |
--heartbeat DUR | Heartbeat interval. Default 5s. |
--simulate | Run the full loop without touching the kernel. Safe for any host. |
chaos run latency
Run a one-off experiment locally — no control plane required.
| Flag | Description |
|---|---|
--dev IFACE | Interface, default eth0. |
--target HOST | Destination host[:port] or CIDR. Omit for whole-device. |
--delay DUR | Added latency, e.g. 200ms. |
--jitter DUR | Variance, e.g. 30ms. |
--duration DUR | How long to hold the fault. |
--grace DUR | Deadman grace window (default 10s). |
--health-url URL | HTTP guardrail; aborts on non-2xx. |
--health-max-latency DUR | Aborts if a health probe takes longer than this. |
--prom URL | Prometheus base URL. |
--query Q | PromQL query. |
--abort-above N / --abort-below N | Threshold(s) for the query. |
--interval DUR | Probe interval, default 5s. |
--simulate | Walk through the pipeline without touching the kernel. |
--dry-run | Print the tc plan and exit without applying. |
-y / --yes | Skip confirmation prompt. |
Exit codes: 0 success, 7 aborted-by-slo (a SLO violation auto-aborted the experiment — useful as a CI signal that the system isn't resilient to this fault).
chaos abort / chaos status
Inspect and forcibly tear down local experiments started with chaos run:
chaos status # show local state file
chaos abort # remove any active qdisc and clear state
API reference
Every endpoint takes a Bearer token in Authorization. Three token types, scoped differently:
| Type | Prefix | Issued by | Use |
|---|---|---|---|
| Session token | scs_ | signup/login | Dashboard and admin API. |
| Enrollment token | sce_ | POST /v1/enrollment-tokens | One-time, agent uses it to register. Revocable. |
| Agent token | sca_ | returned from register | Per-agent credential for heartbeat / poll / events. |
Auth
{ "email": "you@co.com", "password": "...", "workspace": "acme-prod" }
→ { "session_token": "scs_...", "workspace_id": "ws-1", "email": "you@co.com" }
{ "email": "you@co.com", "password": "..." }
→ { "session_token": "scs_...", "workspace_id": "ws-1", "email": "you@co.com" }
Enrollment tokens
Agents
{ "hostname": "...", "kernel": "...", "instance_id": "...", "device": "eth0", "version": "..." }
→ { "agent_id": "agt-1", "agent_token": "sca_..." }
Experiments
{ "agent_id": "agt-1", "spec": { "dev":"eth0", "target":"10.0.3.21:5432",
"delay_ms":200, "jitter_ms":30, "duration_sec":30, "grace_sec":10,
"interval_sec":2, "health_url":"http://app:8080/healthz", "health_max_latency_ms":500 } }
→ { "id":"exp-1", "status":"queued", ... }
Operations
Persistence
The default control-plane store is in-memory — fine for dev and the first alpha, but state is lost on restart. A Postgres backend that satisfies the same Store interface is on the roadmap; the schema is documented in STORAGE.md.
TLS and hosting
The control plane speaks plain HTTP. For production, front it with a TLS-terminating reverse proxy (nginx, Caddy, an ALB). Make sure your agents are configured to use the public URL.
Permissions
The agent calls tc, which requires CAP_NET_ADMIN. On a bare host, run as root or grant the capability:
setcap cap_net_admin+ep /usr/local/bin/chaos
In a container, use --cap-add=NET_ADMIN (or --privileged). In Kubernetes, set the capability in the pod's securityContext.
Troubleshooting
Agent registers but no experiments dispatch
The agent long-polls /v1/agents/{id}/next for 25s at a time. Make sure your reverse proxy / load balancer doesn't terminate long-lived requests sooner than that.
RTNETLINK answers: Operation not supported
The sch_netem kernel module isn't available. Verify with tc qdisc add dev lo root netem delay 10ms. On a slim container kernel, install linux-modules-extra-$(uname -r) or use a different base image.
Experiment ends instantly with aborted-by-slo
The guardrail tripped on the first sample. For HTTP health probes, check that the URL is reachable from the agent and that the endpoint actually returns 2xx when the system is healthy. For Prometheus, check that the query already returns a value below your threshold before injecting.
Agent disappears as "offline" while still running
The agent token may have been revoked (workspace policy). The agent's register call will succeed on the next retry only if you give it a fresh enrollment token.
FAQ
Does this work on macOS?
The agent's real fault injection is Linux-only — macOS doesn't have tc/netem. The control plane and dashboard run natively on macOS; the agent compiles and runs on macOS only in --simulate mode. To validate real injection from a Mac, run the agent inside a Linux container (Docker Desktop / OrbStack / Lima) with --privileged or --cap-add=NET_ADMIN.
How is this different from Chaos Mesh / Litmus / Gremlin?
Straight Chaos is built for a service-team primary user, not a platform team. Open core: the eBPF kernel-level fault engine (planned) is the moat; the control plane (safety/scoring/audit/scheduling) is the SKU. Day-one: a single binary, no Helm chart, no operator, no Kubernetes prerequisite, and an HTTP health probe as the default guardrail so you don't need Prometheus to start.
What happens if the agent crashes during an experiment?
The deadman process is detached (setsid + own process group) and survives. It sleeps for duration + grace, then runs tc qdisc del regardless of agent state. The fault is removed even if the agent is gone.
Is my data isolated between workspaces?
Yes — every API call is workspace-scoped at the auth layer. An agent token grants access to one workspace. A session token is bound to one workspace. Cross-workspace reads / writes return 404. This is covered by the TestTenancyIsolation test in internal/server.
Can I use this without the control plane?
Yes — chaos run latency is fully usable on its own. It runs the same plan, the same guardrails, the same deadman, locally. The exit code (0 / 7) makes it a useful resilience gate in CI.