Docs

Run controlled chaos experiments against your services. Inject real network faults, watch a guardrail, abort automatically.

Install

The agent runs on Linux and needs CAP_NET_ADMIN to manipulate the kernel queueing discipline (tc/netem). The control plane is portable Go and runs anywhere.

Linux host

# single-line installer (puts /usr/local/bin/chaos in place)
curl -fsSL https://get.straightchaos.com | sh

# verify
chaos version

Docker

The agent needs network admin capabilities. Use one of:

# minimum permissions
docker run --rm \
  --cap-add=NET_ADMIN --network=host \
  straightchaos/agent:latest \
  agent --control-plane $CP --token $ENROLL_TOKEN

# or fully privileged (simpler, less isolated)
docker run --rm --privileged --network=host straightchaos/agent:latest \
  agent --control-plane $CP --token $ENROLL_TOKEN

--network=host is required. The agent shapes traffic on real network interfaces, not on Docker's bridge.

Kubernetes (DaemonSet)

A reference manifest is in deploy/daemonset.yaml. The pod template needs:

securityContext:
  capabilities:
    add: ["NET_ADMIN"]
hostNetwork: true
env:
  - { name: CHAOS_CONTROL_PLANE, value: https://cp.example.com }
  - { name: CHAOS_TOKEN, valueFrom: { secretKeyRef: { name: chaos-enroll, key: token } }

From source

git clone https://github.com/straightchaos/agent
cd agent
go build -o /usr/local/bin/chaos ./cmd/chaos
go build -o /usr/local/bin/control-plane ./cmd/control-plane

Go 1.22+ required. No external dependencies.

macOS: real fault injection is Linux-only — macOS doesn't have tc/netem. The control plane and dashboard run natively; for the agent, either run it in a Linux VM/container, or use --simulate for the wiring without kernel changes.

Quickstart

From zero to a first experiment in five steps.

1. Start the control plane

control-plane --addr :8080

The control plane keeps state in memory by default. See Operations for persistence.

2. Sign up via the dashboard

Open /dashboard.html, pick Sign up, choose a workspace name. You're now signed in.

3. Mint an enrollment token

From the dashboard's + Connect button or Settings → Enrollment tokens. The dashboard shows the exact command to run on your host.

4. Start the agent

chaos agent --control-plane http://localhost:8080 --token sce_<your-token>

It registers, heartbeats, and shows up in the dashboard within a second or two.

5. Launch your first experiment

In the dashboard, configure a latency spec, pick a target (e.g. 10.0.3.21:5432 for your database), set the HTTP health check guardrail to your service's /healthz, and click Launch. You'll see live samples in the chart. If the health check fails, the agent rolls back automatically.

Tip: Add --simulate to the agent for a dry pass — the agent goes through the whole flow but doesn't touch the kernel. Useful for verifying your wiring before granting CAP_NET_ADMIN.

Concepts

Agent

A Go binary that runs on the host you want to test. Registers with the control plane, polls for work, executes faults, reports events back.

Control plane

HTTP API + workspace/auth model + dashboard. Holds experiments, dispatches them to agents, ingests events, exposes everything via REST.

Experiment

A single run of a fault against a single agent, with a guardrail and a hard duration. Has a status (queued → running → completed | aborted-by-slo | aborted-by-signal | error) and an event timeline.

Fault

The thing being injected. Today: latency (delay + jitter on egress, scoped or device-wide) via tc/netem. More coming.

Guardrail

What watches your service while the fault is active. HTTP health probe or PromQL query. Breaches it → fault is removed immediately.

Blast radius

How much traffic the fault affects: scoped (one destination host or CIDR) or whole device (all egress on the interface).

Deadman

A detached process that removes the fault unconditionally after duration + grace. Survives even if the agent crashes. Last line of defense.

Steady state

What "healthy" means for your service. Define it as an endpoint that returns 200, or a PromQL query that stays under a threshold.

Faults

Latency

Adds delay (with optional jitter) to outbound traffic. Implemented with tc queueing disciplines and the netem module.

Spec

Field	Type	Description
`dev`	string	Network interface, e.g. `eth0`.
`target`	string	Destination host, `host:port`, or CIDR. Empty = whole-device blast radius.
`delay_ms`	int	Added latency in milliseconds. Capped at 60,000.
`jitter_ms`	int	Variance around `delay_ms` (normal distribution). Capped at 10,000.
`duration_sec`	int	How long the fault is active. Capped at 3,600.
`grace_sec`	int	Extra time before the deadman fires (default 10s).

Scoped (one dependency)

tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 200ms 30ms distribution normal
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
   match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3

Only packets to that host:port are delayed; everything else flows unaffected.

Whole device (all egress)

tc qdisc add dev eth0 root handle 1: netem delay 200ms 30ms distribution normal

Loss

Drops a configurable fraction of outbound packets, optionally with correlation so drops cluster into bursts rather than uniformly random. Implemented with tc/netem.

Spec

Field	Type	Description
`dev`	string	Network interface, e.g. `eth0`.
`target`	string	Destination host, `host:port`, or CIDR. Empty = whole-device blast radius.
`loss_percent`	float	Percentage of packets to drop, (0, 100]. Above ~50% effectively partitions the path.
`loss_correlation`	float	0–100. `0` = uniform random loss. Higher values make consecutive drops more likely (burstier).
`duration_sec`	int	How long the fault is active. Capped at 3,600.
`grace_sec`	int	Deadman grace beyond duration (default 10s).

Scoped (one dependency)

tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 5% 25%
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
   match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3

Whole device (all egress)

tc qdisc add dev eth0 root handle 1: netem loss 5%

From the CLI

chaos run loss --dev eth0 --target 10.0.3.21:5432 \
  --percent 5 --correlation 25 --duration 30s \
  --health-url http://localhost:8080/healthz

What this exposes. Latency tests timeout/retry/circuit-breaker behavior. Loss exposes idempotency: are retries safe? Do you have at-least-once semantics that survive duplicate delivery? Even small loss percentages (1–5%) on a high-RPS path will surface re-entrancy bugs that never show under latency.

Partition

Fully blocks egress traffic to a target (or the whole device) by dropping 100% of matching packets at the kernel queueing layer. Tests how a service behaves when a dependency becomes completely unreachable — the failover, retry, and circuit-breaker story under absence rather than degradation.

Spec

Field	Type	Description
`dev`	string	Network interface, e.g. `eth0`.
`target`	string	Destination host, `host:port`, or CIDR. Empty = whole-device blast radius (the host is effectively offline from this NIC).
`duration_sec`	int	How long the partition holds. Capped at 3,600.
`grace_sec`	int	Deadman grace beyond duration (default 10s).

Scoped (cut off one dependency)

tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 100%
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 \
   match ip dst 10.0.3.21/32 match ip dport 5432 0xffff flowid 1:3

Traffic to that host:port disappears into the void; everything else flows unaffected.

Whole device (full network isolation)

tc qdisc add dev eth0 root handle 1: netem loss 100%

From the CLI

chaos run partition --dev eth0 --target 10.0.3.21:5432 \
  --duration 30s --health-url http://localhost:8080/healthz

What this exposes. Latency tests timeout settings. Loss tests retry safety. Partition tests failover — whether the service has a clear, exercised path to abandon a failed dependency. Services that "work" because they've never lost a backing service often fail spectacularly when one actually goes away. The default behavior of many HTTP clients under TCP-level black-holing is to hang for the full connect/read timeout, which is often minutes — a partition experiment surfaces that hang before production does.

CPU burn

Saturates CPU cores with tight spin loops, testing how a service behaves under compute starvation. Exposes autoscaling response time, GC pressure under contention, and whether request latency degrades gracefully or collapses.

Spec

Field	Type	Description
`cores`	int	Number of cores to saturate. `0` = all available cores (the default). Capped at 128.
`duration_sec`	int	How long to burn. Capped at 3,600.

Mechanism

The agent spawns one goroutine per core, each locked to an OS thread (runtime.LockOSThread) in a tight spin loop. On teardown, the child context is cancelled and all goroutines exit. No kernel objects persist — if the agent process dies, the burn stops instantly.

From the CLI

chaos run cpu --cores 4 --duration 60s \
  --health-url http://localhost:8080/healthz

What this exposes. CPU burn reveals whether a service has enough headroom to absorb a noisy-neighbor spike. Services running at 80%+ utilization in steady state often hit cascading failures under a CPU burn because the remaining 20% is consumed by GC, connection handling, and health checks. If the health probe trips, the service doesn't have enough margin.

Memory pressure

Allocates and pins a fixed amount of memory, testing OOM-killer behavior, swap thrash, GC pauses under heap contention, and whether memory-limited containers handle pressure gracefully.

Spec

Field	Type	Description
`megabytes`	int	MB to allocate and pin. Capped at 32,768 (32 GB).
`duration_sec`	int	How long to hold the allocation. Capped at 3,600.

Mechanism

The agent allocates a single byte slice of the requested size and touches every page (4 KB stride) to defeat lazy allocation and overcommit. This forces the OS to commit physical memory. On teardown, the slice is nil'd and runtime.GC() is called as a release hint. No kernel objects persist.

From the CLI

chaos run memory --megabytes 512 --duration 60s \
  --health-url http://localhost:8080/healthz

What this exposes. Memory pressure reveals whether containers are properly limited (resources.limits.memory in K8s), whether the OOM killer targets the right process, and whether the service restarts cleanly or enters a crash loop. It also tests GC behavior — a Go service under heap pressure may see stop-the-world pauses that trip latency-based health checks even though the process isn't OOM'd.

DNS block

Drops outbound DNS queries for specific domain names, simulating a DNS outage for individual dependencies. Tests how a service behaves when it can't resolve a dependency's hostname — failover to cached records, circuit-breaker activation, graceful degradation vs hard crash.

Spec

Field	Type	Description
`domains`	[]string	Domain names to block. Up to 32 entries. E.g. `["api.stripe.com", "db.internal"]`.
`dev`	string	Optional: restrict to a specific interface (`-o <dev>`). Empty = all interfaces.
`duration_sec`	int	How long to block. Capped at 3,600.

Mechanism

The agent adds iptables rules that match outbound UDP port-53 packets containing the wire-encoded domain name (using -m string --algo bm --hex-string). Matching packets are dropped (-j DROP), causing the application's DNS resolver to time out. On teardown, the rules are removed with iptables -D.

An eBPF TC-BPF upgrade path exists in bpf/dns_block.c: a classifier that does the same matching inside the kernel at line rate, with per-CPU stats. The wire format and Spec interface are identical; swapping implementations doesn't affect the rest of the stack.

From the CLI

chaos run dns --domains api.stripe.com,db.internal \
  --duration 60s --health-url http://localhost:8080/healthz

What this exposes. DNS failures are one of the most common real-world outage triggers, yet they're rarely tested because tc/netem can't selectively target specific domains. A service that works fine when Postgres is slow might crash hard when it can't resolve the Postgres hostname. This fault surfaces stale DNS caches, missing TTL respect, and services that retry resolution in a tight loop (amplifying the failure). The eBPF version (future) will also expose DNS query volume via BPF map stats — useful for understanding resolution patterns under pressure.

Process kill

Coming. Targeted SIGKILL / SIGTERM of a named process, optionally recurring (keep killing if it respawns). Different mechanism from CPU/memory but slots in behind the same Spec interface.

Guardrails

Every running fault has a guardrail watching the system. The instant it trips, the agent removes the fault and reports aborted-by-slo.

HTTP health check (default)

Polls an endpoint on a fixed interval. Aborts on non-2xx or when the response takes longer than the configured maximum.

# in spec form
health_url: "http://localhost:8080/healthz"
health_max_latency_ms: 500
interval_sec: 2

Failed requests (connection refused, timeout) count as a breach — the experiment may be what broke the endpoint, so that's the safe interpretation.

Prometheus

Polls a PromQL query that returns a scalar or instant vector. Aborts on abort_above / abort_below threshold cross.

prometheus_url: "http://prom:9090"
query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))'
abort_above: 0.5
interval_sec: 5

Transient query errors are tolerated (a flaky scrape isn't a breach), but a sustained threshold violation aborts.

Deadman

Always on. When a fault is applied, the agent forks a detached process that sleeps for duration + grace seconds, then removes the fault unconditionally. Survives agent crashes, kernel panics on the agent's own process, anything short of a full reboot — and a reboot wipes the qdisc anyway.

Server-side caps

The control plane rejects experiments that exceed these limits before they reach an agent:

delay_ms ≤ 60,000
jitter_ms ≤ 10,000
duration_sec ≤ 3,600

Workspace kill switch

POST /v1/kill aborts every running experiment in the workspace. Use it when something's gone wrong and you don't have time to find the right experiment ID.

Agent CLI

The chaos binary is both the CLI for one-off runs and the long-running daemon.

`chaos agent`

Run the agent connected to a control plane.

Flag	Description
`--control-plane URL`	Control plane URL.
`--token TOKEN`	Enrollment token (one-time) or agent token (already enrolled).
`--heartbeat DUR`	Heartbeat interval. Default `5s`.
`--simulate`	Run the full loop without touching the kernel. Safe for any host.

`chaos run latency`

Run a one-off experiment locally — no control plane required.

Flag	Description
`--dev IFACE`	Interface, default `eth0`.
`--target HOST`	Destination `host[:port]` or CIDR. Omit for whole-device.
`--delay DUR`	Added latency, e.g. `200ms`.
`--jitter DUR`	Variance, e.g. `30ms`.
`--duration DUR`	How long to hold the fault.
`--grace DUR`	Deadman grace window (default `10s`).
`--health-url URL`	HTTP guardrail; aborts on non-2xx.
`--health-max-latency DUR`	Aborts if a health probe takes longer than this.
`--prom URL`	Prometheus base URL.
`--query Q`	PromQL query.
`--abort-above N` / `--abort-below N`	Threshold(s) for the query.
`--interval DUR`	Probe interval, default `5s`.
`--simulate`	Walk through the pipeline without touching the kernel.
`--dry-run`	Print the tc plan and exit without applying.
`-y` / `--yes`	Skip confirmation prompt.

Exit codes: 0 success, 7 aborted-by-slo (a SLO violation auto-aborted the experiment — useful as a CI signal that the system isn't resilient to this fault).

`chaos abort` / `chaos status`

Inspect and forcibly tear down local experiments started with chaos run:

chaos status          # show local state file
chaos abort           # remove any active qdisc and clear state

API reference

Every endpoint takes a Bearer token in Authorization. Three token types, scoped differently:

Type	Prefix	Issued by	Use
Session token	`scs_`	signup/login	Dashboard and admin API.
Enrollment token	`sce_`	`POST /v1/enrollment-tokens`	One-time, agent uses it to register. Revocable.
Agent token	`sca_`	returned from `register`	Per-agent credential for heartbeat / poll / events.

Auth

POST/v1/auth/signup

{ "email": "you@co.com", "password": "...", "workspace": "acme-prod" }
→ { "session_token": "scs_...", "workspace_id": "ws-1", "email": "you@co.com" }

POST/v1/auth/login

{ "email": "you@co.com", "password": "..." }
→ { "session_token": "scs_...", "workspace_id": "ws-1", "email": "you@co.com" }

Enrollment tokens

GET/v1/enrollment-tokens

POST/v1/enrollment-tokens { "label": "prod-east" }

POST/v1/enrollment-tokens/{token}/revoke

Agents

POST/v1/agents/register (uses enrollment token)

{ "hostname": "...", "kernel": "...", "instance_id": "...", "device": "eth0", "version": "..." }
→ { "agent_id": "agt-1", "agent_token": "sca_..." }

GET/v1/agents

DELETE/v1/agents/{id}

POST/v1/agents/{id}/heartbeat (agent token)

GET/v1/agents/{id}/next long-poll, 25s

Experiments

POST/v1/experiments

{ "agent_id": "agt-1", "spec": { "dev":"eth0", "target":"10.0.3.21:5432",
  "delay_ms":200, "jitter_ms":30, "duration_sec":30, "grace_sec":10,
  "interval_sec":2, "health_url":"http://app:8080/healthz", "health_max_latency_ms":500 } }
→ { "id":"exp-1", "status":"queued", ... }

GET/v1/experiments

GET/v1/experiments/{id}

POST/v1/experiments/{id}/abort

POST/v1/experiments/{id}/events (agent token)

POST/v1/kill workspace-wide

Operations

Persistence

The default control-plane store is in-memory — fine for dev and the first alpha, but state is lost on restart. A Postgres backend that satisfies the same Store interface is on the roadmap; the schema is documented in STORAGE.md.

TLS and hosting

The control plane speaks plain HTTP. For production, front it with a TLS-terminating reverse proxy (nginx, Caddy, an ALB). Make sure your agents are configured to use the public URL.

Permissions

The agent calls tc, which requires CAP_NET_ADMIN. On a bare host, run as root or grant the capability:

setcap cap_net_admin+ep /usr/local/bin/chaos

In a container, use --cap-add=NET_ADMIN (or --privileged). In Kubernetes, set the capability in the pod's securityContext.

Troubleshooting

Agent registers but no experiments dispatch

The agent long-polls /v1/agents/{id}/next for 25s at a time. Make sure your reverse proxy / load balancer doesn't terminate long-lived requests sooner than that.

`RTNETLINK answers: Operation not supported`

The sch_netem kernel module isn't available. Verify with tc qdisc add dev lo root netem delay 10ms. On a slim container kernel, install linux-modules-extra-$(uname -r) or use a different base image.

Experiment ends instantly with `aborted-by-slo`

The guardrail tripped on the first sample. For HTTP health probes, check that the URL is reachable from the agent and that the endpoint actually returns 2xx when the system is healthy. For Prometheus, check that the query already returns a value below your threshold before injecting.

Agent disappears as "offline" while still running

The agent token may have been revoked (workspace policy). The agent's register call will succeed on the next retry only if you give it a fresh enrollment token.

FAQ

Does this work on macOS?

The agent's real fault injection is Linux-only — macOS doesn't have tc/netem. The control plane and dashboard run natively on macOS; the agent compiles and runs on macOS only in --simulate mode. To validate real injection from a Mac, run the agent inside a Linux container (Docker Desktop / OrbStack / Lima) with --privileged or --cap-add=NET_ADMIN.

How is this different from Chaos Mesh / Litmus / Gremlin?

Straight Chaos is built for a service-team primary user, not a platform team. Open core: the eBPF kernel-level fault engine (planned) is the moat; the control plane (safety/scoring/audit/scheduling) is the SKU. Day-one: a single binary, no Helm chart, no operator, no Kubernetes prerequisite, and an HTTP health probe as the default guardrail so you don't need Prometheus to start.

What happens if the agent crashes during an experiment?

The deadman process is detached (setsid + own process group) and survives. It sleeps for duration + grace, then runs tc qdisc del regardless of agent state. The fault is removed even if the agent is gone.

Is my data isolated between workspaces?

Yes — every API call is workspace-scoped at the auth layer. An agent token grants access to one workspace. A session token is bound to one workspace. Cross-workspace reads / writes return 404. This is covered by the TestTenancyIsolation test in internal/server.

Can I use this without the control plane?

Yes — chaos run latency is fully usable on its own. It runs the same plan, the same guardrails, the same deadman, locally. The exit code (0 / 7) makes it a useful resilience gate in CI.

Docs

Install

Linux host

Docker

Kubernetes (DaemonSet)

From source

Quickstart

1. Start the control plane

2. Sign up via the dashboard

3. Mint an enrollment token

4. Start the agent

5. Launch your first experiment

Concepts

Agent

Control plane

Experiment

Fault

Guardrail

Blast radius

Deadman

Steady state

Faults

Latency

Spec

Scoped (one dependency)

Whole device (all egress)

Loss

Spec

Scoped (one dependency)

Whole device (all egress)

From the CLI

Partition

Spec

Scoped (cut off one dependency)

Whole device (full network isolation)

From the CLI

CPU burn

Spec

Mechanism

From the CLI

Memory pressure

Spec

Mechanism

From the CLI

DNS block

Spec

Mechanism

From the CLI

Process kill

Guardrails

HTTP health check (default)

Prometheus

Deadman

Server-side caps

Workspace kill switch

Agent CLI

chaos agent

chaos run latency

chaos abort / chaos status

API reference

Auth

Enrollment tokens

Agents

Experiments

Operations

Persistence

TLS and hosting

Permissions

Troubleshooting

Agent registers but no experiments dispatch

RTNETLINK answers: Operation not supported

Experiment ends instantly with aborted-by-slo

Agent disappears as "offline" while still running

FAQ

Does this work on macOS?

How is this different from Chaos Mesh / Litmus / Gremlin?

What happens if the agent crashes during an experiment?

Is my data isolated between workspaces?

Can I use this without the control plane?

`chaos agent`

`chaos run latency`

`chaos abort` / `chaos status`

`RTNETLINK answers: Operation not supported`

Experiment ends instantly with `aborted-by-slo`