Build a 'Process Roulette' Stress Tester to Learn OS Process Management
Build a Process Roulette stress tester to learn OS signals, process isolation, and resilience with hands-on chaos tests and observability.
Hook: Turn Fear of Crashes into a Career-Grade Skill
If you’re a student or developer who’s worried about writing software that falls apart in production, you’re not alone. Employers want engineers who ship resilient systems, but most courses barely scratch the surface of process management, OS signals, or how to debug sudden crashes. This guided dev project — building a Process Roulette stress tester — gives you hands-on experience recreating controlled process-killing scenarios so you learn process isolation, signal handling, and resilience patterns employers actually care about in 2026.
The learning outcome — what you’ll gain
- Understand how UNIX-like OSes handle signals (SIGTERM, SIGKILL, SIGSTOP, etc.) and process isolation.
- Implement robust signal handlers and graceful shutdown logic in an application.
- Build a reproducible test harness that randomly kills processes to test resilience.
- Use modern observability tools (eBPF, Prometheus) to measure effects of faults.
- Write tests that you can add to CI to continuously validate app robustness.
Why this matters in 2026: trends that make this project relevant
- Chaos engineering is mainstream. By 2026 more teams run automated fault-injection in CI pipelines (Chaos Mesh, Gremlin-style workflows) and expect candidate engineers to know basic fault injection concepts.
- Observability advanced with eBPF. eBPF-driven tooling is standard for observability and makes it easier to trace process behavior during faults without heavy instrumentation — see observability for edge agents for patterns you can adapt.
- Edge and microservices growth means process-level failures are common; resilient local services are easier to debug and support than opaque remote ones.
- AI-assisted debugging helps write tests and triage logs, but you still need practical scenarios and signal-aware code to validate the AI’s suggestions.
Project overview: What is "Process Roulette"?
Process Roulette is a local test harness that spawns a set of test processes (agents) and randomly sends signals to them according to configurable scenarios. The goal is not to crash your machine but to create repeatable, controllable crash scenarios so you can:
- Observe how your program behaves under abrupt termination.
- Practice writing signal handlers for graceful shutdown, cleanup, and checkpointing.
- Integrate resilience patterns like supervisors and restart policies.
Safety, ethics, and prerequisites
Safety first: Run Process Roulette only in controlled environments — a dedicated VM, container, or a disposable development machine. Do not run it on a production host, shared servers, or machines with irreplaceable data. Use explicit confirmation prompts before triggering destructive scenarios.
Prerequisites:
- Linux or macOS (examples use POSIX signals) — Linux preferred for eBPF integration.
- Python 3.10+ or Node.js for the controller; small agent app in Python or Go.
- Basic Docker experience for isolating experiments.
- Optional: eBPF tooling (bpftrace), Prometheus, Grafana for metrics/observability.
Step 1 — Design the test architecture
Keep it simple and modular. Build two components:
- Controller: Spawns agents, tracks PIDs, chooses targets, and sends signals. Provides scenario configuration and a UI or CLI for toggling modes.
- Agent: Small test apps that perform realistic work (processing a queue, writing to local state, or handling client requests) and implement various shutdown behaviors for testing.
Scenarios you’ll implement:
- Random kills: pick a running PID and send SIGKILL/SIGTERM at random intervals.
- Patterned kills: kill the leader in a group to test leader election or restart.
- Resource pressure: stop and resume processes (SIGSTOP/SIGCONT) to simulate stalls.
- Graceful vs abrupt: compare SIGTERM (graceful) and SIGKILL (forceful) responses.
Step 2 — Build a minimal agent (Python)
The agent demonstrates realistic behavior: a worker that writes progress to a log, periodically checkpoints state, and handles termination signals.
# agent.py
import os
import signal
import sys
import time
import json
state_file = f"agent-{os.getpid()}.json"
progress = {"count": 0}
stop_flag = False
def save_state():
with open(state_file, "w") as f:
json.dump(progress, f)
def handle_term(signum, frame):
global stop_flag
print(f"Agent {os.getpid()} received signal {signum}, graceful shutdown...", flush=True)
save_state()
stop_flag = True
# Register handlers for SIGTERM and SIGINT. Note SIGKILL cannot be caught.
signal.signal(signal.SIGTERM, handle_term)
signal.signal(signal.SIGINT, handle_term)
print(f"Agent started PID={os.getpid()}")
try:
while not stop_flag:
progress['count'] += 1
if progress['count'] % 5 == 0:
save_state()
print(f"Agent {os.getpid()} checkpointed: {progress}", flush=True)
time.sleep(1)
except Exception as e:
print(f"Agent {os.getpid()} crashed: {e}")
save_state()
raise
print(f"Agent {os.getpid()} exiting, final state: {progress}")
Key learning: SIGTERM allows cleanup and checkpointing; SIGKILL (9) cannot be trapped and forces abrupt termination.
Step 3 — Build the controller (Python CLI)
Controller responsibilities:
- Start N agents as subprocesses and store their PIDs.
- Implement scenarios: random, leader-only, pattern-based.
- Send signals using os.kill(pid, signum).
# controller.py
import subprocess
import os
import time
import random
import signal
NUM_AGENTS = 4
agents = []
# Spawn agents
for _ in range(NUM_AGENTS):
p = subprocess.Popen(["python3", "agent.py"]) # keep agent.py next to controller
agents.append(p)
time.sleep(0.2)
print("Spawned agents:", [p.pid for p in agents])
try:
for t in range(60): # run a 60-second test
time.sleep(1)
if random.random() < 0.2:
target = random.choice(agents)
if target.poll() is None: # still running
sig = random.choice([signal.SIGTERM, signal.SIGKILL])
print(f"Controller sending {sig} to {target.pid}")
os.kill(target.pid, sig)
# Optional: restart killed agents instantly for supervisor testing
for i, p in enumerate(agents):
if p.poll() is not None:
print(f"Restarting agent slot {i}")
agents[i] = subprocess.Popen(["python3", "agent.py"])
except KeyboardInterrupt:
print("Controller exiting, killing all agents")
for p in agents:
try:
os.kill(p.pid, signal.SIGTERM)
except Exception:
pass
Run the controller inside a Docker container or a disposable VM to avoid accidental damage. This controller is intentionally minimal so you can iterate and add features: UI, scenario files, attack schedules, or a REST API.
Step 4 — Extend agents and scenarios for teaching points
Make agents exhibit different behaviors to learn how various patterns respond:
- Stateless worker: resumes work immediately — tests idempotence and external state stores.
- Stateful worker with checkpointing: saves progress to disk and restores on restart (we showed that).
- Leader election example: a small Raft or bully- style demo where killing the leader tests election logic.
- Resource-leaking agent: simulates leaks to combine process-kill with resource exhaustion scenarios.
Step 5 — Observe, measure, and debug
Collect these signals and artifacts during tests:
- Stdout/stderr logs from agents and controller (structured JSON preferred).
- Checkpoint files and timestamps to verify integrity after abrupt termination.
- Metrics: restart rate, average recovery time, number of lost operations.
- System traces: use strace, pstack, lsof as needed; for 2026 advanced debugging use eBPF-based traces (bpftrace) to inspect syscalls while minimizing overhead.
Example eBPF-based check (conceptual): use a simple bpftrace script to count exit() syscalls or track signals received over time and plot them in Grafana.
Step 6 — Hardening your real app: patterns to implement
After you run tests, implement defenses in your app and re-run Process Roulette. Key resilience patterns:
- Signal handlers and graceful shutdown: on SIGTERM finish in-flight work, persist state, and close connections. Always test that handlers are idempotent and safe under concurrency.
- Supervision and restart policies: use a supervisor (systemd, supervisord) with restart backoff to avoid crash loops.
- Idempotence and retry semantics: design operations to be safe to retry after a crash (use unique request IDs and dedup tables).
- Externalize critical state: store durable state in external services (databases, object stores) and use transaction logs if necessary.
- Health checks and circuit breakers: integrate liveness/readiness probes in containers so orchestrators can make smart restart decisions.
Step 7 — Integrate Process Roulette into CI
Short, reproducible scenarios are CI-friendly. Create a lightweight test that runs for 30–60s and asserts the application recovers within an SLO (e.g., 10s). Use containers to isolate CI runners and ensure determinism. In 2026, it's common to gate merges with basic chaos tests so bugs are caught early — integrate with cloud-native orchestration and CI pipelines to automate these checks.
Step 8 — Advanced extensions (for portfolios and interviews)
Turn this into a showcase project by adding:
- Multiple failure injection types: network partitions, CPU throttling, disk fill (simulate with quota), and resource limits using cgroups.
- Visualization: Grafana dashboards showing restarts, recovery times, and error rates during tests.
- Automated reproducibility: record a seed for the RNG used by controller so you can rerun exact sequences of kills.
- Distributed extension: run agents in several containers or pods and orchestrate targeted failures (kill leader pod) to simulate microservice failures.
- Integrate with Chaos Mesh or Litmus for Kubernetes-based experiments if you already work with k8s.
Debugging tips and tools — what to run when things go wrong
- ps aux, pstree: inspect process hierarchies and find orphans.
- strace -f -p PID: trace syscalls and see what a hung process is doing.
- gdb --pid PID or coredump analysis: inspect memory and stack traces post-mortem.
- journalctl / systemd logs: check supervisor messages and restart events.
- eBPF (bcc, bpftrace): low-overhead tracing to determine which syscalls or kernel events coincide with failures.
- Prometheus/Grafana: chart restart frequency, error rates, and request latency before/during/after faults.
Case study: What students learned after adding Process Roulette to their portfolio
"We added a small 'resilience lab' to our capstone: a microservice with a controller that ran reproducible chaos tests. Employers asked specifically about our restart policy and how we validated recovery — it became a key talking point in interviews." — Teaching assistant, Systems Lab, 2025
Students who complete this project show concrete artefacts: test scripts, dashboards, and commits demonstrating bug fixes. This kind of applied resilience work is increasingly asked for by SRE and backend roles in 2026.
Example rubric to evaluate resilience (use in labs or self-assessments)
- Agent gracefully persists state on SIGTERM. (Pass/Fail)
- System restarts killed agents automatically without data loss. (Pass/Fail)
- Recovery time objective (RTO): app returns to normal within defined window after failure. (Numeric)
- CI gate: a short chaos test runs in CI on each PR. (Yes/No)
- Observability: metrics and logs allow you to diagnose the root cause within X minutes. (Numeric)
Common pitfalls and how to avoid them
- Never rely on SIGKILL for cleanup — it’s uncatchable. Design systems to survive abrupt loss.
- Avoid heavy logic inside signal handlers — only set flags, persist minimal state, and return quickly to avoid race conditions.
- Watch for partial restarts: if a restarted process replays messages or re-acquires resources without coordination, you can get duplicate side effects. Use idempotency or transactional semantics.
- Keep experiments reproducible — log the RNG seed, timings, and environment so bugs are debuggable by reviewers.
How this skill maps to jobs and interviews
Employers hiring for backend, SRE, and platform roles are asking for developers who know how to reason about failures. The Process Roulette project gives you:
- Concrete code examples for your portfolio (signal handling, checkpointing, restart policies).
- Metrics & dashboards to show measurable improvements after fixes (great for interviews).
- Talking points for system design: how to build for graceful degradation and robust recovery.
Putting it on your resume — quick bullets
- Implemented a Process Roulette fault-injection harness (Python) to validate graceful shutdown and restart behaviors under SIGTERM and SIGKILL.
- Integrated observability (Prometheus + Grafana + eBPF sampling) to reduce mean time to recovery by X% in tests.
- Added chaos tests to CI to ensure restart resilience across releases.
Further reading and tools to try in 2026
- Chaos engineering frameworks: Chaos Mesh, Litmus, Gremlin (commercial) — good for Kubernetes / production scenarios.
- eBPF tools: bpftrace, libbpf, and observability products that leverage eBPF for low-overhead tracing.
- CRIU (Checkpoint/Restore In Userspace): explore checkpointing entire process trees for advanced state migration tests.
- Supervisors: systemd unit options (Restart=, RestartSec=) and container restart policies for production hardening.
Actionable checklist (start this weekend)
- Clone a fresh VM or create a Docker container to isolate experiments.
- Implement the agent.py and controller.py shown above and run a 60-second test.
- Observe agent checkpoint files and update handlers to be idempotent.
- Integrate a minimal Prometheus metrics endpoint into the agent and chart restarts in Grafana.
- Record a deterministic random seed and rerun to reproduce a specific failure scenario, then fix the bug and show improved metrics.
Final takeaways
Process Roulette is a compact, high-impact project you can complete in a weekend that teaches core OS-level resilience principles employers want in 2026: signal handling, process supervision, observability, and reproducible fault injection. You’ll move from theoretical concepts to concrete artifacts — scripts, dashboards, and tests — that demonstrate your ability to design robust systems.
Call to action
Ready to build this for your portfolio? Start a repo, implement the minimal controller and agent from this guide, and add one measurable improvement (e.g., reduced recovery time). Share your repo link with peers or mentors, and tag the project with "Process Roulette" so future employers can find it. If you want a structured learning path, consider building a small capstone that integrates chaos tests into CI and documents the before/after metrics — that’s the kind of hands-on evidence that turns interviews into offers.
Related Reading
- Patch Orchestration Runbook: Avoiding the 'Fail To Shut Down' Scenario at Scale
- Observability for Edge AI Agents in 2026: Queryable Models, Metadata Protection and Compliance-First Patterns
- Why Cloud-Native Workflow Orchestration Is the Strategic Edge in 2026
- Analytics Playbook for Data-Informed Departments
- Performance Anxiety to Pro Player: Vic Michaelis’ Path and How Tabletop Creators Can Monetize Their Growth
- Responsible Meme Travel: Turning the ‘Very Chinese Time’ Trend into Respectful Neighborhood Guides
- Patch Notes Explainer: Nightreign 1.03.2 in 10 Minutes
- Phone plans for frequent flyers: when a UK traveller should choose T-Mobile-style price guarantees or local eSIMs
- Rebuilding Lost Islands: How to Archive and Recreate Deleted Animal Crossing Worlds
Related Topics
skilling
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you