Skip to content

Reel — VCR for LLM APIs

Record real calls to OpenAI, Anthropic, and Gemini once, then replay them deterministically in tests, CI, and your local dev loop — for free, forever. No mocks. No SDK changes. No real network in CI. No surprise bills.

Reel is a local HTTP proxy that sits between your code and the LLM provider. On first call it forwards upstream and captures the wire-level request/response. On every call after, it replays from disk in ~3 ms. Cassettes are plain JSONL — you can grep them, jq them, git diff them in PRs. Secrets and PII are scrubbed at capture time.

Install and record your first cassette in 5 minutes GitHub repo


See it in action — Claude Opus demo

The same claude -p job run three times against three real markdown docs. First run records and pays real Opus tokens. Runs 2 and 3 serve every call from disk in 2-3 ms and pay nothing.

Per-call latency in this exact run, from the proxy log:

Run Call 1 Call 2 Call 3
1 (record) 1865 ms 1708 ms 2183 ms
2 (replay) 2 ms 2 ms 3 ms
3 (replay) 2 ms 2 ms 3 ms

Output bytes are identical across runs. Cassette stays at 3 entries — replay never re-records. Reproduce locally with the bundled opus-demo.sh script.


Why this exists

LLM tests are flaky and expensive. A pytest suite that calls OpenAI().chat.completions.create(...) in 40 tests bills real money on every CI run — multiply by every PR push, every retry, every contributor. With Reel: record once locally, commit the cassette, run CI with pytest --reel-mode replay for $0.

Production bugs in LLM responses are impossible to reproduce. A user reports a weird answer; you have logs but no way to replay the exact call from a different machine. Reel cassettes are portable byte-for-byte recordings of what the upstream actually returned.

Prompt iteration burns tokens on every tweak. A two-hour prompt-engineering session might re-spend the same prompt 100 times. Reel makes each unique prompt cost real money exactly once.

AI coding agents are slow. Aider, opencode, Claude Code, Cursor, Codex CLI — most of them send the same file context, tool definitions, and embeddings to the LLM many times during a session. Reel caches the deterministic parts.

30-second demo

# 1. Install
pip install reel-vcr

# 2. Start Reel in auto mode (records first time, replays after)
reel auto --cassette tests/cassettes/quickstart.jsonl &

# 3. Point your SDK at it
export OPENAI_BASE_URL=http://127.0.0.1:7878/v1
export OPENAI_API_KEY=sk-...   # real key — Reel forwards it on first run only

# 4. Run your code. First run records. Every run after replays.
python my_app.py

That's it. The cassette is plain JSONL:

{"id":"req_01","provider":"openai","endpoint":"/v1/chat/completions",
 "request":{"model":"gpt-5","messages":[...]},
 "response":{"status":200,"body":{...}}}

Diff cassettes in PRs. Grep them. Share them. They're regular files.

How Reel compares

Tool Layer Non-Python clients? SSE streaming? Survives SDK transport swaps?
Reel HTTP proxy ✅ Yes (any language) ✅ With timing fidelity ✅ Yes — transport-agnostic
VCR.py / pytest-recording / pytest-vcr Monkey-patches urllib3 / requests ❌ Python only Partial ❌ Breaks when SDK changes transport
respx / pytest-httpx Mocks httpx clients ❌ Python only Limited ❌ Coupled to httpx
llm-test-harness Wraps the SDK in Python (harness.wrap(...)) — bundles eval scoring ❌ Python only Limited ❌ Coupled to specific SDK clients
agent-vcr Records JSON-RPC for MCP servers (different layer entirely) n/a — MCP, not LLM HTTP n/a n/a
WireMock / MockServer HTTP proxy (Java) ✅ Yes Manual fixtures ✅ Generic, not LLM-aware
Hand-rolled mocks Inside your code Whatever you write ❌ Whenever you forget to update them

The trade-off: VCR.py is easier to drop into a single test in a single file. Reel is easier to use across a whole project and any client that respects the standard OPENAI_BASE_URL / ANTHROPIC_BASE_URL env-var convention — including non-Python clients like Cursor and Aider.

What works today

  • OpenAI, Anthropic, Gemini HTTP APIs with path-based routing on a single proxy port
  • Any OpenAI-compatible upstream: Ollama, NVIDIA NIM, vLLM, LM Studio, Groq, Together, OpenRouter
  • Three modes: record, replay, auto
  • SSE streaming captured chunk-by-chunk with millisecond timing fidelity (--timing realtime | fast | slow=N)
  • Smart matching: exact, normalized, ignore-fields, fuzzy (sentence-transformers cosine similarity)
  • Capture-time redaction of API keys, Bearer tokens, AWS keys, GitHub PATs, emails, US phone numbers
  • First-class pytest pluginpytest --reel-mode replay for zero-network CI
  • Analytics CLIreel inspect / cost / diff / stats / doctor
  • Local web inspectorreel ui for browsing cassettes in a browser (Starlette + HTMX, no JS build step)
  • Pre-commit hook that refuses to commit cassettes still containing detectable secret patterns

Get started

Install and record your first cassette in 5 minutes

Or jump to a specific topic:

Frequently asked

Is this just VCR.py with extra steps? No. VCR.py monkey-patches Python HTTP clients. When OpenAI or Anthropic ship a new SDK with a different transport, VCR.py silently breaks. Reel is an HTTP proxy — it sees the actual bytes on the wire, language-agnostic, SDK-agnostic.

How is Reel different from llm-test-harness and agent-vcr? llm-test-harness wraps the SDK at the Python client level and bundles eval scoring — same Python-only / SDK-coupled shape as VCR.py. Reel sits one layer below as a language-agnostic HTTP proxy, and stays out of eval/scoring on purpose. agent-vcr records JSON-RPC for MCP servers (a different layer entirely) — it's complementary to Reel, not competitive: cassette your MCP tool servers with agent-vcr, cassette the LLM calls underneath with Reel.

Will it work with Claude Code, Aider, opencode, Cursor, Codex CLI? Yes. All of them respect the standard OPENAI_API_BASE / ANTHROPIC_BASE_URL env-var convention. Cursor needs one settings-file line. Verified live with Claude Code, opencode, and Aider.

What about API keys in committed cassettes? Reel never captures request headers — that's where keys live. Response bodies are scanned for sk-*, sk-ant-*, AIza*, ghp_*, AKIA*, and Bearer-token patterns; matches are redacted before write. A bundled pre-commit hook refuses to commit cassettes still containing detectable secrets.

Does it work with local models — Ollama, vLLM, LM Studio? Yes. Anything OpenAI-compatible. Even with local models the win is real: replay is ~3 ms while local inference is 200-2000 ms.

Why pip install reel-vcr but import reel? Bare reel on PyPI was taken by an unrelated async-subprocess library. Same convention as Pillow (pip install pillowimport PIL).

Is there a Reel Cloud? No, and no plan to build one until there's clear pull. Runs entirely on 127.0.0.1, zero telemetry, no phone-home, Apache 2.0.

More questions: GitHub Discussions · Open an issue