Phase 22
Doctrinal regression eval
A safety net that measures every change
Three-layer suite (structural · citations · semantic) with 47 golden cases. Turns the risk of doctrinal hallucination into an auditable metric that gates PRs. The piece protecting the other 10.
What shipped
- New packages/jw-eval with Suite + GoldenCase + LayerResult.
- L1 structural: contract regression for 6 agents — no network, no LLM, CI-blocking.
- L2 citations: offline snapshot mode (always-on) + live weekly mode (auto-opens drift issues).
- L3 semantic: sentence-transformers + LLM escalation (Ollama/Claude/OpenAI via JW_EVAL_LLM).
- 47 golden cases (25 L1 + 13 L2 + 9 L3) covering all major agents.
- CLI `jw eval`, MCP tool `run_eval_suite`, 3 new CI jobs (fast/weekly/nightly).
Pending / next PR
- Build the 12 wol.jw.org HTML snapshots via `build_eval_snapshots.py`.
- Bot-comment of markdown report on PRs.