Guía
Synth Judge (Fase 44)
Quality filter for synthesized Q&A pairs before they reach data/train.jsonl.
Three pluggable stages, configurable per recipe, transparent scoring.
Pipeline
synthesize_chunk -> validators (cheap) -> judge stage 1 heuristics (always)
-> judge stage 2 LLM pedagogical (opt-in)
-> judge stage 3 NLI entailment (opt-in)
-> kept / rejected verdict
Quick start
# Default LOOSE mode (heuristics only, zero network)
uv run jw-finetune data extract --recipe doctrinal
# STRICT mode (heuristics + harder cutoff)
uv run jw-finetune data extract --recipe doctrinal --judge=strict
# Full pipeline (LLM judge via Anthropic + NLI via DeBERTa)
JW_SYNTH_JUDGE_LLM=anthropic JW_SYNTH_JUDGE_NLI=deberta \
uv run jw-finetune data extract --recipe doctrinal --judge=strict
When the judge is wired the kept JSONL rows carry the score:
{
"question": "...",
"answer": "...",
"metadata": {
"pub_code": "w23",
"judge_score": "{\"cites_jw_publication\": true, \"has_minimum_substance\": true, \"overall\": 7.0, \"kept\": true}"
}
}
Modes and cutoffs
| Mode | Cutoff overall | Default NLI policy |
|---|---|---|
off | None (passes all) | n/a |
loose | 5.0 | NLI optional |
strict | 6.5 | requires entails |
Per-recipe override (YAML):
synth:
judge:
mode: strict
overall_cutoff: 7.0
require_nli_entails: true
Scoring formula (transparent)
base 4.0
+ 1.5 if cites_jw_publication (regex on w/g/jt/bh/sjj/jy/rs/it/lff/lr/sjm... or wol.jw.org URL)
+ 1.5 if has_minimum_substance (length >= 40, not generic, not a question echo)
+ 2.0 * nli_score if nli_verdict == "entails"
- 3.0 if nli_verdict == "contradicts"
+ pedagogical_quality (0..3, returned by the LLM judge)
clamp [0, 10]
Hard rules that force kept=False regardless of overall:
has_minimum_substance == Falsenli_verdict == "contradicts"- strict mode +
nli_verdict == "neutral" pedagogical_quality == 0
Programmatic use
from jw_finetune.synth.judge import build_judge, JudgeMode
judge = build_judge(mode=JudgeMode.STRICT)
score = judge.score(
question="¿Qué enseña la Biblia sobre el reino?",
answer="Como muestra w23.04 página 12, el reino de Dios...",
language="es",
)
print(score.kept, score.overall, score.reasons)
Environment
| Variable | Default | Effect |
|---|---|---|
JW_SYNTH_JUDGE_LLM | off | anthropic / ollama enables stage 2 |
JW_SYNTH_JUDGE_OLLAMA_MODEL | llama3.1:8b | Ollama model for stage 2 |
JW_SYNTH_JUDGE_NLI | off | NLI provider name for stage 3 |
Precision
Heuristic-only LOOSE accuracy on the bundled golden 50-pair fixture is 0.86 (target 0.85, LLM+NLI pushes past 0.90). STRICT hits 1.00 because the higher cutoff catches every no-citation row regardless of substance.
uv run python -c "
from pathlib import Path
from jw_finetune.synth.judge.eval_precision import evaluate_precision
from jw_finetune.synth.judge.thresholds import JudgeMode
r = evaluate_precision(
Path('packages/jw-finetune/tests/synth/judge/fixtures/golden_50_pairs.jsonl'),
mode=JudgeMode.LOOSE,
)
print('accuracy:', r.accuracy)
"
Rejected dump (audit)
uv run jw-finetune data extract \
--recipe doctrinal --judge=strict \
--dump-rejected /tmp/rejected.jsonl
# Inspect why pairs were dropped:
jq -c '.judge_score.reasons | map(.code) | unique' /tmp/rejected.jsonl | sort -u Editar esta página en docs/guias/synth-judge.md