Independent Project Not affiliated with, sponsored by, or endorsed by the Watch Tower Bible and Tract Society or Jehovah's Witnesses.
jw-agent-toolkit
ES

Release F77-F79 · alignment to the published canon

YAML principles · RLAIF · DPO / ORPO on Qwen3.5

Three phases that apply Constitutional AI, RLAIF, and DPO/ORPO to the toolkit's concrete problem: keeping a local fine-tune from drifting away from the canon published by the organization. The source of truth is still the current published material; these techniques are alignment engineering, not new doctrine. Zero human annotators in the critical path — the judge with its principles acts as the AI annotator. Example base model: Qwen3.5-0.8B (Apache-2.0).

Phases

3

New tests

+41

Builtin principles

5

Total suite

1326

passing · 0 cycles

▌ Design decisions that tied the 3 phases together

  • The currently-published canon is the source of truth. The toolkit reflects, doesn't legislate.
  • Principles in versioned, inspectable YAML — not scattered heuristics.
  • Explicit severities: hard blocks, soft only annotates.
  • Lazy cross-package imports — zero dependency cycles (jw-agents ↛ jw-eval).
  • Single source of truth: score_pair reuses score, doesn't duplicate logic.
  • Permissive base model (Apache-2.0) so anyone can use it downstream.
  • CLI still works without Unsloth — only the trainers require GPU extras.
  • No regressions: 1,326 tests pass, no new cycles, ruff clean.

Phase 77

fidelity-principles

Fidelity principles to the published canon, versioned as YAML

✅ Shipped 🧪 21 tests T1 Alignment
Technical guide →

Codifies as YAML the doctrinal principles previously scattered through code (canon-only rule, no-impersonation, epistemic humility, required citations, conscience matters). Each principle has an id, severity (hard/soft), applies_to per agent, source citing an official publication, and an optional regex tier for cheap detection. The jw-finetune judge and the jw-agents fidelity_wrap decorator both consult it at runtime. Lazy import from jw-agents to avoid the cycle (jw-eval already depends on jw-agents).

What was delivered

  • 5 builtin principles: PF001-canon-only, PF002-cite-before-paraphrase, PF003-citation-required, PF010-no-impersonation, PF012-respect-conscience.
  • Pydantic loader with id-based override: a user YAML overrides a builtin sharing the same id.
  • violations_for(text, principles) runs forbidden_phrases + forbidden_regex in one pass (case-insensitive).
  • Judge.score_qa_pair() accepts principles=; a hard hit adds RejectionCode.principle_hard_violation.
  • fidelity_wrap(principles=…) filters by agent_name, stamps metadata principle_hard/soft, respects on_fail warn/reject/annotate.
  • Lazy import of jw_eval.principles from jw_agents to avoid the dependency cycle.
  • Hard severity = block; soft = annotate; no hidden policies.

Phase 78

rlaif-pipeline

The judge promoted to a preference annotator + SL-CAI

✅ Shipped 🧪 8 tests T1 Alignment
Technical guide →

Turns the jw-finetune Judge into a deterministic preference model: given (question, answer_a, answer_b) it decides a winner with a clear protocol — hard fails first (no-citation, NLI contradicts, hard principle), then overall with a tolerance ε, and NLI as tiebreak. The dataset builder generates N candidates per prompt with a temperature sweep, scores them pairwise, and exports JSONL in the trl format. The critique.py module implements the supervised half of Constitutional AI: for each Q&A pair the LLM reviews against principles and rewrites if a hard violation is found.

What was delivered

  • PreferenceVerdict (winner/margin/reasons/score_a/score_b) + compare_scores with tie_epsilon.
  • Hard-fail asymmetry: the side with nli_contradicts, no_jw_citation, or principle_hard_violation loses against the other.
  • Judge.score_pair(question, answer_a, answer_b, language) reuses score() — single source of truth.
  • build_preference_dataset() with n_candidates, min_margin, deterministic temperature sweep [0.1, 0.5, 0.8, 1.0].
  • PreferenceStats with counters items/candidates/kept/tied/low_margin/provider_errors/by_winner.
  • Output JSONL format {prompt, chosen, rejected, metadata} ready for trl.DPOTrainer/ORPOTrainer.
  • self_critique(pair, principles, llm) — SL-CAI loop with opt-in preserve_original for audit.
  • batch_critique(pairs) → (revised_pairs, num_changed) integrates with the existing orchestrator.

Phase 79

dpo-orpo-trainers

DPO and ORPO with Unsloth on top of Qwen3.5-0.8B (Apache-2.0)

✅ Shipped 🧪 12 tests T2 Training
Technical guide →

Closes the alignment loop: preference dataset → trainer → checkpoint → GGUF/MLX export. Two new trainers: DPO (with beta=0.1 by default, ref_model auto-derived from the base via LoRA) and ORPO (single-stage, no reference model, ideal for MLX/ROCm). Three builtin recipes on Qwen3.5-0.8B Apache-2.0 with chat_template=qwen-3. Extended CLI: jw-finetune train auto-dispatches by recipe.task; new prepare-preference subcommand wires the judge with loaded principles.

What was delivered

  • train_dpo() with trl.DPOTrainer + Unsloth FastLanguageModel + get_chat_template (correct template alignment).
  • train_orpo() with trl.ORPOTrainer — single phase, no reference model, ideal for small datasets.
  • Recipe.task admits 'dpo' and 'orpo' alongside cpt/sft/grpo; validation extends coherently.
  • 3 recipes: doctrinal-qa-es-sft-qwen35 (SFT base), doctrinal-qa-es-dpo-qwen35 (lr 5e-6, 1 epoch), doctrinal-qa-es-orpo-qwen35 (lr 8e-6).
  • CLI dispatch in train: rec.task=='dpo' → train_dpo; rec.task=='orpo' → train_orpo; dataset by convention preference_pairs.jsonl.
  • New command jw-finetune prepare-preference --judge-mode strict --principles to produce the ready JSONL.
  • Compatible with all existing exporters: GGUF (llama.cpp/Ollama), MLX (Apple), SafeTensors merged/adapter.
  • Lazy Unsloth import — the CLI still works without a GPU for non-training commands.

▌ End-to-end flow

The 3 phases compose into a coherent pipeline on the Qwen3.5-0.8B Apache-2.0 recipes:

# 1. Generate SFT recipe from preset (Qwen3.5-0.8B, chat_template qwen-3)
uv run jw-finetune init -p doctrinal-qa-es-sft-qwen35 -o ws/recipe.yaml

# 2. Prepare SFT dataset (extract → dedupe → chunk → synth Q&A)
uv run jw-finetune prepare --recipe-file ws/recipe.yaml --source ./pubs/

# 3. Base SFT
uv run jw-finetune train --workspace ws/run-YYYYMMDD-HHMMSS

# 4. Switch the recipe to DPO (same base_model)
uv run jw-finetune init -p doctrinal-qa-es-dpo-qwen35 -o ws-dpo/recipe.yaml

# 5. Build the preference dataset with SL-CAI + RLAIF
#    The judge automatically consults the 5 builtin principles.
uv run jw-finetune prepare-preference \
    --workspace ws-dpo \
    --prompts prompts.jsonl \
    --judge-mode strict \
    --principles

# 6. DPO (auto-dispatches by recipe.task)
uv run jw-finetune train --workspace ws-dpo

# 7. Export to GGUF or MLX
uv run jw-finetune export -c ws-dpo/checkpoints/final -f gguf -q Q4_K_M