Release F77-F79 · alignment to the published canon

YAML principles · RLAIF · DPO / ORPO on Qwen3.5

Three phases that apply Constitutional AI, RLAIF, and DPO/ORPO to the toolkit's concrete problem: keeping a local fine-tune from drifting away from the canon published by the organization. The source of truth is still the current published material; these techniques are alignment engineering, not new doctrine. Zero human annotators in the critical path — the judge with its principles acts as the AI annotator. Example base model: Qwen3.5-0.8B (Apache-2.0).

Phases

New tests

+41

Builtin principles

Total suite

1326

passing · 0 cycles

▌ Design decisions that tied the 3 phases together

→ The currently-published canon is the source of truth. The toolkit reflects, doesn't legislate.
→ Principles in versioned, inspectable YAML — not scattered heuristics.
→ Explicit severities: hard blocks, soft only annotates.
→ Lazy cross-package imports — zero dependency cycles (jw-agents ↛ jw-eval).
→ Single source of truth: score_pair reuses score, doesn't duplicate logic.
→ Permissive base model (Apache-2.0) so anyone can use it downstream.
→ CLI still works without Unsloth — only the trainers require GPU extras.
→ No regressions: 1,326 tests pass, no new cycles, ruff clean.

Phase 77

fidelity-principles

Fidelity principles to the published canon, versioned as YAML

✅ Shipped 🧪 21 tests T1 Alignment

Technical guide →

Codifies as YAML the doctrinal principles previously scattered through code (canon-only rule, no-impersonation, epistemic humility, required citations, conscience matters). Each principle has an id, severity (hard/soft), applies_to per agent, source citing an official publication, and an optional regex tier for cheap detection. The jw-finetune judge and the jw-agents fidelity_wrap decorator both consult it at runtime. Lazy import from jw-agents to avoid the cycle (jw-eval already depends on jw-agents).

What was delivered

5 builtin principles: PF001-canon-only, PF002-cite-before-paraphrase, PF003-citation-required, PF010-no-impersonation, PF012-respect-conscience.
Pydantic loader with id-based override: a user YAML overrides a builtin sharing the same id.
violations_for(text, principles) runs forbidden_phrases + forbidden_regex in one pass (case-insensitive).
Judge.score_qa_pair() accepts principles=; a hard hit adds RejectionCode.principle_hard_violation.
fidelity_wrap(principles=…) filters by agent_name, stamps metadata principle_hard/soft, respects on_fail warn/reject/annotate.
Lazy import of jw_eval.principles from jw_agents to avoid the dependency cycle.
Hard severity = block; soft = annotate; no hidden policies.

Phase 78

rlaif-pipeline

The judge promoted to a preference annotator + SL-CAI

✅ Shipped 🧪 8 tests T1 Alignment

Technical guide →

Turns the jw-finetune Judge into a deterministic preference model: given (question, answer_a, answer_b) it decides a winner with a clear protocol — hard fails first (no-citation, NLI contradicts, hard principle), then overall with a tolerance ε, and NLI as tiebreak. The dataset builder generates N candidates per prompt with a temperature sweep, scores them pairwise, and exports JSONL in the trl format. The critique.py module implements the supervised half of Constitutional AI: for each Q&A pair the LLM reviews against principles and rewrites if a hard violation is found.

What was delivered

PreferenceVerdict (winner/margin/reasons/score_a/score_b) + compare_scores with tie_epsilon.
Hard-fail asymmetry: the side with nli_contradicts, no_jw_citation, or principle_hard_violation loses against the other.
Judge.score_pair(question, answer_a, answer_b, language) reuses score() — single source of truth.
build_preference_dataset() with n_candidates, min_margin, deterministic temperature sweep [0.1, 0.5, 0.8, 1.0].
PreferenceStats with counters items/candidates/kept/tied/low_margin/provider_errors/by_winner.
Output JSONL format {prompt, chosen, rejected, metadata} ready for trl.DPOTrainer/ORPOTrainer.
self_critique(pair, principles, llm) — SL-CAI loop with opt-in preserve_original for audit.
batch_critique(pairs) → (revised_pairs, num_changed) integrates with the existing orchestrator.

Phase 79

dpo-orpo-trainers

DPO and ORPO with Unsloth on top of Qwen3.5-0.8B (Apache-2.0)

✅ Shipped 🧪 12 tests T2 Training

Technical guide →

Closes the alignment loop: preference dataset → trainer → checkpoint → GGUF/MLX export. Two new trainers: DPO (with beta=0.1 by default, ref_model auto-derived from the base via LoRA) and ORPO (single-stage, no reference model, ideal for MLX/ROCm). Three builtin recipes on Qwen3.5-0.8B Apache-2.0 with chat_template=qwen-3. Extended CLI: jw-finetune train auto-dispatches by recipe.task; new prepare-preference subcommand wires the judge with loaded principles.

What was delivered

train_dpo() with trl.DPOTrainer + Unsloth FastLanguageModel + get_chat_template (correct template alignment).
train_orpo() with trl.ORPOTrainer — single phase, no reference model, ideal for small datasets.
Recipe.task admits 'dpo' and 'orpo' alongside cpt/sft/grpo; validation extends coherently.
3 recipes: doctrinal-qa-es-sft-qwen35 (SFT base), doctrinal-qa-es-dpo-qwen35 (lr 5e-6, 1 epoch), doctrinal-qa-es-orpo-qwen35 (lr 8e-6).
CLI dispatch in train: rec.task=='dpo' → train_dpo; rec.task=='orpo' → train_orpo; dataset by convention preference_pairs.jsonl.
New command jw-finetune prepare-preference --judge-mode strict --principles to produce the ready JSONL.
Compatible with all existing exporters: GGUF (llama.cpp/Ollama), MLX (Apple), SafeTensors merged/adapter.
Lazy Unsloth import — the CLI still works without a GPU for non-training commands.

▌ End-to-end flow

The 3 phases compose into a coherent pipeline on the Qwen3.5-0.8B Apache-2.0 recipes:

# 1. Generate SFT recipe from preset (Qwen3.5-0.8B, chat_template qwen-3)
uv run jw-finetune init -p doctrinal-qa-es-sft-qwen35 -o ws/recipe.yaml

# 2. Prepare SFT dataset (extract → dedupe → chunk → synth Q&A)
uv run jw-finetune prepare --recipe-file ws/recipe.yaml --source ./pubs/

# 3. Base SFT
uv run jw-finetune train --workspace ws/run-YYYYMMDD-HHMMSS

# 4. Switch the recipe to DPO (same base_model)
uv run jw-finetune init -p doctrinal-qa-es-dpo-qwen35 -o ws-dpo/recipe.yaml

# 5. Build the preference dataset with SL-CAI + RLAIF
#    The judge automatically consults the 5 builtin principles.
uv run jw-finetune prepare-preference \
    --workspace ws-dpo \
    --prompts prompts.jsonl \
    --judge-mode strict \
    --principles

# 6. DPO (auto-dispatches by recipe.task)
uv run jw-finetune train --workspace ws-dpo

# 7. Export to GGUF or MLX
uv run jw-finetune export -c ws-dpo/checkpoints/final -f gguf -q Q4_K_M

▌ More

← Phases 65-76 Master roadmap → Base model Qwen3.5-0.8B ↗