Independent Project Not affiliated with, sponsored by, or endorsed by the Watch Tower Bible and Tract Society or Jehovah's Witnesses.
jw-agent-toolkit
ES

Release F80 · mechanistic interpretability

Probing · Steering · Qwen Scope + Gemma Scope

Closes the F77–F79 alignment block with a sharp question: did the model internalise the principles or did it learn a stylistic shortcut? Tri-model architecture: production Qwen3.5-0.8B left untouched, a Qwen3.5-2B-Base lab with public Qwen-Scope and a Gemma-2-2B-PT lab with Gemma Scope (SOTA JumpReLU). Cross-family validation: a moral feature that emerges in two architecturally-distinct families is much stronger evidence than one alone.

Sub-phases

6

New tests

+98

New package

jw-interp

Total suite

1411

passing · 0 jw-interp coupling in jw-agents

▌ Design decisions that tied the 6 sub-phases

  • Production Qwen3.5-0.8B is never touched. SAEs live in the 2B lab models.
  • Tier 4 is observational: a probe miss never vetoes a Finding by itself.
  • Cross-family validation: Qwen-Scope + Gemma Scope with an identical numpy interface.
  • The mock capturer is a mathematical contract: if the probe fails on it, the pipeline is broken.
  • Zero coupling: fidelity_wrap takes a callable, no jw-interp import.
  • torch.load(weights_only=True) always — no pickle exploits from HF.
  • Probes persisted as npz + JSON sidecar — no pickle, forward-compat across sklearn.
  • Opt-in extras: torch for real capture, sae for Gemma Scope.

Phase 80.0

sl-cai-critique-cli

Close the upstream gap: rewrite violations before SFT

✅ Shipped 🧪 14 tests T1 Loop closure
Technical guide →

The pipeline already lived in jw_finetune.synth.critique but lacked CLI, tests and docs. F80.0 closes the loop: build-critique-dataset takes a ShareGPT SFT dataset, filters principles by agent_name, runs the cheap regex tier, calls the LLM only on hard violations and emits a parallel dataset with revised answers. The original is preserved in metadata.original_answer for audit. Cuts hard violations in the next SFT round's training set.

What shipped

  • New command jw-finetune build-critique-dataset with flags input/output/provider/model/agent/principles/preserve-original.
  • Regex tier short-circuit: no match means no LLM call (zero cost on clean corpora).
  • Filtering by the principle's applies_to vs the agent_name.
  • Programmatic API batch_critique(pairs, principles, llm, agent) → (revised_pairs, num_changed).
  • self_critique(pair, principles, llm_provider, agent, preserve_original) returns CritiqueResult with changed + violated_principle_ids + original_answer.
  • 10 TDD tests for the module + 4 CLI dispatch tests (input missing, output paths, no-principles flag).
  • Graceful fallbacks: LLM error → original untouched, empty response → original untouched, identical text → changed=False.
  • Guide docs/guias/sl-cai.md with quickstart and how it relates to F77 + F78 + F79.

Phase 80.1

linear-probing-per-principle

Do the principles live in the representation, or is it a stylistic shortcut?

✅ Shipped 🧪 27 tests T2 Analysis
Technical guide →

Brand-new jw-interp package with the full machinery: declarative ContrastiveSpec for the 5 builtin principles, deterministic MockActivationCapturer (offset per principle × layer × hook) that yields linearly-separable data, sklearn LinearProbe with stratified split and AUC + accuracy, and a TorchActivationCapturer using HF forward hooks with auto-device (cuda > mps > cpu). Probe accuracy ≥0.80 on some layer = principle internalised; <0.65 across all layers = shortcut detected. Honest design: the mock is a mathematical contract — if the probe fails to separate it, the pipeline is broken.

What shipped

  • ContrastivePair / ContrastiveSpec / PrincipleContrastiveBuilder / ProbingDataset with shape validation.
  • 5 seed builtin specs for PF001/002/003/010/012 (extendable by the user).
  • Deterministic MockActivationCapturer with direction hashing per (principle_id, layer, hook), configurable signal_strength, low noise_std.
  • TorchActivationCapturer with AutoModelForCausalLM + forward hooks on model.model.layers, last-token / mean pooling, configurable batch_size.
  • LinearProbe sklearn.LogisticRegression with stratified test split, accuracy + AUC, coef + bias persisted for F80.2 steering.
  • train_probe + train_probes_for_principle (one per layer).
  • Lazy torch/transformers imports — the package runs without GPU for tests and mock.
  • End-to-end test: probe on linearly-separable mock activations must hit ≥ 0.95 accuracy.
  • 22 pure-numpy tests + 5 torch tests (importorskip).
  • Guide docs/guias/probing.md with real + synthetic quickstart and result interpretation table.

Phase 80.2

steering-vectors-activation-patching

Causal validation: correlation is not causation

✅ Shipped 🧪 15 tests T2 Analysis

If a probe finds the principle but the corresponding activation does not cause the behaviour, it's a shortcut. Steering vectors (mean-of-positives minus mean-of-negatives), residual application with broadcasting, orthogonal projection (ablation) and monotonic alpha-sweep evaluation against the probe. Activation patching pure-numpy core: leaves the real forward-pass wrap ready (torch_patching.py lands in F80.6+). If +alpha raises the probe score and -alpha lowers it → causal; if neither direction moves it → spurious.

What shipped

  • Frozen SteeringVector dataclass with vector, magnitude, n_positive/negative.
  • compute_steering_vector(batch, principle_id, normalize=True) uses diff-of-means, unit-norm by default.
  • compute_steering_vectors_for_principle(batches) — one per layer.
  • apply_steering_to_residual with broadcasting (1D residual or 2D batch), immutable.
  • project_out removes the vector component (ablation) — verified: post-projection dot ~0.
  • evaluate_steering_effect(batch, vector, probe.predict_proba, alpha) — monotonicity test under alpha (+ raises neg, − lowers neg).
  • patching.PatchedActivation + patch_one / patch_batch / evaluate_patching_effect (pure-numpy).
  • Test: patching with self → ~0 effect; patching with flipped labels returns a finite effect.
  • Rejects shape/layer mismatches with clear errors.
  • Honest design: the real forward-pass wrap (torch_patching) is pending — does not block F80.5.

Phase 80.3

qwen-scope-adapter

Load SAE-Res-Qwen3.5-2B-Base-W32K-L0_50 and map principles to features

✅ Shipped 🧪 14 tests T3 SAE / runtime

Adapter for the public Qwen-Scope SAEs (TopK k=50 over the residual stream, 24 layers of Qwen3.5-2B-Base, W32K features). QwenScopeSAE.encode uses np.argpartition for O(n·d_sae) TopK, decode reconstructs the residual, reconstruction_error as a fidelity metric. The loader uses torch.load(weights_only=True) — no pickle execution. summarize_feature_activations maps principles to candidate features by differential activation rate between positives and negatives.

What shipped

  • Frozen QwenScopeSAE with W_enc / b_enc / W_dec / b_dec + d_model + d_sae + k.
  • encode TopK is pure numpy with argpartition (O(n·d_sae) without a full sort).
  • decode reconstructs the residual; reconstruction_error as MSE.
  • load_qwen_scope_sae(path, layer, k) with torch.load(weights_only=True) — safe against pickle exploits.
  • Checks file exists BEFORE importing torch — FileNotFoundError fast path without the dep.
  • summarize_feature_activations returns the top_n features ordered by |rate_pos − rate_neg|.
  • FeatureActivationSummary with layer + feature_idx + rate_pos/neg + mean_pos/neg + differential_rate.
  • Test: TopK keeps the features with the largest pre-activations (mathematical verification).
  • 11 pure-numpy tests + 3 torch tests (importorskip): load/save round-trip, missing keys, missing file.
  • No hard dep on sae_lens — the adapter only needs torch to deserialise and numpy for everything else.

Phase 80.4

gemma-scope-wrapper

SOTA JumpReLU SAEs + cross-family validation Qwen ⟷ Gemma

✅ Shipped 🧪 7 tests T3 SAE / runtime

Wrapper around sae_lens.SAE for the JumpReLU SAEs of Gemma Scope (gemma-2-2b PT and gemma-2-9b PT, sites resid_post + mlp_out + attn_out, every layer, widths 16k–262k). Numpy interface identical to QwenScopeSAE so F80.3 and F80.4 are interchangeable downstream. Cross-family validation is the point: a moral feature emerging in two architecturally-distinct families is much stronger evidence than one alone. If features do not match, that result is itself informative.

What shipped

  • Frozen GemmaScopeSAE with layer + site (resid_post|mlp_out|attn_out) + d_model + d_sae + _inner SAELens object.
  • encode / decode traverse sae_lens.SAE but the public API is numpy-in / numpy-out.
  • Declarative _RELEASE_MAP for gemma-2-2b/-9b × resid/mlp/attn → SAELens release id.
  • _resolve_release(model_name, site) with a clear error when the combination is not registered.
  • load_gemma_scope_sae(model_name, site, layer, width, l0, device) with lazy sae_lens import.
  • summarize_gemma_features reuses the QwenScopeSAE summariser (same numpy contract).
  • _FakeSAELensSAE in tests to avoid the sae_lens dep on CI; the tests use importorskip torch.
  • Test verifying the correct error message when sae_lens is not installed.

Phase 80.5

runtime-probe-store-tier4

Persisted probes + observational fidelity_wrap Tier 4

✅ Shipped 🧪 21 tests T1 Loop closure
Technical guide →

Closes the F80 loop into runtime. probe_store uses np.savez_compressed + JSON sidecar — no pickle, forward-compatible across sklearn versions. RuntimeProbe.predict_proba implements a numerically stable numpy sigmoid (matches sklearn's predict_proba to 1e-5). ProbeEvaluator has two modes: eager (via TorchActivationCapturer, one forward per finding) and cache-only (accepts pre-captured activations). The Tier 4 in fidelity_wrap is deliberately observational — NEVER vetoes a Finding by itself, only annotates probe_scores + probe_misses + probe_coherence (clear|confirms|conflicts|silent). Zero coupling: the ProbeEvaluatorCallable type lives in jw-agents, no jw-interp import.

What shipped

  • save_probe / load_probe with np.savez_compressed + JSON sidecar (forward-compat, no pickle).
  • save_probe_set / load_probe_set with manifest.json (model_name + hidden_size + n_layers + version).
  • RuntimeProbe.predict_proba with numerically stable sigmoid (separate paths for x ≥ 0 and x < 0).
  • Mathematical test: RuntimeProbe.predict_proba matches sklearn LogisticRegression.predict_proba to 1e-5.
  • ProbeEvaluator with __call__ eager (one forward per finding) and score_cached (pre-captured activations).
  • build_probe_evaluator(probes_dir, capturer, model_name) auto-builds a capturer from the manifest if torch is present.
  • mock_evaluator(returns) for deterministic GPU-free tests.
  • fidelity_wrap accepts probe_evaluator: Callable[[str], dict[str, float]] + probe_min_score (default 0.5).
  • Per Finding: probe_scores (JSON), probe_misses (CSV), probe_min_score, probe_coherence in metadata.
  • Coherence categories: clear (no flags, internalised), confirms (regex + probe agree), conflicts (regex flag, probe high), silent (regex clean, probe miss — suspected shortcut).
  • Tier 4 errors swallowed → probe_error in metadata (a broken evaluator must never take down production).
  • Critical test: even with ALL probes at 0, the Finding is NOT dropped (on_fail='reject' is irrelevant to probes).
  • 14 probe_store + runtime tests + 7 Tier 4 tests in jw-agents (zero regression against the 930 existing).
  • Guide docs/guias/interpretabilidad-runtime.md with eager + cached quickstart + fidelity_wrap integration.

▌ End-to-end: from training to runtime Tier 4

The 6 sub-phases compose a coherent pipeline on top of Qwen3.5-0.8B + a Qwen 2B lab + a Gemma 2B lab:

# 1. F80.0 — SL-CAI critique over the SFT dataset before fine-tune
uv run jw-finetune build-critique-dataset \
    --workspace ws-sft \
    --synth-provider anthropic \
    --synth-model claude-haiku-4-5-20251001

# 2. F77-F79 — DPO/ORPO training on the revised dataset (unchanged)
uv run jw-finetune train --workspace ws-dpo

# 3. F80.1 — train probes against the fine-tuned model
uv sync --extra torch
python -c "
from jw_interp import (
    PrincipleContrastiveBuilder, TorchActivationCapturer,
    build_default_contrastive_specs, train_probes_for_principle,
    save_probe_set, ProbeStoreManifest,
)
cap = TorchActivationCapturer('Qwen/Qwen3.5-0.8B')
builder = PrincipleContrastiveBuilder(build_default_contrastive_specs())
results = []
for pid in builder.principle_ids:
    batches = cap.capture(builder.build(pid), layers=list(range(0, 24, 4)))
    results.extend(train_probes_for_principle(batches, pid))
save_probe_set(results, '~/jw-probes/v1', ProbeStoreManifest(
    model_name='Qwen/Qwen3.5-0.8B',
    hidden_size=cap.hidden_size,
    n_layers=cap.n_layers))
"

# 4. F80.5 — wire Tier 4 into production
python -c "
from jw_interp.runtime import build_probe_evaluator
from jw_agents.fidelity_wrap import fidelity_wrap
from jw_eval.principles import load_principles

evaluator = build_probe_evaluator(probes_dir='~/jw-probes/v1')

@fidelity_wrap(
    on_fail='warn',
    principles=load_principles(),
    probe_evaluator=evaluator,
    probe_min_score=0.5,
)
async def apologetics(query: str): ...
"