Independent Project Not affiliated with, sponsored by, or endorsed by the Watch Tower Bible and Tract Society or Jehovah's Witnesses.
jw-agent-toolkit
ES

Guide

VLM-OCR (Fase 36)

jw_core.vision.vlm replaces the legacy Tesseract OCR path with a typed, structured Vision-Language-Model pipeline that returns one block per typographic element on the page.

Quick start

from jw_core.vision import extract_bible_reference_from_image_v2

out = extract_bible_reference_from_image_v2(
    "path/to/page.png", language="es"
)
print(out["reference"])         # parsed BibleRef.model_dump() or None
print(out["text"])              # raw text fallback (compat)
for block in out["structured_page"].blocks:
    print(block.kind, block.text)

Choosing a provider

HardwareProviderInstall
Apple Siliconqwen3vl_local (mlx)uv pip install jw-core[vlm-mlx] + huggingface-cli download mlx-community/Qwen3-VL-2B-Instruct-4bit
NVIDIA GPUqwen3vl_local (vllm)uv pip install jw-core[vlm-nvidia]
CPU onlyqwen3vl_local (gguf)uv pip install jw-core[vlm-cpu] + download GGUF
API onlyclaude_visionuv pip install jw-core[vlm-anthropic] + ANTHROPIC_API_KEY
API onlyopenai_visionuv pip install jw-core[vlm-openai] + OPENAI_API_KEY
API onlyqwen3vl_apiuv pip install jw-core[vlm-api-qwen] + JW_QWEN3VL_API_KEY + JW_QWEN3VL_API_BASE
Last resorttesseract_fallbackbrew install tesseract + uv pip install jw-core[vlm-tesseract]

The factory picks the first available backend from this chain: qwen3vl_local → qwen3vl_api → claude_vision → openai_vision → tesseract_fallback.

Force a provider:

export JW_VLM_PROVIDER=claude_vision

Model overrides:

  • JW_CLAUDE_VISION_MODEL — default claude-haiku-4-5. ClaudeVisionProvider is an adapter over the anthropic SDK; Claude is natively multimodal.
  • JW_OPENAI_VISION_MODEL — default gpt-4o-mini.
  • JW_QWEN3VL_LOCAL_MODEL — model id / path for local Qwen3-VL backend.
  • JW_QWEN3VL_LOCAL_TARGETmlx | nvidia | cpu.

CLI

JW_VLM_PROVIDER=fake jw image extract path/to/page.png --language es
JW_VLM_PROVIDER=fake jw image ingest  path/to/page.png --language es \
    --store ~/.jw-toolkit/rag

MCP

The MCP server exposes two new tools:

  • extract_structured_page(image_path, language)StructuredPage JSON.
  • ingest_image_to_rag(image_path, language){"chunks": n}.

Migrating from ocr_image()

ocr_image() still works but emits DeprecationWarning. Drop-in replacement:

from jw_core.vision import migrate_to_vlm

ocr_image = migrate_to_vlm()   # callable with same (path, language=) signature
text = ocr_image("page.png", language="es")

Boundaries

  • One image per call. Multi-page PDFs: see Fase 37 (colpali-visual).
  • Pesos locales no se distribuyen — el usuario los baja con huggingface-cli.
  • No fine-tuning aquí (ver Fase 11 / jw-finetune).

Edit this page on docs/guias/vlm-ocr.md