NEW Browse AI tools across categories — updated daily. See what's new →

Vision Ocr

Use whenever a hand-written or scanned answer PDF needs transcription to markdown for /grade. Three tiers — Claude native vision (default, no extra install), local Qwen3-VL 8B via ollama (opt-in pr...

Version1.0.0
LicenseMIT
Token count~1,939
UpdatedJun 5, 2026

Use whenever a hand-written or scanned answer PDF needs transcription to markdown for /grade. Three tiers — Claude native vision (default, no extra install), local Qwen3-VL 8B via ollama (opt-in privacy mode), pytesseract fallback. The engine is selected via `OCR_ENGINE` in `.course-meta` (written by /paideia:init-course) and can be overridden per-call with `/paideia:grade --ocr=<engine>`.

Install

Quick install

via npx skills · works with 57+ agents
npx skills add https://github.com/TaewoooPark/PAIDEIA
Or pick agent:
npx skills add TaewoooPark/PAIDEIA --agent claude-code
npx skills add TaewoooPark/PAIDEIA --agent cursor
npx skills add TaewoooPark/PAIDEIA --agent codex
npx skills add TaewoooPark/PAIDEIA --agent opencode
npx skills add TaewoooPark/PAIDEIA --agent github-copilot
npx skills add TaewoooPark/PAIDEIA --agent windsurf
More install options

Shorthand — useful for multi-skill repos:

npx skills add TaewoooPark/PAIDEIA

Manual — clone the repo and drop the folder into your agent's skills directory:

git clone https://github.com/TaewoooPark/PAIDEIA.git
cp -r PAIDEIA ~/.claude/skills/
How to use: Once installed, ask your agent to "use the Vision Ocr skill" or describe what you want (e.g. "Use whenever a hand-written or scanned answer PDF needs transcription to markdow"). Requires Node.js 18+.

Vision Ocr

Use whenever a hand-written or scanned answer PDF needs transcription to markdown for /grade. Three tiers — Claude native vision (default, no extra install), local Qwen3-VL 8B via ollama (opt-in privacy mode), pytesseract fallback. The engine is selected via OCR_ENGINE in .course-meta (written by /paideia:init-course) and can be overridden per-call with /paideia:grade --ocr=<engine>.

---
name: vision-ocr
description: Use whenever a hand-written or scanned answer PDF needs transcription to markdown for /grade. Three tiers — Claude native vision (default, no extra install), local Qwen3-VL 8B via ollama (opt-in privacy mode), pytesseract fallback. The engine is selected via OCR_ENGINE in .course-meta (written by /paideia:init-course) and can be overridden per-call with /paideia:grade --ocr=<engine>.
---

Vision-OCR

When to load

  • /grade needs to convert answers/.pdfanswers/converted/.md
  • Any hand-written / scanned document whose previous tesseract pass was garbled
  • answer-processing skill's step-2 conversion

Engine choice

.course-meta holds a single line OCR_ENGINE: <engine> written by /paideia:init-course. The grade command reads it and dispatches. Users can override per-call with /paideia:grade --ocr=<engine> [path].

| Engine | Default? | How it runs | When to pick it |
|---|---|---|---|
| claude | Yes | pdftoppm → Claude reads each PNG via the Read tool → synthesizes markdown inline. No external model. No subprocess. | The out-of-the-box path. Nothing to install. Highest fidelity on messy handwriting because Claude vision handles mixed-script (English/Korean) prose with LaTeX well. |
| ollama | opt-in | python3 ${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py --engine=ollama <pdf> <md> — local Qwen3-VL 8B, with an automatic tesseract fall-back if ollama is unreachable. Reads INTERFACE_LANG from .course-meta to set the prose-language rule. | You want the PDF to never leave the machine and you don't want to burn Claude tokens on OCR. Requires one-time ollama pull qwen3-vl:8b (~6 GB). |
| tesseract | opt-in | python3 ${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py --engine=tesseract <pdf> <md> — pytesseract (eng for en, eng+kor for ko, derived from .course-meta). | Zero cloud + no GPU/VRAM budget. Lowest fidelity on handwriting; fine for typed scans. |

All three emit answers/converted/<stem>.md with a <!-- SOURCE: ... --> / <!-- TIER: ... --> header comment that lets /grade caveat the confidence.

Tier 0 — Claude native vision (default)

Pipeline (driven by the /grade command, not this script):

answers/<stem>.pdf
  ↓ pdftoppm -r 200 -png <pdf> <tmpdir>/page   # rasterize to PNG per page
  ↓ Claude reads <tmpdir>/page-1.png, page-2.png, ... via the Read tool
  ↓ Claude synthesizes clean MD following the prompt contract below
answers/converted/<stem>.md
   └── header:  <!-- SOURCE: <stem>.pdf, claude-vision (native), N pages -->

The grade command handles the orchestration — rasterize, Read each page, synthesize into one markdown file in a single pass. No standalone driver script is required.

Tier 1 — Ollama Qwen3-VL 8B (opt-in)

answers/<stem>.pdf
  ↓ pdf2image @ 300dpi
  ↓ resize to ≤1200px wide (VLMs dislike huge inputs)
  ↓ base64 JPEG per page
  ↓ [Tier 1a] ollama qwen3-vl:8b
  ↓    (on timeout / ollama down)
  ↓ [Tier 1b] pytesseract (eng or eng+kor, from .course-meta INTERFACE_LANG)  ← auto-fallback inside the same script
answers/converted/<stem>.md
   └── header:  <!-- SOURCE: <stem>.pdf, qwen3-vl:8b @ 300dpi, N pages -->
                <!-- TIER: tesseract fallback -->  (only when 1a bombed)

Entrypoint:

python3 "${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py" --engine=ollama <input.pdf> <output.md>

${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py is the single source of truth. It:


  1. Warms up the VLM with a 1-token /api/generate so the first real page isn't stalled by model load.

  2. Sends each page as a JPEG-encoded base64 image with a language-aware, LaTeX-first prompt (prose-language rule selected by INTERFACE_LANG).

  3. Sets keep_alive: "15m" so the model stays in memory across pages within a session.

  4. On any exception (timeout, connection refused) falls back to pytesseract and marks the file.

Tier 2 — Tesseract explicit

python3 "${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py" --engine=tesseract <input.pdf> <output.md>

Skips ollama entirely. Header: <!-- TIER: tesseract (explicit) -->.

Prompt contract (must preserve across Tier 0 + Tier 1)

Whether synthesized by Claude inline (Tier 0) or by Qwen3-VL through this script (Tier 1), the transcription prompt must:

  • Keep prose in its original language (English, Korean, etc.) — do not translate
  • Emit math as $...$ / $$...$$
  • Preserve problem numbering (P1, (1), (a), ...)
  • NOT grade or interpret — just transcribe
  • Write [?] for ambiguous glyphs instead of guessing
  • Skip crossed-out work
  • Return markdown only, no <think>, no commentary

If you edit the prompt, keep these six clauses — they're what separates useful transcription from hallucination.

Dependencies

All engines need:


  • poppler binaries (pdftoppm, used by pdf2image). brew install poppler / apt-get install poppler-utils.

Tier 0 (claude): nothing beyond Claude Code itself.

Tier 1 (ollama) extras:


  • ollama CLI + model qwen3-vl:8b (~6.1 GB). brew install ollama && ollama serve & && ollama pull qwen3-vl:8b.

  • Python: pdf2image, pytesseract, pillow.

Tier 2 (tesseract) extras:


  • tesseract + tesseract-lang (or tesseract-ocr-kor on Debian). Python: pdf2image, pytesseract, pillow.

Performance notes

  • Tier 0 (claude): depends on Claude's per-image processing; typically a few seconds per page with no model-load stall.
  • Tier 1 (ollama) on Mac M-series: first page ~2–5 min (model load + decode); subsequent pages ~20–60 s.
  • A 2-page hand-written answer typically takes 3–7 min total on Tier 1, a few seconds on Tier 0, and <10 s on Tier 2 (but fidelity falls off a cliff).
  • 300dpi input is downscaled to 1200px before encoding — keeps the base64 payload under ~500 KB on Tier 1.

Failure modes + fixes

| Symptom | Cause | Fix |
|---|---|---|
| Tier 0 produces garbage | Scan too dim / skewed / low-res | Re-scan at 300dpi with the page flat, re-run |
| Tier 1 timed out on page 1 | first-load stall on cold ollama | re-run; warmup + keep_alive should help on 2nd try |
| Tier 1 empty response / <think>... leaks | prompt contract violated | re-check prompt; add "Return ONLY markdown, no <think>" |
| Tier 1 base64 error / 413 | image too large | drop MAX_IMG_WIDTH from 1200 → 1000 |
| Tier 1 ollama 404 | qwen3-vl:8b not pulled | ollama pull qwen3-vl:8b |
| Tier 1 tesseract fallback kept firing | ollama server not running | ollama serve & |

Anti-patterns

  • ❌ Don't pass base64 via curl -d <arg> — ARG_MAX overflow. Use stdlib urllib with POST body.
  • ❌ Don't send PNG to Qwen3-VL — JPEG q=90 is 5–10× smaller with no impact on VLM accuracy.
  • ❌ Don't ask any tier to grade or solve. That's /grade's job; OCR must stay pure transcription.
  • ❌ Don't trust Tier 1b (silent tesseract fallback) without reading the header — the file comment tells /grade to caveat its verdict.

Integration

  • Called by /grade via answer-processing skill step 2
  • Called by /ingest (future) for hand-written lecture notes, if any appear in materials/
  • Writes to answers/converted/ only; does not modify originals in answers/

---

Source: https://github.com/TaewoooPark/PAIDEIA
Author: TaewoooPark
Discovered via: skillsdirectory.com
Genre: development

SKILL.md source

---
name: Vision Ocr
description: Use whenever a hand-written or scanned answer PDF needs transcription to markdown for /grade. Three tiers — Claude native vision (default, no extra install), local Qwen3-VL 8B via ollama (opt-in pr...
---

# Vision Ocr

Use whenever a hand-written or scanned answer PDF needs transcription to markdown for /grade. Three tiers — Claude native vision (default, no extra install), local Qwen3-VL 8B via ollama (opt-in privacy mode), pytesseract fallback. The engine is selected via `OCR_ENGINE` in `.course-meta` (written by /paideia:init-course) and can be overridden per-call with `/paideia:grade --ocr=<engine>`.

---
name: vision-ocr
description: Use whenever a hand-written or scanned answer PDF needs transcription to markdown for /grade. Three tiers — Claude native vision (default, no extra install), local Qwen3-VL 8B via ollama (opt-in privacy mode), pytesseract fallback. The engine is selected via `OCR_ENGINE` in `.course-meta` (written by /paideia:init-course) and can be overridden per-call with `/paideia:grade --ocr=<engine>`.
---

# Vision-OCR

## When to load

- `/grade` needs to convert `answers/*.pdf` → `answers/converted/*.md`
- Any hand-written / scanned document whose previous tesseract pass was garbled
- `answer-processing` skill's step-2 conversion

## Engine choice

`.course-meta` holds a single line `OCR_ENGINE: <engine>` written by `/paideia:init-course`. The grade command reads it and dispatches. Users can override per-call with `/paideia:grade --ocr=<engine> [path]`.

| Engine | Default? | How it runs | When to pick it |
|---|---|---|---|
| `claude` | **Yes** | `pdftoppm` → Claude reads each PNG via the Read tool → synthesizes markdown inline. No external model. No subprocess. | The out-of-the-box path. Nothing to install. Highest fidelity on messy handwriting because Claude vision handles mixed-script (English/Korean) prose with LaTeX well. |
| `ollama` | opt-in | `python3 ${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py --engine=ollama <pdf> <md>` — local Qwen3-VL 8B, with an automatic tesseract fall-back if ollama is unreachable. Reads `INTERFACE_LANG` from `.course-meta` to set the prose-language rule. | You want the PDF to never leave the machine *and* you don't want to burn Claude tokens on OCR. Requires one-time `ollama pull qwen3-vl:8b` (~6 GB). |
| `tesseract` | opt-in | `python3 ${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py --engine=tesseract <pdf> <md>` — pytesseract (`eng` for en, `eng+kor` for ko, derived from `.course-meta`). | Zero cloud + no GPU/VRAM budget. Lowest fidelity on handwriting; fine for typed scans. |

All three emit `answers/converted/<stem>.md` with a `<!-- SOURCE: ... -->` / `<!-- TIER: ... -->` header comment that lets `/grade` caveat the confidence.

## Tier 0 — Claude native vision (default)

**Pipeline (driven by the `/grade` command, not this script):**

```
answers/<stem>.pdf
  ↓ pdftoppm -r 200 -png <pdf> <tmpdir>/page   # rasterize to PNG per page
  ↓ Claude reads <tmpdir>/page-1.png, page-2.png, ... via the Read tool
  ↓ Claude synthesizes clean MD following the prompt contract below
answers/converted/<stem>.md
   └── header:  <!-- SOURCE: <stem>.pdf, claude-vision (native), N pages -->
```

The grade command handles the orchestration — rasterize, Read each page, synthesize into one markdown file in a single pass. No standalone driver script is required.

## Tier 1 — Ollama Qwen3-VL 8B (opt-in)

```
answers/<stem>.pdf
  ↓ pdf2image @ 300dpi
  ↓ resize to ≤1200px wide (VLMs dislike huge inputs)
  ↓ base64 JPEG per page
  ↓ [Tier 1a] ollama qwen3-vl:8b
  ↓    (on timeout / ollama down)
  ↓ [Tier 1b] pytesseract (eng or eng+kor, from .course-meta INTERFACE_LANG)  ← auto-fallback inside the same script
answers/converted/<stem>.md
   └── header:  <!-- SOURCE: <stem>.pdf, qwen3-vl:8b @ 300dpi, N pages -->
                <!-- TIER: tesseract fallback -->  (only when 1a bombed)
```

**Entrypoint:**

```bash
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py" --engine=ollama <input.pdf> <output.md>
```

`${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py` is the single source of truth. It:
1. Warms up the VLM with a 1-token `/api/generate` so the first real page isn't stalled by model load.
2. Sends each page as a JPEG-encoded base64 image with a language-aware, LaTeX-first prompt (prose-language rule selected by `INTERFACE_LANG`).
3. Sets `keep_alive: "15m"` so the model stays in memory across pages within a session.
4. On any exception (timeout, connection refused) falls back to pytesseract and marks the file.

## Tier 2 — Tesseract explicit

```bash
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/vision_ocr.py" --engine=tesseract <input.pdf> <output.md>
```

Skips ollama entirely. Header: `<!-- TIER: tesseract (explicit) -->`.

## Prompt contract (must preserve across Tier 0 + Tier 1)

Whether synthesized by Claude inline (Tier 0) or by Qwen3-VL through this script (Tier 1), the transcription prompt must:

- Keep prose in its original language (English, Korean, etc.) — do not translate
- Emit math as `$...$` / `$$...$$`
- Preserve problem numbering (P1, (1), (a), ...)
- NOT grade or interpret — just transcribe
- Write `[?]` for ambiguous glyphs instead of guessing
- Skip crossed-out work
- Return markdown only, no `<think>`, no commentary

If you edit the prompt, keep these six clauses — they're what separates useful transcription from hallucination.

## Dependencies

**All engines need:**
- `poppler` binaries (`pdftoppm`, used by pdf2image). `brew install poppler` / `apt-get install poppler-utils`.

**Tier 0 (claude):** nothing beyond Claude Code itself.

**Tier 1 (ollama) extras:**
- `ollama` CLI + model `qwen3-vl:8b` (~6.1 GB). `brew install ollama && ollama serve & && ollama pull qwen3-vl:8b`.
- Python: `pdf2image`, `pytesseract`, `pillow`.

**Tier 2 (tesseract) extras:**
- `tesseract` + `tesseract-lang` (or `tesseract-ocr-kor` on Debian). Python: `pdf2image`, `pytesseract`, `pillow`.

## Performance notes

- **Tier 0 (claude):** depends on Claude's per-image processing; typically a few seconds per page with no model-load stall.
- **Tier 1 (ollama) on Mac M-series:** first page ~2–5 min (model load + decode); subsequent pages ~20–60 s.
- A 2-page hand-written answer typically takes 3–7 min total on Tier 1, a few seconds on Tier 0, and <10 s on Tier 2 (but fidelity falls off a cliff).
- 300dpi input is downscaled to 1200px before encoding — keeps the base64 payload under ~500 KB on Tier 1.

## Failure modes + fixes

| Symptom | Cause | Fix |
|---|---|---|
| Tier 0 produces garbage | Scan too dim / skewed / low-res | Re-scan at 300dpi with the page flat, re-run |
| Tier 1 `timed out` on page 1 | first-load stall on cold ollama | re-run; warmup + `keep_alive` should help on 2nd try |
| Tier 1 empty response / `<think>...` leaks | prompt contract violated | re-check prompt; add "Return ONLY markdown, no <think>" |
| Tier 1 base64 error / 413 | image too large | drop `MAX_IMG_WIDTH` from 1200 → 1000 |
| Tier 1 ollama 404 | `qwen3-vl:8b` not pulled | `ollama pull qwen3-vl:8b` |
| Tier 1 tesseract fallback kept firing | ollama server not running | `ollama serve &` |

## Anti-patterns

- ❌ Don't pass base64 via `curl -d <arg>` — ARG_MAX overflow. Use stdlib `urllib` with POST body.
- ❌ Don't send PNG to Qwen3-VL — JPEG q=90 is 5–10× smaller with no impact on VLM accuracy.
- ❌ Don't ask any tier to grade or solve. That's `/grade`'s job; OCR must stay pure transcription.
- ❌ Don't trust Tier 1b (silent tesseract fallback) without reading the header — the file comment tells `/grade` to caveat its verdict.

## Integration

- Called by `/grade` via `answer-processing` skill step 2
- Called by `/ingest` (future) for hand-written lecture notes, if any appear in `materials/`
- Writes to `answers/converted/` only; does not modify originals in `answers/`


---

**Source**: https://github.com/TaewoooPark/PAIDEIA
**Author**: TaewoooPark
**Discovered via**: skillsdirectory.com
**Genre**: development

Related skills 6

caveman

★ Featured

Ultra-compressed communication mode. Cuts token usage ~75% by speaking like caveman while keeping full technical accuracy. Supports intensity levels: lite, full (default), ultra, wenyan-lite, wenyan-full, wenyan-ultra. Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman. Also auto-triggers when token efficiency is requested.

juliusbrussee 167k
Development

secure-linux-web-hosting

★ Featured

Use when setting up, hardening, or reviewing a cloud server for self-hosting, including DNS, SSH, firewalls, Nginx, static-site hosting, reverse-proxying an app, HTTPS with Let's Encrypt or ACME clients, safe HTTP-to-HTTPS redirects, or optional post-launch network tuning such as BBR.

xixu-me 155k
Development

readme-i18n

★ Featured

Use when the user wants to translate a repository README, make a repo multilingual, localize docs, add a language switcher, internationalize the README, or update localized README variants in a GitHub-style repository.

xixu-me 155k
Development

lark-shared

★ Featured

Use when first setting up lark-cli, running auth login, switching user/bot identity (--as), handling permission denied or scope errors, needing to update lark-cli, or seeing _notice in JSON output.

larksuite 155k
Development

improve-codebase-architecture

★ Featured

Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.

mattpocock 151k
Development

paper-context-resolver

★ Featured

Optional RigorPilot helper for README-first deep learning repo reproduction. Use only when the README and repository files leave a narrow reproduction-critical gap and the task is to resolve a specific paper detail such as dataset split, preprocessing, evaluation protocol, checkpoint mapping, or runtime assumption from primary paper sources while recording conflicts. Do not use for general paper summary, repo scanning, environment setup, command execution, title-only paper lookup, or replacin...

lllllllama 127k
Development