opendataloader-project/opendataloader-pdf

CLAUDE.md

View on GitHub ↗Yours? Claim it ↗

§ 01 — Stats

Stars21.1k

Forks2.0k

Prior1389

Quality—

Score—

Tasks—

§ 02 — Use

Drop into your project.

A CLAUDE.md is just a markdown file at the root of your repo. Copy the content below into your own project's CLAUDE.md to give your agent the same context.

One-line install · current directory

$npx versuz@latest install opendataloader-project-opendataloader-pdf --kind=claude-md

Or curl directly

$curl -o CLAUDE.md https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/HEAD/CLAUDE.md

Project typegeneric

Embed badge

Show

Style

[![Versuz · opendataloader-project/opendataloader-pdf](https://versuz.dev/badge/claude-md/opendataloader-project-opendataloader-pdf)](https://versuz.dev/claude-md/opendataloader-project-opendataloader-pdf)

Show CLAUDE.md content (~431 tokens)

# CLAUDE.md

## Gotchas

After changing CLI options in Java, **must** run `npm run sync` — this regenerates `options.json` and all Python/Node.js bindings. Forgetting this silently breaks the wrappers.

When using `--enrich-formula` or `--enrich-picture-description` on the hybrid server, the client **must** use `--hybrid-mode full`. Otherwise enrichments are silently skipped (they only run on the backend, not in Java).

Processing uses `ForkJoinPool(availableProcessors)` for per-page parallelism. All `StaticContainers` and `StaticLayoutContainers` ThreadLocal state must be propagated to worker threads via `propagateState.run()` — missing a ThreadLocal causes silent data loss or NPE in parallel mode.

Hidden text detection (`--filter-hidden-text`) is **off by default** — it requires per-page PDF rendering via `ContrastRatioConsumer` which cannot be parallelized safely.

## Conventions

Manual docs live in opendataloader.org repo. Reference docs (CLI options, JSON schema) are auto-generated by CI at release time and pushed to opendataloader.org. No MDX files are tracked in this repo.

## Benchmark

- `./scripts/bench.sh` — Run benchmark (auto-clones opendataloader-bench for PDFs and evaluation logic)
- `./scripts/bench.sh --doc-id <id>` — Debug specific document
- `./scripts/bench.sh --check-regression` — CI mode with threshold check
- Benchmark code lives in [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench)
- Metrics: **NID** (reading order), **TEDS** (table structure), **MHS** (heading structure), **Table Detection F1**, **Speed**