DocumentVivekKarmarkarFree

find-paper-by-title

Search a directory tree for PDF(s) whose title matches a query string. Uses PDF metadata first (fast), falls back to filename and first-page text if needed. Groups byte-identical duplicates via MD5 so you see which copies are the same file vs different versions. Use when the user asks "do I have a paper called X?" or "where is the paper with title Y in my folder?" or wants to dedupe papers across subfolders.

Repo bundle on VersuzVivekKarmarkar/claude-code-os810 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/VivekKarmarkar/claude-code-os Yours? Claim it ↗

§ 01 — Stats

Prior1090

Quality—

Score—

Tasks—

§ 02 — Install

Get find-paper-by-title.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install vivekkarmarkar-claude-code-os-skills-find-paper-by-title

Or clone the repo

$git clone https://github.com/VivekKarmarkar/claude-code-os.git

Or copy the SKILL.md manually

$cp claude-code-os/SKILL.MD ~/.claude/skills/vivekkarmarkar-claude-code-os-skills-find-paper-by-title/SKILL.md

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge vivekkarmarkar-claude-code-os-skills-find-paper-by-title↵

Show SKILL.md content (~845 tokens)

---
name: find-paper-by-title
description: Search a directory tree for PDF(s) whose title matches a query string. Uses PDF metadata first (fast), falls back to filename and first-page text if needed. Groups byte-identical duplicates via MD5 so you see which copies are the same file vs different versions. Use when the user asks "do I have a paper called X?" or "where is the paper with title Y in my folder?" or wants to dedupe papers across subfolders.
---

# find-paper-by-title

Fast PDF title search across a directory tree. Returns all matches, grouped by content hash so duplicates are visible.

## Arguments

- `<title>` — paper title (may be truncated or paraphrased)
- `--root <dir>` (optional, default: current directory) — where to search
- `--threshold <0-1>` (optional, default: 0.5) — minimum match score
- `--deep` (optional) — also extract first-page text for PDFs whose metadata is missing/junky (slower)

## Pipeline

### Step 0 — Determine the search root

If the user named a directory, use it. Otherwise default to `$PWD`. For this project, the common search root is the PAT-Scan repo.

### Step 1 — Run the search

```bash
python3 ~/.claude/skills/find-paper-by-title/helpers/search_paper.py \
  "<title>" \
  --root "<dir>" \
  [--threshold 0.5] \
  [--deep]
```

The helper:
1. Walks the tree collecting all `*.pdf` files.
2. For each PDF, extracts candidate titles from PDF metadata (`pdfinfo`) and normalized filename.
3. Scores each candidate against the query using token overlap + substring match.
4. Keeps matches above the threshold.
5. Computes MD5 of each match and groups duplicates.

### Step 2 — Report

The helper prints human-readable output by default:
- Number of matches and distinct files
- For each distinct file: title, score, MD5, and all paths where that content lives
- `[DUPLICATE]` tag if the same file appears in multiple locations, `[UNIQUE]` otherwise

If the user wants machine-readable output, pass `--json`.

## Examples

**Exact title, no duplicates:**
```
python3 search_paper.py "A novel tactile tomography system based on mechanical principles for internal 3D imaging" --root /path/to/PAT-Scan
```

**Truncated title (OK because normalized token match handles it):**
```
python3 search_paper.py "JAX-SSO differentiable finite element analysis solver" --root /path/to/PAT-Scan
```

**Paper stored under a filename that doesn't mention the title (fallback to `--deep`):**
```
python3 search_paper.py "<paraphrased title>" --root /path/to/papers --deep
```

## When to use `--deep`

Default mode relies on PDF metadata + filename. It's fast (~5 seconds for 250 PDFs). Use `--deep` only when:
- A paper is known to exist but the default search can't find it
- Metadata-less scanned PDFs or bare arXiv dumps are in play
- The user explicitly wants exhaustive search

`--deep` adds first-page text extraction, which is still tolerable (~30s for 250 PDFs) but not needed most of the time.

## Do-not-touch rules

- Never modify or move any PDF. This skill is read-only.
- Filename-based "normalization" for matching is internal; the actual filename on disk is not touched.