Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install vivekkarmarkar-claude-code-os-skills-find-similar-papersgit clone https://github.com/VivekKarmarkar/claude-code-os.gitcp claude-code-os/SKILL.MD ~/.claude/skills/vivekkarmarkar-claude-code-os-skills-find-similar-papers/SKILL.md---
name: find-similar-papers
description: Find PDFs in a directory tree that are semantically similar to a query paper. Uses TF-IDF cosine similarity over title (weighted 3x) plus first 2 pages of text. Fast baseline — no model download needed, works locally with scikit-learn. Use when the user asks "are there any papers similar to X in my library?" or "what papers in my lit review are closest to this new paper I just got?"
---
# find-similar-papers
Fast TF-IDF-based similarity search across a PDF library.
## Arguments
- `<title>` OR `<path>` OR `<text>` — the query. Choose one:
- A paper **title** — the helper treats it as a short-but-weighted query
- A paper **file path** — the helper extracts title + first 2 pages from the PDF
- Free **text** — arbitrary keywords or a sentence
- `--root <dir>` (optional, default: CWD) — where to search
- `--top <N>` (optional, default: 15) — how many matches to return
- `--threshold <0-1>` (optional, default: 0) — minimum similarity score
- `--exclude-self` (optional) — when query-path is inside the search root, exclude it
## Pipeline
### Step 0 — Determine the query mode
- If user gave a PDF path, use `--query-path`
- If user gave a title string, use `--query-title`
- If user gave keywords / a sentence, use `--query-text`
### Step 1 — Run the helper
```bash
python3 ~/.claude/skills/find-similar-papers/helpers/find_similar.py \
--root "<dir>" \
--query-path "<pdf>" \
[--top 15] [--threshold 0.05] [--exclude-self]
```
The helper:
1. Walks the tree collecting all `*.pdf` files.
2. Extracts title (metadata, weighted 3x) + first 2 pages of text for each.
3. Vectorizes with TF-IDF (unigrams + bigrams, English stopwords, min_df=2, max_df=0.7).
4. Computes cosine similarity between query vector and every candidate.
5. Prints top-N matches with titles, scores, and paths.
### Step 2 — Report
For each match, show similarity score, title, and path. Interpret scores:
- **≥ 0.25** — strong topical match
- **0.15–0.25** — clearly adjacent work
- **0.05–0.15** — loose thematic overlap
- **< 0.05** — probably unrelated, ignore
Scores rarely exceed 0.4 with TF-IDF on papers because the vocabulary space is large.
## When TF-IDF is enough vs. when to upgrade
TF-IDF catches multi-word phrases well ("ergodic exploration," "inverse problem," "finite element"). It fails when the query and candidates use **different vocabulary for the same concept** — e.g., "autonomous sensing" in robotics vs. "inverse problem" in applied math, both describing the same thing.
If TF-IDF returns thin results, the next move is a semantic embedding model like **SPECTER2** via sentence-transformers. That requires installing `sentence-transformers` (pulls torch, ~10 min download) and loading the model (~1 GB). Not done by default; add it as an optional step if baseline fails.
## Do-not-touch rules
- Never modify any PDF. This skill is read-only.
- Cache nothing by default — each invocation recomputes. If the tree is very large (>1000 PDFs) caching becomes worthwhile, but that's a future upgrade.