DocumentVivekKarmarkarFree

find-similar-papers

Find PDFs in a directory tree that are semantically similar to a query paper. Uses TF-IDF cosine similarity over title (weighted 3x) plus first 2 pages of text. Fast baseline — no model download needed, works locally with scikit-learn. Use when the user asks "are there any papers similar to X in my library?" or "what papers in my lit review are closest to this new paper I just got?"

Repo bundle on VersuzVivekKarmarkar/claude-code-os810 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/VivekKarmarkar/claude-code-os Yours? Claim it ↗

§ 01 — Stats

Prior1090

Quality—

Score—

Tasks—

§ 02 — Install

Get find-similar-papers.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install vivekkarmarkar-claude-code-os-skills-find-similar-papers

Or clone the repo

$git clone https://github.com/VivekKarmarkar/claude-code-os.git

Or copy the SKILL.md manually

$cp claude-code-os/SKILL.MD ~/.claude/skills/vivekkarmarkar-claude-code-os-skills-find-similar-papers/SKILL.md

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge vivekkarmarkar-claude-code-os-skills-find-similar-papers↵

Show SKILL.md content (~831 tokens)

---
name: find-similar-papers
description: Find PDFs in a directory tree that are semantically similar to a query paper. Uses TF-IDF cosine similarity over title (weighted 3x) plus first 2 pages of text. Fast baseline — no model download needed, works locally with scikit-learn. Use when the user asks "are there any papers similar to X in my library?" or "what papers in my lit review are closest to this new paper I just got?"
---

# find-similar-papers

Fast TF-IDF-based similarity search across a PDF library.

## Arguments

- `<title>` OR `<path>` OR `<text>` — the query. Choose one:
  - A paper **title** — the helper treats it as a short-but-weighted query
  - A paper **file path** — the helper extracts title + first 2 pages from the PDF
  - Free **text** — arbitrary keywords or a sentence
- `--root <dir>` (optional, default: CWD) — where to search
- `--top <N>` (optional, default: 15) — how many matches to return
- `--threshold <0-1>` (optional, default: 0) — minimum similarity score
- `--exclude-self` (optional) — when query-path is inside the search root, exclude it

## Pipeline

### Step 0 — Determine the query mode

- If user gave a PDF path, use `--query-path`
- If user gave a title string, use `--query-title`
- If user gave keywords / a sentence, use `--query-text`

### Step 1 — Run the helper

```bash
python3 ~/.claude/skills/find-similar-papers/helpers/find_similar.py \
  --root "<dir>" \
  --query-path "<pdf>"  \
  [--top 15] [--threshold 0.05] [--exclude-self]
```

The helper:
1. Walks the tree collecting all `*.pdf` files.
2. Extracts title (metadata, weighted 3x) + first 2 pages of text for each.
3. Vectorizes with TF-IDF (unigrams + bigrams, English stopwords, min_df=2, max_df=0.7).
4. Computes cosine similarity between query vector and every candidate.
5. Prints top-N matches with titles, scores, and paths.

### Step 2 — Report

For each match, show similarity score, title, and path. Interpret scores:
- **≥ 0.25** — strong topical match
- **0.15–0.25** — clearly adjacent work
- **0.05–0.15** — loose thematic overlap
- **< 0.05** — probably unrelated, ignore

Scores rarely exceed 0.4 with TF-IDF on papers because the vocabulary space is large.

## When TF-IDF is enough vs. when to upgrade

TF-IDF catches multi-word phrases well ("ergodic exploration," "inverse problem," "finite element"). It fails when the query and candidates use **different vocabulary for the same concept** — e.g., "autonomous sensing" in robotics vs. "inverse problem" in applied math, both describing the same thing.

If TF-IDF returns thin results, the next move is a semantic embedding model like **SPECTER2** via sentence-transformers. That requires installing `sentence-transformers` (pulls torch, ~10 min download) and loading the model (~1 GB). Not done by default; add it as an optional step if baseline fails.

## Do-not-touch rules

- Never modify any PDF. This skill is read-only.
- Cache nothing by default — each invocation recomputes. If the tree is very large (>1000 PDFs) caching becomes worthwhile, but that's a future upgrade.