SQLWILLOSCARFree

dedupe-rank

Use when a broad paper candidate pool needs deterministic deduplication and a stable core set. **Trigger**: dedupe, rank, core set, 去重, 排序, 精选论文, 核心集合. **Use when**: 检索后需要把广覆盖集合收敛成可管理的 core set（用于 taxonomy/outline/mapping）。 **Skip if**: 已经有人手工整理了稳定的 `papers/core_set.csv`（无需再次 churn）。 **Network**: none. **Guardrail**: 偏 deterministic；输出应可重复（稳定 paper_id、字段规范）。

Repo bundle on VersuzWILLOSCAR/research-units-pipeline-skills109 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/WILLOSCAR/research-units-pipeline-skills Yours? Claim it ↗

§ 01 — Stats

Stars431

Prior1169

Quality—

Score—

Tasks—

§ 02 — Install

Get dedupe-rank.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install willoscar-research-units-pipeline-skills-codex-skills-dedupe-rank

Or clone the repo

$git clone https://github.com/WILLOSCAR/research-units-pipeline-skills.git

Or copy the SKILL.md manually

cp research-units-pipeline-skills/SKILL.MD ~/.claude/skills/willoscar-research-units-pipeline-skills-codex-skills-dedupe-rank/SKILL.md

More Versuz picks

★ Featured$1.99

vz-bench-debug

Document

★ Featured$0.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge willoscar-research-units-pipeline-skills-codex-skills-dedupe-rank↵

Show SKILL.md content (~367 tokens)

---
name: dedupe-rank
description: |
  Use when a broad paper candidate pool needs deterministic deduplication and a stable core set.
  **Trigger**: dedupe, rank, core set, 去重, 排序, 精选论文, 核心集合.
  **Use when**: 检索后需要把广覆盖集合收敛成可管理的 core set（用于 taxonomy/outline/mapping）。
  **Skip if**: 已经有人手工整理了稳定的 `papers/core_set.csv`（无需再次 churn）。
  **Network**: none.
  **Guardrail**: 偏 deterministic；输出应可重复（稳定 paper_id、字段规范）。
---

# Dedupe + Rank

Turns a raw candidate pool into a deduped pool and a stable core set.

## Input

- `papers/papers_raw.jsonl`

## Outputs

- `papers/papers_dedup.jsonl`
- `papers/core_set.csv`

## Script boundary

`scripts/run.py` should own only:
- title/year deduplication
- deterministic ranking
- stable `paper_id` generation

Use shared domain packs or pipeline contract metadata for topic-specific or product-specific behavior.

## Contract-driven behavior

The script should prefer pipeline contract metadata over profile-name branching.

Current important field:
- `quality_contract.candidate_pool_policy.keep_full_deduped_pool`

If true, the script keeps the full deduped pool in `papers/core_set.csv` unless the user explicitly overrides core size.

## Acceptance

- deduped JSONL exists
- core-set CSV exists
- reruns are stable for the same inputs

## Non-goals

- retrieval
- screening
- manual topic authoring inside the script