Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install seb155-atlas-plugin-dist-atlas-admin-addon-skills-semantic-codebase-searchgit clone https://github.com/seb155/atlas-plugin.gitcp atlas-plugin/SKILL.MD ~/.claude/skills/seb155-atlas-plugin-dist-atlas-admin-addon-skills-semantic-codebase-search/SKILL.md---
name: semantic-codebase-search
description: "Semantic search over ATLAS plugin codebase via ParadeDB BM25. Use when 'find code that handles X', 'search semantic Y', 'where is logic for Z', or before hand-greping for unknown function names."
mode: [personal, all]
effort: medium
version: 1.0.0
tier: [admin]
---
# semantic-codebase-search — ParadeDB BM25 over ATLAS Codebase
> **Mission**: Replace blind `rg`/`grep` hunts with relevance-ranked semantic
> hits over the entire ATLAS plugin (skills, scripts, hooks, agents, commands).
> Reuses Synapse's existing **ParadeDB pg_search** infrastructure — zero new
> moving parts (no Meilisearch, no Qdrant, no Elasticsearch).
> Pairs with **W4.2 knowledge-graph** (typed structure) as the unstructured
> discovery layer.
## When to Use This Skill
Trigger phrases (auto-routed by `auto-orchestrator`):
- "find code that handles X"
- "search semantic Y" / "semantic search for Z"
- "where is the logic for X"
- "which skill does X" / "which script handles X"
- "show me everything related to X"
- Pre-emptive: BEFORE typing a multi-flag `rg -nP --type sh ...` for an
unknown identifier — ask semantic-search first.
## When NOT to Use
- Exact-string lookup (file path, known function name) → use `Read` or `rg`.
- Structured questions ("what depends on module X") → use **W4.2 KG** instead.
- Live runtime queries (logs, metrics) → use ATLAS `infra-health` / `ci-health`.
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ ATLAS plugin codebase │
│ skills/ scripts/ hooks/ agents/ commands/ references/ │
└──────────────────────┬──────────────────────────────────────┘
│
▼ scripts/semantic-index-builder.py
┌─────────────────────────────────────────────────────────────┐
│ Indexable extraction (per file) │
│ • file path (skills/foo/SKILL.md) │
│ • frontmatter `name` + `description` + tier │
│ • Python def/class signatures + docstrings │
│ • Shell function names + leading comments │
│ • Markdown H1/H2 headings │
│ • inline comments lines >= 5 words │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Output (Phase 1 — JSONL placeholder) │
│ ~/.atlas/semantic-index.jsonl │
│ {"id":"skills/foo/SKILL.md#desc","path":..., │
│ "type":"skill","text":"...","tier":"admin"} │
└──────────────────────┬──────────────────────────────────────┘
│
▼ Phase 2 — ParadeDB ingestion
┌─────────────────────────────────────────────────────────────┐
│ PostgreSQL (Synapse instance) │
│ CREATE TABLE atlas_semantic_index ( │
│ id text PRIMARY KEY, │
│ path text NOT NULL, │
│ type text NOT NULL, -- skill|script|hook|agent|... │
│ tier text, -- core|dev|admin|null │
│ text text NOT NULL, -- BM25-indexed │
│ metadata jsonb │
│ ); │
│ CREATE INDEX atlas_sem_bm25 │
│ ON atlas_semantic_index │
│ USING bm25 (id, text, path, type) │
│ WITH (key_field='id'); │
└─────────────────────────────────────────────────────────────┘
```
**Why ParadeDB BM25?**
- Already in the Synapse stack (see `backend/app/services/pg_search_service.py`).
- Runs entirely in PostgreSQL — no external service.
- BM25 ranking outperforms naive ts_vector for short text descriptions.
- Same operator infra/backups as the rest of Synapse (zero ops debt).
## Indexed Entities
| Type | Source | Indexed text |
|----------|----------------------------------------|----------------------------------------------------------|
| `skill` | `skills/*/SKILL.md` frontmatter + body | `name`, `description`, H1/H2 headings, first 200 chars |
| `script` | `scripts/**/*.{sh,py}` | leading comment block, `def`/`function` lines, docstring |
| `hook` | `hooks/**/*.{sh,py,json}` | header comment, hook event/matcher (settings.json) |
| `agent` | `agents/*.md` | frontmatter `name` + `description` + role section |
| `cmd` | `commands/*.md` | frontmatter + first-paragraph description |
| `ref` | `references/**/*.md` | H1 + summary paragraph |
**Stop-list** (excluded):
- `dist/`, `*.tar.gz`, `node_modules/`, `.git/`, generated `_metadata.yaml`,
`evals/results/`, anything matching `.gitignore`.
## CLI
### Daily use
```bash
# Top 10 hits (relevance-ranked)
atlas semantic-search "code that handles WBS hierarchy"
# Type filter — only skills
atlas semantic-search --filter type=skill "audit enforcement"
# Tier filter — only admin tier
atlas semantic-search --filter tier=admin "deploy"
# JSON output (for piping to jq / agents)
atlas semantic-search --format json "BM25 search"
```
### Maintenance
```bash
# Rebuild index from scratch
atlas semantic-search --rebuild-index
# Dry-run: emit JSONL without DB upsert (Phase 1 default)
atlas semantic-search --rebuild-index --dry-run
# Index single file (incremental — used by post-commit hook in Phase 3)
atlas semantic-search --index-file skills/foo/SKILL.md
# Stats (entity count per type, last index time)
atlas semantic-search --stats
```
## Output Format
```
$ atlas semantic-search "WBS hierarchy parsing"
Top 10 hits (BM25):
# score type path snippet
1 12.84 skill skills/wbs-parser/SKILL.md "Parse WBS code into (parent, child) tuple..."
2 9.31 script scripts/parse_wbs_code.py "def parse_wbs_code(code: str) -> tuple..."
3 8.07 skill skills/engineering-ops/SKILL.md "WBS rollup across packages..."
4 6.55 ref references/wbs-format.md "# WBS Format Specification..."
...
```
## Auto-orchestrator Integration
The existing `auto-orchestrator` skill (skill router brain) calls
`semantic-codebase-search` as a **fallback resolver** when:
1. User query has no exact skill-name match.
2. Top regex match score < 0.6 (ambiguous routing).
3. User uses generic verbs: "find / search / show me / where".
Wiring (in `auto-orchestrator` SKILL.md — not modified by this PR, future W4.4):
```python
# Pseudocode
if regex_match.score < 0.6:
candidates = semantic_search(user_query, limit=5, filter="type=skill")
if candidates[0].score > THRESHOLD:
route_to(candidates[0].path)
```
## Verify
After install + index build:
```bash
# 1. Index builder runs without error
python3 scripts/semantic-index-builder.py --dry-run | head -5
# 2. Check JSONL output is valid
jq -c '.' < ~/.atlas/semantic-index.jsonl | head -3
# 3. Smoke query (Phase 1 = grep over JSONL; Phase 2 = real BM25)
grep -i "WBS" ~/.atlas/semantic-index.jsonl | head -3
```
**Acceptance test** (from mission spec):
> "find code that handles WBS hierarchy" → returns `parse_wbs_code` (or
> equivalent entry point) in top-3 hits.
In Phase 1 (JSONL grep), expect at least one hit referencing the WBS-related
script/skill. In Phase 2 (ParadeDB), expect rank 1-3 with BM25 score > 5.0.
## Forward-Compat (W4.x roadmap)
| Feature | Owner | Relationship |
|----------------|-------|--------------------------------------------------------|
| W4.2 KG | sibling | KG = typed structure (depends-on, calls, owns). |
| W4.3 (this) | self | unstructured discovery (BM25 on free-text). |
| W4.4 router++ | future | auto-orchestrator hooks into semantic-search fallback. |
| W4.5 hybrid | future | rerank: KG candidates × BM25 score → final top-5. |
The two layers are complementary: KG answers "WHAT depends on X", semantic
search answers "WHERE is the code about X". W4.5 will fuse them.
## Phasing
| Phase | Status | Description |
|-------|--------|------------------------------------------------------------|
| P1 | this PR| Index builder emits JSONL placeholder. CLI = grep wrapper. |
| P2 | future | ParadeDB schema + ingestion. Real BM25 query. |
| P3 | future | Incremental index (post-commit hook re-indexes touched). |
| P4 | future | Hybrid re-rank with W4.2 KG. |
P1 ships **today** (this skill) — unblocks downstream W4.4 router design.
## Known Limitations (P1)
- No real BM25 ranking yet — JSONL grep is substring match only.
- No incremental index — `--rebuild-index` is full re-scan (~1-2s for
current ATLAS plugin size, acceptable).
- No fuzzy match — typos in query won't match. Phase 2 with ParadeDB will
add `fuzzy_distance` operator.
## Anti-patterns
❌ **Don't use semantic-search for known-path lookups**:
```
# Bad: you already know the path
atlas semantic-search "skill atlas-doctor SKILL.md"
# Good: just Read it
cat skills/atlas-doctor/SKILL.md
```
❌ **Don't index runtime data**:
- Logs (`logs/*.log`, `~/.claude/projects/*.jsonl`) — use `atlas-trace` /
`atlas-analytics` instead.
- Generated artifacts (`dist/`, `_metadata.yaml`) — they leak version noise.
❌ **Don't replace the W4.2 KG**:
- "What calls function X" → KG, not semantic search.
- "Show me dependency graph" → KG, not semantic search.
## References
- Synapse ParadeDB service: `backend/app/services/pg_search_service.py`
- ParadeDB docs: <https://docs.paradedb.com/>
- W4.2 KG plan: pairs with this skill (typed graph layer)
- auto-orchestrator: `skills/atlas-router/SKILL.md` (future hook point)
- Index builder: `scripts/semantic-index-builder.py` (this PR)
---
*Owner*: Seb Gagnon · *Status*: P1 shipped · *Next*: P2 ParadeDB ingestion (separate PR)