semantic-codebase-search

Show SKILL.md content (~2.8k tokens)
---
name: semantic-codebase-search
description: "Semantic search over ATLAS plugin codebase via ParadeDB BM25. Use when 'find code that handles X', 'search semantic Y', 'where is logic for Z', or before hand-greping for unknown function names."
mode: [personal, all]
effort: medium
version: 1.0.0
tier: [admin]
---

# semantic-codebase-search — ParadeDB BM25 over ATLAS Codebase

> **Mission**: Replace blind `rg`/`grep` hunts with relevance-ranked semantic
> hits over the entire ATLAS plugin (skills, scripts, hooks, agents, commands).
> Reuses Synapse's existing **ParadeDB pg_search** infrastructure — zero new
> moving parts (no Meilisearch, no Qdrant, no Elasticsearch).
> Pairs with **W4.2 knowledge-graph** (typed structure) as the unstructured
> discovery layer.

## When to Use This Skill

Trigger phrases (auto-routed by `auto-orchestrator`):

- "find code that handles X"
- "search semantic Y" / "semantic search for Z"
- "where is the logic for X"
- "which skill does X" / "which script handles X"
- "show me everything related to X"
- Pre-emptive: BEFORE typing a multi-flag `rg -nP --type sh ...` for an
  unknown identifier — ask semantic-search first.

## When NOT to Use

- Exact-string lookup (file path, known function name) → use `Read` or `rg`.
- Structured questions ("what depends on module X") → use **W4.2 KG** instead.
- Live runtime queries (logs, metrics) → use ATLAS `infra-health` / `ci-health`.

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│  ATLAS plugin codebase                                      │
│  skills/  scripts/  hooks/  agents/  commands/  references/ │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼  scripts/semantic-index-builder.py
┌─────────────────────────────────────────────────────────────┐
│  Indexable extraction (per file)                            │
│   • file path (skills/foo/SKILL.md)                         │
│   • frontmatter `name` + `description` + tier               │
│   • Python def/class signatures + docstrings                │
│   • Shell function names + leading comments                 │
│   • Markdown H1/H2 headings                                 │
│   • inline comments lines >= 5 words                        │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  Output (Phase 1 — JSONL placeholder)                       │
│   ~/.atlas/semantic-index.jsonl                             │
│   {"id":"skills/foo/SKILL.md#desc","path":...,              │
│    "type":"skill","text":"...","tier":"admin"}              │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼  Phase 2 — ParadeDB ingestion
┌─────────────────────────────────────────────────────────────┐
│  PostgreSQL (Synapse instance)                              │
│   CREATE TABLE atlas_semantic_index (                       │
│     id text PRIMARY KEY,                                    │
│     path text NOT NULL,                                     │
│     type text NOT NULL,    -- skill|script|hook|agent|...   │
│     tier text,             -- core|dev|admin|null           │
│     text text NOT NULL,    -- BM25-indexed                  │
│     metadata jsonb                                          │
│   );                                                        │
│   CREATE INDEX atlas_sem_bm25                               │
│     ON atlas_semantic_index                                 │
│     USING bm25 (id, text, path, type)                       │
│     WITH (key_field='id');                                  │
└─────────────────────────────────────────────────────────────┘
```

**Why ParadeDB BM25?**

- Already in the Synapse stack (see `backend/app/services/pg_search_service.py`).
- Runs entirely in PostgreSQL — no external service.
- BM25 ranking outperforms naive ts_vector for short text descriptions.
- Same operator infra/backups as the rest of Synapse (zero ops debt).

## Indexed Entities

| Type     | Source                                 | Indexed text                                             |
|----------|----------------------------------------|----------------------------------------------------------|
| `skill`  | `skills/*/SKILL.md` frontmatter + body | `name`, `description`, H1/H2 headings, first 200 chars   |
| `script` | `scripts/**/*.{sh,py}`                 | leading comment block, `def`/`function` lines, docstring |
| `hook`   | `hooks/**/*.{sh,py,json}`              | header comment, hook event/matcher (settings.json)       |
| `agent`  | `agents/*.md`                          | frontmatter `name` + `description` + role section        |
| `cmd`    | `commands/*.md`                        | frontmatter + first-paragraph description                |
| `ref`    | `references/**/*.md`                   | H1 + summary paragraph                                   |

**Stop-list** (excluded):
- `dist/`, `*.tar.gz`, `node_modules/`, `.git/`, generated `_metadata.yaml`,
  `evals/results/`, anything matching `.gitignore`.

## CLI

### Daily use

```bash
# Top 10 hits (relevance-ranked)
atlas semantic-search "code that handles WBS hierarchy"

# Type filter — only skills
atlas semantic-search --filter type=skill "audit enforcement"

# Tier filter — only admin tier
atlas semantic-search --filter tier=admin "deploy"

# JSON output (for piping to jq / agents)
atlas semantic-search --format json "BM25 search"
```

### Maintenance

```bash
# Rebuild index from scratch
atlas semantic-search --rebuild-index

# Dry-run: emit JSONL without DB upsert (Phase 1 default)
atlas semantic-search --rebuild-index --dry-run

# Index single file (incremental — used by post-commit hook in Phase 3)
atlas semantic-search --index-file skills/foo/SKILL.md

# Stats (entity count per type, last index time)
atlas semantic-search --stats
```

## Output Format

```
$ atlas semantic-search "WBS hierarchy parsing"

Top 10 hits (BM25):
 #  score  type     path                                          snippet
 1  12.84  skill    skills/wbs-parser/SKILL.md                    "Parse WBS code into (parent, child) tuple..."
 2   9.31  script   scripts/parse_wbs_code.py                     "def parse_wbs_code(code: str) -> tuple..."
 3   8.07  skill    skills/engineering-ops/SKILL.md               "WBS rollup across packages..."
 4   6.55  ref      references/wbs-format.md                      "# WBS Format Specification..."
...
```

## Auto-orchestrator Integration

The existing `auto-orchestrator` skill (skill router brain) calls
`semantic-codebase-search` as a **fallback resolver** when:

1. User query has no exact skill-name match.
2. Top regex match score < 0.6 (ambiguous routing).
3. User uses generic verbs: "find / search / show me / where".

Wiring (in `auto-orchestrator` SKILL.md — not modified by this PR, future W4.4):

```python
# Pseudocode
if regex_match.score < 0.6:
    candidates = semantic_search(user_query, limit=5, filter="type=skill")
    if candidates[0].score > THRESHOLD:
        route_to(candidates[0].path)
```

## Verify

After install + index build:

```bash
# 1. Index builder runs without error
python3 scripts/semantic-index-builder.py --dry-run | head -5

# 2. Check JSONL output is valid
jq -c '.' < ~/.atlas/semantic-index.jsonl | head -3

# 3. Smoke query (Phase 1 = grep over JSONL; Phase 2 = real BM25)
grep -i "WBS" ~/.atlas/semantic-index.jsonl | head -3
```

**Acceptance test** (from mission spec):
> "find code that handles WBS hierarchy" → returns `parse_wbs_code` (or
> equivalent entry point) in top-3 hits.

In Phase 1 (JSONL grep), expect at least one hit referencing the WBS-related
script/skill. In Phase 2 (ParadeDB), expect rank 1-3 with BM25 score > 5.0.

## Forward-Compat (W4.x roadmap)

| Feature        | Owner | Relationship                                           |
|----------------|-------|--------------------------------------------------------|
| W4.2 KG        | sibling | KG = typed structure (depends-on, calls, owns).      |
| W4.3 (this)    | self    | unstructured discovery (BM25 on free-text).          |
| W4.4 router++  | future  | auto-orchestrator hooks into semantic-search fallback. |
| W4.5 hybrid    | future  | rerank: KG candidates × BM25 score → final top-5.    |

The two layers are complementary: KG answers "WHAT depends on X", semantic
search answers "WHERE is the code about X". W4.5 will fuse them.

## Phasing

| Phase | Status | Description                                                |
|-------|--------|------------------------------------------------------------|
| P1    | this PR| Index builder emits JSONL placeholder. CLI = grep wrapper. |
| P2    | future | ParadeDB schema + ingestion. Real BM25 query.              |
| P3    | future | Incremental index (post-commit hook re-indexes touched).   |
| P4    | future | Hybrid re-rank with W4.2 KG.                               |

P1 ships **today** (this skill) — unblocks downstream W4.4 router design.

## Known Limitations (P1)

- No real BM25 ranking yet — JSONL grep is substring match only.
- No incremental index — `--rebuild-index` is full re-scan (~1-2s for
  current ATLAS plugin size, acceptable).
- No fuzzy match — typos in query won't match. Phase 2 with ParadeDB will
  add `fuzzy_distance` operator.

## Anti-patterns

❌ **Don't use semantic-search for known-path lookups**:
```
# Bad: you already know the path
atlas semantic-search "skill atlas-doctor SKILL.md"

# Good: just Read it
cat skills/atlas-doctor/SKILL.md
```

❌ **Don't index runtime data**:
- Logs (`logs/*.log`, `~/.claude/projects/*.jsonl`) — use `atlas-trace` /
  `atlas-analytics` instead.
- Generated artifacts (`dist/`, `_metadata.yaml`) — they leak version noise.

❌ **Don't replace the W4.2 KG**:
- "What calls function X" → KG, not semantic search.
- "Show me dependency graph" → KG, not semantic search.

## References

- Synapse ParadeDB service: `backend/app/services/pg_search_service.py`
- ParadeDB docs: <https://docs.paradedb.com/>
- W4.2 KG plan: pairs with this skill (typed graph layer)
- auto-orchestrator: `skills/atlas-router/SKILL.md` (future hook point)
- Index builder: `scripts/semantic-index-builder.py` (this PR)

---

*Owner*: Seb Gagnon · *Status*: P1 shipped · *Next*: P2 ParadeDB ingestion (separate PR)
Get semantic-codebase-search.

vz-bench-debug

vz-scrape-runner

Think you can beat it?