pr-polish

Show SKILL.md content (~4.0k tokens)
---
name: pr-polish
description: Alternate /pr-review and /pr-address on a PR until the PR is truly mergeable — no new review findings, zero unresolved inline threads, zero unaddressed top-level reviews or issue comments, all CI checks green, and two consecutive quiet polls after CI settles. Use when the user wants a PR polished to merge-ready without setting a fixed number of rounds.
user-invocable: true
argument-hint: "[PR number or URL] — if omitted, finds PR for current branch."
metadata:
  author: autogpt-team
  version: "1.0.0"
---

# PR Polish

**Goal.** Drive a PR to merge-ready by alternating `/pr-review` and `/pr-address` until **all** of the following hold:

1. The most recent `/pr-review` produces **zero new findings** (no new inline comments, no new top-level reviews with a non-empty body).
2. Every inline review thread reachable via GraphQL reports `isResolved: true`.
3. Every non-bot, non-author top-level review has been acknowledged (replied-to) OR resolved via a thread it spawned.
4. Every non-bot, non-author issue comment has been acknowledged (replied-to).
5. Every CI check is `conclusion: "success"` or `"skipped"` / `"neutral"` — none `"failure"` or still pending.
6. **Two consecutive post-CI polls** (≥60s apart) stay clean — no new threads, no new non-empty reviews, no new issue comments. Bots (coderabbitai, sentry, autogpt-reviewer) frequently post late after CI settles; a single green snapshot is not sufficient.

**Do not stop at a fixed number of rounds.** If round N introduces new comments, round N+1 is required. Cap at `_MAX_ROUNDS = 10` as a safety valve, but expect 2–5 in practice.

## TodoWrite

Before starting, write two todos so the user can see the loop progression:

- `Round {current}: /pr-review + /pr-address on PR #{N}` — current iteration.
- `Final polish polling: 2 consecutive clean polls, CI green, 0 unresolved` — runs after the last non-empty review round.

Update the `current` round counter at the start of each iteration; mark `completed` only when the round's address step finishes (all new threads addressed + resolved).

## Find the PR

```bash
ARG_PR="${ARG:-}"
# Normalize URL → numeric ID if the skill arg is a pull-request URL.
if [[ "$ARG_PR" =~ ^https?://github\.com/[^/]+/[^/]+/pull/([0-9]+) ]]; then
  ARG_PR="${BASH_REMATCH[1]}"
fi
PR="${ARG_PR:-$(gh pr list --head "$(git branch --show-current)" --repo Significant-Gravitas/AutoGPT --json number --jq '.[0].number')}"
if [ -z "$PR" ] || [ "$PR" = "null" ]; then
  echo "No PR found for current branch. Provide a PR number or URL as the skill arg."
  exit 1
fi
echo "Polishing PR #$PR"
```

## The outer loop

```text
round = 0
while round < _MAX_ROUNDS:
    round += 1
    baseline = snapshot_state(PR)   # see "Snapshotting state" below
    invoke_skill("pr-review", PR)   # posts findings as inline comments / top-level review
    findings = diff_state(PR, baseline)
    if findings.total == 0:
        break  # no new findings → go to polish polling
    invoke_skill("pr-address", PR)  # resolves every unresolved thread + CI failure
# Post-loop: polish polling (see below).
polish_polling(PR)
```

### Snapshotting state

Before each `/pr-review`, capture a baseline so the diff after the review reflects **only** what the review just added (not pre-existing threads):

```bash
# Inline threads — total count + latest databaseId per thread
gh api graphql -f query="
{
  repository(owner: \"Significant-Gravitas\", name: \"AutoGPT\") {
    pullRequest(number: ${PR}) {
      reviewThreads(first: 100) {
        totalCount
        nodes {
          id
          isResolved
          comments(last: 1) { nodes { databaseId } }
        }
      }
    }
  }
}" > /tmp/baseline_threads.json

# Top-level reviews — count + latest id per non-empty review
gh api "repos/Significant-Gravitas/AutoGPT/pulls/${PR}/reviews" --paginate \
  --jq '[.[] | select((.body // "") != "") | {id, user: .user.login, state, submitted_at}]' \
  > /tmp/baseline_reviews.json

# Issue comments — count + latest id per non-bot, non-author comment.
# Bots are filtered by User.type == "Bot" (GitHub sets this for app/bot
# accounts like coderabbitai, github-actions, sentry-io). The author is
# filtered by comparing login to the PR author — export it so jq can see it.
AUTHOR=$(gh api "repos/Significant-Gravitas/AutoGPT/pulls/${PR}" --jq '.user.login')
gh api "repos/Significant-Gravitas/AutoGPT/issues/${PR}/comments" --paginate \
  --jq --arg author "$AUTHOR" \
      '[.[] | select(.user.type != "Bot" and .user.login != $author)
            | {id, user: .user.login, created_at}]' \
  > /tmp/baseline_issue_comments.json
```

### Diffing after a review

After `/pr-review` runs, any of these counting as "new findings" means another address round is needed:

- New inline thread `id` not in the baseline.
- An existing thread whose latest comment `databaseId` is higher than the baseline's (new reply on an old thread).
- A new top-level review `id` with a non-empty body.
- A new issue comment `id` from a non-bot, non-author user.

If any of the four buckets is non-empty → not done; invoke `/pr-address` and loop.

## Polish polling

Once `/pr-review` produces zero new findings, do **not** exit yet. Bots (coderabbitai, sentry, autogpt-reviewer) commonly post late reviews after CI settles — 30–90 seconds after the final push. Poll at 60-second intervals:

```text
NON_SUCCESS_TERMINAL = {"failure", "cancelled", "timed_out", "action_required", "startup_failure"}
clean_polls = 0
required_clean = 2
while clean_polls < required_clean:
    # 1. CI gate — any terminal non-success conclusion (not just "failure")
    # must trigger /pr-address. "success", "skipped", "neutral" are clean;
    # anything else (including cancelled, timed_out, action_required) is a
    # blocker that won't self-resolve.
    ci = fetch_check_runs(PR)
    if any ci.conclusion in NON_SUCCESS_TERMINAL:
        invoke_skill("pr-address", PR)  # address failures + any new comments
        baseline = snapshot_state(PR)   # reset — push during address invalidates old baseline
        clean_polls = 0
        continue
    if any ci.conclusion is None (still in_progress):
        sleep 60; continue  # wait without counting this as clean

    # 2. Comment / thread gate
    threads = fetch_unresolved_threads(PR)
    new_issue_comments = diff_against_baseline(issue_comments)
    new_reviews = diff_against_baseline(reviews)
    if threads or new_issue_comments or new_reviews:
        invoke_skill("pr-address", PR)
        baseline = snapshot_state(PR)   # reset — the address loop just dealt with these,
                                        # otherwise they stay "new" relative to the old baseline forever
        clean_polls = 0
        continue

    # 3. Mergeability gate
    mergeable = gh api repos/.../pulls/${PR} --jq '.mergeable'
    if mergeable == false (CONFLICTING):
        resolve_conflicts(PR)  # see pr-address skill
        clean_polls = 0
        continue
    if mergeable is null (UNKNOWN):
        sleep 60; continue

    clean_polls += 1
    sleep 60
```

Only after `clean_polls == 2` do you report `ORCHESTRATOR:DONE`.

### Concrete CI fetch (don't parse `gh pr checks` text columns)

The `fetch_check_runs(PR)` step above must use `--json`, not the default text output. Job names can contain spaces and parentheses (e.g. `test (3.11)`, `Analyze (python)`), so `gh pr checks $PR | awk '{print $2}'` extracts `(3.11)` instead of the status — leading to a clean-poll firing while jobs are still pending.

```bash
# Reliable: use --json so columns are unambiguous.
ci_json=$(gh pr checks $PR --repo Significant-Gravitas/AutoGPT --json name,state,bucket)
pending=$(echo "$ci_json" | jq '[.[] | select(.bucket == "pending")] | length')
failed=$(echo "$ci_json" | jq '[.[] | select(.bucket == "fail" or .bucket == "cancel")] | length')

# Buckets are: pass | fail | pending | cancel | skipping
# (NOTE: gh pr checks does NOT expose `conclusion` as a JSON field —
# only `bucket`. Don't confuse with the GitHub REST API's check_runs
# endpoint, which DOES use conclusion.)
```

Map back to the pseudocode above: `bucket == "pending"` is `ci.conclusion is None (still in_progress)`; `bucket in {"fail", "cancel"}` is `ci.conclusion in NON_SUCCESS_TERMINAL`; `bucket in {"pass", "skipping"}` is clean.

### Why 2 clean polls, not 1

A single green snapshot can be misleading — the final CI check often completes ~30s before a bot posts its delayed review. One quiet cycle does not prove the PR is stable; two consecutive cycles with no new threads, reviews, or issue comments arriving gives high confidence nothing else is incoming.

### Why checking every source each poll

`/pr-address` polling inside a single round already re-checks its own comments, but `/pr-polish` sits a level above and must also catch:

- New top-level reviews (autogpt-reviewer sometimes posts structured feedback only after several CI green cycles).
- Issue comments from human reviewers (not caught by inline thread polling).
- Sentry bug predictions that land on new line numbers post-push.
- Merge conflicts introduced by a race between your push and a merge to `dev`.

## Invocation pattern

Delegate to existing skills with the `Skill` tool; do not re-implement the review or address logic inline. This keeps the polish loop focused on orchestration and lets the child skills evolve independently.

```python
Skill(skill="pr-review",  args=pr_url)
Skill(skill="pr-address", args=pr_url)
```

After each child invocation, re-query GitHub state directly — never trust a summary for the stop condition. The orchestrator's `ORCHESTRATOR:DONE` is verified against actual GraphQL / REST responses per the rules in `pr-address`'s "Verify actual count before outputting ORCHESTRATOR:DONE" section.

### **Auto-continue: do NOT end your response between child skills**

`/pr-polish` is a single orchestration task — one invocation drives the PR all the way to merge-ready. When a child `Skill()` call returns control to you:

- Do NOT summarize and stop.
- Do NOT wait for user confirmation to continue.
- Immediately, in the same response, perform the next loop step: state diff → decide next action → next `Skill()` call or polling sleep.

The child skill returning is a **loop iteration boundary**, not a conversation turn boundary. You are expected to keep going until one of the exit conditions in the opening section is met (2 consecutive clean polls, `_MAX_ROUNDS` hit, or an unrecoverable error).

If the user needs to approve a risky action mid-loop (e.g., a force-push or a destructive git operation), pause there — but not at the routine "round N finished, round N+1 needed" boundary. Those are silent transitions.

### **Run /pr-polish in the foreground — never in a background agent**

Spawning `/pr-polish` inside an `Agent(subagent_type="general-purpose")` background task **does not work**. Background agents don't inherit the parent's slash-command registry, so `Skill(skill="pr-review")` and `Skill(skill="pr-address")` calls aren't available — the agent has to manually replicate the child skills' logic, which is fragile and tends to stall on the first network or rate-limit hiccup. Symptom: the background task reports `stalled: no progress for 600s` mid-review.

Run `/pr-polish` inline in the foreground conversation. If the user asks for "/pr-polish + /pr-test in parallel", split them: foreground `/pr-polish`, and ONLY then can the test step go to a background agent (because `/pr-test` doesn't itself need to invoke skills).

### **You MUST invoke `Skill(pr-review)` every round — even when bot reviews already exist**

A common failure mode: CodeRabbit / autogpt-reviewer / Sentry have already posted findings on the PR, and the orchestrator skips the `Skill(pr-review)` step on the assumption that "review has been done." That's wrong — the outer loop's purpose is to layer **the agent's own review** on top of the bot reviews, catching issues the bots miss (architecture, naming, cross-file invariants, hidden coupling). If the orchestrator only addresses bot findings without ever running its own review, the loop converges to "bot-clean" but not "agent-reviewed-clean," and the user reasonably asks "did /pr-polish even read the diff?"

**Self-check before reporting `ORCHESTRATOR:DONE`:** confirm at least one `Skill(skill="pr-review")` call appears in the current orchestration. If none, the loop is incomplete — go back and run one round.

## GitHub rate limits

This skill issues many GraphQL calls (one review-thread query per outer iteration plus per-poll queries inside polish polling). Expect the GraphQL budget to be tight on large PRs. When `gh api rate_limit --jq .resources.graphql.remaining` drops below ~200, back off:

- Fall back to REST for reads (flat `/pulls/{N}/comments`, `/pulls/{N}/reviews`, `/issues/{N}/comments`) per the `pr-address` skill's GraphQL-fallback section.
- Queue thread resolutions (GraphQL-only) until the budget resets; keep making progress on fixes + REST replies meanwhile.
- `sleep 5` between any batch of ≥20 writes to avoid secondary rate limits.

## Safety valves

- `_MAX_ROUNDS = 10` — if review+address rounds exceed this, stop and escalate to the user with a summary of what's still unresolved. A PR that cannot converge in 10 rounds has systemic issues that need human judgment.
- After each commit, run `poetry run format` / `pnpm format && pnpm lint && pnpm types` per the target codebase's conventions. A failing format check is CI `failure` that will never self-resolve.
- Every `/pr-review` round checks for **duplicate** concerns first (via `pr-review`'s own "Fetch existing review comments" step) so the loop does not re-post the same finding that a prior round already resolved.

## Reporting

When the skill finishes (either via two clean polls or hitting `_MAX_ROUNDS`), produce a compact summary:

```
PR #{N} polish complete ({rounds_completed} rounds):
- {X} inline threads opened and resolved
- {Y} CI failures fixed
- {Z} new commits pushed
Final state: CI green, {total} threads all resolved, mergeable.
```

If exiting via `_MAX_ROUNDS`, flag explicitly:

```
PR #{N} polish stopped at {_MAX_ROUNDS} rounds — NOT merge-ready:
- {N} threads still unresolved: {titles}
- CI status: {summary}
Needs human review.
```

## When to use this skill

Use when the user says any of:
- "polish this PR"
- "keep reviewing and addressing until it's mergeable"
- "loop /pr-review + /pr-address until done"
- "make sure the PR is actually merge-ready"

Do **not** use when:
- User wants just one review pass (→ `/pr-review`).
- User wants to address already-posted comments without further self-review (→ `/pr-address`).
- A fixed round count is explicitly requested (e.g., "do 3 rounds") — honour the count instead of converging.
Get pr-polish.

vz-bench-debug

vz-scrape-runner

Think you can beat it?