arxiv-search

Show SKILL.md content (~10.6k tokens)
---
name: arxiv-search
description: "arXiv paper search skill - search academic papers by keywords, authors, categories. Supports time filtering, category filtering, and paper detail retrieval. Activation: arxiv search, paper search, 论文搜索, search papers, arxiv 论文."
---

# arXiv Search Skill

Academic paper search skill using arXiv API. Search papers by keywords, authors, categories with time filtering and detail retrieval.

## Features

- **Search Capabilities**
  - Keyword search (title, abstract, all fields)
  - Author search
  - Title-specific search
  - Category-based filtering

- **Filtering Options**
  - Time range (last day/week/month/year)
  - Subject categories (cs.AI, cs.CL, cs.LG, etc.)
  - Result count limit
  - Sort by relevance or date

- **Paper Information**
  - Title, authors, abstract
  - arXiv ID and version
  - PDF download link
  - Publication date
  - Primary category

## Activation Keywords

- arxiv search
- arxiv 搜索
- paper search
- 论文搜索
- search papers
- arxiv 论文
- 学术论文
- 搜论文

## Recommended Model

- **sonnet4.5** (Balanced for search and analysis)
- **opus4.5** (For complex research tasks)

## Tools Used

- **exec**: Run arxiv API queries via curl/httpx
- **read**: Load cached results, read paper PDFs
- **write**: Save search results, create paper summaries

## Usage Examples

### Basic Search

```
搜索 arxiv: "large language model"
```

### Author Search

```
搜索作者 "Yann LeCun" 的论文
```

### Category Search

```
搜索 cs.AI 类别最新论文
```

### Time-filtered Search

```
搜索最近一周的 "vision transformer" 论文
```

## ⚠️ Critical Reliability Notes (Updated 2026-05)

The arXiv API and search page have degraded significantly. Follow this reliability hierarchy:

### Method Priority (Most → Least Reliable)

1. **Browser category pages** (`https://arxiv.org/list/{category}/recent`) — MOST RELIABLE
   - Use `browser_navigate` + `browser_snapshot(full=True)` to parse structured text
   - Categories: `q-bio.NC` (neurons/cognition), `cs.NE` (neural/evolutionary computing), `cs.LG` (ML)
   - Pre-structured, chronological listings — no rate limiting

2. **`web_search` for discovery** — GOOD for initial paper discovery
   - Broad results from Google/arXiv indexing
   - Use as first-pass filter before deeper extraction

3. **Terminal + curl with HTTPS** — PARTIAL RELIABILITY
   - `curl "https://export.arxiv.org/api/query?..."` — use HTTPS only (HTTP triggers security scan)
   - May return 0-byte empty responses in some sandboxes (no error, just empty)
   - Add `time.sleep(3)` between requests, use 5s User-Agent delays

4. **arXiv search page** (`/search/?query=...`) — UNRELIABLE
   - Returns 400 Bad Request for URL-encoded queries
   - Browser form submission works but is fragile

5. **`httpx` / `urllib` in execute_code sandbox** — AVOID
   - Returns 0-byte empty responses (no error, just empty)
   - API returns HTTP 429 on most queries

6. **`web_extract` for arXiv URLs** — BROKEN
   - Returns "Blocked: URL targets a private or internal network" for arXiv URLs
   - Returns empty content for `/abs/` URLs

### Browser Category Page Recipe (Recommended)

```python
from hermes_tools import browser_navigate, browser_snapshot

# Navigate to category recent submissions
browser_navigate(url="https://arxiv.org/list/q-bio.NC/recent")

# Get structured paper list from full snapshot
snapshot = browser_snapshot(full=True)
# Contains all paper titles, IDs, authors, and subjects in structured text
# Parse the DescriptionList format: term/definition pairs
```

### Important Notes
- Individual `/abs/` pages may return "Article not found" for very recent papers (same-day submissions not yet indexed)
- Weekend caveat: arXiv doesn't publish on weekends; last working day's papers appear Monday
- Always use `https://` not `http://` for arXiv API (HTTP triggers interactive security approval)

## ⚠️ API Degradation Warning (2026-05)

The arXiv API (`export.arxiv.org/api/query`) now returns **HTTP 429 "Rate exceeded" on virtually ALL queries**, even with generous delays and User-Agent headers. The `httpx` and `urllib` approaches in this skill's code examples are **no longer reliable**.

**Recommended approach:** Use the `arxiv-to-skill-research-workflow` skill which covers browser-based extraction (`browser_navigate` + `browser_snapshot` + `browser_console` JS) as the primary method, with API as last-resort fallback.

**If you must use the API:**
- Always use **HTTPS** (`https://export.arxiv.org/api/query`) — plain HTTP triggers security guardrail approval prompts
- Use `execute_code` with `httpx` + proxy (`http://127.0.0.1:7890`) — curl pipes to python are blocked
- Implement exponential backoff: 15s → 30s → 60s between retries
- CS/ML queries are more heavily rate-limited than quant-ph
- Expect ~30-50% query failure rate even with retries
- `web_search` with `site:arxiv.org <topic> 2026` is a viable fallback for discovery

## ⚠️ Critical Pitfalls (Updated 2026-05-06)

- **Aggressive rate limiting**: arXiv returns "Rate exceeded." on most requests even with 3-4s delays. Use **10s minimum** between requests. `sleep 4` is NOT enough.
- **API URL**: Always use `https://export.arxiv.org/api/query` (not `http://`). httpx with `http://` fails.
- **web_extract blocks arxiv.org**: The `web_extract` tool blocks arxiv URLs as "private/internal network." Use `web_search` for discovery, then `curl` for metadata.
- **Pipe-to-interpreter blocked**: Security guardrail blocks `curl ... | python3`. Always save curl output to file first, then run python on the file.
- **httpx timeouts**: Direct httpx calls to arXiv frequently timeout (300s+). Prefer `web_search` for discovery + `curl` for individual paper fetches.

### Recommended Workflow (When API Is Unreliable)

```
1. web_search → find paper IDs and titles (no rate limits)
2. curl -s -o /tmp/arxiv.xml "https://export.arxiv.org/api/query?id_list=ID1,ID2,ID3"
   → fetch metadata for multiple papers in ONE request
3. python3 parse.py /tmp/arxiv.xml → extract details from saved file
4. sleep 10 between curl requests if fetching individually
```

## API Details

### arXiv API Endpoint

```
https://export.arxiv.org/api/query
```

**IMPORTANT**: arXiv API now requires HTTPS. Using `http://` returns empty responses (HTTP 301 redirect with no body in many HTTP clients). Always use `https://`.

**IMPORTANT**: Always use `https://`, not `http://`. Plain HTTP URLs trigger security scanner blocks in Hermes Agent environments.

**Note**: Always use `https://`. The `http://` endpoint returns a 301 redirect which some HTTP clients won't follow automatically.

**IMPORTANT**: Use `https://` — the `http://` endpoint returns a 301 redirect. Always set `follow_redirects=True` in httpx.

**IMPORTANT:** Always use `https://` — the `http://` endpoint returns a 301 redirect which causes httpx timeouts in sandboxed environments. Also always set a `User-Agent` header; arxiv aggressively rate-limits requests without one.

**Use HTTPS**, not HTTP. Plain HTTP (`http://export.arxiv.org/...`) triggers security scan blocks on many agent environments.

**Note**: Always use `https://` — the `http://` endpoint returns 301 redirects which some HTTP clients don't follow automatically.

> **Important**: Always use `https://` — the `http://` endpoint returns 301 redirects which break httpx without `follow_redirects=True`.

⚠️ Always use HTTPS. HTTP will fail with XML parse errors.

### Query Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `search_query` | Search query | `ti:machine learning` |
| `start` | Start index | 0 |
| `max_results` | Max results | 10 |
| `sortBy` | Sort method | `relevance`, `submittedDate` |
| `sortOrder` | Sort order | `ascending`, `descending` |

### Query Prefixes

| Prefix | Field |
|--------|-------|
| `ti:` | Title |
| `au:` | Author |
| `ab:` | Abstract |
| `cat:` | Category |
| `all:` | All fields |

### Common Categories

| Category | Description |
|----------|-------------|
| cs.AI | Artificial Intelligence |
| cs.CL | Computation and Language |
| cs.LG | Machine Learning |
| cs.CV | Computer Vision |
| cs.NE | Neural and Evolutionary Computing |
| cs.RO | Robotics |
| stat.ML | Machine Learning (Statistics) |
| math.OC | Optimization and Control |
| physics.** | Physics subcategories |

## Implementation

### ⚠️ WARNING: httpx in execute_code sandbox

`httpx.get()` against the arXiv API from within `execute_code` sandbox can return 0-byte responses (no error, just empty). Observed consistently since 2026-05. The arXiv API also returns HTTP 429 rate limits on most queries.

**Do NOT use the async httpx implementation below in cron jobs or execute_code.** It is provided for reference only. Use browser category pages (`/list/{category}/recent`) instead.
## Implementation

### Search Function (with retry + proxy)

```python
import urllib.request
import urllib.parse
import xml.etree.ElementTree as ET
import time

ARXIV_API = "https://export.arxiv.org/api/query"
PROXY = "http://127.0.0.1:7890"  # Use if direct access fails

def search_arxiv(query, field="all", category=None, max_results=10,
                 sort_by="submittedDate", days=None, retries=3):
    """Search arXiv papers with retry logic and proxy fallback."""
    search_query = f"{field}:{query}"
    if category:
        search_query += f" AND cat:{category}"
    
    url = (f"{ARXIV_API}?search_query={urllib.parse.quote(search_query)}"
           f"&max_results={max_results}&sortBy={sort_by}&sortOrder=descending")
    
    for attempt in range(retries):
        try:
            proxy_handler = urllib.request.ProxyHandler(
                {"https": PROXY, "http": PROXY})
            opener = urllib.request.build_opener(proxy_handler)
            req = urllib.request.Request(
                url, headers={"User-Agent": "ResearchAgent/1.0"})
            
            with opener.open(req, timeout=60) as resp:
                xml_text = resp.read().decode("utf-8")
            
            return parse_arxiv_response(xml_text, days)
            
        except urllib.error.HTTPError as e:
            if e.code == 429 and attempt < retries - 1:
                wait = 5 * (attempt + 1)
                time.sleep(wait)
            else:
                return []  # Give up, caller should try web_search fallback
        except Exception:
            if attempt < retries - 1:
                time.sleep(3)
            else:
                return []
    return []
```

### Fallback: web_search when API is blocked/rate-limited

When direct arxiv API access fails (429, network blocks, no proxy available),
use `web_search` with `site:arxiv.org` queries:

```python
# web_search("site:arxiv.org quantum computing 2025")
# Parse URLs like https://arxiv.org/abs/2511.12379
# Extract paper_id from URL path, construct pdf_url and abs_url
```

This is slower but reliable. Parse the search result titles/URLs/descriptions
to reconstruct paper metadata.

### Quick Search Command

```bash
# Search via curl (save to file first — don't pipe to python)
curl -s "https://export.arxiv.org/api/query?search_query=all:transformer&max_results=5" | xmllint --format -
```

## Pitfalls & Workarounds

### Security Scanner Blocks

The Hermes Agent security scanner blocks these patterns:
- **`curl | python3` pipes**: Direct piping of downloaded content to an interpreter is flagged as `[HIGH] Pipe to interpreter`
- **Plain HTTP URLs**: `http://` URLs trigger `[HIGH] Plain HTTP URL in execution context`
- **Solution**: Use `https://` and write to a Python script file, then execute it separately. Or use Python's `urllib` with `urllib.parse.quote()` for URL encoding.

### URL Encoding Required

Complex queries with quotes and spaces (e.g., `all:"quantum computing" AND all:"machine learning"`) must be URL-encoded. Use `urllib.parse.quote(query)` in Python.

### Rate Limiting (HTTP 429)

arXiv API returns 429 after rapid sequential requests. Add delays (3+ seconds) between calls. **Fallback**: If arXiv API is unavailable or rate-limited, use `web_search` with `site:arxiv.org <keywords>` to find papers.

### Fallback Search Strategy

```python
# When arXiv API fails, fall back to web_search:
web_search(query='site:arxiv.org "quantum computing" "machine learning" recent papers')
```

### httpx with Proxy

When using httpx with a proxy, use `HTTPTransport` — the `proxies=` kwarg on `httpx.Client` is deprecated in newer versions:

```python
proxy_url = "http://127.0.0.1:7890"
transport = httpx.HTTPTransport(proxy=proxy_url, verify=False)
with httpx.Client(transport=transport, timeout=30) as client:
    response = client.get(ARXIV_API, params=params, follow_redirects=True)
```

**⚠️ URL Encoding**: When using Python `urllib.request` directly (not via `params=` dict), spaces in the query string cause `http.client.InvalidURL`. Always encode:
```python
from urllib.parse import quote
query = quote(f'all:"{search_term}"')
url = f"https://export.arxiv.org/api/query?search_query={query}&max_results=5"
```
Using `httpx`/`requests` with a `params=` dict avoids this issue entirely — the library handles encoding automatically.

## Fallback: When curl/httpx Are Blocked

Security scanners may block `curl | python3` pipes or flag HTTP URLs. When exec-based API access fails:

1. **Use `browser_navigate`** to visit arXiv directly:
   - Browse by category: `https://arxiv.org/list/{category}/recent`
   - View specific paper: `https://arxiv.org/abs/{arxiv_id}`
   - Example categories: `quant-ph`, `cs.AI`, `cs.LG`, `cs.CV`, `cs.NE`

2. **Use `web_search`** with `site:arxiv.org` prefix:
   - `site:arxiv.org quantum machine learning 2026`
   - `site:arxiv.org reinforcement learning algorithms`

3. **Use `execute_code` with `hermes_tools.web_search`** for programmatic search (avoids pipe-to-interpreter blocks).

The browser approach yields full paper details (title, authors, abstract, categories) from the rendered page.

### Fetch Paper Details by ID

When you have arXiv IDs and need full details (complete abstracts, author lists, categories), use the `id_list` parameter instead of `search_query`:

```python
def get_paper_details(arxiv_ids):
    \"\"\"Fetch full paper metadata by arXiv ID list.\"\"\"
    ids_str = ",".join(arxiv_ids)
    params = {"id_list": ids_str}
    r = httpx.get(ARXIV_API, params=params, timeout=30)
    r.raise_for_status()
    return r.text  # Parse with same parse_arxiv function
```

**Why this matters:** Search queries may truncate results or miss papers. `id_list` guarantees exact retrieval and is essential for deep paper analysis workflows.

## Workflow for Agents

### Step 1: Understand Search Intent

```markdown
- What is the user looking for?
  - Keywords → keyword search
  - Author name → author search
  - Specific topic → category + keyword
  - Recent papers → time-filtered search
```

### Step 2: Build Query

```python
def build_query(intent):
    """Build arXiv query from user intent."""
    
    if intent["type"] == "keyword":
        return f"all:{intent['query']}"
    elif intent["type"] == "author":
        return f"au:{intent['query']}"
    elif intent["type"] == "title":
        return f"ti:{intent['query']}"
    elif intent["type"] == "category":
        return f"cat:{intent['category']}"
    elif intent["type"] == "combined":
        # e.g., "machine learning in computer vision"
        return f"all:{intent['keywords']} AND cat:{intent['category']}"
```

### Step 3: Execute Search

```python
# Execute search with appropriate parameters
results = await search_arxiv(
    query=built_query,
    field=intent.get("field", "all"),
    category=intent.get("category"),
    max_results=intent.get("max_results", 10),
    sort_by=intent.get("sort_by", "relevance"),
    days=intent.get("days")
)
```

### Step 4: Present Results

```markdown
## arXiv Search Results

Found {count} papers for "{query}":

### 1. {title}
- **Authors:** {authors}
- **Published:** {date}
- **Category:** {category}
- **arXiv:** [{id}]({abs_url})
- **PDF:** [Download]({pdf_url})

**Abstract:** {abstract}

---
```

## Category Reference

### Computer Science

| Category | Name |
|----------|------|
| cs.AI | Artificial Intelligence |
| cs.CL | Computation and Language (NLP) |
| cs.CV | Computer Vision and Pattern Recognition |
| cs.LG | Machine Learning |
| cs.NE | Neural and Evolutionary Computing |
| cs.RO | Robotics |
| cs.CR | Cryptography and Security |
| cs.DB | Databases |
| cs.DC | Distributed Computing |
| cs.HC | Human-Computer Interaction |
| cs.IR | Information Retrieval |
| cs.MM | Multimedia |
| cs.SE | Software Engineering |

### Mathematics

| Category | Name |
|----------|------|
| math.OC | Optimization and Control |
| math.ST | Statistics Theory |
| math.NA | Numerical Analysis |
| stat.ML | Machine Learning (Statistics) |

### Physics

| Category | Name |
|----------|------|
| physics.comp-ph | Computational Physics |
| physics.data-an | Data Analysis |
| quant-ph | Quantum Physics |

## Rate Limiting (Critical)

arXiv API aggressively rate-limits — even with a proxy. The old "~3 second delay" guidance is **insufficient**.

- **Minimum delay: 5 seconds** between requests
- **On HTTP 429**: Wait **10 seconds** before retry
- **Use `follow_redirects=True`**: The API returns HTTP 301 redirects; httpx must follow them
- **Batch by ID, not by query**: `id_list=2503.07681,2507.10722` is more efficient than separate queries
- **User-Agent header required**: Always include `User-Agent: ResearchBot/1.0` or similar

```python
import httpx
proxy = httpx.Proxy("http://127.0.0.1:7890")
with httpx.Client(proxy=proxy, timeout=30, follow_redirects=True) as client:
    for i, arxiv_id in enumerate(papers):
        if i > 0:
            time.sleep(5)  # minimum delay
        url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}"
        resp = client.get(url, headers={"User-Agent": "ResearchBot/1.0"})
        if resp.status_code == 429:
            time.sleep(10)  # retry delay
            resp = client.get(url, headers={"User-Agent": "ResearchBot/1.0"})
```

## Rate Limits and Proxy Configuration

### arXiv API Rate Limits
- arXiv enforces strict rate limits (HTTP 429). Space requests **5+ seconds apart**.
- On 429: wait 15 seconds, then retry once. If still 429, abort.
- Maximum ~3 requests per 30 seconds.

### Proxy Configuration (when direct access fails)
When running from environments where arXiv is blocked or rate-limited, use:
```python
import httpx
proxy = httpx.Proxy('http://127.0.0.1:7890')
with httpx.Client(proxy=proxy, timeout=45) as client:
    resp = client.get(url, headers={'User-Agent': 'ResearchBot/1.0'})
```

### Retry Pattern
```python
resp = client.get(url, headers={'User-Agent': 'ResearchBot/1.0'})
if resp.status_code == 429:
    time.sleep(15)
    resp = client.get(url, headers={'User-Agent': 'ResearchBot/1.0'})
if resp.status_code != 200:
    # Abort on second failure
    continue
```

## Best Practices
## Best Practices

1. **Be Specific**: Use specific keywords for better results
2. **Use Categories**: Filter by category to narrow results
3. **Sort Appropriately**: Use `relevance` for overview, `submittedDate` for latest
4. **Limit Results**: Start with 10-20 results, increase if needed
5. **Check Date**: Use time filter for recent developments
6. **Use web_search for discovery**: When arXiv API is rate-limited or unreliable, use `web_search` to find paper IDs, then fetch metadata via `curl`
7. **Batch metadata fetches**: Use `id_list=ID1,ID2,ID3` to fetch multiple papers in one API call
8. **Save curl to file first**: Never pipe curl output to python — security guardrail blocks it
6. **Browser Fallback**: When API (429) AND `web_search` both return empty for niche queries, use `browser_navigate` to `https://arxiv.org/search/?searchtype=all&query=<query>&order=-announced_date_first`. Extract from page snapshot. Confirmed effective for topic combinations like "quantum machine learning medicine" where other methods miss results.
6. **Proxy Required**: Route through proxy (127.0.0.1:7890) — direct API access gets 429'd
7. **Use urllib, not httpx**: httpx in execute_code sandbox returns 0-byte responses for arxiv API
8. **web_search Fallback**: If API fails after retries, use `web_search("site:arxiv.org <keywords>")` and parse URLs/summaries
9. **Never web_extract arxiv**: `web_extract` blocks arxiv URLs as internal/private — use `web_search` results instead
6. **Multi-Topic Scanning**: For comprehensive literature reviews, run parallel searches across multiple related topics and deduplicate by paper ID. Example pattern:
   ```python
   queries = ['all:"spiking neural"', 'all:"brain network"', 'cat:q-bio.NC']
   all_papers = []
   for q in queries:
       papers = search_arxiv(q, max_results=8, days=14)
       for p in papers:
           if not any(x['id'] == p['id'] for x in all_papers):
               all_papers.append(p)
   ```
6. **Handle 301s**: Always follow redirects — the API may redirect queries

## Common Use Cases

### 1. Literature Review

```
搜索 arxiv: "prompt engineering" --category cs.CL --days 30 --max 20
```

### 2. Author Tracking

```
搜索作者 "Andrew Ng" 的最新论文
```

### 3. Topic Monitoring

```
搜索 cs.AI 类别最近一周的论文
```

### 4. Specific Paper

```
搜索标题 "Attention is All You Need"
```

## Output Format

### Summary Format

```markdown
# arXiv Search Results

**Query:** {query}
**Results:** {count} papers
**Time Range:** {time_range}

---

## Papers

### 1. {Title}
**Authors:** {Author 1}, {Author 2}, et al.
**Published:** {YYYY-MM-DD}
**Category:** {category}

**Abstract:** 
{abstract}

**Links:**
- arXiv: [{id}](https://arxiv.org/abs/{id})
- **PDF**: [Download]({pdf_url})

---
```

## Limitations

- **arXiv API is aggressively rate-limited** — returns 429 ("Rate exceeded.") on most requests even with short delays. Use `sleep 10` minimum between requests. `sleep 4` is NOT enough.
- **Use `web_search` as primary discovery** when arXiv API is rate-limited. `web_search("query site:arxiv.org")` works reliably with no rate limits, then use API only for full metadata on specific IDs.
- **`web_extract` blocks arxiv.org URLs** as "private/internal network." Do NOT use web_extract for arxiv content. Use the API directly or `browser_navigate` as fallback.
- No abstract search in advanced mode (use `all:` prefix)
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed

## Pitfalls

### arXiv API Rate Limiting (429)
The arXiv API returns "Rate exceeded." aggressively. Even with `sleep 4` between requests, most calls will 429. Use `sleep 10` minimum. Better: use `web_search` for discovery (no rate limits), then query API only for specific IDs.

### HTTP vs HTTPS
Always use `https://export.arxiv.org/api/query`, NOT `http://`. HTTP connections timeout through proxy environments.

### Security Guardrail on curl
Never pipe curl output directly to Python — security guardrail blocks pipe-to-interpreter. Save to file first: `curl -o /tmp/arxiv.xml "https://..." && python3 parse.py /tmp/arxiv.xml`.

### httpx Proxy Issue
When using httpx with `trust_env=True`, the proxy may be auto-detected and cause timeouts. Use `httpx.Client(trust_env=False)` to bypass proxy for arXiv (or ensure proxy allows it).
- Preprints are not peer-reviewed
- **curl | python3 is blocked**: Security scanner prevents piping curl output directly to Python. Save to file first, then read with Python.
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed
    params = {"search_query": query, "max_results": max_results, "sortBy": "submittedDate"}
    for attempt in range(max_retries):
        try:
            r = httpx.get(ARXIV_API, params=params, timeout=30)
            if r.status_code == 429:
                wait = 10 * (attempt + 1)  # 10s, 20s, 30s
                time.sleep(wait)
                continue
            r.raise_for_status()
            return parse_response(r.text)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(10 * (attempt + 1))
            else:
                return []
    return []
```

**Never call arXiv API in a loop without delays** — batch queries and wait 3+ seconds between separate searches.

## Rate Limiting & Fallback

**The arXiv API aggressively rate-limits** (HTTP 429). Expect failures on repeated calls.

### Mitigation
- Use a custom User-Agent: `-A "ResearchAgent/1.0 (your@email.com)"`
- Wait 10-20 seconds between requests after a 429
- Limit `max_results` to 10 per call for reliability

### Browser Fallback (when API is rate-limited)
```
1. Navigate to https://arxiv.org/list/quant-ph/recent
2. Browse the listing page to collect paper titles, IDs, authors
3. Visit individual https://arxiv.org/abs/<id> pages for abstracts
4. Extract paper metadata from the HTML structure
```

### Web Search Fallback
```
Search: "arxiv <topic> latest site:arxiv.org"
Extract: IDs, titles, descriptions from SERP snippets
Follow: Visit individual abs pages for full abstracts
```

## Limitations

- arXiv API has aggressive rate limits (HTTP 429); always implement retry with exponential backoff
- Direct API access may require proxy in restricted environments (default 127.0.0.1:7890)
## Limitations

- arXiv API has rate limits (HTTP 429); add 3+ second delays between calls
- If rate-limited or blocked by security scanner, fall back to `web_search` with `site:arxiv.org`
- URL-encode complex queries with `urllib.parse.quote()` when using Python
- No abstract search in advanced mode (use `all:` prefix)
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed## Limitations

- arXiv API has rate limits (HTTP 429); add 3+ second delays between calls
- If rate-limited or blocked by security scanner, fall back to `web_search` with `site:arxiv.org`
- URL-encode complex queries with `urllib.parse.quote()` when using Python
- No abstract search in advanced mode (use `all:` prefix)
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed

The arXiv API enforces strict rate limits. Observed behavior:

- **Batch/multi-query searches**: Return **HTTP 429** even with 3–4 second delays between requests
- **Individual paper lookups** (`id_list` parameter): Work reliably with 3+ second delays
- **Single search queries**: Usually succeed, but can 429 under load

### Recommended Retry Strategy

```python
import time
import httpx

def arxiv_request(url, max_retries=3, base_delay=4):
    """Request arXiv API with exponential backoff for 429s."""
    for attempt in range(max_retries):
        resp = httpx.get(url, timeout=60)
        if resp.status_code == 429:
            wait = base_delay * (2 ** attempt)  # 4s, 8s, 16s
            time.sleep(wait)
            continue
        resp.raise_for_status()
        return resp
    raise Exception(f"arXiv API rate limited after {max_retries} retries")
```

### Fallback: Use id_list for Individual Paper Details

When searching multiple papers fails, look up each one individually:

```python
# Instead of batch query for 10 papers at once (gets 429'd):
# Use id_list for single papers with delays:
for paper_id in paper_ids:
    time.sleep(4)  # Required delay between requests
    resp = httpx.get(f"https://export.arxiv.org/api/query?id_list={paper_id}")
    # Parse single entry...
```

### Practical Guidance

| Scenario | Approach |
|----------|----------|
| Single keyword search | Direct query, usually works |
| 3+ sequential searches | 4+ second delays between each |
| Bulk paper detail fetch | Use `id_list={id}` one at a time with delays |
| Category browsing (latest N) | Single query works; multiple category queries need delays |
| web_search fallback | Use if arXiv API is fully rate-limited |

## Related Skills
- Preprints are not peer-reviewed

## Rate Limiting & Retry Pattern

The arXiv API frequently returns 429 (rate limited) on initial queries. Use this pattern:

```python
import time
import httpx

def arxiv_query_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        r = httpx.get(url, timeout=30, follow_redirects=True)
        if r.status_code == 429:
            wait = 4 * (attempt + 1)  # 4s, 8s, 12s backoff
            time.sleep(wait)
            continue
        if r.status_code != 200:
            return None
        return r.text
    return None

# Between multiple queries, always add a delay:
# time.sleep(3-4) between consecutive queries
```

## Common Use Cases

### URL Encoding (Critical)
arXiv queries with spaces/special chars cause `InvalidURL` errors. Always encode:
```python
import urllib.parse
encoded = urllib.parse.quote(query, safe=':+()')
```

### HTTP 429 Rate Limiting
Broad queries (e.g. `all:neuroscience`) consistently return 429. Narrow queries or fallback to `web_search`. Use ≥3.5s delay between requests.

### Proxy Required
This environment needs `http://127.0.0.1:7890` proxy. Configure via `urllib.request.ProxyHandler`.

### Security Scanner
`curl | python3` pipe patterns are blocked. Write self-contained Python scripts instead.

### Always Use HTTPS
Use `https://export.arxiv.org/api/query` — never `http://`.

## Environment-Specific Notes

### URL Encoding (Critical)
arXiv API URLs must be properly encoded. Spaces and special characters cause `InvalidURL` errors:
```python
import urllib.parse
encoded = urllib.parse.quote(query, safe=':+()')
url = f"https://export.arxiv.org/api/query?search_query={encoded}"
```

### Proxy Configuration
If behind a proxy, configure it in urllib/httpx:
```python
proxy_handler = urllib.request.ProxyHandler({'https': 'http://127.0.0.1:7890', 'http': 'http://127.0.0.1:7890'})
opener = urllib.request.build_opener(proxy_handler)
```

### Security Scanner Warning
`curl | python3` pipe patterns trigger HIGH security warnings. Use self-contained Python scripts instead.

### Rate Limit Fallback
When arXiv returns 429:
1. Wait 5 seconds and retry (max 2 retries)
2. If still failing, use `web_search` with `arxiv <keywords>` as fallback
3. Narrow broad queries: `all:neuroscience` → `all:(neural memory learning)`

### Always Use HTTPS
- Preprints are not peer-reviewed

## Pitfalls & Workarounds

### API Rate Limiting
The arxiv API returns "Rate exceeded." when queried too frequently. When this happens:
1. **Wait 5+ seconds** before retrying
2. **Use proxy**: `export https_proxy=http://127.0.0.1:7890 http_proxy=http://127.0.0.1:7890`
3. **Fallback to web_search**: `web_search(query="arxiv quantum computing recent papers 2025", limit=5)` often returns arxiv paper links with titles and abstracts

### web_extract Blocks arxiv URLs
`web_extract` blocks arxiv.org URLs with "Blocked: URL targets a private or internal network address". To read paper content:
1. Use `browser_navigate(url="https://arxiv.org/abs/2508.06011")` — works reliably
2. Use `browser_snapshot` after navigation to capture abstract, authors, metadata
3. For Nature/APS/etc. publisher sites: also use browser navigation (accept cookie dialogs first)

### curl Timeouts Without Proxy
Direct `curl` to arxiv API often times out (30s+) without proxy. Use proxy or web_search fallback.

### Recommended Fallback Chain
```
## Rate Limiting & Retry Pattern

arXiv aggressively rate-limits automated queries. **429 errors are common even with 3-second delays between requests.** Always use exponential backoff retry:

```python
import time, httpx

def search_with_retry(url, params, max_retries=3):
    """Search arXiv with exponential backoff on 429."""
    for attempt in range(max_retries):
        with httpx.Client(timeout=20) as client:
            resp = client.get(url, params=params)
            if resp.status_code == 429:
                wait = (2 ** attempt) * 3  # 3s, 6s, 12s
                time.sleep(wait)
                continue
            resp.raise_for_status()
            return resp.text
    raise RuntimeError(f"arXiv rate limit exceeded after {max_retries} retries")
```

**Rules for automated/bulk searches:**
- Wait **at least 3.5 seconds** between consecutive queries
- Use `httpx` or `requests` (NOT `curl | python3` — triggers security scans)
- For >5 queries, spread them out with 5+ second gaps
- Cache results locally to avoid re-querying

## Limitations

- arXiv API has strict rate limits — returns **HTTP 429** after ~3 quick requests. Always add 3-5 second delays between calls.
- `web_search` works as a reliable fallback when API returns 429
- `web_extract` **blocks arxiv.org URLs** (flagged as "private/internal network address") — use `web_search` for abstracts instead
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed

## Tool Fallback Strategy (IMPORTANT)

When searching arxiv programmatically, use this priority order:

1. **arxiv API** (primary) — but expect 429 after 2-3 calls. Add `time.sleep(3-5)` between requests.
2. **web_search** (fallback) — query `"arxiv XXXX.XXXXX abstract"` or `"arxiv {keywords} 2025 2026"`
3. Never use **web_extract** on arxiv URLs — it will block them

### Proxy Configuration

If behind a proxy (e.g., `http://127.0.0.1:7890`), set `ProxyHandler` in Python urllib:
```python
proxy_handler = urllib.request.ProxyHandler({"https": PROXY, "http": PROXY})
opener = urllib.request.build_opener(proxy_handler)
- Preprints are not peer-reviewed

## Pitfalls & Workarounds

### web_search with `site:arxiv.org` returns empty
The `web_search` tool consistently returns **zero results** for `site:arxiv.org` queries. Do not rely on it for arxiv paper discovery.

**Workaround:** Use `browser_navigate` to `https://arxiv.org/abs/<id>` for fetching individual paper abstracts, or use the arxiv API directly via Python `httpx`.

### curl to arxiv triggers security scanner
---

## Rate Limiting (CRITICAL)

The arXiv API enforces **aggressive rate limiting** (429 Too Many Requests). The common "~3 second delay" guidance is **insufficient** for programmatic multi-query searches.

### Observed Behavior (May 2026)

| Delay Between Queries | Result |
|----------------------|--------|
| 3-5 seconds | ❌ Consistent 429 errors |
| 6-8 seconds | ❌ Intermittent 429 errors |
| **15 seconds** | ✅ Reliable |

### Rules for Multi-Query Searches

1. **Minimum 15-second delay** between sequential API calls
2. **Never parallelize** arXiv API calls — always run sequentially
3. **Use `id_list` parameter** for fetching specific papers (lower rate limit risk):
   ```
   https://export.arxiv.org/api/query?id_list=2605.03598v1,2605.02509v1
   ```
4. **Batch paper ID lookups** when possible (up to ~10 IDs per request)
5. **On 429 error**: wait 30+ seconds before retry, do not immediately retry

### Recommended Pattern for Cron/Bulk Scans

```python
import time

queries = ["cat:cs.NE", "cat:q-bio.NC", "all:spiking neural"]
all_results = []

for q in queries:
    time.sleep(15)  # CRITICAL: must be >= 15s between queries
    results = search_arxiv(q, max_results=20)
    all_results.extend(results)
```

### Alternative: Single Broad Query

When rate limiting is a concern, prefer **one well-crafted query** over many narrow ones:
```
all:neural dynamics AND (brain OR spiking OR cognitive) AND cat:(cs.NE OR q-bio.NC)
```

## Limitations and Pitfalls

- **Rate limiting (HTTP 429)**: arXiv API aggressively rate-limits. Always add `sleep 5` between consecutive requests. If you get 429, wait 10+ seconds before retry.
- **httpx proxy configuration**: In newer httpx versions, `httpx.Client(proxies=...)` raises `TypeError`. Use environment variables instead: `HTTPS_PROXY=http://127.0.0.1:7890` or `httpx.Client(proxy="http://127.0.0.1:7890")` (note singular `proxy`).
- **Security scanner blocks**: `curl | python3` pipes trigger security scans (HIGH severity for "pipe to interpreter"). Write curl output to a temp file first, then parse with python.
- **web_search/web_extract fallbacks**: `web_extract` cannot access `arxiv.org/list/*` pages (blocked as "private/internal network"). Use `web_search` with `site:arxiv.org` for discovery, then the API for details.
- **httpx redirects**: The arXiv API may redirect. Always use `httpx.Client(follow_redirects=True, timeout=60)`.
- No abstract search in advanced mode (use `all:` prefix)
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed

## Related Skills

- **news-search**: For general news
- **tech-researcher agent**: For technical research
- **skill-extractor**: Extract patterns from papers

## Dependencies
## Bundled Resources

- **references/multi-topic-scanning.md** — Complete multi-topic scanning workflow with deduplication and existing-skill checking patterns. Use for comprehensive literature reviews across multiple related topics.

## Dependencies

```bash
pip install httpx xmltodict
```

## Notes

- arXiv is free and open access
- Papers are preprints (not peer-reviewed)
- Good for cutting-edge research
- Check citation count on Google Scholar for impact
- Use Semantic Scholar API for additional metadata

## Fallback: RSS Feed (when API is rate-limited)

The arXiv query API aggressively rate-limits (429 errors, read timeouts). When the API fails, use the RSS feed:

```bash
# Single category
curl -s --max-time 15 "https://export.arxiv.org/rss/quant-ph"

# Combined categories (use + to join)
curl -s --max-time 15 "https://export.arxiv.org/rss/quant-ph+cs.LG"
```

RSS has separate, more generous rate limits and is far more reliable for recent papers. Each `<item>` contains `<title>`, `<link>` (abs URL → extract ID), `<description>` (abstract with LaTeX). PDF: `https://arxiv.org/pdf/<id>`.

## Detailed Reference

- **Reliable Fetch Patterns**: See [references/arxiv-reliable-fetch.md](references/arxiv-reliable-fetch.md) for working patterns when the API is rate-limited or unreliable (web_search → curl → parse workflow).

## Pitfalls (from live sessions)

- **http:// → https:// redirect**: The arXiv API endpoint at `http://export.arxiv.org/api/query` returns a 301 redirect. Always use `https://`.
- **httpx proxy kwarg changed**: Newer httpx versions removed the `proxies=` kwarg on `httpx.Client`. Use `httpx.HTTPTransport(proxy=url)` instead:
  ```python
  transport = httpx.HTTPTransport(proxy="http://127.0.0.1:7890", verify=False)
  with httpx.Client(transport=transport, timeout=30) as client:
      response = client.get(url, params=params, follow_redirects=True)
  ```
- **arXiv API 429 rate limits**: The API returns 429 even with delays. Fall back to `web_search` with `site:arxiv.org` queries to bypass rate limits.
- **web_extract blocks arxiv.org**: The web extraction tool blocks arxiv URLs as "private/internal network." Use `web_search` for discovery, `curl` with proxy for raw API metadata.

## Related Skills

1. Read the task description carefully
2. Follow the step-by-step process
3. Use the appropriate tools
4. Verify the results

## Examples

### Example 1: Basic Usage

**User:** <example user request>

**Agent:** <example agent response>

### Example 2: Advanced Usage

**User:** <example user request>

**Agent:** <example agent response>
Get arxiv-search.

vz-bench-debug

vz-scrape-runner

Think you can beat it?