Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-arxiv-searchgit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-arxiv-search/SKILL.md---
name: arxiv-search
description: "arXiv paper search skill - search academic papers by keywords, authors, categories. Supports time filtering, category filtering, and paper detail retrieval. Activation: arxiv search, paper search, 论文搜索, search papers, arxiv 论文."
---
# arXiv Search Skill
Academic paper search skill using arXiv API. Search papers by keywords, authors, categories with time filtering and detail retrieval.
## Features
- **Search Capabilities**
- Keyword search (title, abstract, all fields)
- Author search
- Title-specific search
- Category-based filtering
- **Filtering Options**
- Time range (last day/week/month/year)
- Subject categories (cs.AI, cs.CL, cs.LG, etc.)
- Result count limit
- Sort by relevance or date
- **Paper Information**
- Title, authors, abstract
- arXiv ID and version
- PDF download link
- Publication date
- Primary category
## Activation Keywords
- arxiv search
- arxiv 搜索
- paper search
- 论文搜索
- search papers
- arxiv 论文
- 学术论文
- 搜论文
## Recommended Model
- **sonnet4.5** (Balanced for search and analysis)
- **opus4.5** (For complex research tasks)
## Tools Used
- **exec**: Run arxiv API queries via curl/httpx
- **read**: Load cached results, read paper PDFs
- **write**: Save search results, create paper summaries
## Usage Examples
### Basic Search
```
搜索 arxiv: "large language model"
```
### Author Search
```
搜索作者 "Yann LeCun" 的论文
```
### Category Search
```
搜索 cs.AI 类别最新论文
```
### Time-filtered Search
```
搜索最近一周的 "vision transformer" 论文
```
## ⚠️ Critical Reliability Notes (Updated 2026-05)
The arXiv API and search page have degraded significantly. Follow this reliability hierarchy:
### Method Priority (Most → Least Reliable)
1. **Browser category pages** (`https://arxiv.org/list/{category}/recent`) — MOST RELIABLE
- Use `browser_navigate` + `browser_snapshot(full=True)` to parse structured text
- Categories: `q-bio.NC` (neurons/cognition), `cs.NE` (neural/evolutionary computing), `cs.LG` (ML)
- Pre-structured, chronological listings — no rate limiting
2. **`web_search` for discovery** — GOOD for initial paper discovery
- Broad results from Google/arXiv indexing
- Use as first-pass filter before deeper extraction
3. **Terminal + curl with HTTPS** — PARTIAL RELIABILITY
- `curl "https://export.arxiv.org/api/query?..."` — use HTTPS only (HTTP triggers security scan)
- May return 0-byte empty responses in some sandboxes (no error, just empty)
- Add `time.sleep(3)` between requests, use 5s User-Agent delays
4. **arXiv search page** (`/search/?query=...`) — UNRELIABLE
- Returns 400 Bad Request for URL-encoded queries
- Browser form submission works but is fragile
5. **`httpx` / `urllib` in execute_code sandbox** — AVOID
- Returns 0-byte empty responses (no error, just empty)
- API returns HTTP 429 on most queries
6. **`web_extract` for arXiv URLs** — BROKEN
- Returns "Blocked: URL targets a private or internal network" for arXiv URLs
- Returns empty content for `/abs/` URLs
### Browser Category Page Recipe (Recommended)
```python
from hermes_tools import browser_navigate, browser_snapshot
# Navigate to category recent submissions
browser_navigate(url="https://arxiv.org/list/q-bio.NC/recent")
# Get structured paper list from full snapshot
snapshot = browser_snapshot(full=True)
# Contains all paper titles, IDs, authors, and subjects in structured text
# Parse the DescriptionList format: term/definition pairs
```
### Important Notes
- Individual `/abs/` pages may return "Article not found" for very recent papers (same-day submissions not yet indexed)
- Weekend caveat: arXiv doesn't publish on weekends; last working day's papers appear Monday
- Always use `https://` not `http://` for arXiv API (HTTP triggers interactive security approval)
## ⚠️ API Degradation Warning (2026-05)
The arXiv API (`export.arxiv.org/api/query`) now returns **HTTP 429 "Rate exceeded" on virtually ALL queries**, even with generous delays and User-Agent headers. The `httpx` and `urllib` approaches in this skill's code examples are **no longer reliable**.
**Recommended approach:** Use the `arxiv-to-skill-research-workflow` skill which covers browser-based extraction (`browser_navigate` + `browser_snapshot` + `browser_console` JS) as the primary method, with API as last-resort fallback.
**If you must use the API:**
- Always use **HTTPS** (`https://export.arxiv.org/api/query`) — plain HTTP triggers security guardrail approval prompts
- Use `execute_code` with `httpx` + proxy (`http://127.0.0.1:7890`) — curl pipes to python are blocked
- Implement exponential backoff: 15s → 30s → 60s between retries
- CS/ML queries are more heavily rate-limited than quant-ph
- Expect ~30-50% query failure rate even with retries
- `web_search` with `site:arxiv.org <topic> 2026` is a viable fallback for discovery
## ⚠️ Critical Pitfalls (Updated 2026-05-06)
- **Aggressive rate limiting**: arXiv returns "Rate exceeded." on most requests even with 3-4s delays. Use **10s minimum** between requests. `sleep 4` is NOT enough.
- **API URL**: Always use `https://export.arxiv.org/api/query` (not `http://`). httpx with `http://` fails.
- **web_extract blocks arxiv.org**: The `web_extract` tool blocks arxiv URLs as "private/internal network." Use `web_search` for discovery, then `curl` for metadata.
- **Pipe-to-interpreter blocked**: Security guardrail blocks `curl ... | python3`. Always save curl output to file first, then run python on the file.
- **httpx timeouts**: Direct httpx calls to arXiv frequently timeout (300s+). Prefer `web_search` for discovery + `curl` for individual paper fetches.
### Recommended Workflow (When API Is Unreliable)
```
1. web_search → find paper IDs and titles (no rate limits)
2. curl -s -o /tmp/arxiv.xml "https://export.arxiv.org/api/query?id_list=ID1,ID2,ID3"
→ fetch metadata for multiple papers in ONE request
3. python3 parse.py /tmp/arxiv.xml → extract details from saved file
4. sleep 10 between curl requests if fetching individually
```
## API Details
### arXiv API Endpoint
```
https://export.arxiv.org/api/query
```
**IMPORTANT**: arXiv API now requires HTTPS. Using `http://` returns empty responses (HTTP 301 redirect with no body in many HTTP clients). Always use `https://`.
**IMPORTANT**: Always use `https://`, not `http://`. Plain HTTP URLs trigger security scanner blocks in Hermes Agent environments.
**Note**: Always use `https://`. The `http://` endpoint returns a 301 redirect which some HTTP clients won't follow automatically.
**IMPORTANT**: Use `https://` — the `http://` endpoint returns a 301 redirect. Always set `follow_redirects=True` in httpx.
**IMPORTANT:** Always use `https://` — the `http://` endpoint returns a 301 redirect which causes httpx timeouts in sandboxed environments. Also always set a `User-Agent` header; arxiv aggressively rate-limits requests without one.
**Use HTTPS**, not HTTP. Plain HTTP (`http://export.arxiv.org/...`) triggers security scan blocks on many agent environments.
**Note**: Always use `https://` — the `http://` endpoint returns 301 redirects which some HTTP clients don't follow automatically.
> **Important**: Always use `https://` — the `http://` endpoint returns 301 redirects which break httpx without `follow_redirects=True`.
⚠️ Always use HTTPS. HTTP will fail with XML parse errors.
### Query Parameters
| Parameter | Description | Example |
|-----------|-------------|---------|
| `search_query` | Search query | `ti:machine learning` |
| `start` | Start index | 0 |
| `max_results` | Max results | 10 |
| `sortBy` | Sort method | `relevance`, `submittedDate` |
| `sortOrder` | Sort order | `ascending`, `descending` |
### Query Prefixes
| Prefix | Field |
|--------|-------|
| `ti:` | Title |
| `au:` | Author |
| `ab:` | Abstract |
| `cat:` | Category |
| `all:` | All fields |
### Common Categories
| Category | Description |
|----------|-------------|
| cs.AI | Artificial Intelligence |
| cs.CL | Computation and Language |
| cs.LG | Machine Learning |
| cs.CV | Computer Vision |
| cs.NE | Neural and Evolutionary Computing |
| cs.RO | Robotics |
| stat.ML | Machine Learning (Statistics) |
| math.OC | Optimization and Control |
| physics.** | Physics subcategories |
## Implementation
### ⚠️ WARNING: httpx in execute_code sandbox
`httpx.get()` against the arXiv API from within `execute_code` sandbox can return 0-byte responses (no error, just empty). Observed consistently since 2026-05. The arXiv API also returns HTTP 429 rate limits on most queries.
**Do NOT use the async httpx implementation below in cron jobs or execute_code.** It is provided for reference only. Use browser category pages (`/list/{category}/recent`) instead.
## Implementation
### Search Function (with retry + proxy)
```python
import urllib.request
import urllib.parse
import xml.etree.ElementTree as ET
import time
ARXIV_API = "https://export.arxiv.org/api/query"
PROXY = "http://127.0.0.1:7890" # Use if direct access fails
def search_arxiv(query, field="all", category=None, max_results=10,
sort_by="submittedDate", days=None, retries=3):
"""Search arXiv papers with retry logic and proxy fallback."""
search_query = f"{field}:{query}"
if category:
search_query += f" AND cat:{category}"
url = (f"{ARXIV_API}?search_query={urllib.parse.quote(search_query)}"
f"&max_results={max_results}&sortBy={sort_by}&sortOrder=descending")
for attempt in range(retries):
try:
proxy_handler = urllib.request.ProxyHandler(
{"https": PROXY, "http": PROXY})
opener = urllib.request.build_opener(proxy_handler)
req = urllib.request.Request(
url, headers={"User-Agent": "ResearchAgent/1.0"})
with opener.open(req, timeout=60) as resp:
xml_text = resp.read().decode("utf-8")
return parse_arxiv_response(xml_text, days)
except urllib.error.HTTPError as e:
if e.code == 429 and attempt < retries - 1:
wait = 5 * (attempt + 1)
time.sleep(wait)
else:
return [] # Give up, caller should try web_search fallback
except Exception:
if attempt < retries - 1:
time.sleep(3)
else:
return []
return []
```
### Fallback: web_search when API is blocked/rate-limited
When direct arxiv API access fails (429, network blocks, no proxy available),
use `web_search` with `site:arxiv.org` queries:
```python
# web_search("site:arxiv.org quantum computing 2025")
# Parse URLs like https://arxiv.org/abs/2511.12379
# Extract paper_id from URL path, construct pdf_url and abs_url
```
This is slower but reliable. Parse the search result titles/URLs/descriptions
to reconstruct paper metadata.
### Quick Search Command
```bash
# Search via curl (save to file first — don't pipe to python)
curl -s "https://export.arxiv.org/api/query?search_query=all:transformer&max_results=5" | xmllint --format -
```
## Pitfalls & Workarounds
### Security Scanner Blocks
The Hermes Agent security scanner blocks these patterns:
- **`curl | python3` pipes**: Direct piping of downloaded content to an interpreter is flagged as `[HIGH] Pipe to interpreter`
- **Plain HTTP URLs**: `http://` URLs trigger `[HIGH] Plain HTTP URL in execution context`
- **Solution**: Use `https://` and write to a Python script file, then execute it separately. Or use Python's `urllib` with `urllib.parse.quote()` for URL encoding.
### URL Encoding Required
Complex queries with quotes and spaces (e.g., `all:"quantum computing" AND all:"machine learning"`) must be URL-encoded. Use `urllib.parse.quote(query)` in Python.
### Rate Limiting (HTTP 429)
arXiv API returns 429 after rapid sequential requests. Add delays (3+ seconds) between calls. **Fallback**: If arXiv API is unavailable or rate-limited, use `web_search` with `site:arxiv.org <keywords>` to find papers.
### Fallback Search Strategy
```python
# When arXiv API fails, fall back to web_search:
web_search(query='site:arxiv.org "quantum computing" "machine learning" recent papers')
```
### httpx with Proxy
When using httpx with a proxy, use `HTTPTransport` — the `proxies=` kwarg on `httpx.Client` is deprecated in newer versions:
```python
proxy_url = "http://127.0.0.1:7890"
transport = httpx.HTTPTransport(proxy=proxy_url, verify=False)
with httpx.Client(transport=transport, timeout=30) as client:
response = client.get(ARXIV_API, params=params, follow_redirects=True)
```
**⚠️ URL Encoding**: When using Python `urllib.request` directly (not via `params=` dict), spaces in the query string cause `http.client.InvalidURL`. Always encode:
```python
from urllib.parse import quote
query = quote(f'all:"{search_term}"')
url = f"https://export.arxiv.org/api/query?search_query={query}&max_results=5"
```
Using `httpx`/`requests` with a `params=` dict avoids this issue entirely — the library handles encoding automatically.
## Fallback: When curl/httpx Are Blocked
Security scanners may block `curl | python3` pipes or flag HTTP URLs. When exec-based API access fails:
1. **Use `browser_navigate`** to visit arXiv directly:
- Browse by category: `https://arxiv.org/list/{category}/recent`
- View specific paper: `https://arxiv.org/abs/{arxiv_id}`
- Example categories: `quant-ph`, `cs.AI`, `cs.LG`, `cs.CV`, `cs.NE`
2. **Use `web_search`** with `site:arxiv.org` prefix:
- `site:arxiv.org quantum machine learning 2026`
- `site:arxiv.org reinforcement learning algorithms`
3. **Use `execute_code` with `hermes_tools.web_search`** for programmatic search (avoids pipe-to-interpreter blocks).
The browser approach yields full paper details (title, authors, abstract, categories) from the rendered page.
### Fetch Paper Details by ID
When you have arXiv IDs and need full details (complete abstracts, author lists, categories), use the `id_list` parameter instead of `search_query`:
```python
def get_paper_details(arxiv_ids):
\"\"\"Fetch full paper metadata by arXiv ID list.\"\"\"
ids_str = ",".join(arxiv_ids)
params = {"id_list": ids_str}
r = httpx.get(ARXIV_API, params=params, timeout=30)
r.raise_for_status()
return r.text # Parse with same parse_arxiv function
```
**Why this matters:** Search queries may truncate results or miss papers. `id_list` guarantees exact retrieval and is essential for deep paper analysis workflows.
## Workflow for Agents
### Step 1: Understand Search Intent
```markdown
- What is the user looking for?
- Keywords → keyword search
- Author name → author search
- Specific topic → category + keyword
- Recent papers → time-filtered search
```
### Step 2: Build Query
```python
def build_query(intent):
"""Build arXiv query from user intent."""
if intent["type"] == "keyword":
return f"all:{intent['query']}"
elif intent["type"] == "author":
return f"au:{intent['query']}"
elif intent["type"] == "title":
return f"ti:{intent['query']}"
elif intent["type"] == "category":
return f"cat:{intent['category']}"
elif intent["type"] == "combined":
# e.g., "machine learning in computer vision"
return f"all:{intent['keywords']} AND cat:{intent['category']}"
```
### Step 3: Execute Search
```python
# Execute search with appropriate parameters
results = await search_arxiv(
query=built_query,
field=intent.get("field", "all"),
category=intent.get("category"),
max_results=intent.get("max_results", 10),
sort_by=intent.get("sort_by", "relevance"),
days=intent.get("days")
)
```
### Step 4: Present Results
```markdown
## arXiv Search Results
Found {count} papers for "{query}":
### 1. {title}
- **Authors:** {authors}
- **Published:** {date}
- **Category:** {category}
- **arXiv:** [{id}]({abs_url})
- **PDF:** [Download]({pdf_url})
**Abstract:** {abstract}
---
```
## Category Reference
### Computer Science
| Category | Name |
|----------|------|
| cs.AI | Artificial Intelligence |
| cs.CL | Computation and Language (NLP) |
| cs.CV | Computer Vision and Pattern Recognition |
| cs.LG | Machine Learning |
| cs.NE | Neural and Evolutionary Computing |
| cs.RO | Robotics |
| cs.CR | Cryptography and Security |
| cs.DB | Databases |
| cs.DC | Distributed Computing |
| cs.HC | Human-Computer Interaction |
| cs.IR | Information Retrieval |
| cs.MM | Multimedia |
| cs.SE | Software Engineering |
### Mathematics
| Category | Name |
|----------|------|
| math.OC | Optimization and Control |
| math.ST | Statistics Theory |
| math.NA | Numerical Analysis |
| stat.ML | Machine Learning (Statistics) |
### Physics
| Category | Name |
|----------|------|
| physics.comp-ph | Computational Physics |
| physics.data-an | Data Analysis |
| quant-ph | Quantum Physics |
## Rate Limiting (Critical)
arXiv API aggressively rate-limits — even with a proxy. The old "~3 second delay" guidance is **insufficient**.
- **Minimum delay: 5 seconds** between requests
- **On HTTP 429**: Wait **10 seconds** before retry
- **Use `follow_redirects=True`**: The API returns HTTP 301 redirects; httpx must follow them
- **Batch by ID, not by query**: `id_list=2503.07681,2507.10722` is more efficient than separate queries
- **User-Agent header required**: Always include `User-Agent: ResearchBot/1.0` or similar
```python
import httpx
proxy = httpx.Proxy("http://127.0.0.1:7890")
with httpx.Client(proxy=proxy, timeout=30, follow_redirects=True) as client:
for i, arxiv_id in enumerate(papers):
if i > 0:
time.sleep(5) # minimum delay
url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}"
resp = client.get(url, headers={"User-Agent": "ResearchBot/1.0"})
if resp.status_code == 429:
time.sleep(10) # retry delay
resp = client.get(url, headers={"User-Agent": "ResearchBot/1.0"})
```
## Rate Limits and Proxy Configuration
### arXiv API Rate Limits
- arXiv enforces strict rate limits (HTTP 429). Space requests **5+ seconds apart**.
- On 429: wait 15 seconds, then retry once. If still 429, abort.
- Maximum ~3 requests per 30 seconds.
### Proxy Configuration (when direct access fails)
When running from environments where arXiv is blocked or rate-limited, use:
```python
import httpx
proxy = httpx.Proxy('http://127.0.0.1:7890')
with httpx.Client(proxy=proxy, timeout=45) as client:
resp = client.get(url, headers={'User-Agent': 'ResearchBot/1.0'})
```
### Retry Pattern
```python
resp = client.get(url, headers={'User-Agent': 'ResearchBot/1.0'})
if resp.status_code == 429:
time.sleep(15)
resp = client.get(url, headers={'User-Agent': 'ResearchBot/1.0'})
if resp.status_code != 200:
# Abort on second failure
continue
```
## Best Practices
## Best Practices
1. **Be Specific**: Use specific keywords for better results
2. **Use Categories**: Filter by category to narrow results
3. **Sort Appropriately**: Use `relevance` for overview, `submittedDate` for latest
4. **Limit Results**: Start with 10-20 results, increase if needed
5. **Check Date**: Use time filter for recent developments
6. **Use web_search for discovery**: When arXiv API is rate-limited or unreliable, use `web_search` to find paper IDs, then fetch metadata via `curl`
7. **Batch metadata fetches**: Use `id_list=ID1,ID2,ID3` to fetch multiple papers in one API call
8. **Save curl to file first**: Never pipe curl output to python — security guardrail blocks it
6. **Browser Fallback**: When API (429) AND `web_search` both return empty for niche queries, use `browser_navigate` to `https://arxiv.org/search/?searchtype=all&query=<query>&order=-announced_date_first`. Extract from page snapshot. Confirmed effective for topic combinations like "quantum machine learning medicine" where other methods miss results.
6. **Proxy Required**: Route through proxy (127.0.0.1:7890) — direct API access gets 429'd
7. **Use urllib, not httpx**: httpx in execute_code sandbox returns 0-byte responses for arxiv API
8. **web_search Fallback**: If API fails after retries, use `web_search("site:arxiv.org <keywords>")` and parse URLs/summaries
9. **Never web_extract arxiv**: `web_extract` blocks arxiv URLs as internal/private — use `web_search` results instead
6. **Multi-Topic Scanning**: For comprehensive literature reviews, run parallel searches across multiple related topics and deduplicate by paper ID. Example pattern:
```python
queries = ['all:"spiking neural"', 'all:"brain network"', 'cat:q-bio.NC']
all_papers = []
for q in queries:
papers = search_arxiv(q, max_results=8, days=14)
for p in papers:
if not any(x['id'] == p['id'] for x in all_papers):
all_papers.append(p)
```
6. **Handle 301s**: Always follow redirects — the API may redirect queries
## Common Use Cases
### 1. Literature Review
```
搜索 arxiv: "prompt engineering" --category cs.CL --days 30 --max 20
```
### 2. Author Tracking
```
搜索作者 "Andrew Ng" 的最新论文
```
### 3. Topic Monitoring
```
搜索 cs.AI 类别最近一周的论文
```
### 4. Specific Paper
```
搜索标题 "Attention is All You Need"
```
## Output Format
### Summary Format
```markdown
# arXiv Search Results
**Query:** {query}
**Results:** {count} papers
**Time Range:** {time_range}
---
## Papers
### 1. {Title}
**Authors:** {Author 1}, {Author 2}, et al.
**Published:** {YYYY-MM-DD}
**Category:** {category}
**Abstract:**
{abstract}
**Links:**
- arXiv: [{id}](https://arxiv.org/abs/{id})
- **PDF**: [Download]({pdf_url})
---
```
## Limitations
- **arXiv API is aggressively rate-limited** — returns 429 ("Rate exceeded.") on most requests even with short delays. Use `sleep 10` minimum between requests. `sleep 4` is NOT enough.
- **Use `web_search` as primary discovery** when arXiv API is rate-limited. `web_search("query site:arxiv.org")` works reliably with no rate limits, then use API only for full metadata on specific IDs.
- **`web_extract` blocks arxiv.org URLs** as "private/internal network." Do NOT use web_extract for arxiv content. Use the API directly or `browser_navigate` as fallback.
- No abstract search in advanced mode (use `all:` prefix)
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed
## Pitfalls
### arXiv API Rate Limiting (429)
The arXiv API returns "Rate exceeded." aggressively. Even with `sleep 4` between requests, most calls will 429. Use `sleep 10` minimum. Better: use `web_search` for discovery (no rate limits), then query API only for specific IDs.
### HTTP vs HTTPS
Always use `https://export.arxiv.org/api/query`, NOT `http://`. HTTP connections timeout through proxy environments.
### Security Guardrail on curl
Never pipe curl output directly to Python — security guardrail blocks pipe-to-interpreter. Save to file first: `curl -o /tmp/arxiv.xml "https://..." && python3 parse.py /tmp/arxiv.xml`.
### httpx Proxy Issue
When using httpx with `trust_env=True`, the proxy may be auto-detected and cause timeouts. Use `httpx.Client(trust_env=False)` to bypass proxy for arXiv (or ensure proxy allows it).
- Preprints are not peer-reviewed
- **curl | python3 is blocked**: Security scanner prevents piping curl output directly to Python. Save to file first, then read with Python.
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed
params = {"search_query": query, "max_results": max_results, "sortBy": "submittedDate"}
for attempt in range(max_retries):
try:
r = httpx.get(ARXIV_API, params=params, timeout=30)
if r.status_code == 429:
wait = 10 * (attempt + 1) # 10s, 20s, 30s
time.sleep(wait)
continue
r.raise_for_status()
return parse_response(r.text)
except Exception as e:
if attempt < max_retries - 1:
time.sleep(10 * (attempt + 1))
else:
return []
return []
```
**Never call arXiv API in a loop without delays** — batch queries and wait 3+ seconds between separate searches.
## Rate Limiting & Fallback
**The arXiv API aggressively rate-limits** (HTTP 429). Expect failures on repeated calls.
### Mitigation
- Use a custom User-Agent: `-A "ResearchAgent/1.0 (your@email.com)"`
- Wait 10-20 seconds between requests after a 429
- Limit `max_results` to 10 per call for reliability
### Browser Fallback (when API is rate-limited)
```
1. Navigate to https://arxiv.org/list/quant-ph/recent
2. Browse the listing page to collect paper titles, IDs, authors
3. Visit individual https://arxiv.org/abs/<id> pages for abstracts
4. Extract paper metadata from the HTML structure
```
### Web Search Fallback
```
Search: "arxiv <topic> latest site:arxiv.org"
Extract: IDs, titles, descriptions from SERP snippets
Follow: Visit individual abs pages for full abstracts
```
## Limitations
- arXiv API has aggressive rate limits (HTTP 429); always implement retry with exponential backoff
- Direct API access may require proxy in restricted environments (default 127.0.0.1:7890)
## Limitations
- arXiv API has rate limits (HTTP 429); add 3+ second delays between calls
- If rate-limited or blocked by security scanner, fall back to `web_search` with `site:arxiv.org`
- URL-encode complex queries with `urllib.parse.quote()` when using Python
- No abstract search in advanced mode (use `all:` prefix)
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed## Limitations
- arXiv API has rate limits (HTTP 429); add 3+ second delays between calls
- If rate-limited or blocked by security scanner, fall back to `web_search` with `site:arxiv.org`
- URL-encode complex queries with `urllib.parse.quote()` when using Python
- No abstract search in advanced mode (use `all:` prefix)
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed
The arXiv API enforces strict rate limits. Observed behavior:
- **Batch/multi-query searches**: Return **HTTP 429** even with 3–4 second delays between requests
- **Individual paper lookups** (`id_list` parameter): Work reliably with 3+ second delays
- **Single search queries**: Usually succeed, but can 429 under load
### Recommended Retry Strategy
```python
import time
import httpx
def arxiv_request(url, max_retries=3, base_delay=4):
"""Request arXiv API with exponential backoff for 429s."""
for attempt in range(max_retries):
resp = httpx.get(url, timeout=60)
if resp.status_code == 429:
wait = base_delay * (2 ** attempt) # 4s, 8s, 16s
time.sleep(wait)
continue
resp.raise_for_status()
return resp
raise Exception(f"arXiv API rate limited after {max_retries} retries")
```
### Fallback: Use id_list for Individual Paper Details
When searching multiple papers fails, look up each one individually:
```python
# Instead of batch query for 10 papers at once (gets 429'd):
# Use id_list for single papers with delays:
for paper_id in paper_ids:
time.sleep(4) # Required delay between requests
resp = httpx.get(f"https://export.arxiv.org/api/query?id_list={paper_id}")
# Parse single entry...
```
### Practical Guidance
| Scenario | Approach |
|----------|----------|
| Single keyword search | Direct query, usually works |
| 3+ sequential searches | 4+ second delays between each |
| Bulk paper detail fetch | Use `id_list={id}` one at a time with delays |
| Category browsing (latest N) | Single query works; multiple category queries need delays |
| web_search fallback | Use if arXiv API is fully rate-limited |
## Related Skills
- Preprints are not peer-reviewed
## Rate Limiting & Retry Pattern
The arXiv API frequently returns 429 (rate limited) on initial queries. Use this pattern:
```python
import time
import httpx
def arxiv_query_with_retry(url, max_retries=3):
for attempt in range(max_retries):
r = httpx.get(url, timeout=30, follow_redirects=True)
if r.status_code == 429:
wait = 4 * (attempt + 1) # 4s, 8s, 12s backoff
time.sleep(wait)
continue
if r.status_code != 200:
return None
return r.text
return None
# Between multiple queries, always add a delay:
# time.sleep(3-4) between consecutive queries
```
## Common Use Cases
### URL Encoding (Critical)
arXiv queries with spaces/special chars cause `InvalidURL` errors. Always encode:
```python
import urllib.parse
encoded = urllib.parse.quote(query, safe=':+()')
```
### HTTP 429 Rate Limiting
Broad queries (e.g. `all:neuroscience`) consistently return 429. Narrow queries or fallback to `web_search`. Use ≥3.5s delay between requests.
### Proxy Required
This environment needs `http://127.0.0.1:7890` proxy. Configure via `urllib.request.ProxyHandler`.
### Security Scanner
`curl | python3` pipe patterns are blocked. Write self-contained Python scripts instead.
### Always Use HTTPS
Use `https://export.arxiv.org/api/query` — never `http://`.
## Environment-Specific Notes
### URL Encoding (Critical)
arXiv API URLs must be properly encoded. Spaces and special characters cause `InvalidURL` errors:
```python
import urllib.parse
encoded = urllib.parse.quote(query, safe=':+()')
url = f"https://export.arxiv.org/api/query?search_query={encoded}"
```
### Proxy Configuration
If behind a proxy, configure it in urllib/httpx:
```python
proxy_handler = urllib.request.ProxyHandler({'https': 'http://127.0.0.1:7890', 'http': 'http://127.0.0.1:7890'})
opener = urllib.request.build_opener(proxy_handler)
```
### Security Scanner Warning
`curl | python3` pipe patterns trigger HIGH security warnings. Use self-contained Python scripts instead.
### Rate Limit Fallback
When arXiv returns 429:
1. Wait 5 seconds and retry (max 2 retries)
2. If still failing, use `web_search` with `arxiv <keywords>` as fallback
3. Narrow broad queries: `all:neuroscience` → `all:(neural memory learning)`
### Always Use HTTPS
- Preprints are not peer-reviewed
## Pitfalls & Workarounds
### API Rate Limiting
The arxiv API returns "Rate exceeded." when queried too frequently. When this happens:
1. **Wait 5+ seconds** before retrying
2. **Use proxy**: `export https_proxy=http://127.0.0.1:7890 http_proxy=http://127.0.0.1:7890`
3. **Fallback to web_search**: `web_search(query="arxiv quantum computing recent papers 2025", limit=5)` often returns arxiv paper links with titles and abstracts
### web_extract Blocks arxiv URLs
`web_extract` blocks arxiv.org URLs with "Blocked: URL targets a private or internal network address". To read paper content:
1. Use `browser_navigate(url="https://arxiv.org/abs/2508.06011")` — works reliably
2. Use `browser_snapshot` after navigation to capture abstract, authors, metadata
3. For Nature/APS/etc. publisher sites: also use browser navigation (accept cookie dialogs first)
### curl Timeouts Without Proxy
Direct `curl` to arxiv API often times out (30s+) without proxy. Use proxy or web_search fallback.
### Recommended Fallback Chain
```
## Rate Limiting & Retry Pattern
arXiv aggressively rate-limits automated queries. **429 errors are common even with 3-second delays between requests.** Always use exponential backoff retry:
```python
import time, httpx
def search_with_retry(url, params, max_retries=3):
"""Search arXiv with exponential backoff on 429."""
for attempt in range(max_retries):
with httpx.Client(timeout=20) as client:
resp = client.get(url, params=params)
if resp.status_code == 429:
wait = (2 ** attempt) * 3 # 3s, 6s, 12s
time.sleep(wait)
continue
resp.raise_for_status()
return resp.text
raise RuntimeError(f"arXiv rate limit exceeded after {max_retries} retries")
```
**Rules for automated/bulk searches:**
- Wait **at least 3.5 seconds** between consecutive queries
- Use `httpx` or `requests` (NOT `curl | python3` — triggers security scans)
- For >5 queries, spread them out with 5+ second gaps
- Cache results locally to avoid re-querying
## Limitations
- arXiv API has strict rate limits — returns **HTTP 429** after ~3 quick requests. Always add 3-5 second delays between calls.
- `web_search` works as a reliable fallback when API returns 429
- `web_extract` **blocks arxiv.org URLs** (flagged as "private/internal network address") — use `web_search` for abstracts instead
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed
## Tool Fallback Strategy (IMPORTANT)
When searching arxiv programmatically, use this priority order:
1. **arxiv API** (primary) — but expect 429 after 2-3 calls. Add `time.sleep(3-5)` between requests.
2. **web_search** (fallback) — query `"arxiv XXXX.XXXXX abstract"` or `"arxiv {keywords} 2025 2026"`
3. Never use **web_extract** on arxiv URLs — it will block them
### Proxy Configuration
If behind a proxy (e.g., `http://127.0.0.1:7890`), set `ProxyHandler` in Python urllib:
```python
proxy_handler = urllib.request.ProxyHandler({"https": PROXY, "http": PROXY})
opener = urllib.request.build_opener(proxy_handler)
- Preprints are not peer-reviewed
## Pitfalls & Workarounds
### web_search with `site:arxiv.org` returns empty
The `web_search` tool consistently returns **zero results** for `site:arxiv.org` queries. Do not rely on it for arxiv paper discovery.
**Workaround:** Use `browser_navigate` to `https://arxiv.org/abs/<id>` for fetching individual paper abstracts, or use the arxiv API directly via Python `httpx`.
### curl to arxiv triggers security scanner
---
## Rate Limiting (CRITICAL)
The arXiv API enforces **aggressive rate limiting** (429 Too Many Requests). The common "~3 second delay" guidance is **insufficient** for programmatic multi-query searches.
### Observed Behavior (May 2026)
| Delay Between Queries | Result |
|----------------------|--------|
| 3-5 seconds | ❌ Consistent 429 errors |
| 6-8 seconds | ❌ Intermittent 429 errors |
| **15 seconds** | ✅ Reliable |
### Rules for Multi-Query Searches
1. **Minimum 15-second delay** between sequential API calls
2. **Never parallelize** arXiv API calls — always run sequentially
3. **Use `id_list` parameter** for fetching specific papers (lower rate limit risk):
```
https://export.arxiv.org/api/query?id_list=2605.03598v1,2605.02509v1
```
4. **Batch paper ID lookups** when possible (up to ~10 IDs per request)
5. **On 429 error**: wait 30+ seconds before retry, do not immediately retry
### Recommended Pattern for Cron/Bulk Scans
```python
import time
queries = ["cat:cs.NE", "cat:q-bio.NC", "all:spiking neural"]
all_results = []
for q in queries:
time.sleep(15) # CRITICAL: must be >= 15s between queries
results = search_arxiv(q, max_results=20)
all_results.extend(results)
```
### Alternative: Single Broad Query
When rate limiting is a concern, prefer **one well-crafted query** over many narrow ones:
```
all:neural dynamics AND (brain OR spiking OR cognitive) AND cat:(cs.NE OR q-bio.NC)
```
## Limitations and Pitfalls
- **Rate limiting (HTTP 429)**: arXiv API aggressively rate-limits. Always add `sleep 5` between consecutive requests. If you get 429, wait 10+ seconds before retry.
- **httpx proxy configuration**: In newer httpx versions, `httpx.Client(proxies=...)` raises `TypeError`. Use environment variables instead: `HTTPS_PROXY=http://127.0.0.1:7890` or `httpx.Client(proxy="http://127.0.0.1:7890")` (note singular `proxy`).
- **Security scanner blocks**: `curl | python3` pipes trigger security scans (HIGH severity for "pipe to interpreter"). Write curl output to a temp file first, then parse with python.
- **web_search/web_extract fallbacks**: `web_extract` cannot access `arxiv.org/list/*` pages (blocked as "private/internal network"). Use `web_search` with `site:arxiv.org` for discovery, then the API for details.
- **httpx redirects**: The arXiv API may redirect. Always use `httpx.Client(follow_redirects=True, timeout=60)`.
- No abstract search in advanced mode (use `all:` prefix)
- Some papers may not have PDF available immediately
- Preprints are not peer-reviewed
## Related Skills
- **news-search**: For general news
- **tech-researcher agent**: For technical research
- **skill-extractor**: Extract patterns from papers
## Dependencies
## Bundled Resources
- **references/multi-topic-scanning.md** — Complete multi-topic scanning workflow with deduplication and existing-skill checking patterns. Use for comprehensive literature reviews across multiple related topics.
## Dependencies
```bash
pip install httpx xmltodict
```
## Notes
- arXiv is free and open access
- Papers are preprints (not peer-reviewed)
- Good for cutting-edge research
- Check citation count on Google Scholar for impact
- Use Semantic Scholar API for additional metadata
## Fallback: RSS Feed (when API is rate-limited)
The arXiv query API aggressively rate-limits (429 errors, read timeouts). When the API fails, use the RSS feed:
```bash
# Single category
curl -s --max-time 15 "https://export.arxiv.org/rss/quant-ph"
# Combined categories (use + to join)
curl -s --max-time 15 "https://export.arxiv.org/rss/quant-ph+cs.LG"
```
RSS has separate, more generous rate limits and is far more reliable for recent papers. Each `<item>` contains `<title>`, `<link>` (abs URL → extract ID), `<description>` (abstract with LaTeX). PDF: `https://arxiv.org/pdf/<id>`.
## Detailed Reference
- **Reliable Fetch Patterns**: See [references/arxiv-reliable-fetch.md](references/arxiv-reliable-fetch.md) for working patterns when the API is rate-limited or unreliable (web_search → curl → parse workflow).
## Pitfalls (from live sessions)
- **http:// → https:// redirect**: The arXiv API endpoint at `http://export.arxiv.org/api/query` returns a 301 redirect. Always use `https://`.
- **httpx proxy kwarg changed**: Newer httpx versions removed the `proxies=` kwarg on `httpx.Client`. Use `httpx.HTTPTransport(proxy=url)` instead:
```python
transport = httpx.HTTPTransport(proxy="http://127.0.0.1:7890", verify=False)
with httpx.Client(transport=transport, timeout=30) as client:
response = client.get(url, params=params, follow_redirects=True)
```
- **arXiv API 429 rate limits**: The API returns 429 even with delays. Fall back to `web_search` with `site:arxiv.org` queries to bypass rate limits.
- **web_extract blocks arxiv.org**: The web extraction tool blocks arxiv URLs as "private/internal network." Use `web_search` for discovery, `curl` with proxy for raw API metadata.
## Related Skills
1. Read the task description carefully
2. Follow the step-by-step process
3. Use the appropriate tools
4. Verify the results
## Examples
### Example 1: Basic Usage
**User:** <example user request>
**Agent:** <example agent response>
### Example 2: Advanced Usage
**User:** <example user request>
**Agent:** <example agent response>