Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install freedomintelligence-openclaw-medical-skills-skills-bio-geo-datagit clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills.gitcp OpenClaw-Medical-Skills/SKILL.MD ~/.claude/skills/freedomintelligence-openclaw-medical-skills-skills-bio-geo-data/SKILL.md<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
---
name: bio-geo-data
description: Query NCBI Gene Expression Omnibus (GEO) for expression datasets using Biopython Bio.Entrez. Use when finding microarray/RNA-seq datasets, downloading expression data, or linking GEO series to SRA runs.
tool_type: python
primary_tool: Bio.Entrez
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
- read_file
- run_shell_command
---
# GEO Data
Query and access Gene Expression Omnibus datasets using Biopython's Entrez module.
## Required Setup
```python
from Bio import Entrez
Entrez.email = 'your.email@example.com' # Required by NCBI
Entrez.api_key = 'your_api_key' # Optional
```
## GEO Database Types
| Database | db value | Description |
|----------|----------|-------------|
| GEO DataSets | `gds` | Curated datasets (GDS*) |
| GEO Profiles | `geoprofiles` | Individual gene profiles |
**GEO Record Types:**
| Prefix | Type | Description |
|--------|------|-------------|
| GSE | Series | Complete study/experiment |
| GSM | Sample | Individual sample |
| GPL | Platform | Array/sequencing platform |
| GDS | DataSet | Curated, normalized dataset |
## Searching GEO
### Search GEO DataSets (GDS)
```python
from Bio import Entrez
Entrez.email = 'your.email@example.com'
# Search curated datasets
handle = Entrez.esearch(db='gds', term='breast cancer AND Homo sapiens[orgn]', retmax=10)
record = Entrez.read(handle)
handle.close()
print(f"Found {record['Count']} datasets")
print(f"IDs: {record['IdList']}")
```
### Search GEO Series (GSE)
```python
# Search GEO Series via gds database
# Use entry_type filter
handle = Entrez.esearch(db='gds', term='RNA-seq[title] AND human[orgn] AND gse[entry_type]', retmax=10)
record = Entrez.read(handle)
handle.close()
```
### Common Search Fields
| Field | Description | Example |
|-------|-------------|---------|
| `[orgn]` | Organism | `human[orgn]` |
| `[title]` | Dataset title | `breast cancer[title]` |
| `[description]` | Description text | `stem cell[description]` |
| `[platform]` | Platform GPL | `GPL570[platform]` |
| `[entry_type]` | Record type | `gse[entry_type]`, `gds[entry_type]` |
| `[gdstype]` | Study type | `expression profiling[gdstype]` |
| `[pubmed]` | PubMed ID | `35412348[pubmed]` |
| `[pdat]` | Publication date | `2024[pdat]` |
### GDS Types
```python
# Expression profiling by array
term = 'expression profiling by array[gdstype] AND cancer'
# RNA-seq expression
term = 'expression profiling by high throughput sequencing[gdstype]'
# ChIP-seq
term = 'genome binding/occupancy profiling[gdstype]'
```
## Fetching GEO Information
### Get GEO DataSet Summary
```python
# Fetch summary for GDS records
handle = Entrez.esummary(db='gds', id='200024320')
record = Entrez.read(handle)
handle.close()
summary = record[0]
print(f"Accession: {summary['Accession']}")
print(f"Title: {summary['title']}")
print(f"Summary: {summary['summary'][:200]}...")
print(f"Organism: {summary['taxon']}")
print(f"Platform: {summary['GPL']}")
print(f"Samples: {summary['n_samples']}")
```
### Summary Fields
```python
summary['Accession'] # GSE/GDS accession
summary['title'] # Dataset title
summary['summary'] # Description
summary['taxon'] # Organism
summary['GPL'] # Platform ID
summary['n_samples'] # Number of samples
summary['FTPLink'] # FTP download link
summary['PubMedIds'] # Associated publications
summary['gdsType'] # Dataset type
summary['ptechType'] # Platform technology
```
## Code Patterns
### Search and List GEO Series
```python
from Bio import Entrez
Entrez.email = 'your.email@example.com'
def search_geo(term, entry_type='gse', max_results=20):
full_term = f'{term} AND {entry_type}[entry_type]'
handle = Entrez.esearch(db='gds', term=full_term, retmax=max_results)
search = Entrez.read(handle)
handle.close()
if not search['IdList']:
return []
handle = Entrez.esummary(db='gds', id=','.join(search['IdList']))
summaries = Entrez.read(handle)
handle.close()
results = []
for s in summaries:
results.append({
'accession': s['Accession'],
'title': s['title'],
'organism': s['taxon'],
'samples': s['n_samples'],
'platform': s['GPL']
})
return results
datasets = search_geo('breast cancer RNA-seq AND human[orgn]')
for ds in datasets:
print(f"{ds['accession']}: {ds['title'][:60]}... ({ds['samples']} samples)")
```
### Find RNA-Seq Datasets
```python
def find_rnaseq_datasets(organism, keywords, max_results=20):
term = f'{keywords} AND {organism}[orgn] AND expression profiling by high throughput sequencing[gdstype] AND gse[entry_type]'
handle = Entrez.esearch(db='gds', term=term, retmax=max_results)
search = Entrez.read(handle)
handle.close()
if not search['IdList']:
return []
handle = Entrez.esummary(db='gds', id=','.join(search['IdList']))
summaries = Entrez.read(handle)
handle.close()
return summaries
datasets = find_rnaseq_datasets('Homo sapiens', 'COVID-19')
for ds in datasets:
print(f"{ds['Accession']}: {ds['n_samples']} samples - {ds['title'][:50]}...")
```
### Get GSE Download Link
```python
def get_geo_ftp(gse_accession):
'''Get FTP download link for a GSE'''
handle = Entrez.esearch(db='gds', term=f'{gse_accession}[accn]')
search = Entrez.read(handle)
handle.close()
if not search['IdList']:
return None
handle = Entrez.esummary(db='gds', id=search['IdList'][0])
summary = Entrez.read(handle)[0]
handle.close()
return summary.get('FTPLink')
ftp_link = get_geo_ftp('GSE123456')
print(f"Download from: {ftp_link}")
```
### Link GEO to SRA
Many GEO RNA-seq datasets have associated SRA data.
```python
def geo_to_sra(gse_accession):
'''Find SRA runs associated with a GEO series'''
# Search GEO
handle = Entrez.esearch(db='gds', term=f'{gse_accession}[accn]')
search = Entrez.read(handle)
handle.close()
if not search['IdList']:
return []
# Link to SRA
handle = Entrez.elink(dbfrom='gds', db='sra', id=search['IdList'][0])
links = Entrez.read(handle)
handle.close()
if not links[0]['LinkSetDb']:
return []
sra_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link']]
# Get SRA accessions
handle = Entrez.esummary(db='sra', id=','.join(sra_ids[:50]))
summaries = Entrez.read(handle)
handle.close()
runs = []
for s in summaries:
expxml = s.get('ExpXml', '')
if 'SRR' in str(expxml) or 'SRX' in str(expxml):
runs.append(s)
return runs
sra_data = geo_to_sra('GSE123456')
print(f"Found {len(sra_data)} SRA records")
```
### Search by PubMed ID
```python
def geo_from_pubmed(pmid):
'''Find GEO datasets associated with a publication'''
handle = Entrez.elink(dbfrom='pubmed', db='gds', id=pmid)
links = Entrez.read(handle)
handle.close()
if not links[0]['LinkSetDb']:
return []
gds_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link']]
handle = Entrez.esummary(db='gds', id=','.join(gds_ids))
summaries = Entrez.read(handle)
handle.close()
return summaries
datasets = geo_from_pubmed('35412348')
for ds in datasets:
print(f"{ds['Accession']}: {ds['title']}")
```
### Download GEO Data (GEOparse)
For full data parsing, use the GEOparse library:
```python
# pip install GEOparse
import GEOparse
# Download and parse GSE
gse = GEOparse.get_GEO('GSE123456')
# Access metadata
print(f"Title: {gse.metadata['title'][0]}")
print(f"Samples: {len(gse.gsms)}")
# Get sample metadata
for gsm_name, gsm in gse.gsms.items():
print(f"{gsm_name}: {gsm.metadata['title'][0]}")
# Get expression table
if gse.gpls:
gpl_name = list(gse.gpls.keys())[0]
expression_table = gse.pivot_samples('VALUE')
```
## Download Options
### Direct FTP Download
```bash
# Download entire GSE
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/
# Download specific file types
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/*counts*.txt.gz
```
### Series Matrix Files
```python
import gzip
import urllib.request
def download_series_matrix(gse):
'''Download series matrix file'''
gse_prefix = gse[:len(gse)-3] + 'nnn'
url = f'https://ftp.ncbi.nlm.nih.gov/geo/series/{gse_prefix}/{gse}/matrix/{gse}_series_matrix.txt.gz'
filename = f'{gse}_series_matrix.txt.gz'
urllib.request.urlretrieve(url, filename)
return filename
```
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| Empty results | Wrong entry_type | Add `gse[entry_type]` or `gds[entry_type]` |
| No FTPLink | Superseries or no data | Check if series has supplementary files |
| No SRA link | Microarray data | SRA only for sequencing data |
## Decision Tree
```
Need GEO expression data?
├── Looking for curated datasets?
│ └── Search gds with [entry_type]=gds
├── Looking for any experiment?
│ └── Search gds with [entry_type]=gse
├── Want RNA-seq specifically?
│ └── Add 'expression profiling by high throughput sequencing[gdstype]'
├── Have a publication?
│ └── Link pubmed -> gds
├── Need raw sequencing data?
│ └── Link gds -> sra, then use sra-data skill
├── Need processed expression matrix?
│ └── Download series matrix or use GEOparse
└── Need full metadata?
└── Use GEOparse library
```
## Related Skills
- entrez-search - General database searching
- entrez-link - Link GEO to SRA and other databases
- sra-data - Download raw sequencing data from linked SRA
- batch-downloads - Download multiple GEO records
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->