Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install freedomintelligence-openclaw-medical-skills-skills-fastq-analysisgit clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills.gitcp OpenClaw-Medical-Skills/SKILL.MD ~/.claude/skills/freedomintelligence-openclaw-medical-skills-skills-fastq-analysis/SKILL.md---
name: fastq-analysis-pipeline
title: FASTQ analysis and RNA-seq alignment with omicverse
description: Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.
---
## Overview
OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the `ov.alignment` module. This skill covers:
- **SRA data acquisition**: `prefetch` and `fqdump` (fasterq-dump wrapper)
- **Quality control**: `fastp` for adapter trimming and QC reports
- **RNA-seq alignment**: `STAR` aligner with auto-index building
- **Gene quantification**: `featureCount` (subread featureCounts wrapper)
- **Single-cell path**: `ref` and `count` via kb-python (kallisto/bustools)
- **Parallel SRA download**: `parallel_fastq_dump`
All functions share a common CLI infrastructure (`_cli_utils.py`) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.
## Instructions
1. **Environment setup**
- Bioinformatics tools are resolved automatically from PATH or the active conda environment.
- If `auto_install=True` (default), missing tools are installed via mamba/conda on demand.
- Supported tools: `prefetch`, `vdb-validate`, `fasterq-dump`, `fastp`, `STAR`, `samtools`, `featureCounts`, `pigz`, `gzip`.
- For the single-cell path, ensure `kb-python` is installed: `pip install kb-python`.
2. **SRA data download** (`ov.alignment.prefetch` + `ov.alignment.fqdump`)
- Use `prefetch` first for reliable downloads with integrity validation (`vdb-validate`).
- Then convert to FASTQ with `fqdump`. It auto-detects single-end vs paired-end.
- `fqdump` can also work directly from SRR accessions without prefetch.
- Both support retry with exponential backoff for network errors.
```python
import omicverse as ov
# Step 1: Prefetch SRA files (optional but recommended)
pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)
# Step 2: Convert to FASTQ
fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
output_dir='fastq', sra_dir='prefetch',
gzip=True, threads=8, jobs=4)
```
3. **FASTQ quality control** (`ov.alignment.fastp`)
- Runs fastp for adapter trimming, quality filtering, and QC reporting.
- Supports single-end and paired-end reads.
- Produces per-sample JSON and HTML QC reports.
- Sample format: tuple of `(sample_name, fq1_path, fq2_path_or_None)`.
```python
samples = [
('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
]
clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)
```
4. **STAR alignment** (`ov.alignment.STAR`)
- Aligns FASTQ reads using the STAR aligner.
- **Auto-index building**: set `auto_index=True` (default) with `genome_fasta_files` and `gtf` to build index automatically if missing.
- Produces coordinate-sorted BAM files.
- Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
- Use `strict=False` (default) for graceful error handling per sample.
```python
# Prepare samples from fastp output
star_samples = [
('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
]
bams = ov.alignment.STAR(
star_samples,
genome_dir='star_index',
output_dir='star_out',
gtf='genes.gtf',
genome_fasta_files=['genome.fa'],
threads=8,
memory='50G',
)
```
5. **Gene quantification** (`ov.alignment.featureCount`)
- Counts aligned reads per gene using featureCounts (subread).
- Auto-detects paired-end from BAM headers (via pysam or samtools).
- `auto_fix=True` (default) retries with corrected paired-end flag on error.
- `gene_mapping=True` maps gene_id to gene_name from the GTF.
- `merge_matrix=True` produces a combined count matrix across all samples.
```python
bam_items = [
('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
]
counts = ov.alignment.featureCount(
bam_items,
gtf='genes.gtf',
output_dir='counts',
gene_mapping=True,
merge_matrix=True,
threads=8,
)
# counts is a pandas DataFrame (gene_id x samples)
```
6. **Single-cell path** (`ov.alignment.ref` + `ov.alignment.count`)
- Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
- `ref()` builds a kallisto index and transcript-to-gene mapping.
- `count()` quantifies single-cell data with barcode/UMI handling.
- Supports technologies: 10XV2, 10XV3, BULK, and custom.
- Output formats: h5ad, loom, cellranger MTX.
```python
# Build reference index
ref_result = ov.alignment.ref(
index_path='kb_ref/index.idx',
t2g_path='kb_ref/t2g.txt',
fasta_paths=['genome.fa'],
gtf_paths=['genes.gtf'],
threads=8,
)
# Quantify 10x v3 data
count_result = ov.alignment.count(
index_path='kb_ref/index.idx',
t2g_path='kb_ref/t2g.txt',
technology='10XV3',
fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
output_path='kb_out',
h5ad=True,
filter_barcodes=True,
threads=8,
)
```
7. **Wiring fastp output into STAR input**
- fastp output is a list of dicts with keys: `sample`, `clean1`, `clean2`, `json`, `html`.
- Convert to STAR sample tuples:
```python
star_samples = [
(r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
for r in (clean if isinstance(clean, list) else [clean])
]
```
8. **Wiring STAR output into featureCount input**
- STAR output is a list of dicts with keys: `sample`, `bam` (or `error`).
- Convert to featureCount items:
```python
bam_items = [
(r['sample'], r['bam'])
for r in (bams if isinstance(bams, list) else [bams])
if 'bam' in r
]
```
9. **Skipping completed steps**
- All functions check for existing outputs and skip if `overwrite=False` (default).
- Set `overwrite=True` to force re-execution.
10. **Troubleshooting**
- If a tool is not found, check `auto_install=True` and that conda/mamba is accessible.
- For STAR index errors, ensure `genome_fasta_files` points to uncompressed or gzip FASTA files.
- For featureCounts paired-end detection errors, `auto_fix=True` handles most cases automatically.
- GTF files can be gzip-compressed; they are auto-decompressed as needed.
## Critical API Reference
### Sample Format Convention
All alignment functions use a consistent sample tuple format:
- **FASTQ samples**: `(sample_name, fq1_path, fq2_path_or_None)`
- **BAM items**: `(sample_name, bam_path)` or `(sample_name, bam_path, is_paired_bool)`
- Single samples can be passed as a single tuple; multiple as a list of tuples.
- When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.
### Auto-installation
```python
# All functions support these parameters:
auto_install=True # Auto-install missing tools via conda/mamba
overwrite=False # Skip if outputs already exist
threads=8 # Per-tool thread count
jobs=None # Concurrent job count (auto-detected from CPU count)
```
## Examples
- **Bulk RNA-seq from SRA**: `prefetch` -> `fqdump` -> `fastp` -> `STAR` -> `featureCount` -> pandas DataFrame
- **Single-cell 10x v3**: `ref` -> `count` with `technology='10XV3'` -> h5ad AnnData
- **Local FASTQ files**: Skip download steps, start directly with `fastp` -> `STAR` -> `featureCount`
## References
- See [reference.md](reference.md) for copy-paste-ready code templates.