Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-brain-to-speech-synthesisgit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-brain-to-speech-synthesis/SKILL.md---
name: brain-to-speech-synthesis-v2
description: "Brain-to-speech synthesis from intracranial EEG (iEEG) using prosody-aware feature engineering and transformer-based reconstruction. Enables high-fidelity speech reconstruction for neuroprosthetics. Based on arXiv:2604.05751v1 (April 2026). Activation: brain-to-speech, iEEG speech, intracranial EEG, speech neuroprosthetics, prosody brain decoding."
---
# Brain-to-Speech: Prosody Feature Engineering and Transformer-Based Reconstruction
Brain-to-speech synthesis from intracranial EEG using prosody-aware features and transformer architectures.
**Paper**: arXiv:2604.05751v1 (April 7, 2026)
**Authors**: Mohammed Salah Al-Radhi, Géza Németh, Andon Tchechmedjiev, Binbin Xu
## Overview
This methodology synthesizes speech directly from intracranial electroencephalography (iEEG) brain signals. The approach integrates prosody-aware feature engineering with transformer-based models to generate accurate and natural-sounding speech for assistive technologies.
## Key Components
### 1. Prosody Feature Extraction from iEEG
Traditional approaches focus on phonetic content only. This method extracts crucial prosodic features:
- **Intonation**: Pitch contours and melodic patterns
- **Pitch**: Fundamental frequency (F0) trajectories
- **Rhythm**: Timing and stress patterns
```python
def extract_prosody_features(ieeg_signal):
"""
Extract prosodic features from iEEG signals.
Args:
ieeg_signal: Raw intracranial EEG data
Returns:
prosody_features: Dict with intonation, pitch, rhythm
"""
# Extract pitch-related features from auditory cortex activity
pitch = extract_pitch_from_ieeg(ieeg_signal, region='superior_temporal')
# Extract intonation contours
intonation = extract_intonation_pattern(ieeg_signal)
# Extract rhythmic patterns
rhythm = extract_timing_patterns(ieeg_signal)
return {
'pitch': pitch,
'intonation': intonation,
'rhythm': rhythm
}
```
### 2. Transformer Encoder Architecture
Novel transformer architecture specifically designed for brain-to-speech tasks:
```
┌─────────────────────────────────────────────────────────┐
│ BRAIN-TO-SPEECH TRANSFORMER │
├─────────────────────────────────────────────────────────┤
│ Input: iEEG signal (multi-channel time series) │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ Temporal Convolution Feature Extraction│ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ Multi-Head Self-Attention (Brain) │ │
│ │ - Capture spatial-temporal patterns │ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ Prosody Feature Integration │ │
│ │ - Inject intonation, pitch, rhythm │ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ Cross-Modal Attention │ │
│ │ - Brain → Speech features │ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ Output: Speech spectrogram / acoustic features │
└─────────────────────────────────────────────────────────┘
```
### 3. Speech Reconstruction Pipeline
```python
class BrainToSpeechTransformer:
def __init__(self):
self.temporal_encoder = TemporalConvEncoder()
self.transformer = TransformerEncoder()
self.prosody_integrator = ProsodyIntegrationModule()
self.speech_decoder = SpeechDecoder()
def forward(self, ieeg_signal):
# Extract temporal features
temporal_features = self.temporal_encoder(ieeg_signal)
# Apply transformer attention
attended = self.transformer(temporal_features)
# Integrate prosody features
prosody = extract_prosody_features(ieeg_signal)
enhanced = self.prosody_integrator(attended, prosody)
# Decode to speech
speech = self.speech_decoder(enhanced)
return speech
```
## Implementation
### Step 1: iEEG Preprocessing
```python
def preprocess_ieeg(raw_ieeg, sampling_rate=1000):
"""
Preprocess intracranial EEG signals.
Args:
raw_ieeg: Raw iEEG data (channels x time)
sampling_rate: Sampling frequency in Hz
Returns:
processed: Cleaned iEEG signal
"""
# Bandpass filter (1-200 Hz typical for speech)
filtered = bandpass_filter(raw_ieeg, low=1, high=200, fs=sampling_rate)
# Notch filter for line noise
notch_filtered = notch_filter(filtered, freq=60, fs=sampling_rate)
# Common average reference
car = notch_filtered - np.mean(notch_filtered, axis=0)
# Z-score normalization per channel
normalized = (car - np.mean(car, axis=1, keepdims=True)) / np.std(car, axis=1, keepdims=True)
return normalized
```
### Step 2: Prosody-Aware Training
```python
def train_with_prosody(model, dataloader, epochs=100):
"""
Train brain-to-speech model with prosody features.
Args:
model: BrainToSpeechTransformer
dataloader: Training data loader
epochs: Number of training epochs
"""
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
for epoch in range(epochs):
for batch in dataloader:
ieeg = batch['ieeg'] # Brain signals
target_speech = batch['speech'] # Target audio
target_prosody = batch['prosody'] # Prosody annotations
# Forward pass
predicted_speech, predicted_prosody = model(ieeg)
# Multi-objective loss
speech_loss = spectral_loss(predicted_speech, target_speech)
prosody_loss = prosody_mse(predicted_prosody, target_prosody)
total_loss = speech_loss + 0.5 * prosody_loss
# Backpropagation
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
```
### Step 3: Inference and Evaluation
```python
def synthesize_speech(model, ieeg_signal):
"""
Synthesize speech from iEEG signal.
Args:
model: Trained BrainToSpeechTransformer
ieeg_signal: Input iEEG data
Returns:
audio: Synthesized speech waveform
"""
model.eval()
with torch.no_grad():
# Preprocess
processed = preprocess_ieeg(ieeg_signal)
# Model inference
spectrogram = model(processed)
# Vocoder (e.g., HiFi-GAN)
audio = vocoder(spectrogram)
return audio
```
## Evaluation Metrics
### Quantitative Metrics
- **Mel Cepstral Distortion (MCD)**: Spectral similarity
- **Fundamental Frequency RMSE**: Pitch accuracy
- **Short-Time Objective Intelligibility (STOI)**: Speech intelligibility
- **Perceptual Evaluation of Speech Quality (PESQ)**: Perceptual quality
### Perceptual Metrics
- Mean Opinion Score (MOS) for naturalness
- Intelligibility scores from listeners
- Speaker similarity ratings
## Advantages Over Baselines
| Method | Intelligibility | Naturalness | Prosody Accuracy |
|--------|----------------|-------------|------------------|
| Griffin-Lim | Baseline | Poor | N/A |
| CNN-based | Moderate | Moderate | Poor |
| **Transformer + Prosody** | **High** | **High** | **High** |
## Applications
- **Neuroprosthetics**: Restore communication for speech-impaired individuals
- **Brain-Computer Interfaces**: Speech output for locked-in patients
- **Clinical Research**: Understanding speech production in the brain
- **Assistive Technology**: Communication aids for ALS, stroke patients
## Future Directions
1. **Diffusion Models**: Integrate diffusion-based vocoders for higher quality
2. **Real-time Systems**: Reduce latency for interactive use
3. **Non-invasive EEG**: Extend to scalp EEG with reduced performance
4. **Multilingual Support**: Adapt to different languages
## Trigger Words
- brain-to-speech, iEEG speech, intracranial EEG
- speech neuroprosthetics, prosody brain decoding
- speech brain interface, neural speech synthesis
## Category
neuroscience, brain-computer-interface, speech-synthesis
## Reference
Al-Radhi, M. S., Németh, G., Tchechmedjiev, A., & Xu, B. (2026). Brain-to-Speech: Prosody Feature Engineering and Transformer-Based Reconstruction. arXiv:2604.05751v1.