Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install kevinzai-commander-skills-ccc-data-machine-learninggit clone https://github.com/KevinZai/commander.gitcp commander/SKILL.MD ~/.claude/skills/kevinzai-commander-skills-ccc-data-machine-learning/SKILL.md---
name: machine-learning
description: "ML model development — data preparation, model selection, training, evaluation, and deployment with scikit-learn, PyTorch, and TensorFlow."
version: 1.0.0
category: data
parent: ccc-data
tags: [ccc-data, ml, machine-learning, ai]
disable-model-invocation: true
---
# Machine Learning
## What This Does
Guides ML model development from problem definition through deployment — covering data preparation, feature engineering, model selection, training, evaluation, and serving. Supports scikit-learn for classical ML, PyTorch for deep learning, and production deployment patterns.
## Instructions
1. **Define the ML problem.** Before writing code:
- What are you predicting? (classification, regression, ranking, clustering)
- What data is available? (features, labels, volume)
- What metric defines success? (accuracy, F1, RMSE, AUC-ROC)
- What's the baseline? (random guess, simple heuristic, current system)
- What are the constraints? (latency, model size, interpretability)
2. **Choose the approach.**
| Problem | Data Size | Start With | Graduate To |
|---------|-----------|-----------|-------------|
| Classification (tabular) | < 100K rows | Logistic Regression, Random Forest | XGBoost, LightGBM |
| Classification (tabular) | > 100K rows | XGBoost, LightGBM | Neural network if needed |
| Regression (tabular) | Any | Linear Regression, Random Forest | XGBoost, LightGBM |
| Text classification | Any | TF-IDF + LogReg | Fine-tuned transformer |
| Image classification | < 10K images | Pre-trained model + fine-tune | Custom CNN |
| Time series | Any | ARIMA, Prophet | LSTM, Temporal Fusion Transformer |
| Recommendation | Any | Collaborative filtering | Matrix factorization, two-tower |
| Clustering | Any | K-Means, DBSCAN | HDBSCAN, Gaussian Mixture |
3. **Prepare the data.**
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load and explore
df = pd.read_csv('data.csv')
print(df.shape, df.dtypes)
print(df.describe())
print(df.isnull().sum())
# Handle missing values
df['age'] = df['age'].fillna(df['age'].median())
df = df.dropna(subset=['target']) # Never impute the target
# Encode categoricals
label_encoders = {}
for col in categorical_columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col].astype(str))
label_encoders[col] = le
# Split BEFORE any fitting (prevent data leakage)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale AFTER split (fit only on training data)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # transform only, no fit
```
4. **Train and evaluate (scikit-learn).**
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
# Cross-validation first
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='f1')
print(f"CV F1: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
# Train on full training set
model.fit(X_train_scaled, y_train)
# Evaluate on test set (only once, at the end)
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Feature importance
importances = pd.Series(
model.feature_importances_, index=X.columns
).sort_values(ascending=False)
print(importances.head(10))
```
5. **Train with PyTorch (deep learning).**
```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
class SimpleNet(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_dim // 2, output_dim),
)
def forward(self, x):
return self.net(x)
# Training loop
model = SimpleNet(input_dim=X_train.shape[1], hidden_dim=128, output_dim=2)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(100):
model.train()
for batch_X, batch_y in train_loader:
optimizer.zero_grad()
output = model(batch_X)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
```
6. **Deploy the model.**
- **Simple:** Export with `joblib.dump()` / `torch.save()`, serve via FastAPI
- **Scalable:** ONNX export for cross-platform serving
- **Production:** MLflow for experiment tracking + model registry
- **Edge:** Convert to ONNX or TensorFlow Lite for mobile/embedded
## Output Format
```markdown
# ML Model: {Model Name}
## Problem
- Type: {classification/regression/clustering}
- Target: {what we're predicting}
- Metric: {primary evaluation metric}
## Data
- Samples: {train/val/test counts}
- Features: {count and key features}
- Target distribution: {balanced/imbalanced, class ratios}
## Model
- Algorithm: {name}
- Hyperparameters: {key params}
- Training time: {duration}
## Results
| Metric | Train | Validation | Test |
|--------|-------|-----------|------|
| {metric} | {score} | {score} | {score} |
## Feature Importance
| Feature | Importance |
|---------|-----------|
| {feature} | {score} |
## Deployment
{How to serve the model in production}
```
## Tips
- Always establish a baseline (random, majority class, simple heuristic) before building complex models
- Data leakage is the #1 beginner mistake — never fit preprocessors on test data, never use future data
- Start simple (logistic regression, random forest) — complex models are only better if you have enough data
- XGBoost/LightGBM win most tabular data competitions — try them before deep learning on structured data
- Cross-validation is more reliable than a single train/test split for small datasets
- Track experiments systematically (MLflow, Weights & Biases) — you will forget what you tried
- A model is only useful if it's deployed — plan for serving from day one