Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-families-linear-algebragit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-families-linear-algebra/SKILL.md---
name: families-linear-algebra
description: Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science. Dense linear algebra and tensor operations framework extending BLAS/LAPACK. Use when: linear algebra library design, BLAS implementation, LAPACK algorithms, tensor operations, high-performance computing, dense matrix operations, scientific computing infrastructure, GPU-accelerated linear algebra, or multi-node/multi-GPU dense algebra.
---
# FAMLIES: Advanced Linear Algebra Infrastructure
## Overview
This skill provides a framework for designing and implementing advanced dense linear and tensor algebra libraries. FAMLIES (Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science) addresses limitations of traditional BLAS/LAPACK approaches while enabling better adaptation to new hardware architectures and applications.
**Key Innovation**: More flexible approach to defining and implementing high-performance dense linear and tensor algorithms that adapts to changing applications, software, and hardware.
## Current Landscape & Challenges
### BLAS/LAPACK Legacy
**Success Factors**:
- De facto standard interfaces
- Stringent boundary enforcement between layers
- Wide adoption in scientific computing
- Foundation for ML/data science
**Current Limitations**:
1. **Boundary Enforcement**: Impedes reducing data movement overhead
2. **Layer Separation**: Prevents loop fusion optimization
3. **Hardware Adaptation**: Slow to adapt to new architectures (GPUs)
4. **Algorithm Implementation**: Delayed adoption of new algorithms
5. **Flexibility**: Rigid structure limits innovation
### Modern Challenges
**Data Movement Overhead**:
- Data movement dominates computation cost
- Boundaries prevent data reuse optimization
- Loop fusion opportunities missed
- Memory hierarchy not fully exploited
**New Hardware Architectures**:
- GPU acceleration requirements
- Multi-node/multi-GPU systems
- Heterogeneous computing platforms
- Different memory hierarchies
**Emerging Applications**:
- Machine learning tensor operations
- Data science sparse/dense hybrids
- New matrix factorization algorithms
- Quantum computing interfaces
## FAMLIES Framework
### Core Principles
1. **Flexibility**: Adaptable to applications, software, and hardware
2. **Optimization**: Enable data movement reduction
3. **Standards**: Maintain interface standards where beneficial
4. **Performance**: High-performance implementation across architectures
5. **Extensibility**: Support new algorithms and operations
### Framework Architecture
**Layer Flexibility**:
- Allow cross-layer optimization
- Enable loop fusion
- Support operation composition
- Flexible boundary enforcement
**Hardware Adaptation**:
- GPU-specific optimizations
- Multi-node coordination
- Memory hierarchy awareness
- Architecture-specific implementations
**Algorithm Support**:
- Dense linear algebra (BLAS/LAPACK level)
- Tensor operations (multi-linear algebra)
- New factorization algorithms
- Application-specific operations
## Leveraged Projects
### BLIS (BLAS-like Library Instantiation Software)
**Key Features**:
- Flexible BLAS implementation
- Architecture-specific optimization
- Layer decomposition
- Performance portability
**Benefits**:
- Rapid architecture adaptation
- Optimized implementations
- Maintainable code structure
- Community-driven development
### libflame
**Key Features**:
- LAPACK-level functionality
- Algorithm-by-blocks approach
- Flexible implementation
- Research-oriented design
**Benefits**:
- Advanced algorithm implementations
- Block-based optimization
- Research platform
- Extensible framework
## Dense Linear Algebra Operations
### BLAS-Level Operations
**Level 1**: Vector operations
- DOT: dot product
- AXPY: scaled vector addition
- NORM: vector norm
- SCALE: vector scaling
**Level 2**: Matrix-vector operations
- GEMV: general matrix-vector multiply
- GER: general rank-1 update
- SYMV: symmetric matrix-vector multiply
- TRSV: triangular solve
**Level 3**: Matrix-matrix operations
- GEMM: general matrix multiply
- SYMM: symmetric matrix multiply
- TRMM: triangular matrix multiply
- SYRK: symmetric rank-k update
### LAPACK-Level Operations
**Factorizations**:
- LU: LU decomposition
- Cholesky: Cholesky factorization
- QR: QR decomposition
- SVD: singular value decomposition
**Solvers**:
- Linear system solve
- Eigenvalue problems
- Least squares
- Matrix inversion
**Auxiliary**:
- Condition estimation
- Matrix norms
- Equilibration
- Scaling
### Tensor Operations
**Multi-linear Operations**:
- Tensor contraction
- Tensor decomposition (Tucker, CP)
- Tensor unfolding
- Tensor permutation
**Tensor Algebra**:
- Einstein notation operations
- Tensor-matrix products
- Tensor-tensor products
- Tensor factorizations
## Implementation Considerations
### Data Movement Optimization
**Loop Fusion**:
```c
// Traditional BLAS: separate operations
GEMM(A, B, C); // C = A*B
SCALE(alpha, C); // C = alpha*C
// FAMLIES: fused operation
GEMM_SCALE_FUSED(A, B, C, alpha); // C = alpha*A*B
// Reduces memory traffic: 2 reads + 2 writes -> 2 reads + 1 write
```
**Operation Composition**:
```c
// Compose multiple operations
COMPOSE({
GEMM(A, B, T1),
GEMM(T1, C, T2),
ADD(T2, D, E)
});
// Single kernel execution, reduced data movement
```
**Memory Hierarchy**:
- Cache-aware implementations
- Register blocking
- Memory bandwidth optimization
- Data locality maximization
### Hardware-Specific Optimization
**GPU Implementations**:
- CUDA kernel design
- Shared memory utilization
- Warp-level operations
- Multi-stream execution
**Multi-Node/Multi-GPU**:
- Distributed matrix storage
- Communication optimization
- Load balancing
- Scalability considerations
**Heterogeneous Computing**:
- CPU-GPU coordination
- Device selection strategies
- Memory transfer optimization
- Hybrid implementations
## Algorithm-by-Blocks Approach
**Concept**: Decompose algorithms into block operations
**Benefits**:
- Better cache utilization
- Parallelizable blocks
- Flexible implementation
- Data locality
**Example: LU Factorization by Blocks**:
```python
def LU_by_blocks(A, block_size):
n = A.shape[0]
for k in range(0, n, block_size):
# Factorize diagonal block
LU_kernel(A[k:k+bs, k:k+bs])
# Update trailing matrix
for i in range(k+bs, n, block_size):
# Solve panel
solve_panel(A[k:k+bs, k:k+bs], A[i:i+bs, k:k+bs])
# Update block row
for j in range(k+bs, n, block_size):
GEMM(A[i:i+bs, k:k+bs],
A[k:k+bs, j:j+bs],
A[i:i+bs, j:j+bs])
```
## Performance Optimization Techniques
### Cache Blocking
**Register Blocking**: Keep data in registers
**L1 Cache Blocking**: Reuse data in L1 cache
**L2 Cache Blocking**: Maximize L2 utilization
**TLB Blocking**: Avoid TLB misses
### Parallelization
**Thread-Level**: Multi-threaded execution
**SIMD-Level**: Vector instructions
**GPU-Level**: GPU parallelism
**Distributed-Level**: Multi-node parallelism
### Memory Optimization
**Preconditioning**: Reduce memory traffic
**In-place Operations**: Avoid extra memory
**Strided Access**: Optimal memory patterns
**Alignment**: Cache line alignment
## Framework Components
### Operation Definitions
**Standard Operations**: BLAS/LAPACK compatible
**Extended Operations**: New operation types
**Tensor Operations**: Multi-linear algebra
**Composite Operations**: Operation sequences
### Implementation Layer
**CPU Kernels**: Optimized CPU implementations
**GPU Kernels**: GPU-specific kernels
**Hybrid Kernels**: Mixed CPU-GPU
**Reference Kernels**: Portable reference code
### Interface Layer
**Standard APIs**: BLAS/LAPACK compatibility
**Extended APIs**: New operation interfaces
**Language Bindings**: C, Fortran, Python, etc.
**Framework APIs**: High-level framework calls
## Research Paper Reference
**Paper**: "A Proposed Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science (FAMLIES)"
- **Authors**: Devin A. Matthews, Tze Meng Low, Margaret E. Myers, et al.
- **arXiv ID**: 2604.07311
- **Published**: April 8, 2026
- **Categories**: cs.MS (Mathematical Software)
- **Link**: https://arxiv.org/abs/2604.07311
## Use Cases
### Scientific Computing
**Applications**:
- Numerical simulations
- Finite element methods
- Computational physics
- Climate modeling
**Requirements**:
- High performance
- Large-scale matrices
- Distributed computing
- Specialized operations
### Machine Learning
**Applications**:
- Neural network training
- Tensor operations
- Large-scale optimization
- Model inference
**Requirements**:
- GPU acceleration
- Batch processing
- Tensor operations
- High throughput
### Data Science
**Applications**:
- Large-scale analytics
- Statistical computations
- Data preprocessing
- Dimensionality reduction
**Requirements**:
- Flexible operations
- Mixed sparse/dense
- Memory efficiency
- Ease of use
## Related Skills
- **blas-lapack**: Traditional BLAS/LAPACK usage
- **gpu-computing**: GPU acceleration techniques
- **tensor-operations**: Tensor algebra fundamentals
- **high-performance-computing**: HPC best practices
## Implementation Examples
### Basic GEMM with Loop Fusion
```c
// FAMLIES GEMM with scaling and addition
void gemm_scale_add(
int m, int n, int k,
double alpha, double beta, double gamma,
double *A, double *B, double *C, double *D,
double *E) {
// E = alpha*A*B + beta*C + gamma*D
// Fused implementation
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
double sum = 0.0;
for (int p = 0; p < k; p++) {
sum += A[i*k+p] * B[p*n+j];
}
E[i*n+j] = alpha*sum + beta*C[i*n+j] + gamma*D[i*n+j];
}
}
}
```
### Tensor Contraction
```python
import numpy as np
def tensor_contract_3d(A, B, contract_dims):
"""
Contract tensors A and B along specified dimensions
Args:
A: tensor of shape (a1, a2, a3)
B: tensor of shape (b1, b2, b3)
contract_dims: dimensions to contract
Returns:
Contracted tensor
"""
# Example: contract last dimension of A with first of B
if contract_dims == ((-1,), (0,)):
return np.einsum('ijk,klm->ijlm', A, B)
# General contraction
einsum_str = build_einsum_string(A.shape, B.shape, contract_dims)
return np.einsum(einsum_str, A, B)
```
## Future Directions
**Quantum Computing Integration**:
- Quantum linear algebra
- Hybrid quantum-classical algorithms
- Quantum matrix operations
**Sparse-Dense Hybrids**:
- Mixed sparse/dense operations
- Selective storage strategies
- Format-aware kernels
**Automatic Optimization**:
- Auto-tuning frameworks
- Machine learning for optimization
- Compiler-driven optimization
## See Also
- BLIS project documentation
- libflame documentation
- BLAS/LAPACK standards
- High-performance computing guides