FAMLIES: Advanced Linear Algebra Infrastructure

Show SKILL.md content (~3.0k tokens)
---
name: families-linear-algebra
description: Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science. Dense linear algebra and tensor operations framework extending BLAS/LAPACK. Use when: linear algebra library design, BLAS implementation, LAPACK algorithms, tensor operations, high-performance computing, dense matrix operations, scientific computing infrastructure, GPU-accelerated linear algebra, or multi-node/multi-GPU dense algebra.
---

# FAMLIES: Advanced Linear Algebra Infrastructure

## Overview

This skill provides a framework for designing and implementing advanced dense linear and tensor algebra libraries. FAMLIES (Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science) addresses limitations of traditional BLAS/LAPACK approaches while enabling better adaptation to new hardware architectures and applications.

**Key Innovation**: More flexible approach to defining and implementing high-performance dense linear and tensor algorithms that adapts to changing applications, software, and hardware.

## Current Landscape & Challenges

### BLAS/LAPACK Legacy

**Success Factors**:
- De facto standard interfaces
- Stringent boundary enforcement between layers
- Wide adoption in scientific computing
- Foundation for ML/data science

**Current Limitations**:
1. **Boundary Enforcement**: Impedes reducing data movement overhead
2. **Layer Separation**: Prevents loop fusion optimization
3. **Hardware Adaptation**: Slow to adapt to new architectures (GPUs)
4. **Algorithm Implementation**: Delayed adoption of new algorithms
5. **Flexibility**: Rigid structure limits innovation

### Modern Challenges

**Data Movement Overhead**:
- Data movement dominates computation cost
- Boundaries prevent data reuse optimization
- Loop fusion opportunities missed
- Memory hierarchy not fully exploited

**New Hardware Architectures**:
- GPU acceleration requirements
- Multi-node/multi-GPU systems
- Heterogeneous computing platforms
- Different memory hierarchies

**Emerging Applications**:
- Machine learning tensor operations
- Data science sparse/dense hybrids
- New matrix factorization algorithms
- Quantum computing interfaces

## FAMLIES Framework

### Core Principles

1. **Flexibility**: Adaptable to applications, software, and hardware
2. **Optimization**: Enable data movement reduction
3. **Standards**: Maintain interface standards where beneficial
4. **Performance**: High-performance implementation across architectures
5. **Extensibility**: Support new algorithms and operations

### Framework Architecture

**Layer Flexibility**:
- Allow cross-layer optimization
- Enable loop fusion
- Support operation composition
- Flexible boundary enforcement

**Hardware Adaptation**:
- GPU-specific optimizations
- Multi-node coordination
- Memory hierarchy awareness
- Architecture-specific implementations

**Algorithm Support**:
- Dense linear algebra (BLAS/LAPACK level)
- Tensor operations (multi-linear algebra)
- New factorization algorithms
- Application-specific operations

## Leveraged Projects

### BLIS (BLAS-like Library Instantiation Software)

**Key Features**:
- Flexible BLAS implementation
- Architecture-specific optimization
- Layer decomposition
- Performance portability

**Benefits**:
- Rapid architecture adaptation
- Optimized implementations
- Maintainable code structure
- Community-driven development

### libflame

**Key Features**:
- LAPACK-level functionality
- Algorithm-by-blocks approach
- Flexible implementation
- Research-oriented design

**Benefits**:
- Advanced algorithm implementations
- Block-based optimization
- Research platform
- Extensible framework

## Dense Linear Algebra Operations

### BLAS-Level Operations

**Level 1**: Vector operations
- DOT: dot product
- AXPY: scaled vector addition
- NORM: vector norm
- SCALE: vector scaling

**Level 2**: Matrix-vector operations
- GEMV: general matrix-vector multiply
- GER: general rank-1 update
- SYMV: symmetric matrix-vector multiply
- TRSV: triangular solve

**Level 3**: Matrix-matrix operations
- GEMM: general matrix multiply
- SYMM: symmetric matrix multiply
- TRMM: triangular matrix multiply
- SYRK: symmetric rank-k update

### LAPACK-Level Operations

**Factorizations**:
- LU: LU decomposition
- Cholesky: Cholesky factorization
- QR: QR decomposition
- SVD: singular value decomposition

**Solvers**:
- Linear system solve
- Eigenvalue problems
- Least squares
- Matrix inversion

**Auxiliary**:
- Condition estimation
- Matrix norms
- Equilibration
- Scaling

### Tensor Operations

**Multi-linear Operations**:
- Tensor contraction
- Tensor decomposition (Tucker, CP)
- Tensor unfolding
- Tensor permutation

**Tensor Algebra**:
- Einstein notation operations
- Tensor-matrix products
- Tensor-tensor products
- Tensor factorizations

## Implementation Considerations

### Data Movement Optimization

**Loop Fusion**:
```c
// Traditional BLAS: separate operations
GEMM(A, B, C);  // C = A*B
SCALE(alpha, C);  // C = alpha*C

// FAMLIES: fused operation
GEMM_SCALE_FUSED(A, B, C, alpha);  // C = alpha*A*B
// Reduces memory traffic: 2 reads + 2 writes -> 2 reads + 1 write
```

**Operation Composition**:
```c
// Compose multiple operations
COMPOSE({
    GEMM(A, B, T1),
    GEMM(T1, C, T2),
    ADD(T2, D, E)
});
// Single kernel execution, reduced data movement
```

**Memory Hierarchy**:
- Cache-aware implementations
- Register blocking
- Memory bandwidth optimization
- Data locality maximization

### Hardware-Specific Optimization

**GPU Implementations**:
- CUDA kernel design
- Shared memory utilization
- Warp-level operations
- Multi-stream execution

**Multi-Node/Multi-GPU**:
- Distributed matrix storage
- Communication optimization
- Load balancing
- Scalability considerations

**Heterogeneous Computing**:
- CPU-GPU coordination
- Device selection strategies
- Memory transfer optimization
- Hybrid implementations

## Algorithm-by-Blocks Approach

**Concept**: Decompose algorithms into block operations

**Benefits**:
- Better cache utilization
- Parallelizable blocks
- Flexible implementation
- Data locality

**Example: LU Factorization by Blocks**:
```python
def LU_by_blocks(A, block_size):
    n = A.shape[0]
    for k in range(0, n, block_size):
        # Factorize diagonal block
        LU_kernel(A[k:k+bs, k:k+bs])
        
        # Update trailing matrix
        for i in range(k+bs, n, block_size):
            # Solve panel
            solve_panel(A[k:k+bs, k:k+bs], A[i:i+bs, k:k+bs])
            
            # Update block row
            for j in range(k+bs, n, block_size):
                GEMM(A[i:i+bs, k:k+bs], 
                     A[k:k+bs, j:j+bs], 
                     A[i:i+bs, j:j+bs])
```

## Performance Optimization Techniques

### Cache Blocking

**Register Blocking**: Keep data in registers
**L1 Cache Blocking**: Reuse data in L1 cache
**L2 Cache Blocking**: Maximize L2 utilization
**TLB Blocking**: Avoid TLB misses

### Parallelization

**Thread-Level**: Multi-threaded execution
**SIMD-Level**: Vector instructions
**GPU-Level**: GPU parallelism
**Distributed-Level**: Multi-node parallelism

### Memory Optimization

**Preconditioning**: Reduce memory traffic
**In-place Operations**: Avoid extra memory
**Strided Access**: Optimal memory patterns
**Alignment**: Cache line alignment

## Framework Components

### Operation Definitions

**Standard Operations**: BLAS/LAPACK compatible
**Extended Operations**: New operation types
**Tensor Operations**: Multi-linear algebra
**Composite Operations**: Operation sequences

### Implementation Layer

**CPU Kernels**: Optimized CPU implementations
**GPU Kernels**: GPU-specific kernels
**Hybrid Kernels**: Mixed CPU-GPU
**Reference Kernels**: Portable reference code

### Interface Layer

**Standard APIs**: BLAS/LAPACK compatibility
**Extended APIs**: New operation interfaces
**Language Bindings**: C, Fortran, Python, etc.
**Framework APIs**: High-level framework calls

## Research Paper Reference

**Paper**: "A Proposed Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science (FAMLIES)"
- **Authors**: Devin A. Matthews, Tze Meng Low, Margaret E. Myers, et al.
- **arXiv ID**: 2604.07311
- **Published**: April 8, 2026
- **Categories**: cs.MS (Mathematical Software)
- **Link**: https://arxiv.org/abs/2604.07311

## Use Cases

### Scientific Computing

**Applications**:
- Numerical simulations
- Finite element methods
- Computational physics
- Climate modeling

**Requirements**:
- High performance
- Large-scale matrices
- Distributed computing
- Specialized operations

### Machine Learning

**Applications**:
- Neural network training
- Tensor operations
- Large-scale optimization
- Model inference

**Requirements**:
- GPU acceleration
- Batch processing
- Tensor operations
- High throughput

### Data Science

**Applications**:
- Large-scale analytics
- Statistical computations
- Data preprocessing
- Dimensionality reduction

**Requirements**:
- Flexible operations
- Mixed sparse/dense
- Memory efficiency
- Ease of use

## Related Skills

- **blas-lapack**: Traditional BLAS/LAPACK usage
- **gpu-computing**: GPU acceleration techniques
- **tensor-operations**: Tensor algebra fundamentals
- **high-performance-computing**: HPC best practices

## Implementation Examples

### Basic GEMM with Loop Fusion

```c
// FAMLIES GEMM with scaling and addition
void gemm_scale_add(
    int m, int n, int k,
    double alpha, double beta, double gamma,
    double *A, double *B, double *C, double *D,
    double *E) {
    
    // E = alpha*A*B + beta*C + gamma*D
    // Fused implementation
    
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < n; j++) {
            double sum = 0.0;
            for (int p = 0; p < k; p++) {
                sum += A[i*k+p] * B[p*n+j];
            }
            E[i*n+j] = alpha*sum + beta*C[i*n+j] + gamma*D[i*n+j];
        }
    }
}
```

### Tensor Contraction

```python
import numpy as np

def tensor_contract_3d(A, B, contract_dims):
    """
    Contract tensors A and B along specified dimensions
    
    Args:
        A: tensor of shape (a1, a2, a3)
        B: tensor of shape (b1, b2, b3)
        contract_dims: dimensions to contract
    
    Returns:
        Contracted tensor
    """
    # Example: contract last dimension of A with first of B
    if contract_dims == ((-1,), (0,)):
        return np.einsum('ijk,klm->ijlm', A, B)
    
    # General contraction
    einsum_str = build_einsum_string(A.shape, B.shape, contract_dims)
    return np.einsum(einsum_str, A, B)
```

## Future Directions

**Quantum Computing Integration**:
- Quantum linear algebra
- Hybrid quantum-classical algorithms
- Quantum matrix operations

**Sparse-Dense Hybrids**:
- Mixed sparse/dense operations
- Selective storage strategies
- Format-aware kernels

**Automatic Optimization**:
- Auto-tuning frameworks
- Machine learning for optimization
- Compiler-driven optimization

## See Also

- BLIS project documentation
- libflame documentation
- BLAS/LAPACK standards
- High-performance computing guides
FAMLIES: Advanced Linear Algebra Infrastructure

Get FAMLIES: Advanced Linear Algebra Infrastructure.

vz-bench-debug

vz-scrape-runner

Think you can beat it?