Systematic profiling identifies where time and resources are actually spent in AI workloads.

## GPU Profiling

### PyTorch Profiler ```python import torch from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler

# Profile training step with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule=torch.profiler.schedule(wait=1, warmup=1, active=5), on_trace_ready=tensorboard_trace_handler('./log/resnet18'), record_shapes=True, with_stack=True, profile_memory=True ) as prof: for step, batch in enumerate(dataloader): if step >= 1 + 1 + 5: break model(batch) prof.step()

# Key metrics print(prof.key_averages().table( sort_by="cuda_time_total", row_limit=20, header="Top 20 CUDA Ops" ))

# Output shows: # - CPU time vs CUDA time per operation # - Memory allocated/freed # - Number of calls # - Self vs total time (with children) ```

### Identifying Bottlenecks ```python # Memory bandwidth test import torch import time

def measure_memory_bandwidth(): size_gb = 4 num_elements = size_gb * 1024**3 // 4 # float32

tensor = torch.randn(num_elements, device='cuda') torch.cuda.synchronize() start = time.time() for _ in range(100): result = tensor * 2.0 # Read + write torch.cuda.synchronize() elapsed = time.time() - start # 2x size because we read AND write bandwidth_gbps = (2 * size_gb * 100) / elapsed print(f"Memory bandwidth: {bandwidth_gbps:.1f} GB/s") # Compare to GPU spec (A100: 2TB/s, RTX 4090: 1TB/s)

# GPU compute utilization test def measure_compute_utilization(): import subprocess result = subprocess.run( ['nvidia-smi', '--query-gpu=utilization.gpu,utilization.memory', '--format=csv,noheader'], capture_output=True, text=True ) print(result.stdout) ```

## Common Bottleneck Patterns ``` Symptom: GPU utilization < 70% → Bottleneck: Data loading or CPU preprocessing → Fix: Increase num_workers, add prefetching, pin_memory=True

Symptom: High GPU memory utilization but low compute → Bottleneck: Memory bandwidth limited (memory-bound kernels) → Fix: Kernel fusion, larger batch sizes, Flash Attention

Symptom: Good GPU utilization but poor scaling → Bottleneck: NCCL communication overhead in distributed training → Fix: Gradient compression, topology-aware communication

Symptom: Spiky GPU utilization → Bottleneck: CPU-GPU synchronization or Python GIL → Fix: Async operations, non-blocking transfers ```