How to Scale Your Model

What you Learned

LLM scaling fundamentals
TPU and GPU architecture basics
Roofline analysis
Compute vs Memory vs Communication bottlenecks
Strong scaling concepts
Transformer architecture internals
Transformer FLOPs and parameter calculation
Multi-GPU/TPU training
Model parallelism techniques
Data parallelism
Tensor parallelism
Pipeline parallelism
Expert parallelism (MoE)
Model sharding and FSDP
ZeRO optimization
Gradient accumulation
Rematerialization (checkpointing)
LLM training cost estimation
LLM inference optimization
KV Cache management
Latency vs Throughput trade-offs
LLaMA 3 training and serving
TPU networking and communication
JAX for large-scale AI
TPU/GPU profiling and debugging
Hardware-aware AI system design

Learned how large language models are trained, scaled, parallelized, optimized, and served efficiently across thousands of GPUs/TPUs.