What I Learned

  • What is model quantization, and how does it reduce LLM size while preserving performance?

  • How can a 744B parameter model be run on consumer hardware?

  • What are the trade-offs between 1-bit, 2-bit, 4-bit, and 8-bit quantization?

  • Why is memory (RAM/VRAM) often more important than raw GPU compute for large-model inference?

  • How does MoE (Mixture of Experts) enable massive models with fewer active parameters?

  • What is KV-cache quantization, and how does it extend context length?

  • How can 1M-token context windows change AI applications?

  • How does local AI deployment work using GGUF and llama.cpp?

  • What are the benefits of running AI models locally instead of in the cloud?

  • How close are open-source models to frontier closed-source models?

  • What do benchmark scores actually measure, and what are their limitations?

  • How can tool calling, web search, and code execution turn an LLM into an AI agent?

  • What optimizations are required to make frontier models practical for real-world use?

  • Why is inference engineering becoming as important as model training?

  • What does the future of AI look like when models become smaller, faster, and more agentic?