What I Learned
What is model quantization, and how does it reduce LLM size while preserving performance?
How can a 744B parameter model be run on consumer hardware?
What are the trade-offs between 1-bit, 2-bit, 4-bit, and 8-bit quantization?
Why is memory (RAM/VRAM) often more important than raw GPU compute for large-model inference?
How does MoE (Mixture of Experts) enable massive models with fewer active parameters?
What is KV-cache quantization, and how does it extend context length?
How can 1M-token context windows change AI applications?
How does local AI deployment work using GGUF and llama.cpp?
What are the benefits of running AI models locally instead of in the cloud?
How close are open-source models to frontier closed-source models?
What do benchmark scores actually measure, and what are their limitations?
How can tool calling, web search, and code execution turn an LLM into an AI agent?
What optimizations are required to make frontier models practical for real-world use?
Why is inference engineering becoming as important as model training?
What does the future of AI look like when models become smaller, faster, and more agentic?
