Dylan Goldblatt, Ph.D.
Kennesaw State University
Slides:
inference-slides.vercel.app
Extras:
inference-extras.vercel.app



| LLM Family | Model Sizes | Multilingual | Coding Strength | Context | MMLU |
|---|---|---|---|---|---|
| Google Gemma (3) | 27B (12B, 4B, 1B) | Yes (140+ langs) | Moderate (basic coding ability) | 128k tokens | ~67.5% |
| Microsoft Phi (Phi-4) | 5.6B, 3.8B (plus 14B) | Yes (broadly multilingual) | Yes – excels for its size | up to 128k tokens | ~69–78% |
| Mistral (Large 2) | 123B (also 24B, 7B) | Yes (dozens of langs) | Yes – strong; trained heavily on code | 128k tokens | ~84.0% |
| Alibaba Qwen (2.5) | 72B (32B, 14B, 7B) | Yes (29+ languages) | Yes – esp. 7B Coder variant | 128k tokens | ~86% |
| Meta LLaMA (3.1) | 405B, 70B, 8B | Partial (8 languages) | Yes – state-of-the-art (matches GPT-4-level) | 128k tokens | ~87.3% |
| DeepSeek (R1) | 671B (MoE) | Yes (EN/ZH high proficiency) | Yes – top-tier reasoning & code | 128k tokens | ~90.8% |
Quantization compresses LLMs by storing weights in fewer bits:

LM Studio (MAC)
Jan (PC)
Apple Silicon Macs have a software limit on GPU memory:
Most users with 7B-13B models won't need this
sudo sysctl iogpu.wired_limit_mb=<mb>ChatRTX (PC)
Free and paid Pro versions
https://goodsnooze.gumroad.com/l/macwhisperMacWhisper (MAC)
By managing your inference at the edge, you:


Thank you for attending the AI Fair!
