Inference at the Edge

How to Work Securely and Offline with Language Models

Dylan Goldblatt, Ph.D.

Kennesaw State University

Slides:

inference-slides.vercel.app

Extras:

inference-extras.vercel.app

Today's Journey

  1. Why local inference matters
  2. How to select the right foundation model
  3. Run local LLMs with RAG functions
  4. Transcribe audio and video files locally
  5. Securing remote inference

1. Why Local Inference Matters

Key Benefits

  • Privacy & data security
  • Independence from connectivity
  • Low latency, high speed
  • Control over updates and usage
  • Personalization without sharing data

Privacy & Security

  • Your input data never leaves your device
  • No one else can access, store, or analyze it
  • No server logs of your queries exist
  • No training on your proprietary information

Availability & Cost

  • Works without internet access
  • Works in secure environments
  • Works without downtimes and outages
  • Works without a dedicated budget/subscription

Speed & Control

  • Fast responses
  • Full control over model versions
  • No surprise interface or functionality changes
  • You decide when to update—or not

2. Selecting Foundation Models

Choosing the Right Model

  • Find a model that suits your hardware
  • Match a model to your needs
  • Use quantization to adjust for both

LLaMA (Meta)

  • Generous context windows
  • Excellent size-quality balance

Mistral

  • Perfect for limited hardware
  • Good balance of capabilities

Gemma (Google)

  • Perfect for limited hardware
  • Good balance of capabilities

Phi (Microsoft)

  • High speed and small size
  • Can run on edge hardware including phones

Qwen (Alibaba)

  • Competes with larger, private models
  • Less computationally demanding vs. DeepSeek

DeepSeek

  • Matches private models in quality
  • Disallowed for USG work computers
LLM Family Model Sizes Multilingual Coding Strength Context MMLU
Google Gemma (3) 27B (12B, 4B, 1B) Yes (140+ langs) Moderate (basic coding ability) 128k tokens ~67.5%
Microsoft Phi (Phi-4) 5.6B, 3.8B (plus 14B) Yes (broadly multilingual) Yes – excels for its size up to 128k tokens ~69–78%
Mistral (Large 2) 123B (also 24B, 7B) Yes (dozens of langs) Yes – strong; trained heavily on code 128k tokens ~84.0%
Alibaba Qwen (2.5) 72B (32B, 14B, 7B) Yes (29+ languages) Yes – esp. 7B Coder variant 128k tokens ~86%
Meta LLaMA (3.1) 405B, 70B, 8B Partial (8 languages) Yes – state-of-the-art (matches GPT-4-level) 128k tokens ~87.3%
DeepSeek (R1) 671B (MoE) Yes (EN/ZH high proficiency) Yes – top-tier reasoning & code 128k tokens ~90.8%

Matching LLMs to GPUs

  1. Macs are the best "default" work option.
  2. Most work PCs lack dedicated VRAM
  3. Most work PCs can run small models
  4. Parameter size (B) is roughly the VRAM target
  5. Estimate LLM capabilities before purchasing
  6. Nvidia-based computers are the best choice.

Quantization

Quantization compresses LLMs by storing weights in fewer bits:

  • Full precision (FP16): ~14GB for a 7B model
    Highest quality, highest memory
  • 8-bit (INT8): ~7GB
    Excellent quality, half the memory
  • 4-bit (INT4): ~3.5GB
    Slight quality tradeoff, significant memory savings

Common Quantization Formats

  • GGUF (GGML Unified Format)
    Optimized for CPUs and Apple Silicon; portable and widely supported
  • GPTQ
    Popular for GPU acceleration; fast inference
  • AWQ (Activation-aware Quantization)
    Advanced technique balancing accuracy and speed
  • MLX (Metal acceleration)
    Optimized specifically for Apple hardware; offers ~20-30% performance improvement

3. Run local LLMs with RAG functions

User-Friendly Local Apps

  • LM Studio: Closed-source, first-class citizen
  • Jan.ai: Open-source, "calmer" user experience
  • Ollama: High efficiency but lacks a UI
  • AnythingLLM: Multifunctionality, RAG focus

LM Studio

  • Cross-platform: macOS, Windows, Linux
  • Built-in model library with easy downloads
  • Completely offline operation after setup
  • ChatGPT-like interface running locally
  • Basic RAG capabilities included
  • Powered by llama.cpp under the hood

Live Demo

LM Studio (MAC)

Jan.ai

  • Fully open-source alternative (AGPL licensed)
  • Similar experience to LM Studio
  • Multiple backends (llama.cpp, TensorRT)
  • Evolving rapidly with new features
  • Community-driven development
  • Local API server capabilities

Demo

Jan (PC)

Mac Specific "Unlock"

Apple Silicon Macs have a software limit on GPU memory:

  • Default: ~75% of physical RAM available for GPU
  • Advanced users can modify this limit by:
  • Disabling System Integrity Protection (SIP)
  • Modifying system graphics driver settings
  • Only recommended for dedicated AI machines
  • Most users with 7B-13B models won't need this

    sudo sysctl iogpu.wired_limit_mb=<mb>

Notes on RAG

  • Feed your notes, PDFs, research papers
  • Query in natural language
  • Get responses based on your personal data
  • All while keeping information private

NVIDIA ChatRTX (PC)

  • Specialized for NVIDIA RTX 30/40/50 series GPUs
  • Blazingly fast RAG optimized for tensor cores
  • Document and image search
  • Voice input
  • Remarkably smooth performance
  • Windows 11 with modern NVIDIA GPU required

Demo

ChatRTX (PC)

RAG to Riches (R2R)

  • Production-grade RAG framework
  • Knowledge graphs
  • Multi-step reasoning
  • Hybrid search (keywords + vectors)
  • Tool-using agents
  • More complex setup but extremely powerful

4. Transcribe audio and video files locally

Local Audio Transformation

  • State-of-the-art speech recognition
  • Open-source and runs locally
  • Multiple sizes (tiny → large)
  • Supports 99 languages
  • Transcribe without sharing audio

MacWhisper

  • Native macOS app for Whisper
  • Metal acceleration for Apple Silicon
  • Excellent transcription capabilities
  • Speaker diarization (who spoke when)
  • Punctuation and formatting
  • Free and paid Pro versions

    https://goodsnooze.gumroad.com/l/macwhisper

Live Demo

MacWhisper (MAC)

5. Securing remote inference

Remote LLMs via API

  • Best quality & performance
  • Very large context windows (100K+)
  • No local device hardware constraints
  • Incorporate inference into software
  • Economical Pay-for-Usage model
  • OpenRouter vs. LLM Companies
  • Improved security

Securing Remote LLMS

  • Strip sensitive information (e.g., names)
  • Use providers with strong data guarantees
  • Segment your usage between local for sensitive tasks and remote for public-domain questions
  • Hugging Face Private Inference
    • Your own server running open models
    • Remote but still under your control

Conclusion

By managing your inference at the edge, you:

  • Reclaim control over privacy and security
  • Use cloud services selectively
  • Work effectively even when disconnected
  • Control your AI experience completely

Book an AI Consultation

KSU AI Community

Discussion

Thank you for attending the AI Fair!