Skip to content
← All Posts
Local AI

My Local LLM Stack: Running AI Without the Cloud

I run language models, speech-to-text, embeddings, vector search, and RAG entirely on hardware I own. Here's the exact setup, what it costs, and what it can do.

Why Local?

I work with sensitive documents — contracts, financial records, personal notes. Sending those to OpenAI's API or Google's Gemini means they touch servers I don't control, under terms that can change quarterly. For personal use and for clients in regulated industries, that's not acceptable.

But I also refuse to give up the capabilities. I want chat with my documents, I want transcription, I want semantic search. So I built a local stack that does all of it.

The Hardware

The entire AI inference stack runs on a single Windows PC with an NVIDIA RTX 4060 (8GB VRAM). That's a $300 GPU. Not a datacenter card, not a cluster — a consumer gaming GPU.

The vector database and orchestration run on a separate TrueNAS box (Intel N95, 16GB RAM). Two machines, both under $1,000 total.

AI Processing: Windows PC

  • GPU: NVIDIA RTX 4060 (8GB VRAM)
  • RAM: 32GB DDR5
  • Role: LLM inference, embeddings, speech-to-text
  • Services: LM Studio (port 1234), Whisper (port 9000)

Storage & Orchestration: TrueNAS

  • CPU: Intel N95
  • RAM: 16GB
  • Role: Vector database, RAG, document processing, automation
  • Services: Qdrant (port 6333), AnythingLLM (port 3002), n8n (port 5678)

The Stack

LM Studio — Local LLM Inference

LM Studio runs on the Windows PC and serves models through an OpenAI-compatible API at http://192.168.86.77:1234/v1. Any tool that speaks the OpenAI API format can connect — n8n, AnythingLLM, custom scripts, even other AI tools.

I primarily run Qwen3-8B for general tasks. It fits comfortably in 8GB VRAM with int8 quantization and responds in 1-3 seconds for typical queries. For embeddings, I use nomic-embed-text v1.5 — a 768-dimensional model that produces high-quality vectors for semantic search.

The key insight: you don't need a 70B parameter model for most tasks. An 8B model handles summarization, classification, question-answering, and structured extraction well enough for daily use. Save the big models for when accuracy on nuanced tasks actually matters.

Whisper — Speech-to-Text

OpenAI's Whisper runs locally via a Docker container with CUDA acceleration. Upload any audio file, get accurate transcription with timestamps. I built SolScribe on top of this — a full transcription management platform with speaker diarization, AI chat, and webhook automation.

Performance: A 30-minute recording transcribes in about 2 minutes on the RTX 4060 using the "base" model. The "large" model is slower but catches more nuance in accented speech and technical jargon.

Qdrant — Vector Database

Qdrant stores embeddings for semantic search. Unlike keyword search, vector search finds results by meaning — searching for "Docker networking issues" will find a note titled "Container can't reach host network" even though the words don't match.

I maintain three collections: documents (Paperless-ngx content), pkm-knowledge (Obsidian notes), and bookstack_embeddings (wiki pages). New content gets embedded automatically through n8n workflows.

AnythingLLM — RAG Chat

AnythingLLM connects to LM Studio for inference and Qdrant for retrieval. Upload documents into workspaces, then chat with them. Ask questions, get answers with citations pointing to the exact source.

I use it for: searching across hundreds of technical documents, finding specific clauses in contracts, and querying my personal knowledge base conversationally.

n8n — The Glue

n8n orchestrates everything. Three shared sub-workflows form the foundation:

  1. LM Studio Call — A reusable webhook that formats and sends requests to LM Studio, with retry logic and timeout handling.
  2. OpenRouter Fallback — Tries a cloud LLM first (for tasks that benefit from larger models), falls back to local LM Studio if the cloud is unavailable.
  3. Qdrant Embedder — Takes text, generates an embedding via nomic-embed-text, and upserts it into Qdrant with metadata.

Every other AI workflow in my stack calls one of these three. New note in Obsidian? The embedder runs. Document uploaded to Paperless? The embedder runs. Voice memo transcribed? Summarized with LM Studio, then embedded.

What It Can Do

Document Q&A

Upload a 50-page PDF. Ask "What are the payment terms?" and get an answer in 3 seconds with a citation to page 12, paragraph 3. All local. No API calls. No data leaving the network.

Voice-to-Knowledge Pipeline

Record a voice memo on my phone. SolScribe transcribes it. n8n summarizes the transcript with LM Studio, creates an Obsidian note, and embeds it in Qdrant. The entire pipeline runs in under a minute with zero manual intervention.

Semantic Search Across Everything

One search query hits Qdrant and returns relevant results from my documents, notes, and wiki — ranked by meaning, not keyword frequency. Finding a half-remembered concept from three months ago takes seconds.

Automated Document Processing

Drop a file in the Paperless-ngx consumption folder. It gets OCR'd, tagged by an LLM that reads the content and suggests categories, embedded for semantic search, and a notification hits my phone — all automatically.

The Cost

Total Cost of Ownership

  • GPU: ~$300 (RTX 4060, one-time)
  • Monthly electricity: ~$8-12 (GPU PC runs when needed)
  • Cloud API costs: $0
  • Subscription fees: $0
  • Equivalent cloud services: $100-200/month

Break-even: ~2 months.

Honest Limitations

This isn't a cloud replacement for everyone. Some real constraints:

  • Model quality: An 8B parameter model is not GPT-4. For complex reasoning, creative writing, or nuanced analysis, cloud models are still better. I use OpenRouter as a fallback for tasks that need it.
  • Context window: Local models typically handle 4K-8K context well. Larger contexts need more VRAM or quantization tricks.
  • Concurrent users: This is a single-user setup. Serving a team would need a beefier GPU or multiple inference servers.
  • Maintenance: Models need updating. Docker containers need monitoring. It's not zero-maintenance — budget 2-3 hours per week.

But for personal use, sensitive work, and daily AI-assisted operations? Local inference is not just viable — it's preferable.

Getting Started

If you want to build a similar stack, start small:

  1. Install LM Studio on any machine with an NVIDIA GPU. Download a small model (Phi-3 Mini or Qwen 2.5 3B). Play with the chat interface.
  2. Set up Qdrant with a single Docker command. Create a collection. Try embedding and searching a few documents.
  3. Add n8n and build your first automation: webhook → LM Studio → notification. Once that works, everything else is iteration.

The infrastructure to run AI privately exists today, on consumer hardware, at consumer prices. The question isn't whether it's possible. It's whether you're comfortable with someone else holding the keys.