Skip to main content
Developer Tools

How to Deploy a Private LLM with Agent-Ready Open-Weight Models

Deploy a self-hosted, agent-ready LLM in 2026: model selection, quantization, vLLM vs SGLang serving, MCP integration, RAG, and production sizing guide.

Raj Patel
Raj PatelNo-Code & API Developer
14 min read
Private LLM deployment stack diagram showing open-weight model, vLLM inference server, MCP tools layer, and LangGraph agent framework connected inside a self-hosted infrastructure

A private LLM deployment that would have taken a dedicated ML platform team six weeks to build in 2024 now takes an afternoon. The stack has matured that fast.

Open-weight models in 2026 are competitive with commercial APIs on the tasks most enterprises actually run: code assistance, document analysis, contract review, RAG over private knowledge bases, and multi-step agentic workflows. Inference serving engines have consolidated around two well-documented, production-hardened options. The Model Context Protocol (MCP) means you connect your model to internal systems once, not once per tool.

This guide walks through every layer of the private LLM stack from scratch: picking the right model, quantizing it to fit your hardware, standing up a serving engine, wiring in an agent framework with MCP and RAG, and sizing hardware for your workload. By the end, you will have a working agent-ready private LLM and a clear decision framework for production deployment.

TL;DR: For most private deployments, start with Mistral Small 4 or Qwen3.6-35B (Apache 2.0 licensed, fit on a single GPU, strong function calling) served via vLLM with AWQ quantization. Wire agents through LangGraph with MCP servers for your internal tools. Upgrade to Llama 3.3 70B or DeepSeek V4 when workloads justify the hardware. Do not self-host unless you are processing more than roughly 1 billion tokens per month or have a hard compliance requirement. The math rarely works otherwise.

What we are building

By the end of this tutorial you will have:

  • A quantized open-weight model running on your own GPU infrastructure
  • A vLLM inference server exposing an OpenAI-compatible API on port 8000
  • An MCP server connecting the model to a local file system and database
  • A LangGraph agent that uses the model to complete multi-step tasks with tool calls
  • A RAG pipeline that answers questions grounded in your private documents

The full stack looks like this:

Private LLM agent stack
User / Application
      │
Agent Framework (LangGraph)
      │
┌─────┴──────┐
MCP Servers  RAG Pipeline
(tools/data) (vector DB)
      │
Inference Engine (vLLM)
OpenAI-compatible :8000/v1
      │
Quantized Open-Weight Model
(AWQ on GPU / GGUF on CPU)
      │
Compute (GPU server / workstation)
Private LLM agent stack diagram showing user application, LangGraph agent framework, MCP servers, RAG pipeline, vLLM inference engine, and open-weight model layers
The full private LLM agent stack. Each layer is independently replaceable — you can swap the model, serving engine, or agent framework without rebuilding the others.

Prerequisites

Before starting, confirm you have the following:

  • A Linux server or workstation with at least one NVIDIA GPU (24 GB VRAM minimum for 7B to 13B models; 40+ GB for 70B models at INT4)
  • Python 3.10+ and pip installed
  • Docker (optional but recommended for production serving)
  • CUDA 12.1+ and the appropriate NVIDIA drivers installed
  • At least 50 GB of free disk space for model weights
  • Basic familiarity with the command line and Python virtual environments

No GPU? Ollama with GGUF models runs on CPU (including Apple Silicon M-series) for development and light workloads. It is not suitable for production serving beyond 10 to 20 concurrent users, but it is the fastest way to validate a model choice before committing to GPU hardware.

Step 1: Pick the right open-weight model

Model selection is the decision with the largest downstream consequences, and it is where teams most commonly spend time on the wrong variables. Benchmark scores matter less than three practical criteria: license, hardware fit, and function-calling reliability.

The license question first

This is non-negotiable for commercial deployments and needs to be resolved before you spend time on anything else.

LicenseKey modelsCommercial useFine-tune and redistribute
Apache 2.0Qwen3.6-35B, Mistral Small 4, Qwen 2.5 72B Unrestricted
MITDeepSeek R1, Phi-4-Reasoning, GLM-5.1 Unrestricted
Meta Community LicenseLlama 3.3 70B, Llama 4 (under 700M MAU)Restricted
Gemma Terms of UseGemma 4 31B (with conditions)Restricted

If your legal team requires a fully OSI-approved license, your shortlist is Apache 2.0 and MIT models. The Qwen and Mistral families have clean commercial licensing and are strong performers.

Match the model to your hardware

At BF16 (full precision), GPU memory requirements scale roughly at 2 GB per billion parameters. Quantization brings this down significantly:

ModelBF16 VRAMINT4 (AWQ) VRAMFits on
7B~14 GB~5 GBRTX 4090, M2 Max
13B~26 GB~9 GBRTX 4090, RTX 3090
35B (MoE 3B active)~70 GB~20 GBRTX 4090
70B~140 GB~40 GB2x RTX 3090 / A100 80GB
671B MoE (37B active)1.3 TB~370 GB8x H100

The Qwen3.6-35B-A3B is worth special attention here. It is a Mixture-of-Experts model with 35B total parameters but only 3B active parameters per token. In practice, it runs on a single RTX 4090 (24 GB) with INT4 quantization while delivering performance well above its active parameter count. It is one of the best value-per-GPU options available with an Apache 2.0 license.

Choose for your primary use case

  • Agentic coding and tool use: Kimi K2.6, DeepSeek V4 Pro, GLM-5.1 (all for enterprise GPU budgets); Qwen3-Coder or Mistral Small 4 for single-GPU deployments
  • General RAG and chat: Llama 3.3 70B (best benchmark results on general tasks), Qwen 2.5 72B (Apache 2.0 alternative)
  • Edge and consumer GPU: Phi-4-Reasoning, Gemma 4 31B, Mistral Small 4
  • Reasoning tasks: DeepSeek R1, Phi-4-Reasoning

For this tutorial, we will use Mistral Small 4 (Apache 2.0, strong native function calling, single GPU, fast iteration) for the development walkthrough, with notes on swapping to larger models for production.

Step 2: Download and quantize the model

Download from Hugging Face

Download model weightsbash
# Install the Hugging Face CLI
pip install huggingface_hub

# Log in (required for gated models like Llama)

huggingface-cli login

# Download Mistral Small 4 (Apache 2.0 - no login required)

huggingface-cli download mistralai/Mistral-Small-Instruct-2409 --local-dir ./models/mistral-small-4

Quantize with AWQ for GPU serving

AWQ (Activation-Aware Weight Quantization) is the recommended format for production GPU inference. It compresses weights to INT4 while preserving the most activation-sensitive parameters, giving near-full-precision quality at roughly 30% of the original VRAM cost.

Install AWQbash
pip install autoawq
Quantize to AWQ INT4python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "./models/mistral-small-4"
quant_path = "./models/mistral-small-4-awq"

# Load the model and tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(
model_path,
device_map="auto",
safetensors=True
)

# Quantize: w_bit=4 is INT4; zero_point=True is standard AWQ

quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"Quantized model saved to {quant_path}")

Skip quantization if a pre-quantized model exists. Check Hugging Face for {model - name}-AWQ or {model - name}-GPTQ variants in the TheBloke or bartowski namespaces before running quantization yourself. Pre-quantized models from trusted namespaces save 30 to 60 minutes of compute time.

For GGUF format (Ollama, LM Studio, CPU inference):

Convert to GGUF with llama.cppbash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

python convert_hf_to_gguf.py ../models/mistral-small-4 --outfile ../models/mistral-small-4-q4_k_m.gguf --outtype q4_k_m

Q4_K_M is the recommended GGUF quantization level for balancing quality and size. Q8_0 gives near-lossless quality at roughly double the VRAM; Q2_K fits more on smaller hardware but shows noticeable quality degradation.

Step 3: Serve with vLLM

vLLM is the production-grade inference engine for GPU deployments. It uses PagedAttention for efficient KV cache management, exposes an OpenAI-compatible API, and has mature Kubernetes support via Helm charts.

Install and start vLLM

Install vLLMbash
# Requires CUDA 12.1+
pip install vllm
Start the vLLM serverbash
python -m vllm.entrypoints.openai.api_server --model ./models/mistral-small-4-awq --quantization awq --dtype auto --max-model-len 32768 --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.90

The server exposes an OpenAI-compatible API at http://localhost:8000/v1. Any library or tool that works with the OpenAI SDK will work against this endpoint by changing the base_url.

Verify the server is running

Test inference endpointpython
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-required" # vLLM does not enforce keys by default
)

response = client.chat.completions.create(
model="mistral-small-4", # matches the model name vLLM loaded
messages=[{"role": "user", "content": "Summarize what a RAG pipeline does in two sentences."}],
max_tokens=200
)
print(response.choices[0].message.content)
vLLM server terminal output showing successful model load, PagedAttention KV cache initialization, and first inference request completion
A healthy vLLM startup shows the KV cache block allocation and confirms the model is loaded before accepting requests.

vLLM vs SGLang: which one to use

Both are production-ready. The choice depends on your workload:

vLLMSGLang
Best forHigh-throughput general servingAgent and multi-turn workloads
KV cache strategyPagedAttentionRadixAttention (shared prefix caching)
Throughput advantageBaseline29% higher on smaller models; stronger for repeated system prompts
Hardware supportNVIDIA, AMD, AWS TrainiumNVIDIA, AMD
Production maturityHigh (battle-tested Helm charts)High
Setuppip install vllmpip install sglang

Use SGLang when your agent workload involves many requests that share a long system prompt or prefix. RadixAttention caches and reuses those shared prefixes rather than recomputing them on every request, which is the dominant bottleneck in agent serving. For all other production cases, vLLM's broader hardware support and ecosystem integrations make it the safer default.

Do not use Ollama for production multi-user serving. Ollama is excellent for development, local experimentation, and Apple Silicon. It is designed for a small number of concurrent users (10 to 20 maximum). For serving more than a handful of developers simultaneously, deploy vLLM or SGLang instead.

Step 4: Add MCP tool integration

The Model Context Protocol provides a standardized interface for your LLM to interact with external systems. Instead of writing custom tool-calling code for every integration, you implement MCP once and gain access to a growing ecosystem of pre-built MCP servers.

Install the MCP Python SDK

Install MCPbash
pip install mcp

Create a simple MCP server

This example exposes a file system reader and a basic database query tool to the model:

mcp_server.pypython
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types
import sqlite3, pathlib

app = Server("private-tools")

@app.list_tools()
async def list_tools() -> list[types.Tool]:
return [
types.Tool(
name="read_file",
description="Read a text file from the local file system",
inputSchema={
"type": "object",
"properties": {
"path": {"type": "string", "description": "Absolute path to the file"}
},
"required": ["path"]
}
),
types.Tool(
name="query_db",
description="Run a read-only SQL query on the internal SQLite database",
inputSchema={
"type": "object",
"properties": {
"sql": {"type": "string", "description": "SELECT statement to execute"}
},
"required": ["sql"]
}
)
]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
if name == "read_file":
path = pathlib.Path(arguments["path"])
if not path.exists():
return [types.TextContent(type="text", text=f"Error: file not found at {path}")]
return [types.TextContent(type="text", text=path.read_text())]

  if name == "query_db":
      conn = sqlite3.connect("./data/internal.db")
      cursor = conn.execute(arguments["sql"])
      rows = cursor.fetchall()
      cols = [d[0] for d in cursor.description]
      result = [dict(zip(cols, row)) for row in rows]
      return [types.TextContent(type="text", text=str(result))]

async def main():
async with stdio_server() as (read_stream, write_stream):
await app.run(read_stream, write_stream, app.create_initialization_options())

if **name** == "**main**":
import asyncio
asyncio.run(main())

Restrict the SQL tool to SELECT statements only. The example above does not enforce this at the code level. In production, add a check that rejects any SQL not beginning with SELECT, or connect to a read-only database replica. Giving an LLM write access to a production database via an MCP tool is a serious security risk.

The MCP ecosystem includes pre-built servers for many common systems: PostgreSQL, MySQL, file systems, web search, GitHub, Slack, Jira, and more. Check the MCP server registry before writing a custom server for a common integration.

Step 5: Build a RAG pipeline for private knowledge

RAG (Retrieval-Augmented Generation) lets the model answer questions about your private documents without retraining. The pipeline: chunk documents, embed them into a vector database, retrieve relevant chunks at query time, inject them into the model's context.

Install RAG dependenciesbash
pip install langchain-community chromadb sentence-transformers
rag_pipeline.pypython
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# Load and chunk your private documents

loader = DirectoryLoader("./docs/", glob="*_/_.txt", loader_cls=TextLoader)
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)

# Embed with a local sentence-transformer model (no data leaves your server)

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

# Store in a local ChromaDB vector database

vectorstore = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./chroma_db"
)
vectorstore.persist()
print(f"Indexed {len(chunks)} chunks from {len(docs)} documents.")

Use a local embedding model. BAAI/bge-small-en-v1.5 runs fast on CPU and keeps all data local. Sending documents to an external embedding API (OpenAI, Cohere) defeats the purpose of a private deployment. For larger document sets or multilingual content, BAAI/bge-m3 is a strong multilingual alternative.

Step 6: Wire a LangGraph agent

LangGraph connects the model, MCP tools, and RAG retriever into a stateful agent that can plan, call tools, observe results, and reason across multiple steps.

Install LangGraphbash
pip install langgraph langchain-openai
agent.pypython
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.tools.retriever import create_retriever_tool
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
import subprocess, json

# Point to your local vLLM server

llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-required",
model="mistral-small-4",
temperature=0
)

# RAG retriever tool

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

rag_tool = create_retriever_tool(
retriever,
name="search_internal_docs",
description="Search the company knowledge base. Use for questions about internal policies, procedures, or documentation."
)

# MCP tool wrapper (calls the MCP server as a subprocess)

@tool
def read_file(path: str) -> str:
"""Read a file from the local file system."""
request = {"jsonrpc": "2.0", "id": 1, "method": "tools/call",
"params": {"name": "read_file", "arguments": {"path": path}}}
result = subprocess.run(
["python", "mcp_server.py"],
input=json.dumps(request),
capture_output=True, text=True
)
return result.stdout

# Build the agent

tools = [rag_tool, read_file]
agent = create_react_agent(llm, tools)

# Run a query

result = agent.invoke({
"messages": [{"role": "user", "content": "What does our internal refund policy say, and is there a policy document at /docs/refund-policy.txt?"}]
})
print(result["messages"][-1].content)

This creates a ReAct (Reasoning + Acting) agent that will plan which tools to call, call them in sequence, observe the results, and compose a final answer. LangGraph's graph-based architecture makes it straightforward to add human-in-the-loop review steps, retry logic, and branching flows as your agent complexity grows.

LangGraph agent execution trace showing model reasoning step, RAG retrieval tool call, file read tool call, and final answer composition
A LangGraph agent trace showing the model's planning steps, tool calls, and result synthesis. Full observability is one of the core advantages of a self-hosted stack.

Hardware sizing for production

Choosing hardware before you know your throughput requirements is one of the most common and expensive mistakes in private LLM deployment. Measure first.

Establish your throughput requirement

Before sizing hardware, answer three questions:

  1. How many requests per minute at peak load?
  2. What is the average input token count (system prompt + user message + RAG context)?
  3. What is the expected output token count per response?

A practical starting point: a single A100 80GB running Llama 3.3 70B at AWQ INT4 with vLLM handles roughly 1,000 to 2,000 tokens per second throughput, which translates to 15 to 30 concurrent users with typical enterprise query patterns.

Hardware tiers

TierHardwareModelsUse case
DevelopmentRTX 4090 (24 GB)Up to 13B BF16, 35B MoE INT4Individual developer, prototyping
Small team2x RTX 3090 (48 GB)70B INT45 to 20 concurrent users
Mid productionA100 80GB70B AWQ or BF1620 to 100 concurrent users
ProductionH100 80GB70B FP8 or BF16100+ concurrent users
Enterprise2-8x H100 (160-640 GB)Large MoE (DeepSeek V4, Llama 4) BF16High-volume agentic workloads

FP8 on H100 and Blackwell hardware achieves near-FP16 quality at roughly half the VRAM cost. If your hardware supports it (H100, A100, RTX 5090, RTX PRO 6000 Blackwell), --dtype fp8 in vLLM is the recommended precision tier for production serving.

The cost crossover reality

Self-hosting only becomes economically favorable compared to commercial APIs at significant token volumes. A rough benchmark: the crossover point sits around 1 billion tokens per month for a mid-tier GPU server, accounting for hardware amortization, electricity, and engineering maintenance overhead.

Below that volume, you are paying more to self-host than to use a commercial API. The self-hosting decision should be driven by compliance requirements, data sovereignty, or fine-tuning needs — not cost alone — unless your volumes are genuinely high.

Security hardening before you go live

Private deployments come with no built-in safety features. Every item below needs to be addressed before your inference endpoint is accessible to more than your own development machine.

Network access: Bind vLLM to 0.0.0.0 only if you have a reverse proxy (nginx, Caddy) in front with TLS termination. Never expose the inference port directly to the internet. Use VPC or VLAN isolation for the inference server.

Authentication: vLLM supports --api-key for basic bearer token authentication. For production, put a proper API gateway (Kong, AWS API Gateway, nginx with auth) in front that handles token rotation and per-user rate limiting.

Input validation: Implement prompt injection detection before production. Dedicated libraries like rebuff and llm-guard provide heuristic and model-based detection. This is especially important for agent deployments where tool calls execute real actions.

PII redaction: Commercial APIs include PII detection by default. Self-hosted deployments do not. Integrate a redaction layer (Microsoft Presidio is a well-maintained open-source option) in the request pipeline before sensitive data reaches the model.

Audit logging: Log every request and response with user ID, timestamp, and token counts to your SIEM. This is a hard requirement for HIPAA, SOC 2, and financial regulations, and it enables incident forensics if something goes wrong.

Troubleshooting common errors

CUDA out of memory during model load

Reduce --gpu-memory-utilization from 0.90 to 0.80 or lower. If the model still does not fit, switch to a more aggressive quantization level (INT4 instead of INT8, or GGUF Q4_K_M instead of Q8_0). Confirm no other process is holding VRAM with nvidia-smi.

Slow first-token latency

vLLM compiles CUDA kernels on the first request, which causes 10 to 30 second latency. This is a one-time cost per server restart. Warm the server by sending a short request immediately after startup in your deployment script.

Function calls produce malformed JSON

Most open-weight models generate reliable function calls when the tool schema is precise and examples are in the system prompt. If you see malformed output: (1) verify the model supports native function calling (Mistral Small 4 and Qwen3 do natively; others may need chat template adjustments); (2) reduce temperature to 0 for tool-call steps; (3) add JSON validation and retry logic in your agent.

vLLM not recognizing AWQ quantization

Confirm the quantize_config.json file is present in the model directory and that it specifies "quant_method": "awq". Run with --quantization awq explicitly even if the config is present.

What to build next

With this stack running, a few natural extensions are worth tackling in order:

A monitoring layer is the highest-priority next step for any production deployment. vLLM exposes Prometheus metrics at /metrics. Wire these into Grafana for latency percentiles, token throughput, GPU utilization, and queue depth. You cannot operate a production LLM deployment without visibility into these.

Fine-tuning on your domain data turns a general-purpose model into a specialist. For a 7B model, QLoRA fine-tuning on a single RTX 4090 overnight is viable using Unsloth. A fine-tuned Mistral 7B on your company's support tickets will outperform a general Llama 70B on your specific task at a fraction of the compute cost.

Multi-model routing becomes relevant as workload diversity grows. A lightweight 7B model handles simple FAQ retrieval cheaply; a 70B model handles complex reasoning. A routing layer that classifies request complexity and routes accordingly cuts inference costs significantly while maintaining quality on demanding tasks.

For context on how this private LLM stack compares to commercial coding assistants, see the open-source AI coding tools comparison — tools that can connect to a self-hosted LLM backend using the same OpenAI-compatible API you just set up. For Kubernetes infrastructure to run inference at scale, read Kubernetes for AI workloads. The guardian agents in CI/CD guide shows how to extend the agent pattern into automated code quality pipelines. Browse more developer tools on Bytewaves.

Frequently asked questions

For development and light personal use, no. Ollama with GGUF-quantized models runs on CPU (including Apple Silicon M-series chips, which are particularly capable). For production serving beyond 5 to 10 concurrent users, a dedicated NVIDIA GPU is necessary. The minimum practical GPU for a production single-model deployment is an RTX 4090 (24 GB VRAM) for models up to 13B, or an A100/H100 80 GB for 70B models.

Not automatically, and for many teams the math never crosses over. GPU hardware, electricity, engineering maintenance, and operational overhead add up to a real total cost of ownership that exceeds commercial API pricing for workloads below roughly 1 billion tokens per month. The case for self-hosting should be built on compliance requirements, data sovereignty, or fine-tuning needs first. Cost savings are a secondary benefit that only materializes at genuine scale.

For enterprise GPU budgets, Kimi K2.6 and DeepSeek V4 Pro lead the 2026 Berkeley Function Calling Leaderboard. For single-GPU deployments, Mistral Small 4 has native function calling without special prompting, and Qwen3.6-35B-A3B (Apache 2.0, runs on one RTX 4090) is highly competitive. The most important factor after model selection is your agent scaffolding: Princeton's Holistic Agent Leaderboard shows that orchestration quality can shift benchmark scores by 30 absolute points on the same model.

MCP (Model Context Protocol) is an open standard originally developed by Anthropic that defines how LLMs communicate with external tools, databases, and data sources. Before MCP, every AI application needed custom code to connect to each tool. With MCP, you implement the protocol once and gain access to a growing ecosystem of pre-built MCP servers covering databases, file systems, GitHub, Slack, web search, and hundreds of other systems. For private deployments, this dramatically reduces the integration work required to build a genuinely useful agent.

The key requirements are: no data egress (the inference endpoint and all data must stay within your controlled infrastructure), full audit logging of all model inputs and outputs, PII detection and redaction before data reaches the model, access controls on the inference endpoint, and documented data processing policies. Self-hosted open-weight models satisfy the data residency requirement by design. You still need to implement audit logging, PII redaction (Microsoft Presidio is a good open-source option), and access authentication — none of these come built into the inference server by default.

Tags#private llm deployment#self-hosted llm#open-weight models#vllm#sglang#mcp#rag#llm agents#llama deployment#developer tools 2026
ShareX / TwitterLinkedIn
Contextual Recommendations

Related Evaluations & Guides