Local AI on RTX 4060 Ti: Setup & Performance Guide

Running Local AI on Your Gaming PC: Complete Guide for RTX 4060 Ti

Turn your RTX 4060 Ti into an AI powerhouse for homework, coding, and creative projects—completely free and private!

What You'll Learn

Why run AI locally on your own PC
Hardware setup and requirements
Installing and configuring llama.cpp
Running real AI models with OpenAI-compatible API
Practical curl examples for all SGR (Schema-Guided Reasoning) tasks
Context size optimization and real-world testing
Performance benchmarks: RTX 4060 Ti vs RTX 5090 vs DGX Spark
Tips and troubleshooting

Part 1: Why Local AI Matters

Your Gaming PC = Personal AI Server

If you have a modern gaming PC with an RTX 4060 Ti, you already own hardware capable of running AI models that rival ChatGPT. Here's why this is awesome:

Privacy: Everything runs offline—your homework, code, and creative writing never leave your computer

Cost: $0/month after setup (vs$ 20+/month for ChatGPT Plus)

Learning: Understand how AI actually works by running it yourself

Control: Choose your models, adjust settings, no rate limits

Fun: It's like running your own mini data center!

Part 2: Hardware Requirements

Target System: RTX 4060 Ti Desktop

Tested Configuration:

GPU: NVIDIA RTX 4060 Ti 16GB VRAM
- Architecture: Ada Lovelace (AD106)
- CUDA Cores: 4,352
- Tensor Cores: 136
- SMs (Streaming Multiprocessors): 34 ⚠️
- Memory Bandwidth: 288 GB/s
- TDP: 160W
CPU: Intel Core i5-13400 (10 cores, 16 threads)
RAM: 64GB DDR4
Storage: 1TB NVMe SSD (100GB+ free for models)
OS: Ubuntu 22.04 LTS or Windows 11

Total Cost: ~$1,120-1,250 USD

What You Can Actually Run:

7B models: Lightning fast (150-161 tokens/sec) ⭐⭐⭐⭐⭐
20B models: Very fast (84 tokens/sec) ⭐⭐⭐⭐⭐
30B models: Fast (88-110 tokens/sec with Qwen3-30B Q3_K_M) ⭐⭐⭐⭐⭐
70B models: Possible but slow (3-7 tokens/sec hybrid CPU+GPU) ⭐⭐

⚠️ Important Hardware Limitations:

The RTX 4060 Ti has only 34 SMs (Streaming Multiprocessors), which means:

GEMM max-autotune optimization is NOT available on GPUs with fewer than 46 SMs
This limitation affects some PyTorch training frameworks (particularly torch.compile with max-autotune mode)
For llama.cpp inference: No practical impact — you still get excellent text generation performance
The 34 SMs are perfectly sufficient for running local LLMs with llama.cpp

Bottom Line: Don't worry about SM count for local AI usage. The RTX 4060 Ti delivers outstanding performance for inference tasks!

Part 3: Setup Guide (Linux/Ubuntu)

Step 1: Install Prerequisites

bash

# Update system
sudo apt update && sudo apt upgrade -y

# Install build tools
sudo apt install -y build-essential cmake git \
  nvidia-driver-550 nvidia-cuda-toolkit \
  python3 python3-pip jq curl wget

# Verify NVIDIA driver
nvidia-smi

# Install Hugging Face CLI
pip install huggingface-hub

Step 2: Build llama.cpp with CUDA Support

bash

# Clone the repository
cd ~
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Configure with CUDA and required compiler flags
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  -DCMAKE_C_FLAGS="-fno-finite-math-only" \
  -DCMAKE_CXX_FLAGS="-fno-finite-math-only" \
  -DCMAKE_BUILD_TYPE=Release

# Build (takes 5-10 minutes)
cmake --build build --config Release -j 16

# Verify build succeeded
./build/bin/llama-cli --version

⚠️ Critical Note: The -fno-finite-math-only compiler flag is required for recent llama.cpp versions. Without it, you'll get compilation errors about non-finite math arithmetics.

Step 3: Download AI Models

bash

# Create models directory
mkdir -p ~/llm_models && cd ~/llm_models

# Download Qwen3-30B Q3_K_M (recommended for RTX 4060 Ti)
huggingface-cli download \
  Mungert/Qwen3-30B-A3B-Instruct-2507-GGUF \
  Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  --local-dir qwen3-30b-a3b-instruct-2507-GGUF

# Alternative: Smaller 7B model for maximum speed
huggingface-cli download \
  Qwen/Qwen2.5-7B-Instruct-GGUF \
  qwen2.5-7b-instruct-q4_k_m.gguf \
  --local-dir qwen2.5-7b-GGUF

# Alternative: 20B model for balanced performance
huggingface-cli download \
  unsloth/gpt-oss-20b-GGUF \
  gpt-oss-20b-Q4_K_M.gguf \
  --local-dir gpt-oss-20b-GGUF

Model Sizes:

Qwen2.5-7B (Q4_K_M): ~4.5GB download
GPT-OSS-20B (Q4_K_M): ~12GB download
Qwen3-30B (Q3_K_M): ~14GB download

⚠️ Note on Quantization: This guide tests Q3_K_M for 30B models, which is lighter and faster than Q4_K_M, perfect for 16GB VRAM. Q3 offers ~20% faster inference with minimal quality loss compared to Q4.

Part 4: Context Size Configuration

Understanding Context Size

Context size determines how much text the AI can "remember" during a conversation:

4K context ≈ 3,000 words
8K context ≈ 6,000 words
12K context ≈ 9,000 words
15K context ≈ 11,500 words

Tested Context Configurations (Qwen3-30B Q3_K_M)

Configuration 1: 4K Context (Maximum Speed)

bash

~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 4096 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --metrics \
  --jinja

Speed: 105-110 tokens/sec
VRAM: ~14.7GB
Best for: Short conversations, quick queries

Configuration 2: 8K Context (Balanced)

bash

~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 8192 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics \
  --jinja

Speed: 88.5 tokens/sec
VRAM: ~14.9GB
Best for: Standard conversations, document analysis

Configuration 3: 12K Context (Large Documents)

bash

~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 12288 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics \
  --jinja

Speed: 88.36 tokens/sec
VRAM: ~15.1GB
Best for: Long documents, extended conversations

Configuration 4: 14K Context (Extended)

bash

~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 14336 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics \
  --jinja

Speed: 89.42 tokens/sec
VRAM: ~15.2GB
Best for: Very long documents

Configuration 5: 15K Context (Maximum Stable - Recommended)

bash

~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 15360 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics \
  --jinja

Speed: 87.87 tokens/sec
VRAM: ~15.3GB
Best for: Maximum context capacity
⭐ Recommended production configuration

Server starts successfully when you see:

llama server listening at http://0.0.0.0:8080

What these flags mean:

-ngl 99: Use GPU for all layers (maximum speed)
-c 4096/8192/12288/14336/15360: Context window size
-b 256: Batch size for processing
-ub 64: Ubatch size
--threads 12: CPU threads to use
--cache-type-k q8_0: KV cache key precision (q8_0 saves VRAM)
--cache-type-v q8_0: KV cache value precision
--metrics: Enable performance monitoring

Part 5: Context Size Performance Comparison

Tested Results (Qwen3-30B Q3_K_M on RTX 4060 Ti)

Context	KV Cache Type	Speed (t/s)	VRAM Used	Words Capacity	Status
4K	f16	110.0	14.7GB	~3,000	✅ Fastest
8K	q8_0	88.5	14.9GB	~6,000	✅ Great
12K	q8_0	88.36	15.1GB	~9,000	✅ Excellent
14K	q8_0	89.42	15.2GB	~10,700	✅ Outstanding
15K	q8_0	87.87	15.3GB	~11,500	✅ Maximum
16K	q8_0	N/A	15.8GB	~12,000	❌ OOM

Key Findings:

4K with f16 cache gives maximum speed (110 t/s)
8K-15K with q8_0 cache maintains 88-89 t/s (excellent!)
15K is the absolute maximum stable context on 16GB VRAM
16K causes out-of-memory errors during Flash Attention warmup

Recommendation: Use 15K context for production—it maximizes capacity while maintaining 88 t/s speed.

Part 6: Using the OpenAI-Compatible API

Your local server provides an OpenAI-compatible API at http://localhost:8080.

Check Available Models

bash

curl http://localhost:8080/v1/models | jq

Response:

json

{
  "models": [{
    "name": "/home/berdachuk/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf",
    "type": "model",
    "format": "gguf"
  }],
  "data": [{
    "id": "...",
    "meta": {
      "n_params": 30532122624,
      "size": 15150649344,
      "n_ctx_train": 262144
    }
  }]
}

Example 1: Simple Sentiment Analysis (JSON Output)

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Respond with JSON only."},
      {"role": "user", "content": "I loved the service but disliked the price."}
    ],
    "max_tokens": 80
  }' | jq

Response:

json

{
  "choices": [{
    "message": {
      "content": "{\n  \"sentiment\": \"mixed\",\n  \"feedback\": \"I loved the service but disliked the price.\",\n  \"rating\": 3\n}"
    }
  }],
  "usage": {
    "completion_tokens": 32,
    "prompt_tokens": 27,
    "total_tokens": 59
  },
  "timings": {
    "prompt_ms": 131.94,
    "prompt_per_second": 204.64,
    "predicted_ms": 297.13,
    "predicted_per_second": 107.70
  }
}

Performance: 107.7 tokens/sec generation, ~300ms total response time

Example 2: Information Extraction

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Output valid JSON."},
      {"role": "user", "content": "Jane, 17, likes Python. Email: jane@mail.com"}
    ],
    "max_tokens": 128
  }' | jq

Response:

json

{
  "choices": [{
    "message": {
      "content": "{\n  \"name\": \"Jane\",\n  \"age\": 17,\n  \"interests\": [\"Python\"],\n  \"email\": \"jane@mail.com\"\n}"
    }
  }],
  "timings": {
    "prompt_per_second": 505.66,
    "predicted_per_second": 110.74
  }
}

Performance: 110.7 tokens/sec, ~360ms total

Example 3: Math Problem Solving

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Show math steps, JSON format."},
      {"role": "user", "content": "If I buy 3 apples at $2 and 2 at $3, total?"}
    ],
    "max_tokens": 100
  }' | jq

Response:

json

{
  "choices": [{
    "message": {
      "content": "{\"steps\": [{\"operation\": \"Calculate cost of 3 apples at $2 each\", \"expression\": \"3 * 2\", \"result\": 6}, {\"operation\": \"Calculate cost of 2 apples at $3 each\", \"expression\": \"2 * 3\", \"result\": 6}, {\"operation\": \"Add both costs\", \"expression\": \"6 + 6\", \"result\": 12}], \"total\": 12}"
    }
  }],
  "timings": {
    "predicted_per_second": 109.56
  }
}

Performance: 109.6 tokens/sec

Example 4: Aspect-Based Sentiment

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Output sentiment per aspect as JSON."},
      {"role": "user", "content": "Battery great. Screen poor. Price fair."}
    ],
    "max_tokens": 120
  }' | jq

Response:

json

{
  "choices": [{
    "message": {
      "content": "{\n  \"battery\": \"positive\",\n  \"screen\": \"negative\",\n  \"price\": \"neutral\"\n}"
    }
  }],
  "timings": {
    "predicted_per_second": 108.18
  }
}

Performance: 108.2 tokens/sec

Example 5: Code Generation

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a Python function to check if a number is prime"}
    ],
    "max_tokens": 200
  }' | jq '.choices[0].message.content'

Response:

python

def is_prime(n):
    """Check if a number is prime."""
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True

Performance: ~110 tokens/sec

Example 6: Document Summarization

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Summarize in JSON format."},
      {"role": "user", "content": "The Industrial Revolution transformed manufacturing from handmade goods to machine production. It began in Britain in the late 1700s with textile mills and steam engines. The revolution spread across Europe and America, creating urban centers and new social classes. While it increased productivity, it also led to poor working conditions and pollution.\n\nSummarize as JSON with key_points and impacts."}
    ],
    "max_tokens": 150
  }' | jq

Performance: ~88-105 tokens/sec for summaries

Part 7: Verified Performance Benchmarks

RTX 4060 Ti 16GB Desktop Results (Confirmed Testing)

Test System:

GPU: RTX 4060 Ti 16GB (34 SMs)
CPU: Intel i5-13400
RAM: 64GB DDR4
Framework: llama.cpp (November 2025)
Model: Qwen3-30B-A3B-Instruct Q3_K_M ⚠️

Real Testing Results:

Task	Context	Tokens Generated	Speed (t/s)	Total Time
Sentiment Analysis	4K	32	107.7	~300ms
Info Extraction	4K	40	110.7	~360ms
Math Reasoning	4K	95	109.6	~870ms
Aspect Sentiment	4K	24	108.2	~220ms
Code Generation	4K	150	110.0	~1.4s
Long Document	8K	249	88.5	~2.8s
Extended Doc	12K	658	88.36	~7.4s
Very Long Doc	14K	249	89.42	~2.8s
Maximum Context	15K	649	87.87	~7.4s

Average Generation Speed: 87-110 tokens/second (depending on context)

Prompt Processing Speed: 81-505 tokens/second

Verified Performance by Model Size (RTX 4060 Ti 16GB)

Model Size	Quantization	Speed (t/s)	VRAM Used	All GPU?	Status
7B	Q4_K_M	150-161	~5GB	✅ Yes	⭐⭐⭐⭐⭐ Excellent
13B	Q4_K_M	90-110	~8GB	✅ Yes	⭐⭐⭐⭐⭐ Great
20B	Q4_K_M	84	~13GB	✅ Yes	⭐⭐⭐⭐⭐ Fast
30B	Q3_K_M ⚠️	88-110	~15GB	✅ Yes	⭐⭐⭐⭐⭐ Excellent
30B	Q4_K_M	35-45	~17GB	⚠️ Tight	⭐⭐⭐ Limited context
70B	Q4_K_M	3-7	40GB+	❌ No	⭐⭐ Hybrid only

⚠️ Important Note: The 30B results above use Q3_K_M quantization, which is optimized for 16GB VRAM. Q3 provides ~20% faster speed than Q4 with minimal quality loss.

Key Findings:

Sweet spot: 7B-30B models (all fit in VRAM with excellent speed)
30B Q3_K_M is optimal for RTX 4060 Ti (confirmed 88-110 t/s)
30B Q4_K_M is limited on 16GB (only 4K-8K context, 35-45 t/s)
70B models require hybrid CPU+GPU (slow, 3-7 t/s only)

Part 8: Performance Comparison with Other Hardware

Hardware Specifications

RTX 4060 Ti 16GB (Tested):

CUDA Cores: 4,352
SMs: 34
VRAM: 16GB GDDR6
Bandwidth: 288 GB/s
TDP: 160W
Price: ~ $460 GPU /$ 1,250 complete system

RTX 5090 32GB (Reference):

Architecture: Blackwell (GB202)
CUDA Cores: 21,760 (5x more)
SMs: 170 (5x more)
VRAM: 32GB GDDR7
Bandwidth: 1,792 GB/s (6.2x faster)
TDP: 575W
Price: ~ $1,999 GPU /$ 3,200 complete system

NVIDIA DGX Spark GB10 (Reference):

Architecture: Grace Blackwell Superchip
CPU: 20-core ARM Grace
GPU: Blackwell GPU (6,144 CUDA cores)
Memory: 128GB LPDDR5X unified
Bandwidth: 273 GB/s (unified)
TDP: 240W
Price: $3,999 (complete system)
Special: FP4 hardware acceleration

Fair Comparison: GPT-OSS-20B (Q4_K_M) - All Same Quantization

System	Speed (t/s)	Prefill (t/s)	VRAM Used	Power	Price
RTX 4060 Ti	84	850	13GB	160W	$1,250
RTX 5090	156	1,500+	13GB	450W	$3,200
DGX Spark	49.7	2,053	13GB	180W	$3,999

✅ All tested with same Q4_K_M quantization - fair comparison!

Performance vs Cost:

RTX 4060 Ti: $14.88 per token/sec
RTX 5090: $20.51 per token/sec (1.86x faster, 2.56x price)
DGX Spark: $80.48 per token/sec (0.59x speed, 3.2x price)

Winner: RTX 4060 Ti has best value for 20B models.

Qwen3-30B Comparison (Mixed Quantizations ⚠️)

System	Quantization	Speed (t/s)	VRAM	Notes
RTX 4060 Ti	Q3_K_M ⚠️	88-110	15GB	Tested ✓
RTX 5090	Q4_K_M	213	17GB	Public benchmark
DGX Spark	Q4_K_M	45-50	17GB	Public benchmark

⚠️ Important: This comparison uses different quantizations:

RTX 4060 Ti uses Q3_K_M (lighter, faster, fits in 16GB)
Others use Q4_K_M (heavier, slightly better quality)
Q3 is ~20% faster than Q4 but not directly comparable
Direct speed comparison is not apples-to-apples

Key Takeaway: RTX 4060 Ti achieves excellent 30B performance by using Q3_K_M, which is the right choice for 16GB VRAM.

Qwen2.5-7B Comparison (Q4_K_M) - Fair Comparison

System	Speed (t/s)	Prefill (t/s)	VRAM
RTX 4060 Ti	150-161	620	~5GB
RTX 5090	213	1,200+	~5GB
DGX Spark	95	1,200+	~5GB

✅ All tested with same Q4_K_M quantization

Winner: RTX 5090 (but all are fast enough for 7B models)

70B Model Comparison

System	Can Run?	Speed (t/s)	Notes
RTX 4060 Ti	⚠️ Hybrid only	3-7	CPU+GPU offload, slow
RTX 5090	⚠️ Tight	15-25	Barely fits in 32GB
DGX Spark	✅ Yes	25-35	128GB unified memory

Winner: DGX Spark (only practical option for 70B+)

120B Model Comparison

System	Can Run?	Speed (t/s)	Notes
RTX 4060 Ti	❌ No	N/A	Insufficient VRAM
RTX 5090	❌ No	N/A	Needs 65GB+
DGX Spark	✅ Yes	31-38	FP4 precision only

Winner: DGX Spark (only option for 120B+)

Performance Summary Table

Metric	RTX 4060 Ti	RTX 5090	DGX Spark
Price	$1,250	$3,200	$3,999
7B Speed	150-161 t/s	213 t/s	95 t/s
20B Speed (Q4)	84 t/s	156 t/s	49.7 t/s
30B Speed	88-110 t/s (Q3) ⚠️	213 t/s (Q4)	45-50 t/s (Q4)
70B Speed	3-7 t/s	15-25 t/s	25-35 t/s
120B Speed	❌	❌	31-38 t/s
Max Practical	30B (Q3)	70B	200B
Max Context (30B)	15K	32K+	32K+
Power Draw	160W	575W	240W
Value (20B Q4)	$14.88/t/s ⭐	$20.51/t/s	$80.48/t/s

⚠️ Note: 30B comparison uses different quantizations (Q3 vs Q4). For fair comparison, see 20B Q4 benchmarks.

Part 9: Value Analysis & Recommendations

Cost Breakdown

Component	RTX 4060 Ti Build	RTX 5090 Build	DGX Spark
GPU/System	$460	$1,999	$3,999 (complete)
CPU	$185 (i5-13400)	$300 (i7-14700K)	Included
RAM	$140 (64GB DDR4)	$240 (128GB DDR5)	Included (128GB)
Motherboard	$125	$200	Included
Storage	$75 (1TB)	$150 (2TB)	Included (4TB)
PSU	$70 (650W)	$180 (1200W)	Included
Case	$65	$100	Included
Total	$1,120	$3,169	$3,999

Who Should Buy What?

Choose RTX 4060 Ti if:

✅ Budget: $1,000-1,500
✅ Primary use: 7B-30B models (covers 95% of use cases)
✅ Priority: Best value for money ($14.88 per token/sec on 20B)
✅ Use case: Learning, homework, coding, personal projects
✅ Power efficiency: 160W vs 575W
✅ Context needs: Up to 15K tokens (11,500 words)
✅ Willing to use Q3 quantization for 30B models
Best for: Students, hobbyists, developers, most users

Choose RTX 5090 if:

✅ Budget: $3,000-4,000
✅ Primary use: 30B-70B models with Q4+ quantization
✅ Priority: Maximum speed (1.86x faster on 20B)
✅ Use case: Professional content creation, commercial development
✅ Need: Larger contexts (32K+) and heavier quantizations
✅ Want to run 70B models at usable speeds
Best for: Professionals, power users, content creators

Choose DGX Spark if:

✅ Budget: $4,000+
✅ Primary use: 70B-200B models
✅ Priority: Model capacity over raw speed
✅ Use case: Research, model development, enterprise
✅ Special: Only system that runs 120B+ models locally
✅ Need unified memory architecture
Best for: Researchers, AI developers, small research teams

The Verdict for Most Users

90% of users should choose RTX 4060 Ti:

Why?

Sufficient for most models: 7B-30B covers nearly all practical use cases
Outstanding value: $14.88 per token/sec (best in class)
Confirmed performance: 88-110 t/s on 30B Q3, 84 t/s on 20B Q4
Maximum context: 15K tokens (11,500 words) proven stable
Lower total cost: $1,950-2,750 less than competitors
Efficient: 3.6x less power than RTX 5090
Practical: Models load in seconds, responses feel instant

When to upgrade:

You need Q4+ quantization for 30B models regularly (→ RTX 5090)
You frequently work with 70B models (→ RTX 5090)
You need 120B+ models (→ DGX Spark)
You need 32K+ context regularly (→ RTX 5090 or DGX Spark)
You're a professional who bills clients (→ RTX 5090)
Budget is unlimited (→ RTX 5090 or DGX Spark)

Part 10: Python Usage Example

python

from openai import OpenAI

# Point to your local server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

# Simple chat
response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing simply"}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)

# JSON structured output
response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system", "content": "Output valid JSON only."},
        {"role": "user", "content": "Extract: John, 25, loves coding"}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)

Part 11: Optimization Tips

Context Size Selection Guide

Choose based on your use case:

Use Case	Recommended Context	Speed	VRAM
Quick Q&A	4K	110 t/s	14.7GB
Standard Chat	8K	88.5 t/s	14.9GB
Document Analysis	12K	88.36 t/s	15.1GB
Long Documents	14K	89.42 t/s	15.2GB
Maximum Capacity	15K	87.87 t/s	15.3GB

If You Get "Out of Memory" Errors

Option 1: Use quantized KV cache

bash

--cache-type-k q8_0 --cache-type-v q8_0

Saves ~50% KV cache VRAM with minimal quality loss.

Option 2: Reduce context size

bash

-c 12288  # Instead of 15360

Option 3: Lower batch size

bash

-b 128 -ub 32  # Instead of -b 256 -ub 64

Getting Maximum Speed

1. Verify GPU usage:

bash

nvidia-smi
# Should show 85-98% GPU utilization

2. Check all layers are offloaded:

bash

# In server output, look for:
# "llm_load_tensors: offloaded 49/49 layers to GPU"

3. Monitor VRAM:

bash

watch -n 1 nvidia-smi

Part 12: Troubleshooting

"Out of Memory" Error

Solution:

bash

# Use smaller context
-c 12288  # Instead of 15360

# Or use quantized KV cache
--cache-type-k q8_0 --cache-type-v q8_0

# Or lower batch size
-b 128 -ub 32

"Very Slow Performance"

Check:

bash

# 1. GPU utilization
nvidia-smi

# 2. All layers on GPU
# Look for "offloaded 49/49 layers" in server output

# 3. CUDA is being used
# Build should show "CUDA support enabled"

"Port Already in Use"

bash

# Use different port
--port 8081

# Or kill existing process
sudo lsof -ti:8080 | xargs kill -9

KV Cache Quantization Requires Flash Attention

Error: V cache quantization requires flash_attn

Solution: Cannot disable Flash Attention when using q8_0 cache. Either:

Use f16 cache (but uses more VRAM)
Keep Flash Attention enabled (recommended)

Conclusion

Key Takeaways

RTX 4060 Ti 16GB:

✅ Confirmed Performance: 88-110 tokens/sec on 30B Q3, 84 t/s on 20B Q4
✅ Maximum Context: 15K tokens (11,500 words) proven stable
✅ Best Value: $14.88 per token/sec (20B Q4 benchmark)
✅ Perfect for 95% of use cases: 7B-30B models run excellently
✅ Smart Quantization: Q3_K_M for 30B enables full GPU usage
✅ Practical Limitation: 70B models only work at 3-7 t/s (hybrid mode)
✅ Hardware Note: 34 SMs means no max-autotune GEMM, but doesn't affect llama.cpp
⭐ Recommended for students, hobbyists, developers

RTX 5090 32GB:

1.86x faster on 20B Q4 (156 vs 84 t/s)
Can handle 70B models (15-25 t/s)
32K+ context capacity
Supports heavier quantizations (Q4+)
2.56x more expensive
⭐ Recommended for professionals only

DGX Spark:

Only system for 120B+ models
Slower than RTX for small-medium models
128GB unified memory
Most expensive ($3,999)
⭐ Recommended for researchers only

Final Recommendation

Start with RTX 4060 Ti if:

Budget under $1,500
You're learning AI/ML
You use 7B-30B models (which is 95% of people)
You want best value for money
15K context is sufficient
Q3 quantization quality is acceptable

You can always upgrade later if you outgrow it!

Resources

llama.cpp GitHub: https://github.com/ggerganov/llama.cpp
Model Hub: https://huggingface.co/models?library=gguf

You now have everything needed to run powerful AI locally on your RTX 4060 Ti with optimized context settings. Start building! 🚀

Guide created November 2025
Based on confirmed testing with RTX 4060 Ti 16GB + i5-13400 + 64GB RAM
All RTX 4060 Ti performance numbers verified through real-world testing
Primary test model: Qwen3-30B-A3B-Instruct-2507 (Q3_K_M, 30.5B parameters, 14.1GB)
Maximum stable context: 15,360 tokens @ 87.87 tokens/sec
Comparison benchmarks: RTX 5090 and DGX Spark data from public sources

Published on 11/20/2025