
Running Local AI on Your Gaming PC: Complete Guide for RTX 4060 Ti
Turn your RTX 4060 Ti into an AI powerhouse for homework, coding, and creative projects—completely free and private!
What You'll Learn
- Why run AI locally on your own PC
- Hardware setup and requirements
- Installing and configuring llama.cpp
- Running real AI models with OpenAI-compatible API
- Practical curl examples for all SGR (Schema-Guided Reasoning) tasks
- Context size optimization and real-world testing
- Performance benchmarks: RTX 4060 Ti vs RTX 5090 vs DGX Spark
- Tips and troubleshooting
Part 1: Why Local AI Matters
Your Gaming PC = Personal AI Server
If you have a modern gaming PC with an RTX 4060 Ti, you already own hardware capable of running AI models that rival ChatGPT. Here's why this is awesome:
Privacy: Everything runs offline—your homework, code, and creative writing never leave your computer
Cost: 20+/month for ChatGPT Plus)
Learning: Understand how AI actually works by running it yourself
Control: Choose your models, adjust settings, no rate limits
Fun: It's like running your own mini data center!
Part 2: Hardware Requirements
Target System: RTX 4060 Ti Desktop
Tested Configuration:
- GPU: NVIDIA RTX 4060 Ti 16GB VRAM
- Architecture: Ada Lovelace (AD106)
- CUDA Cores: 4,352
- Tensor Cores: 136
- SMs (Streaming Multiprocessors): 34 ⚠️
- Memory Bandwidth: 288 GB/s
- TDP: 160W
- CPU: Intel Core i5-13400 (10 cores, 16 threads)
- RAM: 64GB DDR4
- Storage: 1TB NVMe SSD (100GB+ free for models)
- OS: Ubuntu 22.04 LTS or Windows 11
Total Cost: ~$1,120-1,250 USD
What You Can Actually Run:
- 7B models: Lightning fast (150-161 tokens/sec) ⭐⭐⭐⭐⭐
- 20B models: Very fast (84 tokens/sec) ⭐⭐⭐⭐⭐
- 30B models: Fast (88-110 tokens/sec with Qwen3-30B Q3_K_M) ⭐⭐⭐⭐⭐
- 70B models: Possible but slow (3-7 tokens/sec hybrid CPU+GPU) ⭐⭐
⚠️ Important Hardware Limitations:
The RTX 4060 Ti has only 34 SMs (Streaming Multiprocessors), which means:
- GEMM max-autotune optimization is NOT available on GPUs with fewer than 46 SMs
- This limitation affects some PyTorch training frameworks (particularly torch.compile with max-autotune mode)
- For llama.cpp inference: No practical impact — you still get excellent text generation performance
- The 34 SMs are perfectly sufficient for running local LLMs with llama.cpp
Bottom Line: Don't worry about SM count for local AI usage. The RTX 4060 Ti delivers outstanding performance for inference tasks!
Part 3: Setup Guide (Linux/Ubuntu)
Step 1: Install Prerequisites
# Update system
sudo apt update && sudo apt upgrade -y
# Install build tools
sudo apt install -y build-essential cmake git \
nvidia-driver-550 nvidia-cuda-toolkit \
python3 python3-pip jq curl wget
# Verify NVIDIA driver
nvidia-smi
# Install Hugging Face CLI
pip install huggingface-hub
Step 2: Build llama.cpp with CUDA Support
# Clone the repository
cd ~
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Configure with CUDA and required compiler flags
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=89 \
-DCMAKE_C_FLAGS="-fno-finite-math-only" \
-DCMAKE_CXX_FLAGS="-fno-finite-math-only" \
-DCMAKE_BUILD_TYPE=Release
# Build (takes 5-10 minutes)
cmake --build build --config Release -j 16
# Verify build succeeded
./build/bin/llama-cli --version
⚠️ Critical Note: The -fno-finite-math-only compiler flag is required for recent llama.cpp versions. Without it, you'll get compilation errors about non-finite math arithmetics.
Step 3: Download AI Models
# Create models directory
mkdir -p ~/llm_models && cd ~/llm_models
# Download Qwen3-30B Q3_K_M (recommended for RTX 4060 Ti)
huggingface-cli download \
Mungert/Qwen3-30B-A3B-Instruct-2507-GGUF \
Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
--local-dir qwen3-30b-a3b-instruct-2507-GGUF
# Alternative: Smaller 7B model for maximum speed
huggingface-cli download \
Qwen/Qwen2.5-7B-Instruct-GGUF \
qwen2.5-7b-instruct-q4_k_m.gguf \
--local-dir qwen2.5-7b-GGUF
# Alternative: 20B model for balanced performance
huggingface-cli download \
unsloth/gpt-oss-20b-GGUF \
gpt-oss-20b-Q4_K_M.gguf \
--local-dir gpt-oss-20b-GGUF
Model Sizes:
- Qwen2.5-7B (Q4_K_M): ~4.5GB download
- GPT-OSS-20B (Q4_K_M): ~12GB download
- Qwen3-30B (Q3_K_M): ~14GB download
⚠️ Note on Quantization: This guide tests Q3_K_M for 30B models, which is lighter and faster than Q4_K_M, perfect for 16GB VRAM. Q3 offers ~20% faster inference with minimal quality loss compared to Q4.
Part 4: Context Size Configuration
Understanding Context Size
Context size determines how much text the AI can "remember" during a conversation:
- 4K context ≈ 3,000 words
- 8K context ≈ 6,000 words
- 12K context ≈ 9,000 words
- 15K context ≈ 11,500 words
Tested Context Configurations (Qwen3-30B Q3_K_M)
Configuration 1: 4K Context (Maximum Speed)
~/llama.cpp/build/bin/llama-server \
--host 0.0.0.0 --port 8080 \
-m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
-ngl 99 \
-c 4096 \
-b 256 \
-ub 64 \
--threads 12 \
--cache-type-k f16 \
--cache-type-v f16 \
--metrics \
--jinja
- Speed: 105-110 tokens/sec
- VRAM: ~14.7GB
- Best for: Short conversations, quick queries
Configuration 2: 8K Context (Balanced)
~/llama.cpp/build/bin/llama-server \
--host 0.0.0.0 --port 8080 \
-m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
-ngl 99 \
-c 8192 \
-b 256 \
-ub 64 \
--threads 12 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--metrics \
--jinja
- Speed: 88.5 tokens/sec
- VRAM: ~14.9GB
- Best for: Standard conversations, document analysis
Configuration 3: 12K Context (Large Documents)
~/llama.cpp/build/bin/llama-server \
--host 0.0.0.0 --port 8080 \
-m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
-ngl 99 \
-c 12288 \
-b 256 \
-ub 64 \
--threads 12 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--metrics \
--jinja
- Speed: 88.36 tokens/sec
- VRAM: ~15.1GB
- Best for: Long documents, extended conversations
Configuration 4: 14K Context (Extended)
~/llama.cpp/build/bin/llama-server \
--host 0.0.0.0 --port 8080 \
-m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
-ngl 99 \
-c 14336 \
-b 256 \
-ub 64 \
--threads 12 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--metrics \
--jinja
- Speed: 89.42 tokens/sec
- VRAM: ~15.2GB
- Best for: Very long documents
Configuration 5: 15K Context (Maximum Stable - Recommended)
~/llama.cpp/build/bin/llama-server \
--host 0.0.0.0 --port 8080 \
-m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
-ngl 99 \
-c 15360 \
-b 256 \
-ub 64 \
--threads 12 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--metrics \
--jinja
- Speed: 87.87 tokens/sec
- VRAM: ~15.3GB
- Best for: Maximum context capacity
- ⭐ Recommended production configuration
Server starts successfully when you see:
llama server listening at http://0.0.0.0:8080
What these flags mean:
-ngl 99: Use GPU for all layers (maximum speed)-c 4096/8192/12288/14336/15360: Context window size-b 256: Batch size for processing-ub 64: Ubatch size--threads 12: CPU threads to use--cache-type-k q8_0: KV cache key precision (q8_0 saves VRAM)--cache-type-v q8_0: KV cache value precision--metrics: Enable performance monitoring
Part 5: Context Size Performance Comparison
Tested Results (Qwen3-30B Q3_K_M on RTX 4060 Ti)
| Context | KV Cache Type | Speed (t/s) | VRAM Used | Words Capacity | Status |
|---|---|---|---|---|---|
| 4K | f16 | 110.0 | 14.7GB | ~3,000 | ✅ Fastest |
| 8K | q8_0 | 88.5 | 14.9GB | ~6,000 | ✅ Great |
| 12K | q8_0 | 88.36 | 15.1GB | ~9,000 | ✅ Excellent |
| 14K | q8_0 | 89.42 | 15.2GB | ~10,700 | ✅ Outstanding |
| 15K | q8_0 | 87.87 | 15.3GB | ~11,500 | ✅ Maximum |
| 16K | q8_0 | N/A | 15.8GB | ~12,000 | ❌ OOM |
Key Findings:
- 4K with f16 cache gives maximum speed (110 t/s)
- 8K-15K with q8_0 cache maintains 88-89 t/s (excellent!)
- 15K is the absolute maximum stable context on 16GB VRAM
- 16K causes out-of-memory errors during Flash Attention warmup
Recommendation: Use 15K context for production—it maximizes capacity while maintaining 88 t/s speed.
Part 6: Using the OpenAI-Compatible API
Your local server provides an OpenAI-compatible API at http://localhost:8080.
Check Available Models
curl http://localhost:8080/v1/models | jq
Response:
{
"models": [{
"name": "/home/berdachuk/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf",
"type": "model",
"format": "gguf"
}],
"data": [{
"id": "...",
"meta": {
"n_params": 30532122624,
"size": 15150649344,
"n_ctx_train": 262144
}
}]
}
Example 1: Simple Sentiment Analysis (JSON Output)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "Respond with JSON only."},
{"role": "user", "content": "I loved the service but disliked the price."}
],
"max_tokens": 80
}' | jq
Response:
{
"choices": [{
"message": {
"content": "{\n \"sentiment\": \"mixed\",\n \"feedback\": \"I loved the service but disliked the price.\",\n \"rating\": 3\n}"
}
}],
"usage": {
"completion_tokens": 32,
"prompt_tokens": 27,
"total_tokens": 59
},
"timings": {
"prompt_ms": 131.94,
"prompt_per_second": 204.64,
"predicted_ms": 297.13,
"predicted_per_second": 107.70
}
}
Performance: 107.7 tokens/sec generation, ~300ms total response time
Example 2: Information Extraction
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "Output valid JSON."},
{"role": "user", "content": "Jane, 17, likes Python. Email: jane@mail.com"}
],
"max_tokens": 128
}' | jq
Response:
{
"choices": [{
"message": {
"content": "{\n \"name\": \"Jane\",\n \"age\": 17,\n \"interests\": [\"Python\"],\n \"email\": \"jane@mail.com\"\n}"
}
}],
"timings": {
"prompt_per_second": 505.66,
"predicted_per_second": 110.74
}
}
Performance: 110.7 tokens/sec, ~360ms total
Example 3: Math Problem Solving
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "Show math steps, JSON format."},
{"role": "user", "content": "If I buy 3 apples at $2 and 2 at $3, total?"}
],
"max_tokens": 100
}' | jq
Response:
{
"choices": [{
"message": {
"content": "{\"steps\": [{\"operation\": \"Calculate cost of 3 apples at $2 each\", \"expression\": \"3 * 2\", \"result\": 6}, {\"operation\": \"Calculate cost of 2 apples at $3 each\", \"expression\": \"2 * 3\", \"result\": 6}, {\"operation\": \"Add both costs\", \"expression\": \"6 + 6\", \"result\": 12}], \"total\": 12}"
}
}],
"timings": {
"predicted_per_second": 109.56
}
}
Performance: 109.6 tokens/sec
Example 4: Aspect-Based Sentiment
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "Output sentiment per aspect as JSON."},
{"role": "user", "content": "Battery great. Screen poor. Price fair."}
],
"max_tokens": 120
}' | jq
Response:
{
"choices": [{
"message": {
"content": "{\n \"battery\": \"positive\",\n \"screen\": \"negative\",\n \"price\": \"neutral\"\n}"
}
}],
"timings": {
"predicted_per_second": 108.18
}
}
Performance: 108.2 tokens/sec
Example 5: Code Generation
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a Python function to check if a number is prime"}
],
"max_tokens": 200
}' | jq '.choices[0].message.content'
Response:
def is_prime(n):
"""Check if a number is prime."""
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
for i in range(3, int(n**0.5) + 1, 2):
if n % i == 0:
return False
return True
Performance: ~110 tokens/sec
Example 6: Document Summarization
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "Summarize in JSON format."},
{"role": "user", "content": "The Industrial Revolution transformed manufacturing from handmade goods to machine production. It began in Britain in the late 1700s with textile mills and steam engines. The revolution spread across Europe and America, creating urban centers and new social classes. While it increased productivity, it also led to poor working conditions and pollution.\n\nSummarize as JSON with key_points and impacts."}
],
"max_tokens": 150
}' | jq
Performance: ~88-105 tokens/sec for summaries
Part 7: Verified Performance Benchmarks
RTX 4060 Ti 16GB Desktop Results (Confirmed Testing)
Test System:
- GPU: RTX 4060 Ti 16GB (34 SMs)
- CPU: Intel i5-13400
- RAM: 64GB DDR4
- Framework: llama.cpp (November 2025)
- Model: Qwen3-30B-A3B-Instruct Q3_K_M ⚠️
Real Testing Results:
| Task | Context | Tokens Generated | Speed (t/s) | Total Time |
|---|---|---|---|---|
| Sentiment Analysis | 4K | 32 | 107.7 | ~300ms |
| Info Extraction | 4K | 40 | 110.7 | ~360ms |
| Math Reasoning | 4K | 95 | 109.6 | ~870ms |
| Aspect Sentiment | 4K | 24 | 108.2 | ~220ms |
| Code Generation | 4K | 150 | 110.0 | ~1.4s |
| Long Document | 8K | 249 | 88.5 | ~2.8s |
| Extended Doc | 12K | 658 | 88.36 | ~7.4s |
| Very Long Doc | 14K | 249 | 89.42 | ~2.8s |
| Maximum Context | 15K | 649 | 87.87 | ~7.4s |
Average Generation Speed: 87-110 tokens/second (depending on context)
Prompt Processing Speed: 81-505 tokens/second
Verified Performance by Model Size (RTX 4060 Ti 16GB)
| Model Size | Quantization | Speed (t/s) | VRAM Used | All GPU? | Status |
|---|---|---|---|---|---|
| 7B | Q4_K_M | 150-161 | ~5GB | ✅ Yes | ⭐⭐⭐⭐⭐ Excellent |
| 13B | Q4_K_M | 90-110 | ~8GB | ✅ Yes | ⭐⭐⭐⭐⭐ Great |
| 20B | Q4_K_M | 84 | ~13GB | ✅ Yes | ⭐⭐⭐⭐⭐ Fast |
| 30B | Q3_K_M ⚠️ | 88-110 | ~15GB | ✅ Yes | ⭐⭐⭐⭐⭐ Excellent |
| 30B | Q4_K_M | 35-45 | ~17GB | ⚠️ Tight | ⭐⭐⭐ Limited context |
| 70B | Q4_K_M | 3-7 | 40GB+ | ❌ No | ⭐⭐ Hybrid only |
⚠️ Important Note: The 30B results above use Q3_K_M quantization, which is optimized for 16GB VRAM. Q3 provides ~20% faster speed than Q4 with minimal quality loss.
Key Findings:
- Sweet spot: 7B-30B models (all fit in VRAM with excellent speed)
- 30B Q3_K_M is optimal for RTX 4060 Ti (confirmed 88-110 t/s)
- 30B Q4_K_M is limited on 16GB (only 4K-8K context, 35-45 t/s)
- 70B models require hybrid CPU+GPU (slow, 3-7 t/s only)
Part 8: Performance Comparison with Other Hardware
Hardware Specifications
RTX 4060 Ti 16GB (Tested):
- CUDA Cores: 4,352
- SMs: 34
- VRAM: 16GB GDDR6
- Bandwidth: 288 GB/s
- TDP: 160W
- Price: ~1,250 complete system
RTX 5090 32GB (Reference):
- Architecture: Blackwell (GB202)
- CUDA Cores: 21,760 (5x more)
- SMs: 170 (5x more)
- VRAM: 32GB GDDR7
- Bandwidth: 1,792 GB/s (6.2x faster)
- TDP: 575W
- Price: ~3,200 complete system
NVIDIA DGX Spark GB10 (Reference):
- Architecture: Grace Blackwell Superchip
- CPU: 20-core ARM Grace
- GPU: Blackwell GPU (6,144 CUDA cores)
- Memory: 128GB LPDDR5X unified
- Bandwidth: 273 GB/s (unified)
- TDP: 240W
- Price: $3,999 (complete system)
- Special: FP4 hardware acceleration
Fair Comparison: GPT-OSS-20B (Q4_K_M) - All Same Quantization
| System | Speed (t/s) | Prefill (t/s) | VRAM Used | Power | Price |
|---|---|---|---|---|---|
| RTX 4060 Ti | 84 | 850 | 13GB | 160W | $1,250 |
| RTX 5090 | 156 | 1,500+ | 13GB | 450W | $3,200 |
| DGX Spark | 49.7 | 2,053 | 13GB | 180W | $3,999 |
✅ All tested with same Q4_K_M quantization - fair comparison!
Performance vs Cost:
- RTX 4060 Ti: $14.88 per token/sec
- RTX 5090: $20.51 per token/sec (1.86x faster, 2.56x price)
- DGX Spark: $80.48 per token/sec (0.59x speed, 3.2x price)
Winner: RTX 4060 Ti has best value for 20B models.
Qwen3-30B Comparison (Mixed Quantizations ⚠️)
| System | Quantization | Speed (t/s) | VRAM | Notes |
|---|---|---|---|---|
| RTX 4060 Ti | Q3_K_M ⚠️ | 88-110 | 15GB | Tested ✓ |
| RTX 5090 | Q4_K_M | 213 | 17GB | Public benchmark |
| DGX Spark | Q4_K_M | 45-50 | 17GB | Public benchmark |
⚠️ Important: This comparison uses different quantizations:
- RTX 4060 Ti uses Q3_K_M (lighter, faster, fits in 16GB)
- Others use Q4_K_M (heavier, slightly better quality)
- Q3 is ~20% faster than Q4 but not directly comparable
- Direct speed comparison is not apples-to-apples
Key Takeaway: RTX 4060 Ti achieves excellent 30B performance by using Q3_K_M, which is the right choice for 16GB VRAM.
Qwen2.5-7B Comparison (Q4_K_M) - Fair Comparison
| System | Speed (t/s) | Prefill (t/s) | VRAM |
|---|---|---|---|
| RTX 4060 Ti | 150-161 | 620 | ~5GB |
| RTX 5090 | 213 | 1,200+ | ~5GB |
| DGX Spark | 95 | 1,200+ | ~5GB |
✅ All tested with same Q4_K_M quantization
Winner: RTX 5090 (but all are fast enough for 7B models)
70B Model Comparison
| System | Can Run? | Speed (t/s) | Notes |
|---|---|---|---|
| RTX 4060 Ti | ⚠️ Hybrid only | 3-7 | CPU+GPU offload, slow |
| RTX 5090 | ⚠️ Tight | 15-25 | Barely fits in 32GB |
| DGX Spark | ✅ Yes | 25-35 | 128GB unified memory |
Winner: DGX Spark (only practical option for 70B+)
120B Model Comparison
| System | Can Run? | Speed (t/s) | Notes |
|---|---|---|---|
| RTX 4060 Ti | ❌ No | N/A | Insufficient VRAM |
| RTX 5090 | ❌ No | N/A | Needs 65GB+ |
| DGX Spark | ✅ Yes | 31-38 | FP4 precision only |
Winner: DGX Spark (only option for 120B+)
Performance Summary Table
| Metric | RTX 4060 Ti | RTX 5090 | DGX Spark |
|---|---|---|---|
| Price | $1,250 | $3,200 | $3,999 |
| 7B Speed | 150-161 t/s | 213 t/s | 95 t/s |
| 20B Speed (Q4) | 84 t/s | 156 t/s | 49.7 t/s |
| 30B Speed | 88-110 t/s (Q3) ⚠️ | 213 t/s (Q4) | 45-50 t/s (Q4) |
| 70B Speed | 3-7 t/s | 15-25 t/s | 25-35 t/s |
| 120B Speed | ❌ | ❌ | 31-38 t/s |
| Max Practical | 30B (Q3) | 70B | 200B |
| Max Context (30B) | 15K | 32K+ | 32K+ |
| Power Draw | 160W | 575W | 240W |
| Value (20B Q4) | $14.88/t/s ⭐ | $20.51/t/s | $80.48/t/s |
⚠️ Note: 30B comparison uses different quantizations (Q3 vs Q4). For fair comparison, see 20B Q4 benchmarks.
Part 9: Value Analysis & Recommendations
Cost Breakdown
| Component | RTX 4060 Ti Build | RTX 5090 Build | DGX Spark |
|---|---|---|---|
| GPU/System | $460 | $1,999 | $3,999 (complete) |
| CPU | $185 (i5-13400) | $300 (i7-14700K) | Included |
| RAM | $140 (64GB DDR4) | $240 (128GB DDR5) | Included (128GB) |
| Motherboard | $125 | $200 | Included |
| Storage | $75 (1TB) | $150 (2TB) | Included (4TB) |
| PSU | $70 (650W) | $180 (1200W) | Included |
| Case | $65 | $100 | Included |
| Total | $1,120 | $3,169 | $3,999 |
Who Should Buy What?
Choose RTX 4060 Ti if:
- ✅ Budget: $1,000-1,500
- ✅ Primary use: 7B-30B models (covers 95% of use cases)
- ✅ Priority: Best value for money ($14.88 per token/sec on 20B)
- ✅ Use case: Learning, homework, coding, personal projects
- ✅ Power efficiency: 160W vs 575W
- ✅ Context needs: Up to 15K tokens (11,500 words)
- ✅ Willing to use Q3 quantization for 30B models
- Best for: Students, hobbyists, developers, most users
Choose RTX 5090 if:
- ✅ Budget: $3,000-4,000
- ✅ Primary use: 30B-70B models with Q4+ quantization
- ✅ Priority: Maximum speed (1.86x faster on 20B)
- ✅ Use case: Professional content creation, commercial development
- ✅ Need: Larger contexts (32K+) and heavier quantizations
- ✅ Want to run 70B models at usable speeds
- Best for: Professionals, power users, content creators
Choose DGX Spark if:
- ✅ Budget: $4,000+
- ✅ Primary use: 70B-200B models
- ✅ Priority: Model capacity over raw speed
- ✅ Use case: Research, model development, enterprise
- ✅ Special: Only system that runs 120B+ models locally
- ✅ Need unified memory architecture
- Best for: Researchers, AI developers, small research teams
The Verdict for Most Users
90% of users should choose RTX 4060 Ti:
Why?
- Sufficient for most models: 7B-30B covers nearly all practical use cases
- Outstanding value: $14.88 per token/sec (best in class)
- Confirmed performance: 88-110 t/s on 30B Q3, 84 t/s on 20B Q4
- Maximum context: 15K tokens (11,500 words) proven stable
- Lower total cost: $1,950-2,750 less than competitors
- Efficient: 3.6x less power than RTX 5090
- Practical: Models load in seconds, responses feel instant
When to upgrade:
- You need Q4+ quantization for 30B models regularly (→ RTX 5090)
- You frequently work with 70B models (→ RTX 5090)
- You need 120B+ models (→ DGX Spark)
- You need 32K+ context regularly (→ RTX 5090 or DGX Spark)
- You're a professional who bills clients (→ RTX 5090)
- Budget is unlimited (→ RTX 5090 or DGX Spark)
Part 10: Python Usage Example
from openai import OpenAI
# Point to your local server
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
# Simple chat
response = client.chat.completions.create(
model="local-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing simply"}
],
max_tokens=200
)
print(response.choices[0].message.content)
# JSON structured output
response = client.chat.completions.create(
model="local-model",
messages=[
{"role": "system", "content": "Output valid JSON only."},
{"role": "user", "content": "Extract: John, 25, loves coding"}
],
max_tokens=100
)
print(response.choices[0].message.content)
Part 11: Optimization Tips
Context Size Selection Guide
Choose based on your use case:
| Use Case | Recommended Context | Speed | VRAM |
|---|---|---|---|
| Quick Q&A | 4K | 110 t/s | 14.7GB |
| Standard Chat | 8K | 88.5 t/s | 14.9GB |
| Document Analysis | 12K | 88.36 t/s | 15.1GB |
| Long Documents | 14K | 89.42 t/s | 15.2GB |
| Maximum Capacity | 15K | 87.87 t/s | 15.3GB |
If You Get "Out of Memory" Errors
Option 1: Use quantized KV cache
--cache-type-k q8_0 --cache-type-v q8_0
Saves ~50% KV cache VRAM with minimal quality loss.
Option 2: Reduce context size
-c 12288 # Instead of 15360
Option 3: Lower batch size
-b 128 -ub 32 # Instead of -b 256 -ub 64
Getting Maximum Speed
1. Verify GPU usage:
nvidia-smi
# Should show 85-98% GPU utilization
2. Check all layers are offloaded:
# In server output, look for:
# "llm_load_tensors: offloaded 49/49 layers to GPU"
3. Monitor VRAM:
watch -n 1 nvidia-smi
Part 12: Troubleshooting
"Out of Memory" Error
Solution:
# Use smaller context
-c 12288 # Instead of 15360
# Or use quantized KV cache
--cache-type-k q8_0 --cache-type-v q8_0
# Or lower batch size
-b 128 -ub 32
"Very Slow Performance"
Check:
# 1. GPU utilization
nvidia-smi
# 2. All layers on GPU
# Look for "offloaded 49/49 layers" in server output
# 3. CUDA is being used
# Build should show "CUDA support enabled"
"Port Already in Use"
# Use different port
--port 8081
# Or kill existing process
sudo lsof -ti:8080 | xargs kill -9
KV Cache Quantization Requires Flash Attention
Error: V cache quantization requires flash_attn
Solution: Cannot disable Flash Attention when using q8_0 cache. Either:
- Use f16 cache (but uses more VRAM)
- Keep Flash Attention enabled (recommended)
Conclusion
Key Takeaways
RTX 4060 Ti 16GB:
- ✅ Confirmed Performance: 88-110 tokens/sec on 30B Q3, 84 t/s on 20B Q4
- ✅ Maximum Context: 15K tokens (11,500 words) proven stable
- ✅ Best Value: $14.88 per token/sec (20B Q4 benchmark)
- ✅ Perfect for 95% of use cases: 7B-30B models run excellently
- ✅ Smart Quantization: Q3_K_M for 30B enables full GPU usage
- ✅ Practical Limitation: 70B models only work at 3-7 t/s (hybrid mode)
- ✅ Hardware Note: 34 SMs means no max-autotune GEMM, but doesn't affect llama.cpp
- ⭐ Recommended for students, hobbyists, developers
RTX 5090 32GB:
- 1.86x faster on 20B Q4 (156 vs 84 t/s)
- Can handle 70B models (15-25 t/s)
- 32K+ context capacity
- Supports heavier quantizations (Q4+)
- 2.56x more expensive
- ⭐ Recommended for professionals only
DGX Spark:
- Only system for 120B+ models
- Slower than RTX for small-medium models
- 128GB unified memory
- Most expensive ($3,999)
- ⭐ Recommended for researchers only
Final Recommendation
Start with RTX 4060 Ti if:
- Budget under $1,500
- You're learning AI/ML
- You use 7B-30B models (which is 95% of people)
- You want best value for money
- 15K context is sufficient
- Q3 quantization quality is acceptable
You can always upgrade later if you outgrow it!
Resources
- llama.cpp GitHub: https://github.com/ggerganov/llama.cpp
- Model Hub: https://huggingface.co/models?library=gguf
You now have everything needed to run powerful AI locally on your RTX 4060 Ti with optimized context settings. Start building! 🚀
Guide created November 2025
Based on confirmed testing with RTX 4060 Ti 16GB + i5-13400 + 64GB RAM
All RTX 4060 Ti performance numbers verified through real-world testing
Primary test model: Qwen3-30B-A3B-Instruct-2507 (Q3_K_M, 30.5B parameters, 14.1GB)
Maximum stable context: 15,360 tokens @ 87.87 tokens/sec
Comparison benchmarks: RTX 5090 and DGX Spark data from public sources
Published on 11/20/2025