Local AI on RTX 4060 Ti: Setup & Performance Guide

Running Local AI on Your Gaming PC: Complete Guide for RTX 4060 Ti

Turn your RTX 4060 Ti into an AI powerhouse for homework, coding, and creative projects—completely free and private!


What You'll Learn

  1. Why run AI locally on your own PC
  2. Hardware setup and requirements
  3. Installing and configuring llama.cpp
  4. Running real AI models with OpenAI-compatible API
  5. Practical curl examples for all SGR (Schema-Guided Reasoning) tasks
  6. Context size optimization and real-world testing
  7. Performance benchmarks: RTX 4060 Ti vs RTX 5090 vs DGX Spark
  8. Tips and troubleshooting

Part 1: Why Local AI Matters

Your Gaming PC = Personal AI Server

If you have a modern gaming PC with an RTX 4060 Ti, you already own hardware capable of running AI models that rival ChatGPT. Here's why this is awesome:

Privacy: Everything runs offline—your homework, code, and creative writing never leave your computer

Cost: 0/monthaftersetup(vs0/month after setup (vs 20+/month for ChatGPT Plus)

Learning: Understand how AI actually works by running it yourself

Control: Choose your models, adjust settings, no rate limits

Fun: It's like running your own mini data center!


Part 2: Hardware Requirements

Target System: RTX 4060 Ti Desktop

Tested Configuration:

  • GPU: NVIDIA RTX 4060 Ti 16GB VRAM
    • Architecture: Ada Lovelace (AD106)
    • CUDA Cores: 4,352
    • Tensor Cores: 136
    • SMs (Streaming Multiprocessors): 34 ⚠️
    • Memory Bandwidth: 288 GB/s
    • TDP: 160W
  • CPU: Intel Core i5-13400 (10 cores, 16 threads)
  • RAM: 64GB DDR4
  • Storage: 1TB NVMe SSD (100GB+ free for models)
  • OS: Ubuntu 22.04 LTS or Windows 11

Total Cost: ~$1,120-1,250 USD

What You Can Actually Run:

  • 7B models: Lightning fast (150-161 tokens/sec) ⭐⭐⭐⭐⭐
  • 20B models: Very fast (84 tokens/sec) ⭐⭐⭐⭐⭐
  • 30B models: Fast (88-110 tokens/sec with Qwen3-30B Q3_K_M) ⭐⭐⭐⭐⭐
  • 70B models: Possible but slow (3-7 tokens/sec hybrid CPU+GPU) ⭐⭐

⚠️ Important Hardware Limitations:

The RTX 4060 Ti has only 34 SMs (Streaming Multiprocessors), which means:

  • GEMM max-autotune optimization is NOT available on GPUs with fewer than 46 SMs
  • This limitation affects some PyTorch training frameworks (particularly torch.compile with max-autotune mode)
  • For llama.cpp inference: No practical impact — you still get excellent text generation performance
  • The 34 SMs are perfectly sufficient for running local LLMs with llama.cpp

Bottom Line: Don't worry about SM count for local AI usage. The RTX 4060 Ti delivers outstanding performance for inference tasks!


Part 3: Setup Guide (Linux/Ubuntu)

Step 1: Install Prerequisites

bash
# Update system
sudo apt update && sudo apt upgrade -y

# Install build tools
sudo apt install -y build-essential cmake git \
  nvidia-driver-550 nvidia-cuda-toolkit \
  python3 python3-pip jq curl wget

# Verify NVIDIA driver
nvidia-smi

# Install Hugging Face CLI
pip install huggingface-hub

Step 2: Build llama.cpp with CUDA Support

bash
# Clone the repository
cd ~
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Configure with CUDA and required compiler flags
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  -DCMAKE_C_FLAGS="-fno-finite-math-only" \
  -DCMAKE_CXX_FLAGS="-fno-finite-math-only" \
  -DCMAKE_BUILD_TYPE=Release

# Build (takes 5-10 minutes)
cmake --build build --config Release -j 16

# Verify build succeeded
./build/bin/llama-cli --version

⚠️ Critical Note: The -fno-finite-math-only compiler flag is required for recent llama.cpp versions. Without it, you'll get compilation errors about non-finite math arithmetics.

Step 3: Download AI Models

bash
# Create models directory
mkdir -p ~/llm_models && cd ~/llm_models

# Download Qwen3-30B Q3_K_M (recommended for RTX 4060 Ti)
huggingface-cli download \
  Mungert/Qwen3-30B-A3B-Instruct-2507-GGUF \
  Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  --local-dir qwen3-30b-a3b-instruct-2507-GGUF

# Alternative: Smaller 7B model for maximum speed
huggingface-cli download \
  Qwen/Qwen2.5-7B-Instruct-GGUF \
  qwen2.5-7b-instruct-q4_k_m.gguf \
  --local-dir qwen2.5-7b-GGUF

# Alternative: 20B model for balanced performance
huggingface-cli download \
  unsloth/gpt-oss-20b-GGUF \
  gpt-oss-20b-Q4_K_M.gguf \
  --local-dir gpt-oss-20b-GGUF

Model Sizes:

  • Qwen2.5-7B (Q4_K_M): ~4.5GB download
  • GPT-OSS-20B (Q4_K_M): ~12GB download
  • Qwen3-30B (Q3_K_M): ~14GB download

⚠️ Note on Quantization: This guide tests Q3_K_M for 30B models, which is lighter and faster than Q4_K_M, perfect for 16GB VRAM. Q3 offers ~20% faster inference with minimal quality loss compared to Q4.


Part 4: Context Size Configuration

Understanding Context Size

Context size determines how much text the AI can "remember" during a conversation:

  • 4K context ≈ 3,000 words
  • 8K context ≈ 6,000 words
  • 12K context ≈ 9,000 words
  • 15K context ≈ 11,500 words

Tested Context Configurations (Qwen3-30B Q3_K_M)

Configuration 1: 4K Context (Maximum Speed)

bash
~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 4096 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --metrics \
  --jinja
  • Speed: 105-110 tokens/sec
  • VRAM: ~14.7GB
  • Best for: Short conversations, quick queries

Configuration 2: 8K Context (Balanced)

bash
~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 8192 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics \
  --jinja
  • Speed: 88.5 tokens/sec
  • VRAM: ~14.9GB
  • Best for: Standard conversations, document analysis

Configuration 3: 12K Context (Large Documents)

bash
~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 12288 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics \
  --jinja
  • Speed: 88.36 tokens/sec
  • VRAM: ~15.1GB
  • Best for: Long documents, extended conversations

Configuration 4: 14K Context (Extended)

bash
~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 14336 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics \
  --jinja
  • Speed: 89.42 tokens/sec
  • VRAM: ~15.2GB
  • Best for: Very long documents

Configuration 5: 15K Context (Maximum Stable - Recommended)

bash
~/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 --port 8080 \
  -m ~/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf \
  -ngl 99 \
  -c 15360 \
  -b 256 \
  -ub 64 \
  --threads 12 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics \
  --jinja
  • Speed: 87.87 tokens/sec
  • VRAM: ~15.3GB
  • Best for: Maximum context capacity
  • ⭐ Recommended production configuration

Server starts successfully when you see:

llama server listening at http://0.0.0.0:8080

What these flags mean:

  • -ngl 99: Use GPU for all layers (maximum speed)
  • -c 4096/8192/12288/14336/15360: Context window size
  • -b 256: Batch size for processing
  • -ub 64: Ubatch size
  • --threads 12: CPU threads to use
  • --cache-type-k q8_0: KV cache key precision (q8_0 saves VRAM)
  • --cache-type-v q8_0: KV cache value precision
  • --metrics: Enable performance monitoring

Part 5: Context Size Performance Comparison

Tested Results (Qwen3-30B Q3_K_M on RTX 4060 Ti)

ContextKV Cache TypeSpeed (t/s)VRAM UsedWords CapacityStatus
4Kf16110.014.7GB~3,000✅ Fastest
8Kq8_088.514.9GB~6,000✅ Great
12Kq8_088.3615.1GB~9,000✅ Excellent
14Kq8_089.4215.2GB~10,700✅ Outstanding
15Kq8_087.8715.3GB~11,500Maximum
16Kq8_0N/A15.8GB~12,000❌ OOM

Key Findings:

  • 4K with f16 cache gives maximum speed (110 t/s)
  • 8K-15K with q8_0 cache maintains 88-89 t/s (excellent!)
  • 15K is the absolute maximum stable context on 16GB VRAM
  • 16K causes out-of-memory errors during Flash Attention warmup

Recommendation: Use 15K context for production—it maximizes capacity while maintaining 88 t/s speed.


Part 6: Using the OpenAI-Compatible API

Your local server provides an OpenAI-compatible API at http://localhost:8080.

Check Available Models

bash
curl http://localhost:8080/v1/models | jq

Response:

json
{
  "models": [{
    "name": "/home/berdachuk/llm_models/qwen3-30b-a3b-instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-q3_k_m.gguf",
    "type": "model",
    "format": "gguf"
  }],
  "data": [{
    "id": "...",
    "meta": {
      "n_params": 30532122624,
      "size": 15150649344,
      "n_ctx_train": 262144
    }
  }]
}

Example 1: Simple Sentiment Analysis (JSON Output)

bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Respond with JSON only."},
      {"role": "user", "content": "I loved the service but disliked the price."}
    ],
    "max_tokens": 80
  }' | jq

Response:

json
{
  "choices": [{
    "message": {
      "content": "{\n  \"sentiment\": \"mixed\",\n  \"feedback\": \"I loved the service but disliked the price.\",\n  \"rating\": 3\n}"
    }
  }],
  "usage": {
    "completion_tokens": 32,
    "prompt_tokens": 27,
    "total_tokens": 59
  },
  "timings": {
    "prompt_ms": 131.94,
    "prompt_per_second": 204.64,
    "predicted_ms": 297.13,
    "predicted_per_second": 107.70
  }
}

Performance: 107.7 tokens/sec generation, ~300ms total response time


Example 2: Information Extraction

bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Output valid JSON."},
      {"role": "user", "content": "Jane, 17, likes Python. Email: jane@mail.com"}
    ],
    "max_tokens": 128
  }' | jq

Response:

json
{
  "choices": [{
    "message": {
      "content": "{\n  \"name\": \"Jane\",\n  \"age\": 17,\n  \"interests\": [\"Python\"],\n  \"email\": \"jane@mail.com\"\n}"
    }
  }],
  "timings": {
    "prompt_per_second": 505.66,
    "predicted_per_second": 110.74
  }
}

Performance: 110.7 tokens/sec, ~360ms total


Example 3: Math Problem Solving

bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Show math steps, JSON format."},
      {"role": "user", "content": "If I buy 3 apples at $2 and 2 at $3, total?"}
    ],
    "max_tokens": 100
  }' | jq

Response:

json
{
  "choices": [{
    "message": {
      "content": "{\"steps\": [{\"operation\": \"Calculate cost of 3 apples at $2 each\", \"expression\": \"3 * 2\", \"result\": 6}, {\"operation\": \"Calculate cost of 2 apples at $3 each\", \"expression\": \"2 * 3\", \"result\": 6}, {\"operation\": \"Add both costs\", \"expression\": \"6 + 6\", \"result\": 12}], \"total\": 12}"
    }
  }],
  "timings": {
    "predicted_per_second": 109.56
  }
}

Performance: 109.6 tokens/sec


Example 4: Aspect-Based Sentiment

bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Output sentiment per aspect as JSON."},
      {"role": "user", "content": "Battery great. Screen poor. Price fair."}
    ],
    "max_tokens": 120
  }' | jq

Response:

json
{
  "choices": [{
    "message": {
      "content": "{\n  \"battery\": \"positive\",\n  \"screen\": \"negative\",\n  \"price\": \"neutral\"\n}"
    }
  }],
  "timings": {
    "predicted_per_second": 108.18
  }
}

Performance: 108.2 tokens/sec


Example 5: Code Generation

bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a Python function to check if a number is prime"}
    ],
    "max_tokens": 200
  }' | jq '.choices[0].message.content'

Response:

python
def is_prime(n):
    """Check if a number is prime."""
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True

Performance: ~110 tokens/sec


Example 6: Document Summarization

bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "Summarize in JSON format."},
      {"role": "user", "content": "The Industrial Revolution transformed manufacturing from handmade goods to machine production. It began in Britain in the late 1700s with textile mills and steam engines. The revolution spread across Europe and America, creating urban centers and new social classes. While it increased productivity, it also led to poor working conditions and pollution.\n\nSummarize as JSON with key_points and impacts."}
    ],
    "max_tokens": 150
  }' | jq

Performance: ~88-105 tokens/sec for summaries


Part 7: Verified Performance Benchmarks

RTX 4060 Ti 16GB Desktop Results (Confirmed Testing)

Test System:

  • GPU: RTX 4060 Ti 16GB (34 SMs)
  • CPU: Intel i5-13400
  • RAM: 64GB DDR4
  • Framework: llama.cpp (November 2025)
  • Model: Qwen3-30B-A3B-Instruct Q3_K_M ⚠️

Real Testing Results:

TaskContextTokens GeneratedSpeed (t/s)Total Time
Sentiment Analysis4K32107.7~300ms
Info Extraction4K40110.7~360ms
Math Reasoning4K95109.6~870ms
Aspect Sentiment4K24108.2~220ms
Code Generation4K150110.0~1.4s
Long Document8K24988.5~2.8s
Extended Doc12K65888.36~7.4s
Very Long Doc14K24989.42~2.8s
Maximum Context15K64987.87~7.4s

Average Generation Speed: 87-110 tokens/second (depending on context)

Prompt Processing Speed: 81-505 tokens/second


Verified Performance by Model Size (RTX 4060 Ti 16GB)

Model SizeQuantizationSpeed (t/s)VRAM UsedAll GPU?Status
7BQ4_K_M150-161~5GB✅ Yes⭐⭐⭐⭐⭐ Excellent
13BQ4_K_M90-110~8GB✅ Yes⭐⭐⭐⭐⭐ Great
20BQ4_K_M84~13GB✅ Yes⭐⭐⭐⭐⭐ Fast
30BQ3_K_M ⚠️88-110~15GBYes⭐⭐⭐⭐⭐ Excellent
30BQ4_K_M35-45~17GB⚠️ Tight⭐⭐⭐ Limited context
70BQ4_K_M3-740GB+❌ No⭐⭐ Hybrid only

⚠️ Important Note: The 30B results above use Q3_K_M quantization, which is optimized for 16GB VRAM. Q3 provides ~20% faster speed than Q4 with minimal quality loss.

Key Findings:

  • Sweet spot: 7B-30B models (all fit in VRAM with excellent speed)
  • 30B Q3_K_M is optimal for RTX 4060 Ti (confirmed 88-110 t/s)
  • 30B Q4_K_M is limited on 16GB (only 4K-8K context, 35-45 t/s)
  • 70B models require hybrid CPU+GPU (slow, 3-7 t/s only)

Part 8: Performance Comparison with Other Hardware

Hardware Specifications

RTX 4060 Ti 16GB (Tested):

  • CUDA Cores: 4,352
  • SMs: 34
  • VRAM: 16GB GDDR6
  • Bandwidth: 288 GB/s
  • TDP: 160W
  • Price: ~460GPU/460 GPU / 1,250 complete system

RTX 5090 32GB (Reference):

  • Architecture: Blackwell (GB202)
  • CUDA Cores: 21,760 (5x more)
  • SMs: 170 (5x more)
  • VRAM: 32GB GDDR7
  • Bandwidth: 1,792 GB/s (6.2x faster)
  • TDP: 575W
  • Price: ~1,999GPU/1,999 GPU / 3,200 complete system

NVIDIA DGX Spark GB10 (Reference):

  • Architecture: Grace Blackwell Superchip
  • CPU: 20-core ARM Grace
  • GPU: Blackwell GPU (6,144 CUDA cores)
  • Memory: 128GB LPDDR5X unified
  • Bandwidth: 273 GB/s (unified)
  • TDP: 240W
  • Price: $3,999 (complete system)
  • Special: FP4 hardware acceleration

Fair Comparison: GPT-OSS-20B (Q4_K_M) - All Same Quantization

SystemSpeed (t/s)Prefill (t/s)VRAM UsedPowerPrice
RTX 4060 Ti8485013GB160W$1,250
RTX 50901561,500+13GB450W$3,200
DGX Spark49.72,05313GB180W$3,999

All tested with same Q4_K_M quantization - fair comparison!

Performance vs Cost:

  • RTX 4060 Ti: $14.88 per token/sec
  • RTX 5090: $20.51 per token/sec (1.86x faster, 2.56x price)
  • DGX Spark: $80.48 per token/sec (0.59x speed, 3.2x price)

Winner: RTX 4060 Ti has best value for 20B models.


Qwen3-30B Comparison (Mixed Quantizations ⚠️)

SystemQuantizationSpeed (t/s)VRAMNotes
RTX 4060 TiQ3_K_M ⚠️88-11015GBTested ✓
RTX 5090Q4_K_M21317GBPublic benchmark
DGX SparkQ4_K_M45-5017GBPublic benchmark

⚠️ Important: This comparison uses different quantizations:

  • RTX 4060 Ti uses Q3_K_M (lighter, faster, fits in 16GB)
  • Others use Q4_K_M (heavier, slightly better quality)
  • Q3 is ~20% faster than Q4 but not directly comparable
  • Direct speed comparison is not apples-to-apples

Key Takeaway: RTX 4060 Ti achieves excellent 30B performance by using Q3_K_M, which is the right choice for 16GB VRAM.


Qwen2.5-7B Comparison (Q4_K_M) - Fair Comparison

SystemSpeed (t/s)Prefill (t/s)VRAM
RTX 4060 Ti150-161620~5GB
RTX 50902131,200+~5GB
DGX Spark951,200+~5GB

All tested with same Q4_K_M quantization

Winner: RTX 5090 (but all are fast enough for 7B models)


70B Model Comparison

SystemCan Run?Speed (t/s)Notes
RTX 4060 Ti⚠️ Hybrid only3-7CPU+GPU offload, slow
RTX 5090⚠️ Tight15-25Barely fits in 32GB
DGX Spark✅ Yes25-35128GB unified memory

Winner: DGX Spark (only practical option for 70B+)


120B Model Comparison

SystemCan Run?Speed (t/s)Notes
RTX 4060 Ti❌ NoN/AInsufficient VRAM
RTX 5090❌ NoN/ANeeds 65GB+
DGX Spark✅ Yes31-38FP4 precision only

Winner: DGX Spark (only option for 120B+)


Performance Summary Table

MetricRTX 4060 TiRTX 5090DGX Spark
Price$1,250$3,200$3,999
7B Speed150-161 t/s213 t/s95 t/s
20B Speed (Q4)84 t/s156 t/s49.7 t/s
30B Speed88-110 t/s (Q3) ⚠️213 t/s (Q4)45-50 t/s (Q4)
70B Speed3-7 t/s15-25 t/s25-35 t/s
120B Speed31-38 t/s
Max Practical30B (Q3)70B200B
Max Context (30B)15K32K+32K+
Power Draw160W575W240W
Value (20B Q4)$14.88/t/s$20.51/t/s$80.48/t/s

⚠️ Note: 30B comparison uses different quantizations (Q3 vs Q4). For fair comparison, see 20B Q4 benchmarks.


Part 9: Value Analysis & Recommendations

Cost Breakdown

ComponentRTX 4060 Ti BuildRTX 5090 BuildDGX Spark
GPU/System$460$1,999$3,999 (complete)
CPU$185 (i5-13400)$300 (i7-14700K)Included
RAM$140 (64GB DDR4)$240 (128GB DDR5)Included (128GB)
Motherboard$125$200Included
Storage$75 (1TB)$150 (2TB)Included (4TB)
PSU$70 (650W)$180 (1200W)Included
Case$65$100Included
Total$1,120$3,169$3,999

Who Should Buy What?

Choose RTX 4060 Ti if:

  • ✅ Budget: $1,000-1,500
  • ✅ Primary use: 7B-30B models (covers 95% of use cases)
  • ✅ Priority: Best value for money ($14.88 per token/sec on 20B)
  • ✅ Use case: Learning, homework, coding, personal projects
  • ✅ Power efficiency: 160W vs 575W
  • ✅ Context needs: Up to 15K tokens (11,500 words)
  • ✅ Willing to use Q3 quantization for 30B models
  • Best for: Students, hobbyists, developers, most users

Choose RTX 5090 if:

  • ✅ Budget: $3,000-4,000
  • ✅ Primary use: 30B-70B models with Q4+ quantization
  • ✅ Priority: Maximum speed (1.86x faster on 20B)
  • ✅ Use case: Professional content creation, commercial development
  • ✅ Need: Larger contexts (32K+) and heavier quantizations
  • ✅ Want to run 70B models at usable speeds
  • Best for: Professionals, power users, content creators

Choose DGX Spark if:

  • ✅ Budget: $4,000+
  • ✅ Primary use: 70B-200B models
  • ✅ Priority: Model capacity over raw speed
  • ✅ Use case: Research, model development, enterprise
  • ✅ Special: Only system that runs 120B+ models locally
  • ✅ Need unified memory architecture
  • Best for: Researchers, AI developers, small research teams

The Verdict for Most Users

90% of users should choose RTX 4060 Ti:

Why?

  1. Sufficient for most models: 7B-30B covers nearly all practical use cases
  2. Outstanding value: $14.88 per token/sec (best in class)
  3. Confirmed performance: 88-110 t/s on 30B Q3, 84 t/s on 20B Q4
  4. Maximum context: 15K tokens (11,500 words) proven stable
  5. Lower total cost: $1,950-2,750 less than competitors
  6. Efficient: 3.6x less power than RTX 5090
  7. Practical: Models load in seconds, responses feel instant

When to upgrade:

  • You need Q4+ quantization for 30B models regularly (→ RTX 5090)
  • You frequently work with 70B models (→ RTX 5090)
  • You need 120B+ models (→ DGX Spark)
  • You need 32K+ context regularly (→ RTX 5090 or DGX Spark)
  • You're a professional who bills clients (→ RTX 5090)
  • Budget is unlimited (→ RTX 5090 or DGX Spark)

Part 10: Python Usage Example

python
from openai import OpenAI

# Point to your local server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

# Simple chat
response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing simply"}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)

# JSON structured output
response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system", "content": "Output valid JSON only."},
        {"role": "user", "content": "Extract: John, 25, loves coding"}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)

Part 11: Optimization Tips

Context Size Selection Guide

Choose based on your use case:

Use CaseRecommended ContextSpeedVRAM
Quick Q&A4K110 t/s14.7GB
Standard Chat8K88.5 t/s14.9GB
Document Analysis12K88.36 t/s15.1GB
Long Documents14K89.42 t/s15.2GB
Maximum Capacity15K87.87 t/s15.3GB

If You Get "Out of Memory" Errors

Option 1: Use quantized KV cache

bash
--cache-type-k q8_0 --cache-type-v q8_0

Saves ~50% KV cache VRAM with minimal quality loss.

Option 2: Reduce context size

bash
-c 12288  # Instead of 15360

Option 3: Lower batch size

bash
-b 128 -ub 32  # Instead of -b 256 -ub 64

Getting Maximum Speed

1. Verify GPU usage:

bash
nvidia-smi
# Should show 85-98% GPU utilization

2. Check all layers are offloaded:

bash
# In server output, look for:
# "llm_load_tensors: offloaded 49/49 layers to GPU"

3. Monitor VRAM:

bash
watch -n 1 nvidia-smi

Part 12: Troubleshooting

"Out of Memory" Error

Solution:

bash
# Use smaller context
-c 12288  # Instead of 15360

# Or use quantized KV cache
--cache-type-k q8_0 --cache-type-v q8_0

# Or lower batch size
-b 128 -ub 32

"Very Slow Performance"

Check:

bash
# 1. GPU utilization
nvidia-smi

# 2. All layers on GPU
# Look for "offloaded 49/49 layers" in server output

# 3. CUDA is being used
# Build should show "CUDA support enabled"

"Port Already in Use"

bash
# Use different port
--port 8081

# Or kill existing process
sudo lsof -ti:8080 | xargs kill -9

KV Cache Quantization Requires Flash Attention

Error: V cache quantization requires flash_attn

Solution: Cannot disable Flash Attention when using q8_0 cache. Either:

  • Use f16 cache (but uses more VRAM)
  • Keep Flash Attention enabled (recommended)

Conclusion

Key Takeaways

RTX 4060 Ti 16GB:

  • Confirmed Performance: 88-110 tokens/sec on 30B Q3, 84 t/s on 20B Q4
  • Maximum Context: 15K tokens (11,500 words) proven stable
  • Best Value: $14.88 per token/sec (20B Q4 benchmark)
  • Perfect for 95% of use cases: 7B-30B models run excellently
  • Smart Quantization: Q3_K_M for 30B enables full GPU usage
  • Practical Limitation: 70B models only work at 3-7 t/s (hybrid mode)
  • Hardware Note: 34 SMs means no max-autotune GEMM, but doesn't affect llama.cpp
  • Recommended for students, hobbyists, developers

RTX 5090 32GB:

  • 1.86x faster on 20B Q4 (156 vs 84 t/s)
  • Can handle 70B models (15-25 t/s)
  • 32K+ context capacity
  • Supports heavier quantizations (Q4+)
  • 2.56x more expensive
  • Recommended for professionals only

DGX Spark:

  • Only system for 120B+ models
  • Slower than RTX for small-medium models
  • 128GB unified memory
  • Most expensive ($3,999)
  • Recommended for researchers only

Final Recommendation

Start with RTX 4060 Ti if:

  • Budget under $1,500
  • You're learning AI/ML
  • You use 7B-30B models (which is 95% of people)
  • You want best value for money
  • 15K context is sufficient
  • Q3 quantization quality is acceptable

You can always upgrade later if you outgrow it!


Resources


You now have everything needed to run powerful AI locally on your RTX 4060 Ti with optimized context settings. Start building! 🚀


Guide created November 2025
Based on confirmed testing with RTX 4060 Ti 16GB + i5-13400 + 64GB RAM
All RTX 4060 Ti performance numbers verified through real-world testing
Primary test model: Qwen3-30B-A3B-Instruct-2507 (Q3_K_M, 30.5B parameters, 14.1GB)
Maximum stable context: 15,360 tokens @ 87.87 tokens/sec
Comparison benchmarks: RTX 5090 and DGX Spark data from public sources

Published on 11/20/2025