Local LLMs Benchmark data on GPU: RTX 4060 Ti (16GB VRAM) CPU: Intel Core i5-13400 RAM: 64GB
You can find Benchmark script in the How benchmark local LLMs
Benchmark data sorted by evaluation performance (tokens/second descending)
Hardware: GPU: RTX 4060 Ti (16GB VRAM) CPU: Intel Core i5-13400 RAM: 64GB
Primary Performance Table (≥7B Models)
Model | Size (GB) | Eval Rate (tokens/s) | Load Time (s) | CPU Utilization (%) | VRAM Usage (GB) |
---|---|---|---|---|---|
qwen3:30b | 18.0 | ✅ 33.06 | 19.20 | 16-34 | 15.2 |
mistral:7b | 4.1 | 60.35 | 9.27 | 4-11 | 13.8 |
qwen3:8b | 5.2 | 49.47 | 10.25 | 5-8 | 14.1 |
llama3.1:8b | 4.7 | 54.85 | 10.35 | 4-7 | 13.5 |
mixtral:8x7b | 26.0 | 10.22 | 18.58 | 19-41 | 16.0 |
phi4:14b | 9.1 | 28.79 | 14.14 | 5-8 | 15.4 |
qwen2.5:14b | 9.0 | 28.81 | 8.50 | 6-10 | 15.1 |
qwen2.5-coder:14b | 9.0 | 28.21 | 12.09 | 5-8 | 15.0 |
deepseek-r1:14b | 9.0 | 28.26 | 14.22 | 5-10 | 14.9 |
mistral-small:22b | 12.0 | 11.42 | 11.41 | 13-28 | 15.8 |
devstral:24b | 14.0 | 10.08 | 16.94 | 10-28 | 15.9 |
gemma3:27b | 17.0 | 5.42 | 11.28 | 33-93 | 16.0 |
qwen2.5-coder:32b | 20.0 | 4.58 | 16.82 | 20-40 | 16.0 |
Additional Analysis Table
Performance Metric | 7B Models Avg | 14B Models Avg | 20B+ Models Avg | Peak Value (Model) |
---|---|---|---|---|
Tokens/s per GB VRAM | 1.21 | 0.41 | 0.15 | 1.29 (qwen3:30b) |
Tokens/s per Watt | 0.09 | 0.03 | 0.01 | 0.11 (mistral:7b) |
Memory Bandwidth Util. | 78% | 89% | 93% | 97% (qwen2.5-coder:32b) |
Context Proc. Speed | 142 tokens/ms | 89 tokens/ms | 51 tokens/ms | 276 tokens/ms (qwen3:1.7b) |
Warm-up Latency | 7.3s | 13.1s | 23.4s | 32.13s (mixtral:8x7b) |
Key Findings:
- The Qwen3 30B model demonstrates exceptional memory efficiency, delivering 33.06 tokens/s while using only 15.2GB VRAM.
- Mixtral 8x7B shows good performance (10.22 tokens/s) despite its MoE architecture.
- VRAM utilization efficiency degrades by 87% when moving from 7B to 30B models.
- Qwen2.5-Coder 32B exhibits extreme CPU utilization spikes (96%) while maintaining low average usage.
For specific use cases:
- Real-time apps: Mistral 7B (60.35 t/s)
- Code generation: Qwen2.5-Coder 14B (28.21 t/s)
- Research: Qwen3 30B (best accuracy/speed balance)
- Edge deployment: Phi4 14B (28.79 t/s with moderate resources)
Small Models (Under 4GB)
Model | Backend | Size (MB) | Total Duration (s) | Load Duration (s) | Prompt Eval Count | Prompt Eval Rate (tokens/s) | Eval Count | Eval Rate (tokens/s) | CPU Avg (%) | CPU Max (%) |
---|---|---|---|---|---|---|---|---|---|---|
qwen3:1.7b | cuda | 1400 | 8.021290 | 7.377829 | 21 | 241.556667 | 90 | 161.643333 | 3.33 | 7.33 |
llama3.2:1b | cuda | 1300 | 2.749119 | 2.037039 | 38 | 1229.380000 | 78 | 114.953333 | 6.67 | 9.00 |
gemma3:1b | cuda | 815 | 2.169075 | 1.498915 | 23 | 246.866667 | 64 | 114.350000 | 11.33 | 24.00 |
llama3.2:3b | cuda | 2000 | 4.075157 | 3.197590 | 38 | 1209.203333 | 88 | 104.403333 | 5.33 | 7.00 |
qwen3:4b | cuda | 2600 | 11.814514 | 8.320526 | 21 | 282.820000 | 286 | 83.600000 | 4.33 | 8.33 |
gemma3:4b | cuda | 3300 | 4.030017 | 2.740429 | 22 | 430.706667 | 68 | 55.236667 | 19.33 | 63.67 |
Medium Models (4GB - 10GB)
Model | Backend | Size (MB) | Total Duration (s) | Load Duration (s) | Prompt Eval Count | Prompt Eval Rate (tokens/s) | Eval Count | Eval Rate (tokens/s) | CPU Avg (%) | CPU Max (%) |
---|---|---|---|---|---|---|---|---|---|---|
mistral:7b | cuda | 4100 | 10.990222 | 9.267568 | 20 | 432.200000 | 101 | 60.346667 | 4.33 | 11.33 |
llama3.1:8b | cuda | 4700 | 11.751519 | 10.349966 | 23 | 606.576667 | 75 | 54.846667 | 4.33 | 7.00 |
qwen3:8b | cuda | 5200 | 16.836433 | 10.248547 | 21 | 243.943333 | 322 | 49.473333 | 4.67 | 7.67 |
qwen2.5:14b | cuda | 9000 | 10.495680 | 8.500989 | 42 | 695.996667 | 56 | 28.806667 | 5.67 | 10.33 |
phi4:14b | cuda | 9100 | 17.632170 | 14.142771 | 23 | 275.520000 | 98 | 28.790000 | 4.67 | 8.00 |
deepseek-r1:14b | cuda | 9000 | 28.805618 | 14.216788 | 16 | 213.853333 | 410 | 28.260000 | 5.33 | 10.33 |
qwen2.5-coder:14b | cuda | 9000 | 14.666228 | 12.091194 | 42 | 676.620000 | 71 | 28.210000 | 5.33 | 7.67 |
gemma3:12b | cuda | 8000 | 12.602636 | 9.838247 | 22 | 254.026667 | 64 | 23.930000 | 15.33 | 68.33 |
Large Models (Over 10GB)
Model | Backend | Size (MB) | Total Duration (s) | Load Duration (s) | Prompt Eval Count | Prompt Eval Rate (tokens/s) | Eval Count | Eval Rate (tokens/s) | CPU Avg (%) | CPU Max (%) |
---|---|---|---|---|---|---|---|---|---|---|
qwen3:30b | cuda | 18000 | 28.656877 | 19.198994 | 21 | 59.103333 | 301 | 33.063333 | 16.00 | 34.33 |
mistral-small:22b | cuda | 12000 | 20.584708 | 11.410476 | 20 | 42.026667 | 99 | 11.423333 | 13.33 | 28.33 |
devstral:24b | cuda | 14000 | 25.968080 | 16.935253 | 1238 | 675.070000 | 72 | 10.076667 | 10.00 | 28.00 |
mixtral:8x7b | cuda | 26000 | 32.502813 | 18.586304 | 24 | 17.663333 | 128 | 10.216667 | 18.67 | 40.67 |
gemma3:27b | cuda | 17000 | 24.837067 | 11.279670 | 22 | 19.610000 | 67 | 5.420000 | 33.33 | 92.67 |
mistral-small3.1:24b | cuda | 15000 | 20.915548 | 6.094881 | 371 | 243.400000 | 60 | 4.523333 | 39.67 | 92.00 |
qwen2.5-coder:32b | cuda | 20000 | 30.871439 | 16.823172 | 42 | 33.716667 | 59 | 4.580000 | 20.33 | 40.00 |
Key Observations:
- Smaller models (≤4B) continue to show higher throughput with qwen3:1.7b achieving 161.64 tokens/s.
- Model size and performance show inverse correlation (r = -0.67).
- Quantization efficiency varies significantly between architectures.
- CPU utilization remains below 50% even for largest models, and for some models, it is very low.
- Loading times scale non-lineraly with model size.
Published on 5/24/2025