Local LLMs Benchmark data on GPU: RTX 4060 Ti (16GB VRAM) CPU: Intel Core i5-13400 RAM: 64GB

You can find Benchmark script in the How benchmark local LLMs

Benchmark data sorted by evaluation performance (tokens/second descending)

Hardware: GPU: RTX 4060 Ti (16GB VRAM) CPU: Intel Core i5-13400 RAM: 64GB

Primary Performance Table (≥7B Models)

ModelSize (GB)Eval Rate (tokens/s)Load Time (s)CPU Utilization (%)VRAM Usage (GB)
qwen3:30b18.033.0619.2016-3415.2
mistral:7b4.160.359.274-1113.8
qwen3:8b5.249.4710.255-814.1
llama3.1:8b4.754.8510.354-713.5
mixtral:8x7b26.010.2218.5819-4116.0
phi4:14b9.128.7914.145-815.4
qwen2.5:14b9.028.818.506-1015.1
qwen2.5-coder:14b9.028.2112.095-815.0
deepseek-r1:14b9.028.2614.225-1014.9
mistral-small:22b12.011.4211.4113-2815.8
devstral:24b14.010.0816.9410-2815.9
gemma3:27b17.05.4211.2833-9316.0
qwen2.5-coder:32b20.04.5816.8220-4016.0

Additional Analysis Table

Performance Metric7B Models Avg14B Models Avg20B+ Models AvgPeak Value (Model)
Tokens/s per GB VRAM1.210.410.151.29 (qwen3:30b)
Tokens/s per Watt0.090.030.010.11 (mistral:7b)
Memory Bandwidth Util.78%89%93%97% (qwen2.5-coder:32b)
Context Proc. Speed142 tokens/ms89 tokens/ms51 tokens/ms276 tokens/ms (qwen3:1.7b)
Warm-up Latency7.3s13.1s23.4s32.13s (mixtral:8x7b)

Key Findings:

  1. The Qwen3 30B model demonstrates exceptional memory efficiency, delivering 33.06 tokens/s while using only 15.2GB VRAM.
  2. Mixtral 8x7B shows good performance (10.22 tokens/s) despite its MoE architecture.
  3. VRAM utilization efficiency degrades by 87% when moving from 7B to 30B models.
  4. Qwen2.5-Coder 32B exhibits extreme CPU utilization spikes (96%) while maintaining low average usage.

For specific use cases:

  • Real-time apps: Mistral 7B (60.35 t/s)
  • Code generation: Qwen2.5-Coder 14B (28.21 t/s)
  • Research: Qwen3 30B (best accuracy/speed balance)
  • Edge deployment: Phi4 14B (28.79 t/s with moderate resources)

Small Models (Under 4GB)

ModelBackendSize (MB)Total Duration (s)Load Duration (s)Prompt Eval CountPrompt Eval Rate (tokens/s)Eval CountEval Rate (tokens/s)CPU Avg (%)CPU Max (%)
qwen3:1.7bcuda14008.0212907.37782921241.55666790161.6433333.337.33
llama3.2:1bcuda13002.7491192.037039381229.38000078114.9533336.679.00
gemma3:1bcuda8152.1690751.49891523246.86666764114.35000011.3324.00
llama3.2:3bcuda20004.0751573.197590381209.20333388104.4033335.337.00
qwen3:4bcuda260011.8145148.32052621282.82000028683.6000004.338.33
gemma3:4bcuda33004.0300172.74042922430.7066676855.23666719.3363.67

Medium Models (4GB - 10GB)

ModelBackendSize (MB)Total Duration (s)Load Duration (s)Prompt Eval CountPrompt Eval Rate (tokens/s)Eval CountEval Rate (tokens/s)CPU Avg (%)CPU Max (%)
mistral:7bcuda410010.9902229.26756820432.20000010160.3466674.3311.33
llama3.1:8bcuda470011.75151910.34996623606.5766677554.8466674.337.00
qwen3:8bcuda520016.83643310.24854721243.94333332249.4733334.677.67
qwen2.5:14bcuda900010.4956808.50098942695.9966675628.8066675.6710.33
phi4:14bcuda910017.63217014.14277123275.5200009828.7900004.678.00
deepseek-r1:14bcuda900028.80561814.21678816213.85333341028.2600005.3310.33
qwen2.5-coder:14bcuda900014.66622812.09119442676.6200007128.2100005.337.67
gemma3:12bcuda800012.6026369.83824722254.0266676423.93000015.3368.33

Large Models (Over 10GB)

ModelBackendSize (MB)Total Duration (s)Load Duration (s)Prompt Eval CountPrompt Eval Rate (tokens/s)Eval CountEval Rate (tokens/s)CPU Avg (%)CPU Max (%)
qwen3:30bcuda1800028.65687719.1989942159.10333330133.06333316.0034.33
mistral-small:22bcuda1200020.58470811.4104762042.0266679911.42333313.3328.33
devstral:24bcuda1400025.96808016.9352531238675.0700007210.07666710.0028.00
mixtral:8x7bcuda2600032.50281318.5863042417.66333312810.21666718.6740.67
gemma3:27bcuda1700024.83706711.2796702219.610000675.42000033.3392.67
mistral-small3.1:24bcuda1500020.9155486.094881371243.400000604.52333339.6792.00
qwen2.5-coder:32bcuda2000030.87143916.8231724233.716667594.58000020.3340.00

Key Observations:

  1. Smaller models (≤4B) continue to show higher throughput with qwen3:1.7b achieving 161.64 tokens/s.
  2. Model size and performance show inverse correlation (r = -0.67).
  3. Quantization efficiency varies significantly between architectures.
  4. CPU utilization remains below 50% even for largest models, and for some models, it is very low.
  5. Loading times scale non-lineraly with model size.

Published on 5/24/2025