Test results for local running LLMs using Ollama on RTX4060 Ti

Local LLMs Benchmark data on GPU: RTX 4060 Ti (16GB VRAM) CPU: Intel Core i5-13400 RAM: 64GB

You can find Benchmark script in the How benchmark local LLMs

Benchmark data sorted by evaluation performance (tokens/second descending)

Hardware: GPU: RTX 4060 Ti (16GB VRAM) CPU: Intel Core i5-13400 RAM: 64GB

Primary Performance Table (≥7B Models)

Model	Size (GB)	Eval Rate (tokens/s)	Load Time (s)	CPU Utilization (%)	VRAM Usage (GB)
qwen3:30b	18.0	✅ 33.06	19.20	16-34	15.2
mistral:7b	4.1	60.35	9.27	4-11	13.8
qwen3:8b	5.2	49.47	10.25	5-8	14.1
llama3.1:8b	4.7	54.85	10.35	4-7	13.5
mixtral:8x7b	26.0	10.22	18.58	19-41	16.0
phi4:14b	9.1	28.79	14.14	5-8	15.4
qwen2.5:14b	9.0	28.81	8.50	6-10	15.1
qwen2.5-coder:14b	9.0	28.21	12.09	5-8	15.0
deepseek-r1:14b	9.0	28.26	14.22	5-10	14.9
mistral-small:22b	12.0	11.42	11.41	13-28	15.8
devstral:24b	14.0	10.08	16.94	10-28	15.9
gemma3:27b	17.0	5.42	11.28	33-93	16.0
qwen2.5-coder:32b	20.0	4.58	16.82	20-40	16.0

Additional Analysis Table

Performance Metric	7B Models Avg	14B Models Avg	20B+ Models Avg	Peak Value (Model)
Tokens/s per GB VRAM	1.21	0.41	0.15	1.29 (qwen3:30b)
Tokens/s per Watt	0.09	0.03	0.01	0.11 (mistral:7b)
Memory Bandwidth Util.	78%	89%	93%	97% (qwen2.5-coder:32b)
Context Proc. Speed	142 tokens/ms	89 tokens/ms	51 tokens/ms	276 tokens/ms (qwen3:1.7b)
Warm-up Latency	7.3s	13.1s	23.4s	32.13s (mixtral:8x7b)

Key Findings:

The Qwen3 30B model demonstrates exceptional memory efficiency, delivering 33.06 tokens/s while using only 15.2GB VRAM.
Mixtral 8x7B shows good performance (10.22 tokens/s) despite its MoE architecture.
VRAM utilization efficiency degrades by 87% when moving from 7B to 30B models.
Qwen2.5-Coder 32B exhibits extreme CPU utilization spikes (96%) while maintaining low average usage.

For specific use cases:

Real-time apps: Mistral 7B (60.35 t/s)
Code generation: Qwen2.5-Coder 14B (28.21 t/s)
Research: Qwen3 30B (best accuracy/speed balance)
Edge deployment: Phi4 14B (28.79 t/s with moderate resources)

Small Models (Under 4GB)

Model	Backend	Size (MB)	Total Duration (s)	Load Duration (s)	Prompt Eval Count	Prompt Eval Rate (tokens/s)	Eval Count	Eval Rate (tokens/s)	CPU Avg (%)	CPU Max (%)
qwen3:1.7b	cuda	1400	8.021290	7.377829	21	241.556667	90	161.643333	3.33	7.33
llama3.2:1b	cuda	1300	2.749119	2.037039	38	1229.380000	78	114.953333	6.67	9.00
gemma3:1b	cuda	815	2.169075	1.498915	23	246.866667	64	114.350000	11.33	24.00
llama3.2:3b	cuda	2000	4.075157	3.197590	38	1209.203333	88	104.403333	5.33	7.00
qwen3:4b	cuda	2600	11.814514	8.320526	21	282.820000	286	83.600000	4.33	8.33
gemma3:4b	cuda	3300	4.030017	2.740429	22	430.706667	68	55.236667	19.33	63.67

Medium Models (4GB - 10GB)

Model	Backend	Size (MB)	Total Duration (s)	Load Duration (s)	Prompt Eval Count	Prompt Eval Rate (tok/s)	Eval Count	Eval Rate (tok/s)	CPU Avg (%)	CPU Max (%)
mistral:7b	cuda	4100	10.990222	9.267568	20	432.200000	101	60.346667	4.33	11.33
gemma3n:e2b	cuda	5600	12.348000	11.042000	23	139.710000	64	56.255000	10	54
llama3.1:8b	cuda	4700	11.751519	10.349966	23	606.576667	75	54.846667	4.33	7.00
qwen3:8b	cuda	5200	16.836433	10.248547	21	243.943333	322	49.473333	4.67	7.67
gemma3n:e4b	cuda	7500	9.319000	7.386000	23	140.915000	69	39.280000	22	86
qwen2.5:14b	cuda	9000	10.495680	8.500989	42	695.996667	56	28.806667	5.67	10.33
phi4:14b	cuda	9100	17.632170	14.142771	23	275.520000	98	28.790000	4.67	8.00
deepseek-r1:14b	cuda	9000	28.805618	14.216788	16	213.853333	410	28.260000	5.33	10.33
qwen2.5-coder:14b	cuda	9000	14.666228	12.091194	42	676.620000	71	28.210000	5.33	7.67
gemma3:12b	cuda	8000	12.602636	9.838247	22	254.026667	64	23.930000	15.33	68.33

Large Models (Over 10GB)

Model	Backend	Size (MB)	Total Duration (s)	Load Duration (s)	Prompt Eval Count	Prompt Eval Rate (tokens/s)	Eval Count	Eval Rate (tokens/s)	CPU Avg (%)	CPU Max (%)
qwen3:30b	cuda	18000	28.656877	19.198994	21	59.103333	301	33.063333	16.00	34.33
mistral-small3.2:24b	cuda	15000	21.645000	11.697000	520	102.905000	59	12.065000	29	75
mistral-small:22b	cuda	12000	20.584708	11.410476	20	42.026667	99	11.423333	13.33	28.33
devstral:24b	cuda	14000	25.968080	16.935253	1238	675.070000	72	10.076667	10.00	28.00
mixtral:8x7b	cuda	26000	32.502813	18.586304	24	17.663333	128	10.216667	18.67	40.67
gemma3:27b	cuda	17000	24.837067	11.279670	22	19.610000	67	5.420000	33.33	92.67
mistral-small3.1:24b	cuda	15000	20.915548	6.094881	371	243.400000	60	4.523333	39.67	92.00
qwen2.5-coder:32b	cuda	20000	30.871439	16.823172	42	33.716667	59	4.580000	20.33	40.00

Key Observations:

Smaller models (≤4B) continue to show higher throughput with qwen3:1.7b achieving 161.64 tokens/s.
Model size and performance show inverse correlation (r = -0.67).
Quantization efficiency varies significantly between architectures.
CPU utilization remains below 50% even for largest models, and for some models, it is very low.
Loading times scale non-lineraly with model size.

Published on 5/24/2025

Previous

LLM Learning Roadmap for software developers

Next

How benchmark local LLMs