LLM Benchmark: 3D Rotating Cube Challenge
Updated December 2025: Added 9 new models including NVIDIA Nemotron, ByteDance Seed, Allen AI OLMo, and Mistral Devstral. Now testing 28 models total.
Overview
This benchmark tests various LLM models running locally through LMStudio by asking them to implement a single HTML file containing a 3D rotating cube using HTML, CSS, and JavaScript. The benchmark measures performance metrics including tokens per second, response time, and time to first token.
The Challenge
Please implement an html + js + css only project stored on a single html file that implements a 3d cube rotating. If you want you can add external dependencies as long as they are only included from cdns.
Benchmark Script
The benchmark was implemented in Elixir using HTTPoison to interact with the LMStudio API. The script:
- Fetches available models from LMStudio
- Warms up each model with a simple prompt
- Streams the actual benchmark prompt and measures performance metrics
- Saves individual responses and aggregated metrics
Key features:
- Streaming support for accurate time-to-first-token measurements
- Automatic token counting with fallback estimation
- CSV export of metrics for analysis
- Individual response files for quality comparison
Results
Performance Comparison
| Model | Size (GB) | Tokens/sec | Efficiency (tok/sec/GB) | Total Tokens | Total Time (s) | Time to First Token (s) |
|---|---|---|---|---|---|---|
| nvidia/nemotron-3-nano-4b | 18 | 152.33 | 8.46 | 2,275 | 14.60 | 0.30 |
| openai/gpt-oss-20b | 11 | 95.39 | 8.67 | 1,009 | 10.04 | 0.69 |
| nvidia/nemotron-3-nano-8b | 34 | 94.28 | 2.77 | 7,366 | 77.59 | 0.22 |
| deepseek-coder-v2-lite-instruct | 17 | 90.89 | 5.35 | 1,089 | 11.42 | 0.14 |
| qwen/qwen3-next-80b | 79 | 84.09 | 1.06 | 1,451 | 16.65 | 0.24 |
| qwen/qwen3-coder-30b | 30 | 80.89 | 2.70 | 2,804 | 34.03 | 0.26 |
| qwen/qwen3-30b-a3b-2507 | 30 | 71.00 | 2.37 | 1,391 | 18.87 | 0.24 |
| openai/gpt-oss-120b | 63 | 64.92 | 1.03 | 1,327 | 19.66 | 1.01 |
| openai-gpt-oss-120b-mlx-6 | 88 | 60.56 | 0.69 | 1,320 | 20.96 | 3.03 |
| qwen/qwen3-vl-30b | 34 | 57.90 | 1.70 | 885 | 14.40 | 1.08 |
| deepseek/deepseek-r1-0528-qwen3-8b | 8 | 55.69 | 6.96 | 3,938 | 69.79 | 0.19 |
| glm-4.5-air | 47 | 52.09 | 1.11 | 5,351 | 101.75 | 0.46 |
| minimax-m2 | 100 | 48.00 | 0.48 | 4,005 | 82.38 | 0.55 |
| glm-4.5-air-mlx | 56 | 45.10 | 0.81 | 4,919 | 107.93 | 0.41 |
| zai-org/glm-4.6v-flash | 12 | 36.02 | 3.00 | 2,324 | 63.10 | 2.16 |
| mistralai/devstral-small-2-2512 | 14 | 34.39 | 2.46 | 962 | 26.49 | 0.45 |
| qwen/qwen2.5-coder-14b | 16 | 27.50 | 1.72 | 781 | 26.54 | 0.43 |
| qwen/qwen2.5-coder-32b-4b | 19 | 23.78 | 1.25 | 908 | 36.04 | 0.65 |
| allenai/olmo-3-32b-think-4b | 18 | 23.46 | 1.30 | 7,948 | 336.58 | 0.66 |
| mistralai/magistral-small-2509 | 26 | 19.52 | 0.75 | 1,301 | 64.04 | 0.48 |
| bytedance/seed-oss-36b-4b | 20 | 18.98 | 0.95 | 5,748 | 300.17 | 0.54 |
| allenai/olmo-3-32b-think | 34 | 14.62 | 0.43 | 11,713 | 797.84 | 0.89 |
| qwen/qwq-32b | 35 | 12.70 | 0.36 | 11,099 | 869.78 | 0.52 |
| qwen/qwen2.5-coder-32b | 35 | 11.74 | 0.34 | 841 | 67.28 | 0.89 |
| nousresearch/hermes-4-70b | 40 | 10.83 | 0.27 | 639 | 54.28 | 1.26 |
| bytedance/seed-oss-36b | 38 | 10.72 | 0.28 | 5,386 | 497.60 | 0.61 |
| kimi-dev-72b-dwq | 38 | 6.88 | 0.18 | 4,868 | 700.39 | 1.41 |
| kimi-dev-72b | 72 | 5.96 | 0.08 | 8,199 | 1,366.31 | 1.93 |
Key Observations
Performance Leaders
- Absolute fastest: nvidia/nemotron-3-nano-4b (152.33 tokens/sec, 18GB) - New speed champion!
- Second fastest: openai/gpt-oss-20b (95.39 tokens/sec, 11GB) - Clean working code
- Third fastest: nvidia/nemotron-3-nano-8b (94.28 tokens/sec, 34GB) - Very verbose output
- Most efficient (size-adjusted): openai/gpt-oss-20b (8.67 tokens/sec/GB)
- Quickest to respond: deepseek-coder-v2-lite-instruct (0.14s to first token)
- Fastest large model: qwen/qwen3-next-80b (84.09 tok/sec @ 79GB) - Excellent with working controls
Important: nvidia/nemotron-3-nano-4b (152.33 tok/sec) and nvidia/nemotron-3-nano-8b (94.28 tok/sec) are extremely fast but produce average/poor quality output. deepseek/deepseek-r1-0528-qwen3-8b (55.69 tok/sec, 8GB) has broken mouse controls and should be avoided.
Size vs Performance Analysis
-
Small models (< 20GB) - Best efficiency tier:
- ✅ nvidia/nemotron-3-nano-4b (18G): 152.33 tok/sec, 8.46 efficiency - Fastest overall, but average quality
- ✅ openai/gpt-oss-20b (11G): 95.39 tok/sec, 8.67 efficiency - Best efficiency, clean working code
- ✅ deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, 5.35 efficiency - Fastest first token (0.14s)
- ✅ zai-org/glm-4.6v-flash (12G): 36.02 tok/sec, 3.00 efficiency - Good working code
- ✅ mistralai/devstral-small-2-2512 (14G): 34.39 tok/sec, 2.46 efficiency - Clean working code
- ✅ qwen/qwen2.5-coder-14b (16G): 27.50 tok/sec, 1.72 efficiency - Simple, clean CSS
- ✅ qwen/qwen2.5-coder-32b-4b (19G): 23.78 tok/sec, 1.25 efficiency - Clean CSS with dark background
- ⚠️ allenai/olmo-3-32b-think-4b (18G): 23.46 tok/sec, 1.30 efficiency - Extremely verbose, poor quality
- ❌ deepseek/deepseek-r1-0528-qwen3-8b (8G): 55.69 tok/sec BUT broken mouse controls
-
Medium models (20-40GB) - Balanced tier:
- ✅ nvidia/nemotron-3-nano-8b (34G): 94.28 tok/sec, 2.77 efficiency - Fast but poor quality, very verbose
- ✅ qwen/qwen3-coder-30b (30G): 80.89 tok/sec, 2.70 efficiency - Fast with comprehensive output
- ✅ qwen/qwen3-30b-a3b-2507 (30G): 71.00 tok/sec, 2.37 efficiency - Good speed
- ✅ qwen/qwen3-vl-30b (34G): 57.90 tok/sec, 1.70 efficiency - Clean labeled faces
- ✅ bytedance/seed-oss-36b-4b (20G): 18.98 tok/sec, 0.95 efficiency - Working code
- ⚠️ mistralai/magistral-small-2509 (26G): 19.52 tok/sec, 0.75 efficiency - Low efficiency
- ⚠️ allenai/olmo-3-32b-think (34G): 14.62 tok/sec, 0.43 efficiency - Extremely verbose (11k tokens), poor quality
- ❌ qwen/qwen2.5-coder-32b (35G): 11.74 tok/sec, 0.34 efficiency - Slow full precision model
- ❌ nousresearch/hermes-4-70b (40G): 10.83 tok/sec, 0.27 efficiency - Slow despite excellent code quality
- ❌ bytedance/seed-oss-36b (38G): 10.72 tok/sec, 0.28 efficiency - Slow, very verbose
- ❌ qwen/qwq-32b (35G): 12.70 tok/sec, 0.36 efficiency - Extremely verbose (11k tokens)
-
Large models (> 40GB) - Mixed results, mostly inefficient:
- ✅ qwen/qwen3-next-80b (79G): 84.09 tok/sec, 1.06 efficiency - Best large model with working controls
- ✅ openai/gpt-oss-120b (63G): 64.92 tok/sec, 1.03 efficiency - Working pause/resume controls
- ✅ openai-gpt-oss-120b-mlx-6 (88G): 60.56 tok/sec, 0.69 efficiency - Excellent mouse drag controls
- ⚠️ glm-4.5-air (47G): 52.09 tok/sec, 1.11 efficiency - Over-engineered dual implementation
- ⚠️ glm-4.5-air-mlx (56G): 45.10 tok/sec, 0.81 efficiency - Over-engineered 5 cubes
- ❌ minimax-m2 (100G!): 48.00 tok/sec, 0.48 efficiency - Huge model, basic Three.js output
- ❌ kimi-dev-72b (72G): 5.96 tok/sec, 0.08 efficiency - Extremely slow, verbose
- ❌ kimi-dev-72b-dwq (38G): 6.88 tok/sec, 0.18 efficiency - Very slow
Efficiency Winners (tokens/sec/GB)
The true value considering both speed and resource requirements:
- openai/gpt-oss-20b: 8.67 tok/sec/GB (exceptional efficiency + working code)
- nvidia/nemotron-3-nano-4b: 8.46 tok/sec/GB (fastest overall, but average quality)
- deepseek/deepseek-r1-0528-qwen3-8b: 6.96 tok/sec/GB (high efficiency BUT broken code!)
- deepseek-coder-v2-lite-instruct: 5.35 tok/sec/GB (excellent efficiency + fastest first token)
- zai-org/glm-4.6v-flash: 3.00 tok/sec/GB (good efficiency + working code)
- nvidia/nemotron-3-nano-8b: 2.77 tok/sec/GB (fast but poor quality)
- qwen/qwen3-coder-30b: 2.70 tok/sec/GB (good balance)
- mistralai/devstral-small-2-2512: 2.46 tok/sec/GB (good efficiency + clean code)
- qwen/qwen3-30b-a3b-2507: 2.37 tok/sec/GB (good performance)
- qwen/qwen2.5-coder-14b: 1.72 tok/sec/GB (moderate efficiency)
Critical finding: Models > 40GB all fall below 1.1 tok/sec/GB except glm-4.5-air (1.11). The 100GB minimax-m2 achieves only 0.48 efficiency - worse than models 1/10th its size! New NVIDIA Nemotron models are blazing fast but sacrifice code quality.
Quality Assessment
Important finding: 24 out of 28 models (86%) generated working implementations. However, quality varies significantly based on code cleanliness, working features, and verbosity.
Quality Ratings
| Model | Rating | Implementation | Notes |
|---|---|---|---|
| openai-gpt-oss-120b-mlx-6 | ⭐⭐⭐⭐⭐ Excellent | CSS 3D | Working mouse drag, labeled faces, external CSS reset, clean code |
| qwen/qwen3-next-80b | ⭐⭐⭐⭐⭐ Excellent | CSS 3D + JS | Working pause/reset buttons, labeled faces, info text, clean |
| nousresearch/hermes-4-70b | ⭐⭐⭐⭐⭐ Excellent | CSS 3D + JS | Working speed/rotation controls, pause/reset, gradient background |
| openai/gpt-oss-120b | ⭐⭐⭐⭐⭐ Excellent | CSS 3D + JS | Working pause/resume buttons, clean code, Google Fonts |
| qwen/qwen3-vl-30b | ⭐⭐⭐⭐⭐ Excellent | CSS 3D | Clean code, labeled faces, nice colors, dark background |
| openai/gpt-oss-20b | ⭐⭐⭐⭐ Good | CSS 3D + JS | Working mouse drag, simple and clean, Google Fonts |
| deepseek-coder-v2-lite-instruct | ⭐⭐⭐⭐ Good | Canvas 2D | Clean wireframe with manual 3D math projection |
| qwen/qwen2.5-coder-14b | ⭐⭐⭐⭐ Good | CSS 3D | Pure CSS, no controls, simple white faces, works well |
| qwen/qwen2.5-coder-32b-4b | ⭐⭐⭐⭐ Good | CSS 3D | Pure CSS, no controls, white faces, dark background |
| qwen/qwen2.5-coder-32b | ⭐⭐⭐⭐ Good | CSS 3D | Pure CSS, full precision model, clean code |
| qwen/qwq-32b | ⭐⭐⭐⭐ Good | Three.js | Simple Three.js, ambient + directional lighting, clean |
| minimax-m2 | ⭐⭐⭐⭐ Good | Three.js | Simple Three.js, basic material, resize handler |
| mistralai/magistral-small-2509 | ⭐⭐⭐⭐ Good | CSS 3D | Pure CSS animation, no interactivity, clean and simple |
| mistralai/devstral-small-2-2512 | ⭐⭐⭐⭐ Good | CSS 3D | Clean CSS implementation, working animation |
| zai-org/glm-4.6v-flash | ⭐⭐⭐⭐ Good | CSS 3D | Working animation, clean code |
| bytedance/seed-oss-36b | ⭐⭐⭐⭐ Good | CSS 3D | Working implementation, verbose output |
| bytedance/seed-oss-36b-4b | ⭐⭐⭐⭐ Good | CSS 3D | Working implementation, quantized version |
| kimi-dev-72b | ⭐⭐⭐⭐ Good | CSS 3D | Basic auto-rotation, clean code |
| kimi-dev-72b-dwq | ⭐⭐⭐⭐ Good | CSS 3D | CSS-only Y-axis rotation, clean |
| qwen/qwen3-30b-a3b-2507 | ⭐⭐⭐ Average | CSS 3D + JS | Working buttons but uses JS animation instead of CSS, verbose |
| nvidia/nemotron-3-nano-4b | ⭐⭐⭐ Average | CSS 3D | Fast but average quality output |
| glm-4.5-air-mlx | ⭐⭐⭐ Average | CSS 3D + JS | 5 cubes with mouse tracking/controls, over-engineered, verbose |
| glm-4.5-air | ⭐⭐⭐ Average | CSS 3D + Three.js | Dual implementation with tabs, over-engineered, excessive features |
| qwen/qwen3-coder-30b | ⭐⭐⭐ Average | CSS 3D | Works but added non-functional controls, misleading UI |
| nvidia/nemotron-3-nano-8b | ⭐⭐ Poor | CSS 3D | Extremely fast but poor quality, very verbose (7k tokens) |
| allenai/olmo-3-32b-think | ⭐⭐ Poor | CSS 3D | Extremely verbose (11k tokens), poor quality output |
| allenai/olmo-3-32b-think-4b | ⭐⭐ Poor | CSS 3D | Very verbose (8k tokens), poor quality output |
| deepseek/deepseek-r1-0528-qwen3-8b | ⭐⭐ Poor | CSS 3D + JS | BROKEN: Mouse tracking doesn’t work, double-click zoom broken |
Quality vs Performance Analysis
Critical insight: Output quality and model size show inverse correlation in many cases:
Top Quality + Top Performance (Best Overall)
- ✅ openai/gpt-oss-20b (11G): 95.39 tok/sec, ⭐⭐⭐⭐ - Champion for speed + quality
- ✅ deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, ⭐⭐⭐⭐ - Runner-up, fastest first token
- ✅ qwen/qwen3-next-80b (79G): 84.09 tok/sec, ⭐⭐⭐⭐⭐ - Best large model with excellent quality
Fast But Low Quality (Speed Traps)
- ⚠️ nvidia/nemotron-3-nano-4b (18G): 152.33 tok/sec, ⭐⭐⭐ - Fastest overall but average quality
- ⚠️ nvidia/nemotron-3-nano-8b (34G): 94.28 tok/sec, ⭐⭐ - Very fast but poor quality, extremely verbose
High Quality Despite Poor Performance
- ⚠️ nousresearch/hermes-4-70b (40G): Only 10.83 tok/sec BUT ⭐⭐⭐⭐⭐ excellent code quality
- ⚠️ qwen/qwen3-vl-30b (34G): 57.90 tok/sec, ⭐⭐⭐⭐⭐ excellent clean code
Poor Performance Destroys Value
- ❌ deepseek/deepseek-r1-0528-qwen3-8b (8G): Good speed (55.69 tok/sec) BUT broken code (⭐⭐)
- ❌ allenai/olmo-3-32b-think (34G): Slow (14.62 tok/sec), extremely verbose (11k tokens), poor quality (⭐⭐)
- ❌ allenai/olmo-3-32b-think-4b (18G): Moderate speed (23.46 tok/sec), very verbose (8k tokens), poor quality (⭐⭐)
- ❌ minimax-m2 (100G!): Moderate speed (48.00 tok/sec) but basic Three.js output (⭐⭐⭐⭐)
- ❌ glm-4.5-air (47G): 52.09 tok/sec but over-engineered (⭐⭐⭐)
- ❌ glm-4.5-air-mlx (56G): 45.10 tok/sec but over-engineered (⭐⭐⭐)
Size Analysis by Quality Tier
Excellent (⭐⭐⭐⭐⭐) - 5 models:
- openai-gpt-oss-120b-mlx-6 (88G) - Large, 60.56 tok/sec
- qwen/qwen3-next-80b (79G) - Large, 84.09 tok/sec ✨
- openai/gpt-oss-120b (63G) - Large, 64.92 tok/sec
- nousresearch/hermes-4-70b (40G) - Medium, 10.83 tok/sec
- qwen/qwen3-vl-30b (34G) - Medium, 57.90 tok/sec
Good (⭐⭐⭐⭐) - 14 models:
- Includes best performers: openai/gpt-oss-20b (95.39 tok/sec), deepseek-coder-v2-lite-instruct (90.89 tok/sec)
- New additions: mistralai/devstral-small-2-2512, zai-org/glm-4.6v-flash, bytedance/seed-oss models
- Range: 8-72GB, all working implementations
Average (⭐⭐⭐) - 5 models:
- nvidia/nemotron-3-nano-4b - Fast but average quality
- All over-engineered or misleading UIs
- Includes both glm-4.5-air models (over-engineered)
Poor (⭐⭐) - 4 models:
- nvidia/nemotron-3-nano-8b - Fast but poor quality, very verbose
- allenai/olmo-3-32b-think - Extremely verbose, poor quality
- allenai/olmo-3-32b-think-4b - Very verbose, poor quality
- deepseek/deepseek-r1-0528-qwen3-8b - Broken code despite good metrics
Key insights:
- Speed ≠ Quality: nvidia/nemotron-3-nano-4b (152.33 tok/sec, ⭐⭐⭐) is fastest but average quality
- Working code > broken fancy features: deepseek-r1’s broken code invalidates its speed advantage
- “Think” models produce verbose, poor output: allenai/olmo-3-32b-think generates 11k tokens of poor quality
- Over-engineering reduces quality: glm-4.5-air models score only ⭐⭐⭐ despite elaborate UIs
- Size doesn’t predict quality: 5 excellent models range from 34GB to 88GB, with huge gaps in between
- Minimax-m2 disappoints: 100GB model produces basic Three.js output at only 48 tok/sec
Model Outputs
Click on the links below to view each model’s response:
Top Performers - Best Speed + Quality
- nvidia/nemotron-3-nano-4b (18G) ⭐⭐⭐: Raw Response | HTML Demo
- 152.33 tok/sec | 8.46 efficiency | Fastest overall but average quality | 2,275 tokens
- openai/gpt-oss-20b (11G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 95.39 tok/sec | 8.67 efficiency | Best efficiency + quality combo | Working mouse drag | 958 tokens
- nvidia/nemotron-3-nano-8b (34G) ⭐⭐: Raw Response | HTML Demo
- 94.28 tok/sec | 2.77 efficiency | Very fast but poor quality | 7,366 tokens
- deepseek-coder-v2-lite-instruct (17G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 90.89 tok/sec | 5.35 efficiency | Fastest first token (0.14s) | Canvas wireframe
- qwen/qwen3-next-80b (79G) ⭐⭐⭐⭐⭐: Raw Response | HTML Demo
- 84.09 tok/sec | 1.06 efficiency | Best large model | Working pause/reset buttons
Excellent Quality (⭐⭐⭐⭐⭐) - Best Code
- openai-gpt-oss-120b-mlx-6 (88G): Raw Response | HTML Demo
- 60.56 tok/sec | 0.69 efficiency | Working mouse drag | Labeled faces | External CSS reset
- openai/gpt-oss-120b (63G): Raw Response | HTML Demo
- 64.92 tok/sec | 1.03 efficiency | Working pause/resume | Clean code
- nousresearch/hermes-4-70b (40G): Raw Response | HTML Demo
- 10.83 tok/sec | 0.27 efficiency | Working speed/rotation controls | Minimal 639 tokens
- qwen/qwen3-vl-30b (34G): Raw Response | HTML Demo
- 57.90 tok/sec | 1.70 efficiency | Clean labeled faces | Nice colors
Good Quality (⭐⭐⭐⭐) - Solid Implementations
- zai-org/glm-4.6v-flash (12G): Raw Response | HTML Demo
- 36.02 tok/sec | 3.00 efficiency | Working CSS animation | Clean code
- mistralai/devstral-small-2-2512 (14G): Raw Response | HTML Demo
- 34.39 tok/sec | 2.46 efficiency | Clean CSS | Working animation
- qwen/qwen2.5-coder-14b (16G): Raw Response | HTML Demo
- 27.50 tok/sec | 1.72 efficiency | Pure CSS | Simple clean white faces
- qwen/qwen2.5-coder-32b-4b (19G): Raw Response | HTML Demo
- 23.78 tok/sec | 1.25 efficiency | Pure CSS | Dark background
- qwen/qwen2.5-coder-32b (35G): Raw Response | HTML Demo
- 11.74 tok/sec | 0.34 efficiency | Pure CSS | Full precision model
- bytedance/seed-oss-36b-4b (20G): Raw Response | HTML Demo
- 18.98 tok/sec | 0.95 efficiency | Working CSS | Quantized version
- bytedance/seed-oss-36b (38G): Raw Response | HTML Demo
- 10.72 tok/sec | 0.28 efficiency | Working CSS | Verbose output
- qwen/qwq-32b (35G): Raw Response | HTML Demo
- 12.70 tok/sec | 0.36 efficiency | Three.js | Extremely verbose 11,099 tokens
- minimax-m2 (100G): Raw Response | HTML Demo
- 48.00 tok/sec | 0.48 efficiency | Simple Three.js | Huge 100GB model!
- mistralai/magistral-small-2509 (26G): Raw Response | HTML Demo
- 19.52 tok/sec | 0.75 efficiency | Pure CSS | No interactivity
- kimi-dev-72b (72G): Raw Response | HTML Demo
- 5.96 tok/sec | 0.08 efficiency | Basic CSS | Extremely verbose 8,148 tokens
- kimi-dev-72b-dwq (38G): Raw Response | HTML Demo
- 6.88 tok/sec | 0.18 efficiency | CSS Y-axis only | Clean code
Average Quality (⭐⭐⭐) - Over-engineered or Misleading
- qwen/qwen3-30b-a3b-2507 (30G): Raw Response | HTML Demo
- 71.00 tok/sec | 2.37 efficiency | Working buttons but uses JS instead of CSS | Verbose
- qwen/qwen3-coder-30b (30G): Raw Response | HTML Demo
- 80.89 tok/sec | 2.70 efficiency | Non-functional controls | Misleading UI
- glm-4.5-air-mlx (56G): Raw Response | HTML Demo
- 45.10 tok/sec | 0.81 efficiency | 5 cubes with mouse tracking | Over-engineered
- glm-4.5-air (47G): Raw Response | HTML Demo
- 52.09 tok/sec | 1.11 efficiency | Dual implementation with tabs | Over-engineered
Poor Quality (⭐⭐) - Broken or Low Quality Code
- allenai/olmo-3-32b-think (34G): Raw Response | HTML Demo
- 14.62 tok/sec | 0.43 efficiency | Extremely verbose (11k tokens) | Poor quality output
- allenai/olmo-3-32b-think-4b (18G): Raw Response | HTML Demo
- 23.46 tok/sec | 1.30 efficiency | Very verbose (8k tokens) | Poor quality output
- deepseek/deepseek-r1-0528-qwen3-8b (8G): Raw Response | HTML Demo
- 55.69 tok/sec | 6.96 efficiency | BROKEN: Mouse tracking fails | Double-click zoom fails | AVOID
Technical Details
Hardware: Mac Studio (M4, 128GB Unified Memory) Initial Benchmark: October 2025 (9 models) First Update: November 3, 2025 (17 models total, 7 new) Second Update: November 6, 2025 (20 models total, 3 additional) Third Update: December 20, 2025 (28 models total, 9 new including NVIDIA Nemotron, ByteDance Seed, Allen AI OLMo, Mistral Devstral) Benchmark Tool: Custom Elixir script with streaming support API: LMStudio-compatible OpenAI API endpoint
Methodology
- Model Warmup: Each model receives a simple “hello” prompt before the benchmark to ensure it’s loaded in memory
- Streaming: Responses are streamed to accurately measure time-to-first-token
- Token Counting: Uses actual token counts from the API when available, falls back to estimation (1 token ≈ 4 characters)
- Timeout: 10-minute timeout per model to handle slower responses
Conclusion
This comprehensive benchmark of 28 models reveals critical insights that challenge conventional assumptions about model size, speed, and quality.
Key Takeaways
- Speed ≠ Quality: nvidia/nemotron-3-nano-4b (152.33 tok/sec) is fastest but only ⭐⭐⭐ average quality
- Working code matters more than speed: deepseek-r1 (55.69 tok/sec) and OLMo-think models have poor/broken code despite good metrics
- Size ≠ Speed ≠ Quality: The 11GB openai/gpt-oss-20b (95.39 tok/sec, ⭐⭐⭐⭐) outperforms the 100GB minimax-m2 (48.00 tok/sec, ⭐⭐⭐⭐) in both metrics
- Efficiency gap is massive: Small models achieve 5-18x better tokens/sec/GB than large models (8.67 vs 0.48)
- Only 5 models achieved excellent (⭐⭐⭐⭐⭐) quality: 3 large models (63-88GB), 2 medium models (34-40GB), and ZERO small models
- “Think” models disappoint: Allen AI OLMo-think models produce extremely verbose (8-11k tokens) poor quality output
- Over-engineering hurts: glm-4.5-air models (47-56GB) scored only ⭐⭐⭐ despite elaborate dual implementations
- New efficiency champions: zai-org/glm-4.6v-flash (3.00 eff) and mistralai/devstral (2.46 eff) deliver good quality at great efficiency
- Diminishing returns after 40GB: All models > 40GB achieve < 1.1 tok/sec/GB efficiency
Recommendations by Use Case
-
🏆 Best overall (speed + quality)?
- → openai/gpt-oss-20b (11G, 95.39 tok/sec, ⭐⭐⭐⭐) - Best efficiency + quality combo!
- → deepseek-coder-v2-lite-instruct (17G, 90.89 tok/sec, ⭐⭐⭐⭐) - Runner-up, fastest first token (0.14s)
- → zai-org/glm-4.6v-flash (12G, 36.02 tok/sec, ⭐⭐⭐⭐) - New efficient option with good quality
-
Want absolute best quality code (⭐⭐⭐⭐⭐)?
- → qwen/qwen3-next-80b (79G, 84.09 tok/sec) - Best balance: excellent quality + fast speed
- → openai/gpt-oss-120b (63G, 64.92 tok/sec) - Working pause/resume controls
- → openai-gpt-oss-120b-mlx-6 (88G, 60.56 tok/sec) - Working mouse drag, best code quality
- → qwen/qwen3-vl-30b (34G, 57.90 tok/sec) - Great quality at moderate size
- → nousresearch/hermes-4-70b (40G, 10.83 tok/sec) - Excellent code but slow
-
Maximum efficiency (best tok/sec/GB)?
- → openai/gpt-oss-20b (11G, 8.67 efficiency) - 18x better than minimax-m2!
- → deepseek-coder-v2-lite-instruct (17G, 5.35 efficiency) - Excellent value
- → zai-org/glm-4.6v-flash (12G, 3.00 efficiency) - New efficient option
- → mistralai/devstral-small-2-2512 (14G, 2.46 efficiency) - New coder-focused option
- → qwen/qwen3-coder-30b (30G, 2.70 efficiency) - Best medium-sized option
-
Budget-conscious (< 20GB)?
- → openai/gpt-oss-20b (11G) or deepseek-coder-v2-lite-instruct (17G) - Both excellent
- → zai-org/glm-4.6v-flash (12G, 36.02 tok/sec, ⭐⭐⭐⭐) - New compact option
- → mistralai/devstral-small-2-2512 (14G, 34.39 tok/sec, ⭐⭐⭐⭐) - New coder model
- → qwen/qwen2.5-coder-14b (16G, 27.50 tok/sec, ⭐⭐⭐⭐) - Good clean CSS
- → qwen/qwen2.5-coder-32b-4b (19G, 23.78 tok/sec, ⭐⭐⭐⭐) - Good clean CSS
-
⚠️ AVOID - Poor value:
- ❌ nvidia/nemotron-3-nano models (18-34G) - Fastest but poor/average quality output
- ❌ allenai/olmo-3-32b-think models (18-34G) - Extremely verbose, poor quality
- ❌ deepseek/deepseek-r1-0528-qwen3-8b (8G) - BROKEN CODE despite good speed
- ❌ minimax-m2 (100G!) - Worst efficiency (0.48), basic Three.js output
- ❌ kimi models (38-72G) - Extremely slow (< 7 tok/sec), terrible efficiency
- ❌ bytedance/seed-oss-36b (38G) - Slow (10.72 tok/sec), verbose
- ❌ qwen/qwq-32b (35G) - Slow (12.70 tok/sec), extremely verbose (11k tokens)
- ❌ glm-4.5-air models (47-56G) - Over-engineered (⭐⭐⭐ only)
- ❌ qwen/qwen3-coder-30b (30G) - Misleading non-functional controls
The Quality vs Efficiency Tradeoff
This benchmark reveals a fundamental tradeoff between speed/efficiency and code quality:
Speed Champions (⭐⭐⭐⭐):
- openai/gpt-oss-20b (11G): 95.39 tok/sec, 8.67 efficiency
- deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, 5.35 efficiency
- zai-org/glm-4.6v-flash (12G): 36.02 tok/sec, 3.00 efficiency - New!
- mistralai/devstral-small-2-2512 (14G): 34.39 tok/sec, 2.46 efficiency - New!
- Advantage: 2-18x faster per GB than large models
- Limitation: Good quality but not excellent (missing interactive controls)
Speed Traps (Fast but Low Quality):
- nvidia/nemotron-3-nano-4b (18G): 152.33 tok/sec, 8.46 efficiency - ⭐⭐⭐ only
- nvidia/nemotron-3-nano-8b (34G): 94.28 tok/sec, 2.77 efficiency - ⭐⭐ poor quality
- Warning: Fastest models don’t produce best code!
Quality Champions (⭐⭐⭐⭐⭐):
- qwen/qwen3-next-80b (79G): 84.09 tok/sec, 1.06 efficiency
- openai/gpt-oss-120b (63G): 64.92 tok/sec, 1.03 efficiency
- qwen/qwen3-vl-30b (34G): 57.90 tok/sec, 1.70 efficiency
- Advantage: Working interactive controls, professional-grade code
- Limitation: 5-8x worse efficiency than small models
The Sweet Spot:
- qwen/qwen3-next-80b (79G) offers the best balance: excellent quality (⭐⭐⭐⭐⭐) + high speed (84.09 tok/sec)
- For maximum efficiency with good quality: openai/gpt-oss-20b (11G) remains unbeatable
- New efficient options: zai-org/glm-4.6v-flash (12G) and mistralai/devstral (14G)
Critical Failures:
- nvidia/nemotron models - Fastest but poor/average quality - speed isn’t everything
- allenai/olmo-3-32b-think models - Extremely verbose (8-11k tokens), poor quality output
- minimax-m2 (100G) is the worst value: 0.48 efficiency, basic output, 100GB wasted
- deepseek-r1 (8G) has broken code despite good metrics - reliability matters more than speed
- Over-engineering penalty: glm-4.5-air models score only ⭐⭐⭐ despite elaborate features
For Mac Studio (M4, 128GB):
- Best choice: openai/gpt-oss-20b (11G) - Maximize speed + efficiency
- Best quality: qwen/qwen3-next-80b (79G) - Excellent code with strong performance
- New options: zai-org/glm-4.6v-flash (12G) or mistralai/devstral (14G) for good quality + efficiency
- Avoid: nvidia/nemotron (fast but low quality), olmo-think (verbose), minimax-m2, kimi, deepseek-r1, glm-4.5-air
The data is clear: bigger models CAN produce better code, but at 5-18x efficiency cost. Speed alone doesn’t guarantee quality - nvidia/nemotron models are fastest but produce poor output. For most use cases, the 11-17GB models offer the best pragmatic balance. Only choose large models (> 30GB) when code quality justifies the massive efficiency penalty.
Benchmark conducted with LMStudio on local hardware. Performance will vary based on hardware specifications, model quantization, and system load.