LLM Benchmark: 3D Rotating Cube Challenge
Updated November 2025: Added 10 new models to the benchmark, including coder-specialized models and the latest releases from Qwen, Minimax, and Mistral AI.
Overview
This benchmark tests various LLM models running locally through LMStudio by asking them to implement a single HTML file containing a 3D rotating cube using HTML, CSS, and JavaScript. The benchmark measures performance metrics including tokens per second, response time, and time to first token.
The Challenge
Please implement an html + js + css only project stored on a single html file that implements a 3d cube rotating. If you want you can add external dependencies as long as they are only included from cdns.
Benchmark Script
The benchmark was implemented in Elixir using HTTPoison to interact with the LMStudio API. The script:
- Fetches available models from LMStudio
- Warms up each model with a simple prompt
- Streams the actual benchmark prompt and measures performance metrics
- Saves individual responses and aggregated metrics
Key features:
- Streaming support for accurate time-to-first-token measurements
- Automatic token counting with fallback estimation
- CSV export of metrics for analysis
- Individual response files for quality comparison
Results
Performance Comparison
| Model | Size (GB) | Tokens/sec | Efficiency (tok/sec/GB) | Total Tokens | Total Time (s) | Time to First Token (s) |
|---|---|---|---|---|---|---|
| openai/gpt-oss-20b | 11 | 95.39 | 8.67 | 1,009 | 10.04 | 0.69 |
| deepseek-coder-v2-lite-instruct | 17 | 90.89 | 5.35 | 1,089 | 11.42 | 0.14 |
| qwen/qwen3-next-80b | 79 | 84.09 | 1.06 | 1,451 | 16.65 | 0.24 |
| qwen/qwen3-coder-30b | 30 | 80.89 | 2.70 | 2,804 | 34.03 | 0.26 |
| qwen/qwen3-30b-a3b-2507 | 30 | 71.00 | 2.37 | 1,391 | 18.87 | 0.24 |
| openai/gpt-oss-120b | 63 | 64.92 | 1.03 | 1,327 | 19.66 | 1.01 |
| openai-gpt-oss-120b-mlx-6 | 88 | 60.56 | 0.69 | 1,320 | 20.96 | 3.03 |
| qwen/qwen3-vl-30b | 34 | 57.90 | 1.70 | 885 | 14.40 | 1.08 |
| deepseek/deepseek-r1-0528-qwen3-8b | 8 | 55.69 | 6.88 | 3,938 | 69.79 | 0.19 |
| glm-4.5-air | 47 | 52.09 | 1.11 | 5,351 | 101.75 | 0.46 |
| minimax-m2 | 100 | 48.00 | 0.48 | 4,005 | 82.38 | 0.55 |
| glm-4.5-air-mlx | 56 | 45.10 | 0.81 | 4,919 | 107.93 | 0.41 |
| qwen/qwen2.5-coder-14b | 16 | 27.50 | 1.72 | 781 | 26.54 | 0.43 |
| qwen/qwen2.5-coder-32b | 19 | 23.78 | 1.25 | 908 | 36.04 | 0.65 |
| mistralai/magistral-small-2509 | 26 | 19.52 | 0.75 | 1,301 | 64.04 | 0.48 |
| qwen/qwq-32b | 35 | 12.70 | 0.36 | 11,099 | 869.78 | 0.52 |
| nousresearch/hermes-4-70b | 40 | 10.83 | 0.27 | 639 | 54.28 | 1.26 |
| kimi-dev-72b-dwq | 38 | 6.88 | 0.18 | 4,868 | 700.39 | 1.41 |
| kimi-dev-72b | 72 | 5.96 | 0.08 | 8,199 | 1,366.31 | 1.93 |
Key Observations
Performance Leaders
- Absolute fastest: openai/gpt-oss-20b (95.39 tokens/sec, 11GB) - Clean working code
- Second fastest: deepseek-coder-v2-lite-instruct (90.89 tokens/sec, 17GB) - Canvas wireframe
- Most efficient (size-adjusted): openai/gpt-oss-20b (8.67 tokens/sec/GB)
- Quickest to respond: deepseek-coder-v2-lite-instruct (0.14s to first token)
- Fastest large model: qwen/qwen3-next-80b (84.09 tok/sec @ 79GB) - Excellent with working controls
Important: deepseek/deepseek-r1-0528-qwen3-8b (55.69 tok/sec, 8.1GB) has broken mouse controls and should be avoided despite good speed metrics.
Size vs Performance Analysis
-
Small models (< 20GB) - Best efficiency tier:
- ✅ openai/gpt-oss-20b (11G): 95.39 tok/sec, 8.67 efficiency - Clean, working code
- ✅ deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, 5.35 efficiency - Fastest first token (0.14s)
- ✅ qwen/qwen2.5-coder-14b (16G): 27.50 tok/sec, 1.72 efficiency - Simple, clean CSS
- ✅ qwen/qwen2.5-coder-32b (19G): 23.78 tok/sec, 1.25 efficiency - Clean CSS with dark background
- ❌ deepseek/deepseek-r1-0528-qwen3-8b (8.1G): 55.69 tok/sec BUT broken mouse controls
-
Medium models (20-40GB) - Balanced tier:
- ✅ qwen/qwen3-coder-30b (30G): 80.89 tok/sec, 2.70 efficiency - Fast with comprehensive output
- ✅ qwen/qwen3-30b-a3b-2507 (30G): 71.00 tok/sec, 2.37 efficiency - Good speed
- ✅ qwen/qwen3-vl-30b (34G): 57.90 tok/sec, 1.70 efficiency - Clean labeled faces
- ⚠️ mistralai/magistral-small-2509 (26G): 19.52 tok/sec, 0.75 efficiency - Low efficiency
- ❌ nousresearch/hermes-4-70b (40G): 10.83 tok/sec, 0.27 efficiency - Slow despite good code quality
- ❌ qwen/qwq-32b (35G): 12.70 tok/sec, 0.36 efficiency - Extremely verbose (11k tokens)
-
Large models (> 40GB) - Mixed results, mostly inefficient:
- ✅ qwen/qwen3-next-80b (79G): 84.09 tok/sec, 1.06 efficiency - Best large model with working controls
- ✅ openai/gpt-oss-120b (63G): 64.92 tok/sec, 1.03 efficiency - Working pause/resume controls
- ✅ openai-gpt-oss-120b-mlx-6 (88G): 60.56 tok/sec, 0.69 efficiency - Excellent mouse drag controls
- ⚠️ glm-4.5-air (47G): 52.09 tok/sec, 1.11 efficiency - Over-engineered dual implementation
- ⚠️ glm-4.5-air-mlx (56G): 45.10 tok/sec, 0.81 efficiency - Over-engineered 5 cubes
- ❌ minimax-m2 (100G!): 48.00 tok/sec, 0.48 efficiency - Huge model, basic Three.js output
- ❌ kimi-dev-72b (72G): 5.96 tok/sec, 0.08 efficiency - Extremely slow, verbose
- ❌ kimi-dev-72b-dwq (38G): 6.88 tok/sec, 0.18 efficiency - Very slow
Efficiency Winners (tokens/sec/GB)
The true value considering both speed and resource requirements:
- openai/gpt-oss-20b: 8.67 tok/sec/GB (exceptional efficiency + working code)
- deepseek-coder-v2-lite-instruct: 5.35 tok/sec/GB (excellent efficiency + fastest first token)
- qwen/qwen3-coder-30b: 2.70 tok/sec/GB (good balance)
- qwen/qwen3-30b-a3b-2507: 2.37 tok/sec/GB (good performance)
- qwen/qwen2.5-coder-14b: 1.72 tok/sec/GB (moderate efficiency)
- qwen/qwen3-vl-30b: 1.70 tok/sec/GB (good for vision model)
- qwen/qwen2.5-coder-32b: 1.25 tok/sec/GB (moderate efficiency)
Critical finding: Models > 40GB all fall below 1.1 tok/sec/GB except glm-4.5-air (1.11). The 100GB minimax-m2 achieves only 0.48 efficiency - worse than models 1/10th its size!
Quality Assessment
Important finding: 18 out of 19 models (95%) generated working implementations. However, quality varies significantly based on code cleanliness, working features, and verbosity.
Quality Ratings
| Model | Rating | Implementation | Notes |
|---|---|---|---|
| openai-gpt-oss-120b-mlx-6 | ⭐⭐⭐⭐⭐ Excellent | CSS 3D | Working mouse drag, labeled faces, external CSS reset, clean code |
| qwen/qwen3-next-80b | ⭐⭐⭐⭐⭐ Excellent | CSS 3D + JS | Working pause/reset buttons, labeled faces, info text, clean |
| nousresearch/hermes-4-70b | ⭐⭐⭐⭐⭐ Excellent | CSS 3D + JS | Working speed/rotation controls, pause/reset, gradient background |
| openai/gpt-oss-120b | ⭐⭐⭐⭐⭐ Excellent | CSS 3D + JS | Working pause/resume buttons, clean code, Google Fonts |
| qwen/qwen3-vl-30b | ⭐⭐⭐⭐⭐ Excellent | CSS 3D | Clean code, labeled faces, nice colors, dark background |
| openai/gpt-oss-20b | ⭐⭐⭐⭐ Good | CSS 3D + JS | Working mouse drag, simple and clean, Google Fonts |
| deepseek-coder-v2-lite-instruct | ⭐⭐⭐⭐ Good | Canvas 2D | Clean wireframe with manual 3D math projection |
| qwen/qwen2.5-coder-14b | ⭐⭐⭐⭐ Good | CSS 3D | Pure CSS, no controls, simple white faces, works well |
| qwen/qwen2.5-coder-32b | ⭐⭐⭐⭐ Good | CSS 3D | Pure CSS, no controls, white faces, dark background |
| qwen/qwq-32b | ⭐⭐⭐⭐ Good | Three.js | Simple Three.js, ambient + directional lighting, clean |
| minimax-m2 | ⭐⭐⭐⭐ Good | Three.js | Simple Three.js, basic material, resize handler |
| mistralai/magistral-small-2509 | ⭐⭐⭐⭐ Good | CSS 3D | Pure CSS animation, no interactivity, clean and simple |
| kimi-dev-72b | ⭐⭐⭐⭐ Good | CSS 3D | Basic auto-rotation, clean code |
| kimi-dev-72b-dwq | ⭐⭐⭐⭐ Good | CSS 3D | CSS-only Y-axis rotation, clean |
| qwen/qwen3-30b-a3b-2507 | ⭐⭐⭐ Average | CSS 3D + JS | Working buttons but uses JS animation instead of CSS, verbose |
| glm-4.5-air-mlx | ⭐⭐⭐ Average | CSS 3D + JS | 5 cubes with mouse tracking/controls, over-engineered, verbose |
| glm-4.5-air | ⭐⭐⭐ Average | CSS 3D + Three.js | Dual implementation with tabs, over-engineered, excessive features |
| qwen/qwen3-coder-30b | ⭐⭐⭐ Average | CSS 3D | Works but added non-functional controls, misleading UI |
| deepseek/deepseek-r1-0528-qwen3-8b | ⭐⭐ Poor | CSS 3D + JS | BROKEN: Mouse tracking doesn’t work, double-click zoom broken |
Quality vs Performance Analysis
Critical insight: Output quality and model size show inverse correlation in many cases:
Top Quality + Top Performance (Best Overall)
- ✅ openai/gpt-oss-20b (11G): 95.39 tok/sec, ⭐⭐⭐⭐ - Champion for speed + quality
- ✅ deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, ⭐⭐⭐⭐ - Runner-up, fastest first token
- ✅ qwen/qwen3-next-80b (79G): 84.09 tok/sec, ⭐⭐⭐⭐⭐ - Best large model with excellent quality
High Quality Despite Poor Performance
- ⚠️ nousresearch/hermes-4-70b (40G): Only 10.83 tok/sec BUT ⭐⭐⭐⭐⭐ excellent code quality
- ⚠️ qwen/qwen3-vl-30b (34G): 57.90 tok/sec, ⭐⭐⭐⭐⭐ excellent clean code
Poor Performance Destroys Value
- ❌ deepseek/deepseek-r1-0528-qwen3-8b (8.1G): Good speed (55.69 tok/sec) BUT broken code (⭐⭐)
- ❌ minimax-m2 (100G!): Moderate speed (48.00 tok/sec) but basic Three.js output (⭐⭐⭐⭐)
- ❌ glm-4.5-air (47G): 52.09 tok/sec but over-engineered (⭐⭐⭐)
- ❌ glm-4.5-air-mlx (56G): 45.10 tok/sec but over-engineered (⭐⭐⭐)
Size Analysis by Quality Tier
Excellent (⭐⭐⭐⭐⭐) - Only 5 models:
- openai-gpt-oss-120b-mlx-6 (88G) - Large, 60.56 tok/sec
- qwen/qwen3-next-80b (79G) - Large, 84.09 tok/sec ✨
- openai/gpt-oss-120b (63G) - Large, 64.92 tok/sec
- nousresearch/hermes-4-70b (40G) - Medium, 10.83 tok/sec
- qwen/qwen3-vl-30b (34G) - Medium, 57.90 tok/sec
Good (⭐⭐⭐⭐) - 9 models:
- Includes best performers: openai/gpt-oss-20b (95.39 tok/sec), deepseek-coder-v2-lite-instruct (90.89 tok/sec)
- Range: 8-72GB, all working implementations
Average (⭐⭐⭐) - 4 models:
- All over-engineered or misleading UIs
- Includes both glm-4.5-air models (over-engineered)
Poor (⭐⭐) - 1 model:
- deepseek/deepseek-r1-0528-qwen3-8b - Broken code despite good metrics
Key insights:
- Speed matters more than elaborate features: openai/gpt-oss-20b (95.39 tok/sec, ⭐⭐⭐⭐) beats minimax-m2 (48.00 tok/sec, ⭐⭐⭐⭐)
- Working code > broken fancy features: deepseek-r1’s broken code invalidates its speed advantage
- Over-engineering reduces quality: glm-4.5-air models score only ⭐⭐⭐ despite elaborate UIs
- Size doesn’t predict quality: 5 excellent models range from 34GB to 88GB, with huge gaps in between
- Minimax-m2 disappoints: 100GB model produces basic Three.js output at only 48 tok/sec
Model Outputs
Click on the links below to view each model’s response:
Top Performers - Best Speed + Quality
- openai/gpt-oss-20b (11G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 95.39 tok/sec | 8.67 efficiency | Fastest overall | Working mouse drag | 958 tokens
- deepseek-coder-v2-lite-instruct (17G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 90.89 tok/sec | 5.35 efficiency | Fastest first token (0.14s) | Canvas wireframe
- qwen/qwen3-next-80b (79G) ⭐⭐⭐⭐⭐: Raw Response | HTML Demo
- 84.09 tok/sec | 1.06 efficiency | Best large model | Working pause/reset buttons
Excellent Quality (⭐⭐⭐⭐⭐) - Best Code
- openai-gpt-oss-120b-mlx-6 (88G): Raw Response | HTML Demo
- 60.56 tok/sec | 0.69 efficiency | Working mouse drag | Labeled faces | External CSS reset
- openai/gpt-oss-120b (63G): Raw Response | HTML Demo
- 64.92 tok/sec | 1.03 efficiency | Working pause/resume | Clean code
- nousresearch/hermes-4-70b (40G): Raw Response | HTML Demo
- 10.83 tok/sec | 0.27 efficiency | Working speed/rotation controls | Minimal 639 tokens
- qwen/qwen3-vl-30b (34G): Raw Response | HTML Demo
- 57.90 tok/sec | 1.70 efficiency | Clean labeled faces | Nice colors
Good Quality (⭐⭐⭐⭐) - Solid Implementations
- qwen/qwen2.5-coder-14b (16G): Raw Response | HTML Demo
- 27.50 tok/sec | 1.72 efficiency | Pure CSS | Simple clean white faces
- qwen/qwen2.5-coder-32b (19G): Raw Response | HTML Demo
- 23.78 tok/sec | 1.25 efficiency | Pure CSS | Dark background
- qwen/qwq-32b (35G): Raw Response | HTML Demo
- 12.70 tok/sec | 0.36 efficiency | Three.js | Extremely verbose 11,099 tokens
- minimax-m2 (100G): Raw Response | HTML Demo
- 48.00 tok/sec | 0.48 efficiency | Simple Three.js | Huge 100GB model!
- mistralai/magistral-small-2509 (26G): Raw Response | HTML Demo
- 19.52 tok/sec | 0.75 efficiency | Pure CSS | No interactivity
- kimi-dev-72b (72G): Raw Response | HTML Demo
- 5.96 tok/sec | 0.08 efficiency | Basic CSS | Extremely verbose 8,148 tokens
- kimi-dev-72b-dwq (38G): Raw Response | HTML Demo
- 6.88 tok/sec | 0.18 efficiency | CSS Y-axis only | Clean code
Average Quality (⭐⭐⭐) - Over-engineered or Misleading
- qwen/qwen3-30b-a3b-2507 (30G): Raw Response | HTML Demo
- 71.00 tok/sec | 2.37 efficiency | Working buttons but uses JS instead of CSS | Verbose
- qwen/qwen3-coder-30b (30G): Raw Response | HTML Demo
- 80.89 tok/sec | 2.70 efficiency | Non-functional controls | Misleading UI
- glm-4.5-air-mlx (56G): Raw Response | HTML Demo
- 45.10 tok/sec | 0.81 efficiency | 5 cubes with mouse tracking | Over-engineered
- glm-4.5-air (47G): Raw Response | HTML Demo
- 52.09 tok/sec | 1.11 efficiency | Dual implementation with tabs | Over-engineered
Poor Quality (⭐⭐) - Broken Code
- deepseek/deepseek-r1-0528-qwen3-8b (8.1G): Raw Response | HTML Demo
- 55.69 tok/sec | 6.88 efficiency | BROKEN: Mouse tracking fails | Double-click zoom fails | AVOID
Technical Details
Hardware: Mac Studio (M4, 128GB Unified Memory) Initial Benchmark: October 2025 (9 models) First Update: November 3, 2025 (17 models total, 7 new) Second Update: November 6, 2025 (20 models total, 3 additional) Benchmark Tool: Custom Elixir script with streaming support API: LMStudio-compatible OpenAI API endpoint
Methodology
- Model Warmup: Each model receives a simple “hello” prompt before the benchmark to ensure it’s loaded in memory
- Streaming: Responses are streamed to accurately measure time-to-first-token
- Token Counting: Uses actual token counts from the API when available, falls back to estimation (1 token ≈ 4 characters)
- Timeout: 10-minute timeout per model to handle slower responses
Conclusion
This comprehensive benchmark of 19 models reveals critical insights that challenge conventional assumptions about model size, speed, and quality.
Key Takeaways
- Working code matters more than speed: deepseek-r1 (55.69 tok/sec) has broken code (⭐⭐), making it worse than slower models with working implementations
- Size ≠ Speed ≠ Quality: The 11GB openai/gpt-oss-20b (95.39 tok/sec, ⭐⭐⭐⭐) outperforms the 100GB minimax-m2 (48.00 tok/sec, ⭐⭐⭐⭐) in both metrics
- Efficiency gap is massive: Small models achieve 5-18x better tokens/sec/GB than large models (8.67 vs 0.48)
- Only 5 models achieved excellent (⭐⭐⭐⭐⭐) quality: 3 large models (63-88GB), 2 medium models (34-40GB), and ZERO small models
- Large models dominate quality: All 5 excellent-rated models are > 34GB, with working interactive controls
- Over-engineering hurts: glm-4.5-air models (47-56GB) scored only ⭐⭐⭐ despite elaborate dual implementations
- Minimax-m2 is a trap: 100GB model delivers basic Three.js output at 0.48 efficiency - worst value
- Diminishing returns after 40GB: All models > 40GB achieve < 1.1 tok/sec/GB efficiency
Recommendations by Use Case
-
🏆 Best overall (speed + quality)?
- → openai/gpt-oss-20b (11G, 95.39 tok/sec, ⭐⭐⭐⭐) - Fastest with working code!
- → deepseek-coder-v2-lite-instruct (17G, 90.89 tok/sec, ⭐⭐⭐⭐) - Runner-up, fastest first token (0.14s)
-
Want absolute best quality code (⭐⭐⭐⭐⭐)?
- → qwen/qwen3-next-80b (79G, 84.09 tok/sec) - Best balance: excellent quality + fast speed
- → openai/gpt-oss-120b (63G, 64.92 tok/sec) - Working pause/resume controls
- → openai-gpt-oss-120b-mlx-6 (88G, 60.56 tok/sec) - Working mouse drag, best code quality
- → qwen/qwen3-vl-30b (34G, 57.90 tok/sec) - Great quality at moderate size
- → nousresearch/hermes-4-70b (40G, 10.83 tok/sec) - Excellent code but slow
-
Maximum efficiency (best tok/sec/GB)?
- → openai/gpt-oss-20b (11G, 8.67 efficiency) - 18x better than minimax-m2!
- → deepseek-coder-v2-lite-instruct (17G, 5.35 efficiency) - Excellent value
- → qwen/qwen3-coder-30b (30G, 2.70 efficiency) - Best medium-sized option
-
Budget-conscious (< 20GB)?
- → openai/gpt-oss-20b (11G) or deepseek-coder-v2-lite-instruct (17G) - Both excellent
- → qwen/qwen2.5-coder-14b (16G, 27.50 tok/sec, ⭐⭐⭐⭐) - Good clean CSS
- → qwen/qwen2.5-coder-32b (19G, 23.78 tok/sec, ⭐⭐⭐⭐) - Good clean CSS
-
⚠️ AVOID - Poor value:
- ❌ deepseek/deepseek-r1-0528-qwen3-8b (8.1G) - BROKEN CODE despite good speed
- ❌ minimax-m2 (100G!) - Worst efficiency (0.48), basic Three.js output
- ❌ kimi models (38-72G) - Extremely slow (< 7 tok/sec), terrible efficiency
- ❌ qwen/qwq-32b (35G) - Slow (12.70 tok/sec), extremely verbose (11k tokens)
- ❌ glm-4.5-air models (47-56G) - Over-engineered (⭐⭐⭐ only)
- ❌ qwen/qwen3-coder-30b (30G) - Misleading non-functional controls
The Quality vs Efficiency Tradeoff
This benchmark reveals a fundamental tradeoff between speed/efficiency and code quality:
Speed Champions (⭐⭐⭐⭐):
- openai/gpt-oss-20b (11G): 95.39 tok/sec, 8.67 efficiency
- deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, 5.35 efficiency
- Advantage: 2-18x faster per GB than large models
- Limitation: Good quality but not excellent (missing interactive controls)
Quality Champions (⭐⭐⭐⭐⭐):
- qwen/qwen3-next-80b (79G): 84.09 tok/sec, 1.06 efficiency
- openai/gpt-oss-120b (63G): 64.92 tok/sec, 1.03 efficiency
- qwen/qwen3-vl-30b (34G): 57.90 tok/sec, 1.70 efficiency
- Advantage: Working interactive controls, professional-grade code
- Limitation: 5-8x worse efficiency than small models
The Sweet Spot:
- qwen/qwen3-next-80b (79G) offers the best balance: excellent quality (⭐⭐⭐⭐⭐) + high speed (84.09 tok/sec)
- For maximum efficiency with good quality: openai/gpt-oss-20b (11G) remains unbeatable
Critical Failures:
- minimax-m2 (100G) is the worst model tested: 0.48 efficiency, basic output, 100GB wasted
- deepseek-r1 (8.1G) has broken code despite good metrics - reliability matters more than speed
- Over-engineering penalty: glm-4.5-air models score only ⭐⭐⭐ despite elaborate features
For Mac Studio (M4, 128GB):
- Best choice: openai/gpt-oss-20b (11G) - Maximize speed + efficiency
- Best quality: qwen/qwen3-next-80b (79G) - Excellent code with strong performance
- Avoid: minimax-m2, kimi models, deepseek-r1, glm-4.5-air - All poor value
The data is clear: bigger models CAN produce better code, but at 5-18x efficiency cost. For most use cases, the 11-17GB models offer the best pragmatic balance. Only choose large models (> 30GB) when code quality justifies the massive efficiency penalty.
Benchmark conducted with LMStudio on local hardware. Performance will vary based on hardware specifications, model quantization, and system load.