LLM Benchmark: 3D Rotating Cube Challenge
Overview
This benchmark tests various LLM models running locally through LMStudio by asking them to implement a single HTML file containing a 3D rotating cube using HTML, CSS, and JavaScript. The benchmark measures performance metrics including tokens per second, response time, and time to first token.
The Challenge
Please implement an html + js + css only project stored on a single html file that implements a 3d cube rotating. If you want you can add external dependencies as long as they are only included from cdns.
Benchmark Script
The benchmark was implemented in Elixir using HTTPoison to interact with the LMStudio API. The script:
- Fetches available models from LMStudio
- Warms up each model with a simple prompt
- Streams the actual benchmark prompt and measures performance metrics
- Saves individual responses and aggregated metrics
Key features:
- Streaming support for accurate time-to-first-token measurements
- Automatic token counting with fallback estimation
- CSV export of metrics for analysis
- Individual response files for quality comparison
Results
Performance Comparison
Model | Size | Tokens/sec | Efficiency (tok/sec/GB) | Total Tokens | Total Time (s) | Time to First Token (s) |
---|---|---|---|---|---|---|
openai/gpt-oss-20b | 11G | 95.39 | 8.67 | 1,009 | 10.04 | 0.69 |
qwen/qwen3-next-80b | 79G | 84.09 | 1.06 | 1,451 | 16.65 | 0.24 |
qwen/qwen3-coder-30b | 30G | 80.89 | 2.70 | 2,804 | 34.03 | 0.26 |
qwen/qwen3-30b-a3b-2507 | 30G | 71.00 | 2.37 | 1,391 | 18.87 | 0.24 |
openai-gpt-oss-120b-mlx-6 | 88G | 60.56 | 0.69 | 1,320 | 20.96 | 3.03 |
deepseek/deepseek-r1-0528-qwen3-8b | 8.1G | 55.69 | 6.88 | 3,938 | 69.79 | 0.19 |
glm-4.5-air-mlx | 56G | 45.10 | 0.81 | 4,919 | 107.93 | 0.41 |
kimi-dev-72b-dwq | 38G | 6.88 | 0.18 | 4,868 | 700.39 | 1.41 |
kimi-dev-72b | 72G | 5.96 | 0.08 | 8,199 | 1,366.31 | 1.93 |
Key Observations
Performance Leaders
- Absolute fastest: openai/gpt-oss-20b (95.39 tokens/sec)
- Most efficient (size-adjusted): openai/gpt-oss-20b (8.67 tokens/sec/GB)
- Quickest to respond: deepseek/deepseek-r1-0528-qwen3-8b (0.19s to first token)
- Best small model: deepseek/deepseek-r1-0528-qwen3-8b (55.69 tok/sec @ 8.1G)
Size vs Performance Analysis
-
Small models (< 15GB):
- openai/gpt-oss-20b (11G) delivers exceptional performance with minimal resource usage
- deepseek/deepseek-r1-0528-qwen3-8b (8.1G) offers impressive efficiency and fast response times
-
Medium models (30GB):
- qwen/qwen3-coder-30b balances speed (80.89 tok/sec) with comprehensive output
- qwen/qwen3-30b-a3b-2507 provides good performance at similar size
-
Large models (> 50GB):
- qwen/qwen3-next-80b (79G) is the fastest large model (84.09 tok/sec)
- openai-gpt-oss-120b-mlx-6 (88G) and glm-4.5-air-mlx (56G) show diminishing returns
- kimi models (72G, 38G) significantly underperform relative to their size
Efficiency Winners
The tokens per second per GB metric reveals the true value:
- openai/gpt-oss-20b: 8.67 tok/sec/GB (exceptional efficiency)
- deepseek/deepseek-r1-0528-qwen3-8b: 6.88 tok/sec/GB (excellent for its size)
- qwen/qwen3-coder-30b: 2.70 tok/sec/GB (good balance)
- Large models (> 50GB) all fall below 1.1 tok/sec/GB
Quality Assessment
All models successfully generated working 3D rotating cubes. However, there were notable differences in implementation complexity and feature completeness:
Quality Ratings
Model | Quality | Notes |
---|---|---|
glm-4.5-air-mlx | ⭐⭐⭐⭐⭐ Excellent | Created 5 cubes with working controls! |
openai/gpt-oss-20b | ⭐⭐⭐⭐⭐ Perfect | Clean implementation, works flawlessly |
deepseek/deepseek-r1-0528-qwen3-8b | ⭐⭐⭐⭐⭐ Perfect | Clean implementation, works flawlessly |
openai-gpt-oss-120b-mlx-6 | ⭐⭐⭐⭐⭐ Perfect | Clean implementation, works flawlessly |
qwen/qwen3-30b-a3b-2507 | ⭐⭐⭐⭐ Good | Works perfectly, added non-functional controls |
qwen/qwen3-next-80b | ⭐⭐⭐⭐ Good | Works fine, includes non-functional controls |
qwen/qwen3-coder-30b | ⭐⭐⭐⭐ Good | Works fine, added styling + non-functional controls |
kimi-dev-72b | ⭐⭐⭐⭐ Good | Works fine, basic implementation |
kimi-dev-72b-dwq | ⭐⭐⭐⭐ Good | Works fine, basic implementation |
Quality vs Performance Analysis
Interestingly, output quality does not correlate with model size:
- Small models (8-11GB) produced perfect, clean implementations
- Some medium/large models over-engineered with non-functional features
- The largest model tested (120GB) produced a perfect, simple solution
- glm-4.5-air-mlx went above and beyond with multiple cubes and functional controls
Key insight: Simpler, smaller models tend to generate cleaner, more focused code. Larger models sometimes add unnecessary complexity (non-working controls) that doesn’t enhance the final product.
Model Outputs
Click on the links below to view each model’s response:
Small Models (< 15GB) - Best Efficiency
- openai/gpt-oss-20b (11G) ⭐⭐⭐⭐⭐: Raw Response | HTML Demo
- 95.39 tok/sec | 8.67 tok/sec/GB | Most concise (958 tokens) | Perfect implementation
- deepseek/deepseek-r1-0528-qwen3-8b (8.1G) ⭐⭐⭐⭐⭐: Raw Response | HTML Demo
- 55.69 tok/sec | 6.88 tok/sec/GB | Fastest first token (0.19s) | Perfect implementation
Medium Models (30GB) - Balanced Performance
- qwen/qwen3-coder-30b (30G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 80.89 tok/sec | 2.70 tok/sec/GB | Added styling + non-functional controls
- qwen/qwen3-30b-a3b-2507 (30G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 71.00 tok/sec | 2.37 tok/sec/GB | Non-functional controls
Large Models (> 50GB) - Diminishing Returns
- qwen/qwen3-next-80b (79G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 84.09 tok/sec | 1.06 tok/sec/GB | Fastest large model | Non-functional controls
- openai-gpt-oss-120b-mlx-6 (88G) ⭐⭐⭐⭐⭐: Raw Response | HTML Demo
- 60.56 tok/sec | 0.69 tok/sec/GB | Slowest first token (3.03s) | Perfect implementation
- glm-4.5-air-mlx (56G) ⭐⭐⭐⭐⭐: Raw Response | HTML Demo
- 45.10 tok/sec | 0.81 tok/sec/GB | Most creative: 5 cubes with working controls!
Underperforming Models
- kimi-dev-72b-dwq (38G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 6.88 tok/sec | 0.18 tok/sec/GB | Works fine
- kimi-dev-72b (72G) ⭐⭐⭐⭐: Raw Response | HTML Demo
- 5.96 tok/sec | 0.08 tok/sec/GB | Extremely verbose (8,148 tokens)
Technical Details
Hardware: Mac Studio (M4, 128GB Unified Memory) Date: October 2025 Benchmark Tool: Custom Elixir script with streaming support API: LMStudio-compatible OpenAI API endpoint
Methodology
- Model Warmup: Each model receives a simple “hello” prompt before the benchmark to ensure it’s loaded in memory
- Streaming: Responses are streamed to accurately measure time-to-first-token
- Token Counting: Uses actual token counts from the API when available, falls back to estimation (1 token ≈ 4 characters)
- Timeout: 10-minute timeout per model to handle slower responses
Conclusion
This benchmark reveals a clear efficiency advantage for smaller models without sacrificing quality. The data shows that larger models do not provide proportional performance gains relative to their resource requirements, nor do they produce better code.
Key Takeaways
- Size doesn’t guarantee speed: The 11GB openai/gpt-oss-20b outperforms models 7x its size
- Size doesn’t guarantee quality: Small models (8-11GB) produced perfect implementations
- Efficiency matters: Small models deliver 6-8x better tokens/sec/GB than large models
- Diminishing returns: Models > 50GB show significantly worse efficiency (< 1.1 tok/sec/GB)
- Simplicity wins: Smaller models generate cleaner code; larger models over-engineer with non-functional features
- Sweet spot: 8-30GB models offer the best balance of performance and resource usage
Recommendations by Use Case
-
Best overall choice (speed + quality + efficiency)?
- → openai/gpt-oss-20b (11G, 95.39 tok/sec, 8.67 tok/sec/GB, ⭐⭐⭐⭐⭐)
- → deepseek/deepseek-r1-0528-qwen3-8b (8.1G, 55.69 tok/sec, 6.88 tok/sec/GB, ⭐⭐⭐⭐⭐)
-
Want creative implementations?
- → glm-4.5-air-mlx (56G) - created 5 cubes with working controls (⭐⭐⭐⭐⭐)
- Note: Slower speed (45.10 tok/sec) but impressive output
-
Need comprehensive code with good speed?
- → qwen/qwen3-coder-30b (30G, 80.89 tok/sec, 2.70 tok/sec/GB)
- Caveat: Added non-functional controls (over-engineering)
-
Have resources for large models?
- → openai-gpt-oss-120b-mlx-6 (88G) - perfect quality but slow (60.56 tok/sec)
- → qwen/qwen3-next-80b (79G) - faster (84.09 tok/sec) but adds non-functional features
- Note: Neither justifies the 7x resource usage vs small models
-
Avoid: kimi models show poor performance relative to size (< 0.2 tok/sec/GB)
The Efficiency Paradox
Counterintuitively, smaller models are both faster AND produce cleaner code. The overhead of processing larger parameter counts appears to outweigh their potential computational advantages. Larger models tend to over-engineer solutions with non-functional features that don’t add value.
For code generation on local hardware with the M4 chip’s unified memory architecture, bigger is definitely not better. The sweet spot is 8-11GB models that maximize both performance and code quality.
Benchmark conducted with LMStudio on local hardware. Performance will vary based on hardware specifications, model quantization, and system load.