LLM Benchmark: 3D Rotating Cube Challenge

Updated November 2025: Added 10 new models to the benchmark, including coder-specialized models and the latest releases from Qwen, Minimax, and Mistral AI.

Overview

This benchmark tests various LLM models running locally through LMStudio by asking them to implement a single HTML file containing a 3D rotating cube using HTML, CSS, and JavaScript. The benchmark measures performance metrics including tokens per second, response time, and time to first token.

The Challenge

Please implement an html + js + css only project stored on a single html file that implements a 3d cube rotating. If you want you can add external dependencies as long as they are only included from cdns.

Benchmark Script

The benchmark was implemented in Elixir using HTTPoison to interact with the LMStudio API. The script:

  1. Fetches available models from LMStudio
  2. Warms up each model with a simple prompt
  3. Streams the actual benchmark prompt and measures performance metrics
  4. Saves individual responses and aggregated metrics

Key features:

  • Streaming support for accurate time-to-first-token measurements
  • Automatic token counting with fallback estimation
  • CSV export of metrics for analysis
  • Individual response files for quality comparison

View full script source

Results

Performance Comparison

Model Size (GB) Tokens/sec Efficiency (tok/sec/GB) Total Tokens Total Time (s) Time to First Token (s)
openai/gpt-oss-20b 11 95.39 8.67 1,009 10.04 0.69
deepseek-coder-v2-lite-instruct 17 90.89 5.35 1,089 11.42 0.14
qwen/qwen3-next-80b 79 84.09 1.06 1,451 16.65 0.24
qwen/qwen3-coder-30b 30 80.89 2.70 2,804 34.03 0.26
qwen/qwen3-30b-a3b-2507 30 71.00 2.37 1,391 18.87 0.24
openai/gpt-oss-120b 63 64.92 1.03 1,327 19.66 1.01
openai-gpt-oss-120b-mlx-6 88 60.56 0.69 1,320 20.96 3.03
qwen/qwen3-vl-30b 34 57.90 1.70 885 14.40 1.08
deepseek/deepseek-r1-0528-qwen3-8b 8 55.69 6.88 3,938 69.79 0.19
glm-4.5-air 47 52.09 1.11 5,351 101.75 0.46
minimax-m2 100 48.00 0.48 4,005 82.38 0.55
glm-4.5-air-mlx 56 45.10 0.81 4,919 107.93 0.41
qwen/qwen2.5-coder-14b 16 27.50 1.72 781 26.54 0.43
qwen/qwen2.5-coder-32b 19 23.78 1.25 908 36.04 0.65
mistralai/magistral-small-2509 26 19.52 0.75 1,301 64.04 0.48
qwen/qwq-32b 35 12.70 0.36 11,099 869.78 0.52
nousresearch/hermes-4-70b 40 10.83 0.27 639 54.28 1.26
kimi-dev-72b-dwq 38 6.88 0.18 4,868 700.39 1.41
kimi-dev-72b 72 5.96 0.08 8,199 1,366.31 1.93

Key Observations

Performance Leaders

  • Absolute fastest: openai/gpt-oss-20b (95.39 tokens/sec, 11GB) - Clean working code
  • Second fastest: deepseek-coder-v2-lite-instruct (90.89 tokens/sec, 17GB) - Canvas wireframe
  • Most efficient (size-adjusted): openai/gpt-oss-20b (8.67 tokens/sec/GB)
  • Quickest to respond: deepseek-coder-v2-lite-instruct (0.14s to first token)
  • Fastest large model: qwen/qwen3-next-80b (84.09 tok/sec @ 79GB) - Excellent with working controls

Important: deepseek/deepseek-r1-0528-qwen3-8b (55.69 tok/sec, 8.1GB) has broken mouse controls and should be avoided despite good speed metrics.

Size vs Performance Analysis

  • Small models (< 20GB) - Best efficiency tier:

    • openai/gpt-oss-20b (11G): 95.39 tok/sec, 8.67 efficiency - Clean, working code
    • deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, 5.35 efficiency - Fastest first token (0.14s)
    • qwen/qwen2.5-coder-14b (16G): 27.50 tok/sec, 1.72 efficiency - Simple, clean CSS
    • qwen/qwen2.5-coder-32b (19G): 23.78 tok/sec, 1.25 efficiency - Clean CSS with dark background
    • deepseek/deepseek-r1-0528-qwen3-8b (8.1G): 55.69 tok/sec BUT broken mouse controls
  • Medium models (20-40GB) - Balanced tier:

    • qwen/qwen3-coder-30b (30G): 80.89 tok/sec, 2.70 efficiency - Fast with comprehensive output
    • qwen/qwen3-30b-a3b-2507 (30G): 71.00 tok/sec, 2.37 efficiency - Good speed
    • qwen/qwen3-vl-30b (34G): 57.90 tok/sec, 1.70 efficiency - Clean labeled faces
    • ⚠️ mistralai/magistral-small-2509 (26G): 19.52 tok/sec, 0.75 efficiency - Low efficiency
    • nousresearch/hermes-4-70b (40G): 10.83 tok/sec, 0.27 efficiency - Slow despite good code quality
    • qwen/qwq-32b (35G): 12.70 tok/sec, 0.36 efficiency - Extremely verbose (11k tokens)
  • Large models (> 40GB) - Mixed results, mostly inefficient:

    • qwen/qwen3-next-80b (79G): 84.09 tok/sec, 1.06 efficiency - Best large model with working controls
    • openai/gpt-oss-120b (63G): 64.92 tok/sec, 1.03 efficiency - Working pause/resume controls
    • openai-gpt-oss-120b-mlx-6 (88G): 60.56 tok/sec, 0.69 efficiency - Excellent mouse drag controls
    • ⚠️ glm-4.5-air (47G): 52.09 tok/sec, 1.11 efficiency - Over-engineered dual implementation
    • ⚠️ glm-4.5-air-mlx (56G): 45.10 tok/sec, 0.81 efficiency - Over-engineered 5 cubes
    • minimax-m2 (100G!): 48.00 tok/sec, 0.48 efficiency - Huge model, basic Three.js output
    • kimi-dev-72b (72G): 5.96 tok/sec, 0.08 efficiency - Extremely slow, verbose
    • kimi-dev-72b-dwq (38G): 6.88 tok/sec, 0.18 efficiency - Very slow

Efficiency Winners (tokens/sec/GB)

The true value considering both speed and resource requirements:

  1. openai/gpt-oss-20b: 8.67 tok/sec/GB (exceptional efficiency + working code)
  2. deepseek-coder-v2-lite-instruct: 5.35 tok/sec/GB (excellent efficiency + fastest first token)
  3. qwen/qwen3-coder-30b: 2.70 tok/sec/GB (good balance)
  4. qwen/qwen3-30b-a3b-2507: 2.37 tok/sec/GB (good performance)
  5. qwen/qwen2.5-coder-14b: 1.72 tok/sec/GB (moderate efficiency)
  6. qwen/qwen3-vl-30b: 1.70 tok/sec/GB (good for vision model)
  7. qwen/qwen2.5-coder-32b: 1.25 tok/sec/GB (moderate efficiency)

Critical finding: Models > 40GB all fall below 1.1 tok/sec/GB except glm-4.5-air (1.11). The 100GB minimax-m2 achieves only 0.48 efficiency - worse than models 1/10th its size!

Quality Assessment

Important finding: 18 out of 19 models (95%) generated working implementations. However, quality varies significantly based on code cleanliness, working features, and verbosity.

Quality Ratings

Model Rating Implementation Notes
openai-gpt-oss-120b-mlx-6 ⭐⭐⭐⭐⭐ Excellent CSS 3D Working mouse drag, labeled faces, external CSS reset, clean code
qwen/qwen3-next-80b ⭐⭐⭐⭐⭐ Excellent CSS 3D + JS Working pause/reset buttons, labeled faces, info text, clean
nousresearch/hermes-4-70b ⭐⭐⭐⭐⭐ Excellent CSS 3D + JS Working speed/rotation controls, pause/reset, gradient background
openai/gpt-oss-120b ⭐⭐⭐⭐⭐ Excellent CSS 3D + JS Working pause/resume buttons, clean code, Google Fonts
qwen/qwen3-vl-30b ⭐⭐⭐⭐⭐ Excellent CSS 3D Clean code, labeled faces, nice colors, dark background
openai/gpt-oss-20b ⭐⭐⭐⭐ Good CSS 3D + JS Working mouse drag, simple and clean, Google Fonts
deepseek-coder-v2-lite-instruct ⭐⭐⭐⭐ Good Canvas 2D Clean wireframe with manual 3D math projection
qwen/qwen2.5-coder-14b ⭐⭐⭐⭐ Good CSS 3D Pure CSS, no controls, simple white faces, works well
qwen/qwen2.5-coder-32b ⭐⭐⭐⭐ Good CSS 3D Pure CSS, no controls, white faces, dark background
qwen/qwq-32b ⭐⭐⭐⭐ Good Three.js Simple Three.js, ambient + directional lighting, clean
minimax-m2 ⭐⭐⭐⭐ Good Three.js Simple Three.js, basic material, resize handler
mistralai/magistral-small-2509 ⭐⭐⭐⭐ Good CSS 3D Pure CSS animation, no interactivity, clean and simple
kimi-dev-72b ⭐⭐⭐⭐ Good CSS 3D Basic auto-rotation, clean code
kimi-dev-72b-dwq ⭐⭐⭐⭐ Good CSS 3D CSS-only Y-axis rotation, clean
qwen/qwen3-30b-a3b-2507 ⭐⭐⭐ Average CSS 3D + JS Working buttons but uses JS animation instead of CSS, verbose
glm-4.5-air-mlx ⭐⭐⭐ Average CSS 3D + JS 5 cubes with mouse tracking/controls, over-engineered, verbose
glm-4.5-air ⭐⭐⭐ Average CSS 3D + Three.js Dual implementation with tabs, over-engineered, excessive features
qwen/qwen3-coder-30b ⭐⭐⭐ Average CSS 3D Works but added non-functional controls, misleading UI
deepseek/deepseek-r1-0528-qwen3-8b ⭐⭐ Poor CSS 3D + JS BROKEN: Mouse tracking doesn’t work, double-click zoom broken

Quality vs Performance Analysis

Critical insight: Output quality and model size show inverse correlation in many cases:

Top Quality + Top Performance (Best Overall)

  • openai/gpt-oss-20b (11G): 95.39 tok/sec, ⭐⭐⭐⭐ - Champion for speed + quality
  • deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, ⭐⭐⭐⭐ - Runner-up, fastest first token
  • qwen/qwen3-next-80b (79G): 84.09 tok/sec, ⭐⭐⭐⭐⭐ - Best large model with excellent quality

High Quality Despite Poor Performance

  • ⚠️ nousresearch/hermes-4-70b (40G): Only 10.83 tok/sec BUT ⭐⭐⭐⭐⭐ excellent code quality
  • ⚠️ qwen/qwen3-vl-30b (34G): 57.90 tok/sec, ⭐⭐⭐⭐⭐ excellent clean code

Poor Performance Destroys Value

  • deepseek/deepseek-r1-0528-qwen3-8b (8.1G): Good speed (55.69 tok/sec) BUT broken code (⭐⭐)
  • minimax-m2 (100G!): Moderate speed (48.00 tok/sec) but basic Three.js output (⭐⭐⭐⭐)
  • glm-4.5-air (47G): 52.09 tok/sec but over-engineered (⭐⭐⭐)
  • glm-4.5-air-mlx (56G): 45.10 tok/sec but over-engineered (⭐⭐⭐)

Size Analysis by Quality Tier

Excellent (⭐⭐⭐⭐⭐) - Only 5 models:

  • openai-gpt-oss-120b-mlx-6 (88G) - Large, 60.56 tok/sec
  • qwen/qwen3-next-80b (79G) - Large, 84.09 tok/sec ✨
  • openai/gpt-oss-120b (63G) - Large, 64.92 tok/sec
  • nousresearch/hermes-4-70b (40G) - Medium, 10.83 tok/sec
  • qwen/qwen3-vl-30b (34G) - Medium, 57.90 tok/sec

Good (⭐⭐⭐⭐) - 9 models:

  • Includes best performers: openai/gpt-oss-20b (95.39 tok/sec), deepseek-coder-v2-lite-instruct (90.89 tok/sec)
  • Range: 8-72GB, all working implementations

Average (⭐⭐⭐) - 4 models:

  • All over-engineered or misleading UIs
  • Includes both glm-4.5-air models (over-engineered)

Poor (⭐⭐) - 1 model:

  • deepseek/deepseek-r1-0528-qwen3-8b - Broken code despite good metrics

Key insights:

  1. Speed matters more than elaborate features: openai/gpt-oss-20b (95.39 tok/sec, ⭐⭐⭐⭐) beats minimax-m2 (48.00 tok/sec, ⭐⭐⭐⭐)
  2. Working code > broken fancy features: deepseek-r1’s broken code invalidates its speed advantage
  3. Over-engineering reduces quality: glm-4.5-air models score only ⭐⭐⭐ despite elaborate UIs
  4. Size doesn’t predict quality: 5 excellent models range from 34GB to 88GB, with huge gaps in between
  5. Minimax-m2 disappoints: 100GB model produces basic Three.js output at only 48 tok/sec

Model Outputs

Click on the links below to view each model’s response:

Top Performers - Best Speed + Quality

  • openai/gpt-oss-20b (11G) ⭐⭐⭐⭐: Raw Response | HTML Demo
    • 95.39 tok/sec | 8.67 efficiency | Fastest overall | Working mouse drag | 958 tokens
  • deepseek-coder-v2-lite-instruct (17G) ⭐⭐⭐⭐: Raw Response | HTML Demo
    • 90.89 tok/sec | 5.35 efficiency | Fastest first token (0.14s) | Canvas wireframe
  • qwen/qwen3-next-80b (79G) ⭐⭐⭐⭐⭐: Raw Response | HTML Demo
    • 84.09 tok/sec | 1.06 efficiency | Best large model | Working pause/reset buttons

Excellent Quality (⭐⭐⭐⭐⭐) - Best Code

  • openai-gpt-oss-120b-mlx-6 (88G): Raw Response | HTML Demo
    • 60.56 tok/sec | 0.69 efficiency | Working mouse drag | Labeled faces | External CSS reset
  • openai/gpt-oss-120b (63G): Raw Response | HTML Demo
    • 64.92 tok/sec | 1.03 efficiency | Working pause/resume | Clean code
  • nousresearch/hermes-4-70b (40G): Raw Response | HTML Demo
    • 10.83 tok/sec | 0.27 efficiency | Working speed/rotation controls | Minimal 639 tokens
  • qwen/qwen3-vl-30b (34G): Raw Response | HTML Demo
    • 57.90 tok/sec | 1.70 efficiency | Clean labeled faces | Nice colors

Good Quality (⭐⭐⭐⭐) - Solid Implementations

  • qwen/qwen2.5-coder-14b (16G): Raw Response | HTML Demo
    • 27.50 tok/sec | 1.72 efficiency | Pure CSS | Simple clean white faces
  • qwen/qwen2.5-coder-32b (19G): Raw Response | HTML Demo
    • 23.78 tok/sec | 1.25 efficiency | Pure CSS | Dark background
  • qwen/qwq-32b (35G): Raw Response | HTML Demo
    • 12.70 tok/sec | 0.36 efficiency | Three.js | Extremely verbose 11,099 tokens
  • minimax-m2 (100G): Raw Response | HTML Demo
    • 48.00 tok/sec | 0.48 efficiency | Simple Three.js | Huge 100GB model!
  • mistralai/magistral-small-2509 (26G): Raw Response | HTML Demo
    • 19.52 tok/sec | 0.75 efficiency | Pure CSS | No interactivity
  • kimi-dev-72b (72G): Raw Response | HTML Demo
    • 5.96 tok/sec | 0.08 efficiency | Basic CSS | Extremely verbose 8,148 tokens
  • kimi-dev-72b-dwq (38G): Raw Response | HTML Demo
    • 6.88 tok/sec | 0.18 efficiency | CSS Y-axis only | Clean code

Average Quality (⭐⭐⭐) - Over-engineered or Misleading

  • qwen/qwen3-30b-a3b-2507 (30G): Raw Response | HTML Demo
    • 71.00 tok/sec | 2.37 efficiency | Working buttons but uses JS instead of CSS | Verbose
  • qwen/qwen3-coder-30b (30G): Raw Response | HTML Demo
    • 80.89 tok/sec | 2.70 efficiency | Non-functional controls | Misleading UI
  • glm-4.5-air-mlx (56G): Raw Response | HTML Demo
    • 45.10 tok/sec | 0.81 efficiency | 5 cubes with mouse tracking | Over-engineered
  • glm-4.5-air (47G): Raw Response | HTML Demo
    • 52.09 tok/sec | 1.11 efficiency | Dual implementation with tabs | Over-engineered

Poor Quality (⭐⭐) - Broken Code

  • deepseek/deepseek-r1-0528-qwen3-8b (8.1G): Raw Response | HTML Demo
    • 55.69 tok/sec | 6.88 efficiency | BROKEN: Mouse tracking fails | Double-click zoom fails | AVOID

Technical Details

Hardware: Mac Studio (M4, 128GB Unified Memory) Initial Benchmark: October 2025 (9 models) First Update: November 3, 2025 (17 models total, 7 new) Second Update: November 6, 2025 (20 models total, 3 additional) Benchmark Tool: Custom Elixir script with streaming support API: LMStudio-compatible OpenAI API endpoint

Methodology

  1. Model Warmup: Each model receives a simple “hello” prompt before the benchmark to ensure it’s loaded in memory
  2. Streaming: Responses are streamed to accurately measure time-to-first-token
  3. Token Counting: Uses actual token counts from the API when available, falls back to estimation (1 token ≈ 4 characters)
  4. Timeout: 10-minute timeout per model to handle slower responses

Conclusion

This comprehensive benchmark of 19 models reveals critical insights that challenge conventional assumptions about model size, speed, and quality.

Key Takeaways

  1. Working code matters more than speed: deepseek-r1 (55.69 tok/sec) has broken code (⭐⭐), making it worse than slower models with working implementations
  2. Size ≠ Speed ≠ Quality: The 11GB openai/gpt-oss-20b (95.39 tok/sec, ⭐⭐⭐⭐) outperforms the 100GB minimax-m2 (48.00 tok/sec, ⭐⭐⭐⭐) in both metrics
  3. Efficiency gap is massive: Small models achieve 5-18x better tokens/sec/GB than large models (8.67 vs 0.48)
  4. Only 5 models achieved excellent (⭐⭐⭐⭐⭐) quality: 3 large models (63-88GB), 2 medium models (34-40GB), and ZERO small models
  5. Large models dominate quality: All 5 excellent-rated models are > 34GB, with working interactive controls
  6. Over-engineering hurts: glm-4.5-air models (47-56GB) scored only ⭐⭐⭐ despite elaborate dual implementations
  7. Minimax-m2 is a trap: 100GB model delivers basic Three.js output at 0.48 efficiency - worst value
  8. Diminishing returns after 40GB: All models > 40GB achieve < 1.1 tok/sec/GB efficiency

Recommendations by Use Case

  • 🏆 Best overall (speed + quality)?

    • openai/gpt-oss-20b (11G, 95.39 tok/sec, ⭐⭐⭐⭐) - Fastest with working code!
    • deepseek-coder-v2-lite-instruct (17G, 90.89 tok/sec, ⭐⭐⭐⭐) - Runner-up, fastest first token (0.14s)
  • Want absolute best quality code (⭐⭐⭐⭐⭐)?

    • qwen/qwen3-next-80b (79G, 84.09 tok/sec) - Best balance: excellent quality + fast speed
    • openai/gpt-oss-120b (63G, 64.92 tok/sec) - Working pause/resume controls
    • openai-gpt-oss-120b-mlx-6 (88G, 60.56 tok/sec) - Working mouse drag, best code quality
    • qwen/qwen3-vl-30b (34G, 57.90 tok/sec) - Great quality at moderate size
    • nousresearch/hermes-4-70b (40G, 10.83 tok/sec) - Excellent code but slow
  • Maximum efficiency (best tok/sec/GB)?

    • openai/gpt-oss-20b (11G, 8.67 efficiency) - 18x better than minimax-m2!
    • deepseek-coder-v2-lite-instruct (17G, 5.35 efficiency) - Excellent value
    • qwen/qwen3-coder-30b (30G, 2.70 efficiency) - Best medium-sized option
  • Budget-conscious (< 20GB)?

    • openai/gpt-oss-20b (11G) or deepseek-coder-v2-lite-instruct (17G) - Both excellent
    • qwen/qwen2.5-coder-14b (16G, 27.50 tok/sec, ⭐⭐⭐⭐) - Good clean CSS
    • qwen/qwen2.5-coder-32b (19G, 23.78 tok/sec, ⭐⭐⭐⭐) - Good clean CSS
  • ⚠️ AVOID - Poor value:

    • deepseek/deepseek-r1-0528-qwen3-8b (8.1G) - BROKEN CODE despite good speed
    • minimax-m2 (100G!) - Worst efficiency (0.48), basic Three.js output
    • kimi models (38-72G) - Extremely slow (< 7 tok/sec), terrible efficiency
    • qwen/qwq-32b (35G) - Slow (12.70 tok/sec), extremely verbose (11k tokens)
    • glm-4.5-air models (47-56G) - Over-engineered (⭐⭐⭐ only)
    • qwen/qwen3-coder-30b (30G) - Misleading non-functional controls

The Quality vs Efficiency Tradeoff

This benchmark reveals a fundamental tradeoff between speed/efficiency and code quality:

Speed Champions (⭐⭐⭐⭐):

  • openai/gpt-oss-20b (11G): 95.39 tok/sec, 8.67 efficiency
  • deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, 5.35 efficiency
  • Advantage: 2-18x faster per GB than large models
  • Limitation: Good quality but not excellent (missing interactive controls)

Quality Champions (⭐⭐⭐⭐⭐):

  • qwen/qwen3-next-80b (79G): 84.09 tok/sec, 1.06 efficiency
  • openai/gpt-oss-120b (63G): 64.92 tok/sec, 1.03 efficiency
  • qwen/qwen3-vl-30b (34G): 57.90 tok/sec, 1.70 efficiency
  • Advantage: Working interactive controls, professional-grade code
  • Limitation: 5-8x worse efficiency than small models

The Sweet Spot:

  • qwen/qwen3-next-80b (79G) offers the best balance: excellent quality (⭐⭐⭐⭐⭐) + high speed (84.09 tok/sec)
  • For maximum efficiency with good quality: openai/gpt-oss-20b (11G) remains unbeatable

Critical Failures:

  • minimax-m2 (100G) is the worst model tested: 0.48 efficiency, basic output, 100GB wasted
  • deepseek-r1 (8.1G) has broken code despite good metrics - reliability matters more than speed
  • Over-engineering penalty: glm-4.5-air models score only ⭐⭐⭐ despite elaborate features

For Mac Studio (M4, 128GB):

  • Best choice: openai/gpt-oss-20b (11G) - Maximize speed + efficiency
  • Best quality: qwen/qwen3-next-80b (79G) - Excellent code with strong performance
  • Avoid: minimax-m2, kimi models, deepseek-r1, glm-4.5-air - All poor value

The data is clear: bigger models CAN produce better code, but at 5-18x efficiency cost. For most use cases, the 11-17GB models offer the best pragmatic balance. Only choose large models (> 30GB) when code quality justifies the massive efficiency penalty.


Benchmark conducted with LMStudio on local hardware. Performance will vary based on hardware specifications, model quantization, and system load.