LLM Benchmark: 3D Rotating Cube Challenge

Updated December 2025: Added 9 new models including NVIDIA Nemotron, ByteDance Seed, Allen AI OLMo, and Mistral Devstral. Now testing 28 models total.

Overview

This benchmark tests various LLM models running locally through LMStudio by asking them to implement a single HTML file containing a 3D rotating cube using HTML, CSS, and JavaScript. The benchmark measures performance metrics including tokens per second, response time, and time to first token.

The Challenge

Please implement an html + js + css only project stored on a single html file that implements a 3d cube rotating. If you want you can add external dependencies as long as they are only included from cdns.

Benchmark Script

The benchmark was implemented in Elixir using HTTPoison to interact with the LMStudio API. The script:

  1. Fetches available models from LMStudio
  2. Warms up each model with a simple prompt
  3. Streams the actual benchmark prompt and measures performance metrics
  4. Saves individual responses and aggregated metrics

Key features:

  • Streaming support for accurate time-to-first-token measurements
  • Automatic token counting with fallback estimation
  • CSV export of metrics for analysis
  • Individual response files for quality comparison

View full script source

Results

Performance Comparison

Model Size (GB) Tokens/sec Efficiency (tok/sec/GB) Total Tokens Total Time (s) Time to First Token (s)
nvidia/nemotron-3-nano-4b 18 152.33 8.46 2,275 14.60 0.30
openai/gpt-oss-20b 11 95.39 8.67 1,009 10.04 0.69
nvidia/nemotron-3-nano-8b 34 94.28 2.77 7,366 77.59 0.22
deepseek-coder-v2-lite-instruct 17 90.89 5.35 1,089 11.42 0.14
qwen/qwen3-next-80b 79 84.09 1.06 1,451 16.65 0.24
qwen/qwen3-coder-30b 30 80.89 2.70 2,804 34.03 0.26
qwen/qwen3-30b-a3b-2507 30 71.00 2.37 1,391 18.87 0.24
openai/gpt-oss-120b 63 64.92 1.03 1,327 19.66 1.01
openai-gpt-oss-120b-mlx-6 88 60.56 0.69 1,320 20.96 3.03
qwen/qwen3-vl-30b 34 57.90 1.70 885 14.40 1.08
deepseek/deepseek-r1-0528-qwen3-8b 8 55.69 6.96 3,938 69.79 0.19
glm-4.5-air 47 52.09 1.11 5,351 101.75 0.46
minimax-m2 100 48.00 0.48 4,005 82.38 0.55
glm-4.5-air-mlx 56 45.10 0.81 4,919 107.93 0.41
zai-org/glm-4.6v-flash 12 36.02 3.00 2,324 63.10 2.16
mistralai/devstral-small-2-2512 14 34.39 2.46 962 26.49 0.45
qwen/qwen2.5-coder-14b 16 27.50 1.72 781 26.54 0.43
qwen/qwen2.5-coder-32b-4b 19 23.78 1.25 908 36.04 0.65
allenai/olmo-3-32b-think-4b 18 23.46 1.30 7,948 336.58 0.66
mistralai/magistral-small-2509 26 19.52 0.75 1,301 64.04 0.48
bytedance/seed-oss-36b-4b 20 18.98 0.95 5,748 300.17 0.54
allenai/olmo-3-32b-think 34 14.62 0.43 11,713 797.84 0.89
qwen/qwq-32b 35 12.70 0.36 11,099 869.78 0.52
qwen/qwen2.5-coder-32b 35 11.74 0.34 841 67.28 0.89
nousresearch/hermes-4-70b 40 10.83 0.27 639 54.28 1.26
bytedance/seed-oss-36b 38 10.72 0.28 5,386 497.60 0.61
kimi-dev-72b-dwq 38 6.88 0.18 4,868 700.39 1.41
kimi-dev-72b 72 5.96 0.08 8,199 1,366.31 1.93

Key Observations

Performance Leaders

  • Absolute fastest: nvidia/nemotron-3-nano-4b (152.33 tokens/sec, 18GB) - New speed champion!
  • Second fastest: openai/gpt-oss-20b (95.39 tokens/sec, 11GB) - Clean working code
  • Third fastest: nvidia/nemotron-3-nano-8b (94.28 tokens/sec, 34GB) - Very verbose output
  • Most efficient (size-adjusted): openai/gpt-oss-20b (8.67 tokens/sec/GB)
  • Quickest to respond: deepseek-coder-v2-lite-instruct (0.14s to first token)
  • Fastest large model: qwen/qwen3-next-80b (84.09 tok/sec @ 79GB) - Excellent with working controls

Important: nvidia/nemotron-3-nano-4b (152.33 tok/sec) and nvidia/nemotron-3-nano-8b (94.28 tok/sec) are extremely fast but produce average/poor quality output. deepseek/deepseek-r1-0528-qwen3-8b (55.69 tok/sec, 8GB) has broken mouse controls and should be avoided.

Size vs Performance Analysis

  • Small models (< 20GB) - Best efficiency tier:

    • nvidia/nemotron-3-nano-4b (18G): 152.33 tok/sec, 8.46 efficiency - Fastest overall, but average quality
    • openai/gpt-oss-20b (11G): 95.39 tok/sec, 8.67 efficiency - Best efficiency, clean working code
    • deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, 5.35 efficiency - Fastest first token (0.14s)
    • zai-org/glm-4.6v-flash (12G): 36.02 tok/sec, 3.00 efficiency - Good working code
    • mistralai/devstral-small-2-2512 (14G): 34.39 tok/sec, 2.46 efficiency - Clean working code
    • qwen/qwen2.5-coder-14b (16G): 27.50 tok/sec, 1.72 efficiency - Simple, clean CSS
    • qwen/qwen2.5-coder-32b-4b (19G): 23.78 tok/sec, 1.25 efficiency - Clean CSS with dark background
    • ⚠️ allenai/olmo-3-32b-think-4b (18G): 23.46 tok/sec, 1.30 efficiency - Extremely verbose, poor quality
    • deepseek/deepseek-r1-0528-qwen3-8b (8G): 55.69 tok/sec BUT broken mouse controls
  • Medium models (20-40GB) - Balanced tier:

    • nvidia/nemotron-3-nano-8b (34G): 94.28 tok/sec, 2.77 efficiency - Fast but poor quality, very verbose
    • qwen/qwen3-coder-30b (30G): 80.89 tok/sec, 2.70 efficiency - Fast with comprehensive output
    • qwen/qwen3-30b-a3b-2507 (30G): 71.00 tok/sec, 2.37 efficiency - Good speed
    • qwen/qwen3-vl-30b (34G): 57.90 tok/sec, 1.70 efficiency - Clean labeled faces
    • bytedance/seed-oss-36b-4b (20G): 18.98 tok/sec, 0.95 efficiency - Working code
    • ⚠️ mistralai/magistral-small-2509 (26G): 19.52 tok/sec, 0.75 efficiency - Low efficiency
    • ⚠️ allenai/olmo-3-32b-think (34G): 14.62 tok/sec, 0.43 efficiency - Extremely verbose (11k tokens), poor quality
    • qwen/qwen2.5-coder-32b (35G): 11.74 tok/sec, 0.34 efficiency - Slow full precision model
    • nousresearch/hermes-4-70b (40G): 10.83 tok/sec, 0.27 efficiency - Slow despite excellent code quality
    • bytedance/seed-oss-36b (38G): 10.72 tok/sec, 0.28 efficiency - Slow, very verbose
    • qwen/qwq-32b (35G): 12.70 tok/sec, 0.36 efficiency - Extremely verbose (11k tokens)
  • Large models (> 40GB) - Mixed results, mostly inefficient:

    • qwen/qwen3-next-80b (79G): 84.09 tok/sec, 1.06 efficiency - Best large model with working controls
    • openai/gpt-oss-120b (63G): 64.92 tok/sec, 1.03 efficiency - Working pause/resume controls
    • openai-gpt-oss-120b-mlx-6 (88G): 60.56 tok/sec, 0.69 efficiency - Excellent mouse drag controls
    • ⚠️ glm-4.5-air (47G): 52.09 tok/sec, 1.11 efficiency - Over-engineered dual implementation
    • ⚠️ glm-4.5-air-mlx (56G): 45.10 tok/sec, 0.81 efficiency - Over-engineered 5 cubes
    • minimax-m2 (100G!): 48.00 tok/sec, 0.48 efficiency - Huge model, basic Three.js output
    • kimi-dev-72b (72G): 5.96 tok/sec, 0.08 efficiency - Extremely slow, verbose
    • kimi-dev-72b-dwq (38G): 6.88 tok/sec, 0.18 efficiency - Very slow

Efficiency Winners (tokens/sec/GB)

The true value considering both speed and resource requirements:

  1. openai/gpt-oss-20b: 8.67 tok/sec/GB (exceptional efficiency + working code)
  2. nvidia/nemotron-3-nano-4b: 8.46 tok/sec/GB (fastest overall, but average quality)
  3. deepseek/deepseek-r1-0528-qwen3-8b: 6.96 tok/sec/GB (high efficiency BUT broken code!)
  4. deepseek-coder-v2-lite-instruct: 5.35 tok/sec/GB (excellent efficiency + fastest first token)
  5. zai-org/glm-4.6v-flash: 3.00 tok/sec/GB (good efficiency + working code)
  6. nvidia/nemotron-3-nano-8b: 2.77 tok/sec/GB (fast but poor quality)
  7. qwen/qwen3-coder-30b: 2.70 tok/sec/GB (good balance)
  8. mistralai/devstral-small-2-2512: 2.46 tok/sec/GB (good efficiency + clean code)
  9. qwen/qwen3-30b-a3b-2507: 2.37 tok/sec/GB (good performance)
  10. qwen/qwen2.5-coder-14b: 1.72 tok/sec/GB (moderate efficiency)

Critical finding: Models > 40GB all fall below 1.1 tok/sec/GB except glm-4.5-air (1.11). The 100GB minimax-m2 achieves only 0.48 efficiency - worse than models 1/10th its size! New NVIDIA Nemotron models are blazing fast but sacrifice code quality.

Quality Assessment

Important finding: 24 out of 28 models (86%) generated working implementations. However, quality varies significantly based on code cleanliness, working features, and verbosity.

Quality Ratings

Model Rating Implementation Notes
openai-gpt-oss-120b-mlx-6 ⭐⭐⭐⭐⭐ Excellent CSS 3D Working mouse drag, labeled faces, external CSS reset, clean code
qwen/qwen3-next-80b ⭐⭐⭐⭐⭐ Excellent CSS 3D + JS Working pause/reset buttons, labeled faces, info text, clean
nousresearch/hermes-4-70b ⭐⭐⭐⭐⭐ Excellent CSS 3D + JS Working speed/rotation controls, pause/reset, gradient background
openai/gpt-oss-120b ⭐⭐⭐⭐⭐ Excellent CSS 3D + JS Working pause/resume buttons, clean code, Google Fonts
qwen/qwen3-vl-30b ⭐⭐⭐⭐⭐ Excellent CSS 3D Clean code, labeled faces, nice colors, dark background
openai/gpt-oss-20b ⭐⭐⭐⭐ Good CSS 3D + JS Working mouse drag, simple and clean, Google Fonts
deepseek-coder-v2-lite-instruct ⭐⭐⭐⭐ Good Canvas 2D Clean wireframe with manual 3D math projection
qwen/qwen2.5-coder-14b ⭐⭐⭐⭐ Good CSS 3D Pure CSS, no controls, simple white faces, works well
qwen/qwen2.5-coder-32b-4b ⭐⭐⭐⭐ Good CSS 3D Pure CSS, no controls, white faces, dark background
qwen/qwen2.5-coder-32b ⭐⭐⭐⭐ Good CSS 3D Pure CSS, full precision model, clean code
qwen/qwq-32b ⭐⭐⭐⭐ Good Three.js Simple Three.js, ambient + directional lighting, clean
minimax-m2 ⭐⭐⭐⭐ Good Three.js Simple Three.js, basic material, resize handler
mistralai/magistral-small-2509 ⭐⭐⭐⭐ Good CSS 3D Pure CSS animation, no interactivity, clean and simple
mistralai/devstral-small-2-2512 ⭐⭐⭐⭐ Good CSS 3D Clean CSS implementation, working animation
zai-org/glm-4.6v-flash ⭐⭐⭐⭐ Good CSS 3D Working animation, clean code
bytedance/seed-oss-36b ⭐⭐⭐⭐ Good CSS 3D Working implementation, verbose output
bytedance/seed-oss-36b-4b ⭐⭐⭐⭐ Good CSS 3D Working implementation, quantized version
kimi-dev-72b ⭐⭐⭐⭐ Good CSS 3D Basic auto-rotation, clean code
kimi-dev-72b-dwq ⭐⭐⭐⭐ Good CSS 3D CSS-only Y-axis rotation, clean
qwen/qwen3-30b-a3b-2507 ⭐⭐⭐ Average CSS 3D + JS Working buttons but uses JS animation instead of CSS, verbose
nvidia/nemotron-3-nano-4b ⭐⭐⭐ Average CSS 3D Fast but average quality output
glm-4.5-air-mlx ⭐⭐⭐ Average CSS 3D + JS 5 cubes with mouse tracking/controls, over-engineered, verbose
glm-4.5-air ⭐⭐⭐ Average CSS 3D + Three.js Dual implementation with tabs, over-engineered, excessive features
qwen/qwen3-coder-30b ⭐⭐⭐ Average CSS 3D Works but added non-functional controls, misleading UI
nvidia/nemotron-3-nano-8b ⭐⭐ Poor CSS 3D Extremely fast but poor quality, very verbose (7k tokens)
allenai/olmo-3-32b-think ⭐⭐ Poor CSS 3D Extremely verbose (11k tokens), poor quality output
allenai/olmo-3-32b-think-4b ⭐⭐ Poor CSS 3D Very verbose (8k tokens), poor quality output
deepseek/deepseek-r1-0528-qwen3-8b ⭐⭐ Poor CSS 3D + JS BROKEN: Mouse tracking doesn’t work, double-click zoom broken

Quality vs Performance Analysis

Critical insight: Output quality and model size show inverse correlation in many cases:

Top Quality + Top Performance (Best Overall)

  • openai/gpt-oss-20b (11G): 95.39 tok/sec, ⭐⭐⭐⭐ - Champion for speed + quality
  • deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, ⭐⭐⭐⭐ - Runner-up, fastest first token
  • qwen/qwen3-next-80b (79G): 84.09 tok/sec, ⭐⭐⭐⭐⭐ - Best large model with excellent quality

Fast But Low Quality (Speed Traps)

  • ⚠️ nvidia/nemotron-3-nano-4b (18G): 152.33 tok/sec, ⭐⭐⭐ - Fastest overall but average quality
  • ⚠️ nvidia/nemotron-3-nano-8b (34G): 94.28 tok/sec, ⭐⭐ - Very fast but poor quality, extremely verbose

High Quality Despite Poor Performance

  • ⚠️ nousresearch/hermes-4-70b (40G): Only 10.83 tok/sec BUT ⭐⭐⭐⭐⭐ excellent code quality
  • ⚠️ qwen/qwen3-vl-30b (34G): 57.90 tok/sec, ⭐⭐⭐⭐⭐ excellent clean code

Poor Performance Destroys Value

  • deepseek/deepseek-r1-0528-qwen3-8b (8G): Good speed (55.69 tok/sec) BUT broken code (⭐⭐)
  • allenai/olmo-3-32b-think (34G): Slow (14.62 tok/sec), extremely verbose (11k tokens), poor quality (⭐⭐)
  • allenai/olmo-3-32b-think-4b (18G): Moderate speed (23.46 tok/sec), very verbose (8k tokens), poor quality (⭐⭐)
  • minimax-m2 (100G!): Moderate speed (48.00 tok/sec) but basic Three.js output (⭐⭐⭐⭐)
  • glm-4.5-air (47G): 52.09 tok/sec but over-engineered (⭐⭐⭐)
  • glm-4.5-air-mlx (56G): 45.10 tok/sec but over-engineered (⭐⭐⭐)

Size Analysis by Quality Tier

Excellent (⭐⭐⭐⭐⭐) - 5 models:

  • openai-gpt-oss-120b-mlx-6 (88G) - Large, 60.56 tok/sec
  • qwen/qwen3-next-80b (79G) - Large, 84.09 tok/sec ✨
  • openai/gpt-oss-120b (63G) - Large, 64.92 tok/sec
  • nousresearch/hermes-4-70b (40G) - Medium, 10.83 tok/sec
  • qwen/qwen3-vl-30b (34G) - Medium, 57.90 tok/sec

Good (⭐⭐⭐⭐) - 14 models:

  • Includes best performers: openai/gpt-oss-20b (95.39 tok/sec), deepseek-coder-v2-lite-instruct (90.89 tok/sec)
  • New additions: mistralai/devstral-small-2-2512, zai-org/glm-4.6v-flash, bytedance/seed-oss models
  • Range: 8-72GB, all working implementations

Average (⭐⭐⭐) - 5 models:

  • nvidia/nemotron-3-nano-4b - Fast but average quality
  • All over-engineered or misleading UIs
  • Includes both glm-4.5-air models (over-engineered)

Poor (⭐⭐) - 4 models:

  • nvidia/nemotron-3-nano-8b - Fast but poor quality, very verbose
  • allenai/olmo-3-32b-think - Extremely verbose, poor quality
  • allenai/olmo-3-32b-think-4b - Very verbose, poor quality
  • deepseek/deepseek-r1-0528-qwen3-8b - Broken code despite good metrics

Key insights:

  1. Speed ≠ Quality: nvidia/nemotron-3-nano-4b (152.33 tok/sec, ⭐⭐⭐) is fastest but average quality
  2. Working code > broken fancy features: deepseek-r1’s broken code invalidates its speed advantage
  3. “Think” models produce verbose, poor output: allenai/olmo-3-32b-think generates 11k tokens of poor quality
  4. Over-engineering reduces quality: glm-4.5-air models score only ⭐⭐⭐ despite elaborate UIs
  5. Size doesn’t predict quality: 5 excellent models range from 34GB to 88GB, with huge gaps in between
  6. Minimax-m2 disappoints: 100GB model produces basic Three.js output at only 48 tok/sec

Model Outputs

Click on the links below to view each model’s response:

Top Performers - Best Speed + Quality

  • nvidia/nemotron-3-nano-4b (18G) ⭐⭐⭐: Raw Response | HTML Demo
    • 152.33 tok/sec | 8.46 efficiency | Fastest overall but average quality | 2,275 tokens
  • openai/gpt-oss-20b (11G) ⭐⭐⭐⭐: Raw Response | HTML Demo
    • 95.39 tok/sec | 8.67 efficiency | Best efficiency + quality combo | Working mouse drag | 958 tokens
  • nvidia/nemotron-3-nano-8b (34G) ⭐⭐: Raw Response | HTML Demo
    • 94.28 tok/sec | 2.77 efficiency | Very fast but poor quality | 7,366 tokens
  • deepseek-coder-v2-lite-instruct (17G) ⭐⭐⭐⭐: Raw Response | HTML Demo
    • 90.89 tok/sec | 5.35 efficiency | Fastest first token (0.14s) | Canvas wireframe
  • qwen/qwen3-next-80b (79G) ⭐⭐⭐⭐⭐: Raw Response | HTML Demo
    • 84.09 tok/sec | 1.06 efficiency | Best large model | Working pause/reset buttons

Excellent Quality (⭐⭐⭐⭐⭐) - Best Code

  • openai-gpt-oss-120b-mlx-6 (88G): Raw Response | HTML Demo
    • 60.56 tok/sec | 0.69 efficiency | Working mouse drag | Labeled faces | External CSS reset
  • openai/gpt-oss-120b (63G): Raw Response | HTML Demo
    • 64.92 tok/sec | 1.03 efficiency | Working pause/resume | Clean code
  • nousresearch/hermes-4-70b (40G): Raw Response | HTML Demo
    • 10.83 tok/sec | 0.27 efficiency | Working speed/rotation controls | Minimal 639 tokens
  • qwen/qwen3-vl-30b (34G): Raw Response | HTML Demo
    • 57.90 tok/sec | 1.70 efficiency | Clean labeled faces | Nice colors

Good Quality (⭐⭐⭐⭐) - Solid Implementations

  • zai-org/glm-4.6v-flash (12G): Raw Response | HTML Demo
    • 36.02 tok/sec | 3.00 efficiency | Working CSS animation | Clean code
  • mistralai/devstral-small-2-2512 (14G): Raw Response | HTML Demo
    • 34.39 tok/sec | 2.46 efficiency | Clean CSS | Working animation
  • qwen/qwen2.5-coder-14b (16G): Raw Response | HTML Demo
    • 27.50 tok/sec | 1.72 efficiency | Pure CSS | Simple clean white faces
  • qwen/qwen2.5-coder-32b-4b (19G): Raw Response | HTML Demo
    • 23.78 tok/sec | 1.25 efficiency | Pure CSS | Dark background
  • qwen/qwen2.5-coder-32b (35G): Raw Response | HTML Demo
    • 11.74 tok/sec | 0.34 efficiency | Pure CSS | Full precision model
  • bytedance/seed-oss-36b-4b (20G): Raw Response | HTML Demo
    • 18.98 tok/sec | 0.95 efficiency | Working CSS | Quantized version
  • bytedance/seed-oss-36b (38G): Raw Response | HTML Demo
    • 10.72 tok/sec | 0.28 efficiency | Working CSS | Verbose output
  • qwen/qwq-32b (35G): Raw Response | HTML Demo
    • 12.70 tok/sec | 0.36 efficiency | Three.js | Extremely verbose 11,099 tokens
  • minimax-m2 (100G): Raw Response | HTML Demo
    • 48.00 tok/sec | 0.48 efficiency | Simple Three.js | Huge 100GB model!
  • mistralai/magistral-small-2509 (26G): Raw Response | HTML Demo
    • 19.52 tok/sec | 0.75 efficiency | Pure CSS | No interactivity
  • kimi-dev-72b (72G): Raw Response | HTML Demo
    • 5.96 tok/sec | 0.08 efficiency | Basic CSS | Extremely verbose 8,148 tokens
  • kimi-dev-72b-dwq (38G): Raw Response | HTML Demo
    • 6.88 tok/sec | 0.18 efficiency | CSS Y-axis only | Clean code

Average Quality (⭐⭐⭐) - Over-engineered or Misleading

  • qwen/qwen3-30b-a3b-2507 (30G): Raw Response | HTML Demo
    • 71.00 tok/sec | 2.37 efficiency | Working buttons but uses JS instead of CSS | Verbose
  • qwen/qwen3-coder-30b (30G): Raw Response | HTML Demo
    • 80.89 tok/sec | 2.70 efficiency | Non-functional controls | Misleading UI
  • glm-4.5-air-mlx (56G): Raw Response | HTML Demo
    • 45.10 tok/sec | 0.81 efficiency | 5 cubes with mouse tracking | Over-engineered
  • glm-4.5-air (47G): Raw Response | HTML Demo
    • 52.09 tok/sec | 1.11 efficiency | Dual implementation with tabs | Over-engineered

Poor Quality (⭐⭐) - Broken or Low Quality Code

  • allenai/olmo-3-32b-think (34G): Raw Response | HTML Demo
    • 14.62 tok/sec | 0.43 efficiency | Extremely verbose (11k tokens) | Poor quality output
  • allenai/olmo-3-32b-think-4b (18G): Raw Response | HTML Demo
    • 23.46 tok/sec | 1.30 efficiency | Very verbose (8k tokens) | Poor quality output
  • deepseek/deepseek-r1-0528-qwen3-8b (8G): Raw Response | HTML Demo
    • 55.69 tok/sec | 6.96 efficiency | BROKEN: Mouse tracking fails | Double-click zoom fails | AVOID

Technical Details

Hardware: Mac Studio (M4, 128GB Unified Memory) Initial Benchmark: October 2025 (9 models) First Update: November 3, 2025 (17 models total, 7 new) Second Update: November 6, 2025 (20 models total, 3 additional) Third Update: December 20, 2025 (28 models total, 9 new including NVIDIA Nemotron, ByteDance Seed, Allen AI OLMo, Mistral Devstral) Benchmark Tool: Custom Elixir script with streaming support API: LMStudio-compatible OpenAI API endpoint

Methodology

  1. Model Warmup: Each model receives a simple “hello” prompt before the benchmark to ensure it’s loaded in memory
  2. Streaming: Responses are streamed to accurately measure time-to-first-token
  3. Token Counting: Uses actual token counts from the API when available, falls back to estimation (1 token ≈ 4 characters)
  4. Timeout: 10-minute timeout per model to handle slower responses

Conclusion

This comprehensive benchmark of 28 models reveals critical insights that challenge conventional assumptions about model size, speed, and quality.

Key Takeaways

  1. Speed ≠ Quality: nvidia/nemotron-3-nano-4b (152.33 tok/sec) is fastest but only ⭐⭐⭐ average quality
  2. Working code matters more than speed: deepseek-r1 (55.69 tok/sec) and OLMo-think models have poor/broken code despite good metrics
  3. Size ≠ Speed ≠ Quality: The 11GB openai/gpt-oss-20b (95.39 tok/sec, ⭐⭐⭐⭐) outperforms the 100GB minimax-m2 (48.00 tok/sec, ⭐⭐⭐⭐) in both metrics
  4. Efficiency gap is massive: Small models achieve 5-18x better tokens/sec/GB than large models (8.67 vs 0.48)
  5. Only 5 models achieved excellent (⭐⭐⭐⭐⭐) quality: 3 large models (63-88GB), 2 medium models (34-40GB), and ZERO small models
  6. “Think” models disappoint: Allen AI OLMo-think models produce extremely verbose (8-11k tokens) poor quality output
  7. Over-engineering hurts: glm-4.5-air models (47-56GB) scored only ⭐⭐⭐ despite elaborate dual implementations
  8. New efficiency champions: zai-org/glm-4.6v-flash (3.00 eff) and mistralai/devstral (2.46 eff) deliver good quality at great efficiency
  9. Diminishing returns after 40GB: All models > 40GB achieve < 1.1 tok/sec/GB efficiency

Recommendations by Use Case

  • 🏆 Best overall (speed + quality)?

    • openai/gpt-oss-20b (11G, 95.39 tok/sec, ⭐⭐⭐⭐) - Best efficiency + quality combo!
    • deepseek-coder-v2-lite-instruct (17G, 90.89 tok/sec, ⭐⭐⭐⭐) - Runner-up, fastest first token (0.14s)
    • zai-org/glm-4.6v-flash (12G, 36.02 tok/sec, ⭐⭐⭐⭐) - New efficient option with good quality
  • Want absolute best quality code (⭐⭐⭐⭐⭐)?

    • qwen/qwen3-next-80b (79G, 84.09 tok/sec) - Best balance: excellent quality + fast speed
    • openai/gpt-oss-120b (63G, 64.92 tok/sec) - Working pause/resume controls
    • openai-gpt-oss-120b-mlx-6 (88G, 60.56 tok/sec) - Working mouse drag, best code quality
    • qwen/qwen3-vl-30b (34G, 57.90 tok/sec) - Great quality at moderate size
    • nousresearch/hermes-4-70b (40G, 10.83 tok/sec) - Excellent code but slow
  • Maximum efficiency (best tok/sec/GB)?

    • openai/gpt-oss-20b (11G, 8.67 efficiency) - 18x better than minimax-m2!
    • deepseek-coder-v2-lite-instruct (17G, 5.35 efficiency) - Excellent value
    • zai-org/glm-4.6v-flash (12G, 3.00 efficiency) - New efficient option
    • mistralai/devstral-small-2-2512 (14G, 2.46 efficiency) - New coder-focused option
    • qwen/qwen3-coder-30b (30G, 2.70 efficiency) - Best medium-sized option
  • Budget-conscious (< 20GB)?

    • openai/gpt-oss-20b (11G) or deepseek-coder-v2-lite-instruct (17G) - Both excellent
    • zai-org/glm-4.6v-flash (12G, 36.02 tok/sec, ⭐⭐⭐⭐) - New compact option
    • mistralai/devstral-small-2-2512 (14G, 34.39 tok/sec, ⭐⭐⭐⭐) - New coder model
    • qwen/qwen2.5-coder-14b (16G, 27.50 tok/sec, ⭐⭐⭐⭐) - Good clean CSS
    • qwen/qwen2.5-coder-32b-4b (19G, 23.78 tok/sec, ⭐⭐⭐⭐) - Good clean CSS
  • ⚠️ AVOID - Poor value:

    • nvidia/nemotron-3-nano models (18-34G) - Fastest but poor/average quality output
    • allenai/olmo-3-32b-think models (18-34G) - Extremely verbose, poor quality
    • deepseek/deepseek-r1-0528-qwen3-8b (8G) - BROKEN CODE despite good speed
    • minimax-m2 (100G!) - Worst efficiency (0.48), basic Three.js output
    • kimi models (38-72G) - Extremely slow (< 7 tok/sec), terrible efficiency
    • bytedance/seed-oss-36b (38G) - Slow (10.72 tok/sec), verbose
    • qwen/qwq-32b (35G) - Slow (12.70 tok/sec), extremely verbose (11k tokens)
    • glm-4.5-air models (47-56G) - Over-engineered (⭐⭐⭐ only)
    • qwen/qwen3-coder-30b (30G) - Misleading non-functional controls

The Quality vs Efficiency Tradeoff

This benchmark reveals a fundamental tradeoff between speed/efficiency and code quality:

Speed Champions (⭐⭐⭐⭐):

  • openai/gpt-oss-20b (11G): 95.39 tok/sec, 8.67 efficiency
  • deepseek-coder-v2-lite-instruct (17G): 90.89 tok/sec, 5.35 efficiency
  • zai-org/glm-4.6v-flash (12G): 36.02 tok/sec, 3.00 efficiency - New!
  • mistralai/devstral-small-2-2512 (14G): 34.39 tok/sec, 2.46 efficiency - New!
  • Advantage: 2-18x faster per GB than large models
  • Limitation: Good quality but not excellent (missing interactive controls)

Speed Traps (Fast but Low Quality):

  • nvidia/nemotron-3-nano-4b (18G): 152.33 tok/sec, 8.46 efficiency - ⭐⭐⭐ only
  • nvidia/nemotron-3-nano-8b (34G): 94.28 tok/sec, 2.77 efficiency - ⭐⭐ poor quality
  • Warning: Fastest models don’t produce best code!

Quality Champions (⭐⭐⭐⭐⭐):

  • qwen/qwen3-next-80b (79G): 84.09 tok/sec, 1.06 efficiency
  • openai/gpt-oss-120b (63G): 64.92 tok/sec, 1.03 efficiency
  • qwen/qwen3-vl-30b (34G): 57.90 tok/sec, 1.70 efficiency
  • Advantage: Working interactive controls, professional-grade code
  • Limitation: 5-8x worse efficiency than small models

The Sweet Spot:

  • qwen/qwen3-next-80b (79G) offers the best balance: excellent quality (⭐⭐⭐⭐⭐) + high speed (84.09 tok/sec)
  • For maximum efficiency with good quality: openai/gpt-oss-20b (11G) remains unbeatable
  • New efficient options: zai-org/glm-4.6v-flash (12G) and mistralai/devstral (14G)

Critical Failures:

  • nvidia/nemotron models - Fastest but poor/average quality - speed isn’t everything
  • allenai/olmo-3-32b-think models - Extremely verbose (8-11k tokens), poor quality output
  • minimax-m2 (100G) is the worst value: 0.48 efficiency, basic output, 100GB wasted
  • deepseek-r1 (8G) has broken code despite good metrics - reliability matters more than speed
  • Over-engineering penalty: glm-4.5-air models score only ⭐⭐⭐ despite elaborate features

For Mac Studio (M4, 128GB):

  • Best choice: openai/gpt-oss-20b (11G) - Maximize speed + efficiency
  • Best quality: qwen/qwen3-next-80b (79G) - Excellent code with strong performance
  • New options: zai-org/glm-4.6v-flash (12G) or mistralai/devstral (14G) for good quality + efficiency
  • Avoid: nvidia/nemotron (fast but low quality), olmo-think (verbose), minimax-m2, kimi, deepseek-r1, glm-4.5-air

The data is clear: bigger models CAN produce better code, but at 5-18x efficiency cost. Speed alone doesn’t guarantee quality - nvidia/nemotron models are fastest but produce poor output. For most use cases, the 11-17GB models offer the best pragmatic balance. Only choose large models (> 30GB) when code quality justifies the massive efficiency penalty.


Benchmark conducted with LMStudio on local hardware. Performance will vary based on hardware specifications, model quantization, and system load.