DeepSeek R1 Aider Benchmark

DeepSeek recently released its R1 model, a state-of-the-art LLM that outperforms all available reasoning models on the market. The accompanying paper includes a comprehensive comparison across 21 benchmarks in four categories: English, Code, Math, and Chinese .

R1 benchamark results

As a software engineer, I was particularly curious about the Code category and decided to explore the datasets and evaluation criteria. While many benchmarks in this category were either poorly documented or required extensive dataset downloads. Aider-polyglot stood out for its clear documentation and ease of use, benchmark script

What is Aider?

The benchmark is based on programming problems from exercism.io and covers six popular languages: Python, Java, JavaScript, C++, Rust, and Go. The README provides step-by-step instructions for running the benchmarks, making it accessible even for those new to the AI.

Running the benchmark

Set the DEEPSEEK_API_KEY while running the benchmark command. I used the hosted version of DeepSeek to run the benchmark. Here’s the command I executed for the Python benchmarks:

$ ./benchmark/benchmark.py test-deepseek-r1-run --model r1 --edit-format whole --threads 10 --exercises-dir polyglot-benchmark --verbose --new --languages python

Key CLI Parameters:

Output:

- dirname: 2025-01-25-19-03-46--test-deepseek-r1-run
  test_cases: 34
  model: deepseek/deepseek-reasoner
  edit_format: whole
  commit_hash: b276d48
  pass_rate_1: 35.3
  pass_rate_2: 64.7
  pass_num_1: 12
  pass_num_2: 22
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 0
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  total_tests: 225
  command: aider --model deepseek/deepseek-reasoner
  date: 2025-01-25
  versions: 0.72.3.dev
  seconds_per_case: 226.0
  total_cost: 0.9313

costs: $0.0274/test-case, $0.93 total, $6.16 projected

Most fields are self-explanatory, but two key metrics stand out: pass_rate_1 and pass_rate_2, which indicate the percentage of problems solved on the first and second attempts, respectively. The R1 model achieved a 64.7% pass rate across 34 exercises. From the official leaderboard the pass rate of 56.9% across langauges. This is not like to like comparision but for illustrative purpose. Notably, the official website does not distinguish between pass rates for the first and second attempts.

images/aider/polyglot-benchmark

Conclusion

During the benchmark, I encountered a temporary issue where the DeepSeek API returned a 503 error. While Aider employs exponential backoff to retry failed exercises, recovery can be time-consuming.

Following are some of the results from other language benchmarks except Java.

C++

$./benchmark/benchmark.py test-deepseek-r1-run-cpp --model r1 --edit-format whole --threads 10 --exercises-dir polyglot-benchmark --verbose --new --languages cpp
- dirname: 2025-01-25-19-26-20--test-deepseek-r1-run-cpp
  test_cases: 26
  model: deepseek/deepseek-reasoner
  edit_format: whole
  commit_hash: b276d48
  pass_rate_1: 19.2
  pass_rate_2: 69.2
  pass_num_1: 5
  pass_num_2: 18
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 0
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 0
  total_tests: 225
  command: aider --model deepseek/deepseek-reasoner
  date: 2025-01-25
  versions: 0.72.3.dev
  seconds_per_case: 410.2
  total_cost: 0.4168

costs: $0.0160/test-case, $0.42 total, $3.61 projected

Go

$./benchmark/benchmark.py test-deepseek-r1-run-go --model r1 --edit-format whole --threads 10 --exercises-dir polyglot-benchmark --verbose --new --languages go
rname: 2025-01-26-07-44-16--test-deepseek-r1-run-go
  test_cases: 39
  model: deepseek/deepseek-reasoner
  edit_format: whole
  commit_hash: b276d48
  pass_rate_1: 41.0
  pass_rate_2: 66.7
  pass_num_1: 16
  pass_num_2: 26
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 3
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  total_tests: 225
  command: aider --model deepseek/deepseek-reasoner
  date: 2025-01-26
  versions: 0.72.3.dev
  seconds_per_case: 204.4
  total_cost: 0.8196

costs: $0.0210/test-case, $0.82 total, $4.73 projected

Javascript

./benchmark/benchmark.py test-deepseek-r1-run-javascript --model r1 --edit-format whole --threads 10 --exercises-dir polyglot-benchmark --verbose  --languages javascript --new
- dirname: 2025-01-26-14-52-31--test-deepseek-r1-run-javascript
  test_cases: 49
  model: deepseek/deepseek-reasoner
  edit_format: whole
  commit_hash: b276d48
  pass_rate_1: 22.4
  pass_rate_2: 57.1
  pass_num_1: 11
  pass_num_2: 28
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 2
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  total_tests: 225
  command: aider --model deepseek/deepseek-reasoner
  date: 2025-01-26
  versions: 0.72.3.dev
  seconds_per_case: 236.6
  total_cost: 1.2589

costs: $0.0257/test-case, $1.26 total, $5.78 projected

Rust

./benchmark/benchmark.py test-deepseek-r1-run-rust --model r1 --edit-format whole --threads 10 --exercises-dir polyglot-benchmark --verbose  --languages rust --new

- dirname: 2025-01-26-15-18-05--test-deepseek-r1-run-rust
  test_cases: 30
  model: deepseek/deepseek-reasoner
  edit_format: whole
  commit_hash: b276d48
  pass_rate_1: 50.0
  pass_rate_2: 63.3
  pass_num_1: 15
  pass_num_2: 19
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 3
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 0
  total_tests: 225
  command: aider --model deepseek/deepseek-reasoner
  date: 2025-01-26
  versions: 0.72.3.dev
  seconds_per_case: 174.1
  total_cost: 0.7162

costs: $0.0239/test-case, $0.72 total, $5.37 projected