Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
128 commits
Select commit Hold shift + click to select a range
e29270e
qwen2.5-max support
gnguralnick Feb 4, 2025
f047f3a
add new geminis and fix olympiad parsing
gnguralnick Feb 6, 2025
f3501c7
use max_tokens for non-openai api models; properly remove <think> for IF
gnguralnick Feb 13, 2025
a485c34
Update README.md
crwhite14 Feb 24, 2025
9d0147c
claude 3.7 sonnet support
gnguralnick Feb 25, 2025
7521e88
convert bash run scripts to unified python script
gnguralnick Feb 25, 2025
19ef418
add more script options
gnguralnick Feb 25, 2025
a3b6943
only insert ``` when parsing csv if necessary
gnguralnick Feb 26, 2025
065f2d3
refactor to extract params to object
gnguralnick Feb 27, 2025
c56700e
gemini-2.0-flash-lite support
gnguralnick Feb 27, 2025
573445f
pass through debug options
gnguralnick Feb 27, 2025
596810c
update readme to explain parallelization better
gnguralnick Feb 27, 2025
baff165
gpt 4.5 support
gnguralnick Feb 27, 2025
b85d9a6
misc
gnguralnick Mar 5, 2025
3708b03
misc improvements
gnguralnick Mar 14, 2025
7a68ce4
default use venv
gnguralnick Mar 17, 2025
b54c562
gemma-3-27b support
gnguralnick Mar 17, 2025
f01d4e5
mistral small and gemma support
gnguralnick Mar 18, 2025
fb45525
Add perplexity sonar models (#169)
arvindsun Mar 21, 2025
a27511a
improve scripts
gnguralnick Mar 21, 2025
7e90e6b
Fix names
arvindsun Mar 22, 2025
a8f61ff
o1 pro support
gnguralnick Mar 31, 2025
7173eee
Llama4
arvindsun Apr 5, 2025
a41783c
April update (#179)
gnguralnick Apr 7, 2025
bb4283b
Model configs (#186)
gnguralnick Apr 8, 2025
ec9c074
fix reaadme link
gnguralnick Apr 9, 2025
7429e7d
grok 3 support
gnguralnick Apr 11, 2025
5a1081b
gpt 4.1 models
gnguralnick Apr 14, 2025
8c34c03
rename gemini-2.5-pro-exp to preview and reimplement google inference
gnguralnick Apr 15, 2025
35203f5
add llama 4 config
gnguralnick Apr 15, 2025
bc46272
o3 and o4-mini support
gnguralnick Apr 16, 2025
515595e
o3 medium, o4-mini medium
gnguralnick Apr 16, 2025
7a34982
Fix date in changelog.md (#190)
evansemet Apr 17, 2025
e2a8e99
add release option instructions to readme
gnguralnick Apr 17, 2025
e647d72
gemini 2.5 flash preview support
gnguralnick Apr 18, 2025
ee60174
qwen 3 support
gnguralnick Apr 29, 2025
84c4667
End of april update (#212)
gnguralnick Apr 29, 2025
0fb2ef5
implement deepinfra with resubmits
gnguralnick May 5, 2025
b1650bb
use livebench release for calculating stats
gnguralnick May 5, 2025
24fb934
use run_livebench when rerunning failed qs
gnguralnick May 5, 2025
57e5a8c
use combined coding category
gnguralnick May 5, 2025
89668ba
phi-4-reasoning-plus support
gnguralnick May 5, 2025
8326acf
new gemini 2.5 preview and phi 4 reasoning plus
gnguralnick May 6, 2025
ebf812f
mistral medium 3
gnguralnick May 8, 2025
c4f0867
support for gemini 2.5 flash preview 05-20
gnguralnick May 22, 2025
e52a2a2
add claude 4 and gemini 2.5 flash non-thinking
gnguralnick May 23, 2025
f775f9f
deepseek r1 0528 config
gnguralnick May 29, 2025
ba9d7d7
Agentic coding update (#230)
gnguralnick May 30, 2025
778da7e
misc stuff
gnguralnick May 30, 2025
621dbf4
delay importing docker_util
gnguralnick Jun 2, 2025
c786340
properly pass parameters
gnguralnick Jun 2, 2025
c21bcee
remove unneeded variable
gnguralnick Jun 2, 2025
eed5058
fix openai responses implementation
gnguralnick Jun 2, 2025
1b23094
display prompt testing results better
gnguralnick Jun 2, 2025
3a30024
add new gemini pros
gnguralnick Jun 5, 2025
063ff64
fix new googles
gnguralnick Jun 5, 2025
0b7b252
add agent configs for new geminis
gnguralnick Jun 5, 2025
0107332
add supports function calling lol
gnguralnick Jun 5, 2025
80e2e12
support for new geminis
gnguralnick Jun 6, 2025
4e32547
add o3-pro
gnguralnick Jun 10, 2025
60411ac
combine all agentic coding tasks together when running
gnguralnick Jun 10, 2025
93e777d
don't stream openai responses
gnguralnick Jun 26, 2025
8af83c4
add gemma 3n and gemini 2.5 flash lite
gnguralnick Jun 26, 2025
677cbd7
grok 4 support
gnguralnick Jul 10, 2025
31faac9
kimi k2 instruct support
gnguralnick Jul 14, 2025
541fb15
update kimi to use native platform
gnguralnick Jul 14, 2025
dcb4cda
add qwen3 235b
gnguralnick Jul 22, 2025
c456304
add qwen3-coder-480b-a35b-instruct
gnguralnick Jul 23, 2025
68948cc
add qwen 3 coder plus
gnguralnick Jul 23, 2025
bf15dd6
add qwen3-235b-a22b-thinking-2507
gnguralnick Jul 28, 2025
ee06be8
glm-4.5 and glm-4.5-air
gnguralnick Jul 29, 2025
0bb20e3
fixes for deepinfra
gnguralnick Jul 31, 2025
9d77033
use correct glm-4.5 endpoint
gnguralnick Jul 31, 2025
cd5ef94
update zai configs
gnguralnick Jul 31, 2025
1e5f342
fix
gnguralnick Jul 31, 2025
dfc75c1
fix
gnguralnick Jul 31, 2025
7b4f85f
update readme
gnguralnick Jul 31, 2025
480cdd8
fix provider name for agentic coding
gnguralnick Aug 1, 2025
21a7a1d
New models
arvindsun Aug 5, 2025
5cd4901
Various fixes for mac + new python
arvindsun Aug 5, 2025
d353d0d
Correct settings for 120b
arvindsun Aug 5, 2025
47bad3e
Add LLM judge for latex math
arvindsun Aug 6, 2025
b802f8b
Max output tokens for gpt-oss
arvindsun Aug 6, 2025
a6811b6
Move to fireworks
arvindsun Aug 7, 2025
7eae295
Fix mac runner
arvindsun Aug 7, 2025
2820c4c
New model
arvindsun Aug 7, 2025
adf1acb
More models
arvindsun Aug 7, 2025
41f64e8
Cleaup bad files
arvindsun Aug 7, 2025
dfa63a0
Fixes to high config
arvindsun Aug 7, 2025
e37dd72
Fix config for responses
arvindsun Aug 8, 2025
d2dfc70
More models
arvindsun Aug 10, 2025
cfc04d9
Tweaks
arvindsun Aug 10, 2025
8097258
MOre variations
arvindsun Aug 10, 2025
0a2be93
Add gpt5 chat as well
arvindsun Aug 10, 2025
81fd7e9
More config fixes
arvindsun Aug 10, 2025
0dba6e7
Minimal as well
arvindsun Aug 12, 2025
59f1adb
Fix chat config
arvindsun Aug 12, 2025
f18e4ea
Fix minimal
arvindsun Aug 13, 2025
0351576
Minimal as well
arvindsun Aug 17, 2025
f56979a
deepseek-v3.1 config
gnguralnick Aug 20, 2025
4d37ac8
add supports function calling
gnguralnick Aug 20, 2025
d0e8e2f
deepseek-v3.1-thinking config
gnguralnick Aug 21, 2025
4b54b1f
add --only-incorrect option to only regenerate judgments for question…
gnguralnick Aug 26, 2025
662bb50
add grok-code-fast-1
gnguralnick Aug 26, 2025
4cc08d4
fix manual api key specification
gnguralnick Aug 28, 2025
6c7a1f7
another manual api key override fix
gnguralnick Aug 28, 2025
4651699
fix part 3
gnguralnick Aug 28, 2025
7341509
add qwen 3 next
gnguralnick Sep 12, 2025
7bf6143
add gemini-2.5-flash-lite
gnguralnick Sep 12, 2025
bcac363
swap default provider for qwen3 next thinking
gnguralnick Sep 16, 2025
6902227
grok-4-fast configs
gnguralnick Sep 22, 2025
7e880fb
misc
gnguralnick Sep 22, 2025
c31591e
use function calling
gnguralnick Sep 22, 2025
af07aab
deepseek-v3.1-terminus configs
gnguralnick Sep 22, 2025
717e0b5
add gpt-5-codex config
gnguralnick Sep 23, 2025
5e9824d
qwen3 max config
gnguralnick Sep 26, 2025
1c4c655
claude sonnet 4.5 configs
gnguralnick Sep 29, 2025
d506015
Switch to minisweagent and update step limit (#286)
gnguralnick Oct 6, 2025
728fdd0
misc fixes
gnguralnick Oct 6, 2025
563f435
add gpt-5-pro and glm-4.6
gnguralnick Oct 6, 2025
1091b6e
update glm-4.6 to use openrouter
gnguralnick Oct 6, 2025
7907c29
use deepinfra for glm-4.6
gnguralnick Oct 6, 2025
5af0e22
add exclude question id param
gnguralnick Oct 8, 2025
1553a62
Add giga dependency
Pupy101 Feb 4, 2025
59cc0aa
Add giga
Pupy101 Feb 4, 2025
e428f70
Fix GigaChat integration after rebase
cursoragent Oct 10, 2025
5087b28
Fix GigaChat model adapter support
cursoragent Oct 10, 2025
3a308b9
Complete GigaChat integration
cursoragent Oct 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
15 changes: 14 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,21 @@ output
*.pkl
*.csv

*~

# Build
build


data
data

core

.env

livebench/answer_inspections
livebench/question_edit

trajectories

prompts.txt
191 changes: 85 additions & 106 deletions README.md

Large diffs are not rendered by default.

18 changes: 17 additions & 1 deletion changelog.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,29 @@
## Changelog

### 2025-05-30
This update introduces a new agentic coding category, where models must operate in a multi-turn, realistic development environment to resolve issues from real Github repositories. It contains Python, JavaScript, and TypeScript tasks, and we plan to add more tasks and languages in the future. Inference is performed using the [SWE-Agent](https://swe-agent.com) framework, and evaluation uses the [Multi-SWE-Bench](https://multi-swe-bench.github.io/#/) harness. This task provides the most realistic possible evaluation of LLM coding capabilities and prediction of which will be most useful for developers.

Note: SWE-Agent was run with a 50-step limit for all models during this evaluation. In some cases, it's likely that scores would have improved were models given more time to complete the tasks.

### 2025-04-25
- Completely new coding questions focused on evaluating usage of real-world libraries in realistic scenarios. Questions are no longer sourced from LiveCodeBench. The tasks themselves are the same; we still have a full code generation task and a code completion task.
- Refeshed data analysis tasks, specifically for the tablejoin and tablereformat tasks. The cta task has been retired.

### 2025-04-02
- Refreshed coding questions (coding_completion and LCB_generation) with *much* newer questions from LiveCodeBench. The previous questions were likely heavily contaminated for newer models. LiveCodeBench also increases question difficulty over time.
- Refreshed typos and plot_unscrambling questions with newer ArXiv papers and movie plots from Wikipedia, respectively. Issues in the typos question generation script were also fixed, so that all questions can be fairly evaluated
- Replaced 2023 AMC questions with 2024 AMC questions in math_comp
- Updated web_of_lies with a mix of harder questions of the previous web_of_lies_v2 format and the new format from [BIG-Bench Extra Hard](https://github.com/google-deepmind/bbeh)
- All new questions ask for answers in the `<solution></solution>` format.

### 2024-11-25
This update focused on refreshing questions to check for contamination and increasing the difficulty of tasks for which o1 (and other reasoning models) achieved very high scores.
- Refreshed the instruction following tasks with new articles from The Guardian
- Updated IF question generation to include 3 instructions per task on average (previously 2)
- Refreshed Connections task with new puzzles from NYT
- Updated Connections generation to more frequently ask for more groups
- Regenerated Zebra Puzzles, skewing towards larger board sizes and more complex constraints
- Updated Connections and Zebra Puzzles questions to require answers to be in `<solution><\solution>` tags rather than bolded
- Updated Connections and Zebra Puzzles questions to require answers to be in `<solution></solution>` tags rather than bolded

### 2024-08-31

Expand Down
13 changes: 13 additions & 0 deletions livebench/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GEMINI_API_KEY=
DEEPSEEK_API_KEY=
COHERE_API_KEY=
TOGETHER_API_KEY=
FIREWORKS_API_KEY=
MISTRAL_API_KEY=
PERPLEXITY_API_KEY=
ALIBABA_API_KEY=
XAI_API_KEY=
STEP_API_KEY=
DEEPINFRA_API_KEY=
201 changes: 201 additions & 0 deletions livebench/agentic_code_runner/eval/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.

"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.

"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).

"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.

2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.

3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.

4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:

(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and

(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and

(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and

(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.

You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.

5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.

6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.

7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.

8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.

9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
9 changes: 9 additions & 0 deletions livebench/agentic_code_runner/eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# LiveBench Agentic Coding Evaluation

This directory contains the evaluation code for the LiveBench agentic coding task, which consists of testing models on their ability to resolve [Multi-SWE-Bench](https://multi-swe-bench.github.io/#/)-style tasks under the [SWE-Agent](https://swe-agent.com) agent framework.

The code here is adapted from the Multi-SWE-Bench [GitHub repository](https://github.com/multi-swe-bench/multi-swe-bench) with minimal modifications to ensure it can be triggered from the broader LiveBench evaluation harness.

The main modification that has been made is to update the Docker image build process to ensure all instance images have Python, pip, and pipx installed and a virtual environment activated, as these are necessary in order for SWE-Rex (the backend of SWE-Agent) to function in the Docker container.

The LICENSE file from the Multi-SWE-Bench GitHub repository is included here. (original source https://github.com/multi-swe-bench/multi-swe-bench/blob/main/LICENSE)
Empty file.
1 change: 1 addition & 0 deletions livebench/agentic_code_runner/eval/harness/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from livebench.agentic_code_runner.eval.harness.repos import *
Loading