- Tool Environments for executing tool code in a sandbox.
- Caching to reduce the number of model API calls made.
- The
multiple_choice()
solver now has support for questions with multiple correct answers. - More fine grained handling of Claude
BadRequestError
(400) errors (which were formerly all treated as content moderation errors). - Filter out empty TextBlockParam when playing messages back to Claude.
- Automatically combine Claude user messages that include tool content.
- Revert to "auto" rather than "none" after forced tool call.
- Provide
TaskState.tools
getter/setter (where the setter automatically syncs the system messages to the specified set of tools). - The
use_tools()
function now uses theTaskState.tools
setter, so replaces the current set of tools entirely rather than appending to it. - Set
state.completed = False
whenmax_messages
is reached. - Allow tools to be declared with no parameters.
- Allow for null
bytes
field inLogprobs
andTopLogprobs
- Support all Llama series models on Bedrock.
- Added
truthfulqa
benchmark. - Added
intercode-ctf
example.
- Stream samples to the evaluation log as they are completed (subject to the new
--log-buffer
option). Always write completed samples in the case of an error or cancelled task. - New
"cancelled"
status in eval log for tasks interrupted with SIGINT (e.g. Ctrl-C). Logs are now written for cancellations (previously they were not). - Default
--max-samples
(maximum concurrent samples) to--max-connections
, which will result in samples being more frequently completed and written to the log file. - For
eval_retry()
, copy previously completed samples in the log file being retried so that work is not unnecessarily repeated. - New
inspect eval-retry
command to retry a log file from a task that ended in error or cancellation. - New
retryable_eval_logs()
function and--retryable
option forinspect list logs
to query for tasks not yet completed within a log directory. - Add
shuffled
property to datasets to determine if they were shuffled. - Remove unused
extensions
argument fromlist_eval_logs()
- Bugfix: Inspect view was not reliably updating when new evaluation logs were written.
- Bugfix:
results
was not defined when no scorer was provided resulting in an error being thrown. Fixed by settingresults = EvalResults()
when no scorer is provided. - Bugfix: The viewer was not properly handling samples without scores.
- Update to non-beta version of Anthropic tool use (remove legacy xml tools implementation).
- BREAKING: The
pattern
scorer has been modified to match against any (or all) regex match groups. This replaces the previous behaviour when there was more than one group, which would only match the second group. - Improved performance for Inspect View on very large datasets (virtualized sample list).
- ToolChoice
any
option to indicate the model should use at least one tool (supported by Anthropic and Mistral, mapped toauto
for OpenAI). - Tool calls can now return a simple scalar or
list[ContentText | ContentImage]
- Support for updated Anthropic tools beta (tool_choice and image tool results)
- Report tool_error back to model if it provides invalid JSON for tool calls arguments (formerly this halted the entire eval with an error).
- New
max_samples
option to control how many samples are run in parallel (still defaults to running all samples in parallel). - Add
boolq.py
benchmark. - Add
piqa.py
benchmark. - View: Improved markdown rendering (properly escape reference links)
- Improved typing for example_dataset function.
- Setuptools entry point for loading custom model extensions.
- Break optional
tuple
return out ofToolResult
type. - Bugfix: always read original sample message(s) for
TaskState.input_text
. - Bugfix: remove write counter from log (could have resulted in incomplete/invalid logs propagating to the viewer)
- Bugfix: handle task names that include spaces in log viewer.
- Add
ollama
local model provider. - Add
multi_scorer()
andmajority_vote()
functions for combining multiple scorers into a single score. - Add support for multiple model graders in
model_graded_qa()
. - Raise
TypeError
for solvers and scorers not declared asasync
. - Fallback to standard parase if
NaN
orInf
is encountered while reading log file header. - Remove deprecated support for matching partial model names (e.g. "gpt" or "claude").
- Exclude null config values from listings in log viewer.
- Add support for logprobs to HF provider, and create uniform API for other providers that support logprobs (Together and OpenAI).
- Provide an option to merge assistant messages and use it for Anthropoic models (as they don't allow consecutive assistant messages).
- Supporting infrastructure in Inspect CLI for VS Code extension (additional list and info commands).
- Show first log file immediately (don't wait for fetching metadata for other logs)
- Add
--version
CLI arg andinspect info version
command for interrogating version and runtime source path. - Fix: exclude
null
config values in output frominspect info log-file
- Fix issue with logs from S3 buckets in inspect view.
- Add
sort()
method toDataset
(defaults to sorting by sample input length). - Improve tokenization for HF provider (left padding, attention mask, and allow for custom chat template)
- Improve batching for HF provider (generate as soon as queue fills, thread safety for future.set_result).
- Various improvements to documentation.
write_eval_log()
now ignores unserializable objects in metadata fields.read_eval_log()
now takes astr
orFileInfo
(for compatibility w/ list returned fromlist_eval_logs()
).- Registry name looks are now case sensitive (fixes issue w/ loading tasks w/ mixed case names).
- Resiliancy to Python syntax errors that occur when enumerating tasks in a directory.
- Do not throw error if unable to parse or load
.ipynb
file due to lack of dependencies (e.g.nbformat
). - Various additions to log viewer display (log file name, dataset/scorer in listing, filter by complex score types).
- Improvements to markdown rendering in log viewer (don't render intraword underscores, escape html tags).
inspect view
command for viewing eval log files.Score
now has an optionalanswer
field, which denotes the answer text extracted from model output.- Accuracy metrics now take an optional
ValueToFloat
function for customising how textual values mapped to float. - Made
model_graded_qa
more flexible with separateinstruction
template andgrade_pattern
, as well providingpartial_credit
as an option. - Modify the default templates for
chain_of_thought()
andself_critique()
to instruct the model to reply withANSWER: $ANSWER
at the end on its own line. - Improved numeric extraction for
match(numeric=True)
(better currency and decimal handling). - Improve
answer()
patterns so that they detect letter and word answers both within and at the end of model output. Plan
now has an optionalcleanup
function which can be used to free per-sample resources (e.g. Docker containers) even in the case of an evaluation error.- Add
Dataset.filter
method for filtering samples using a predicate. Dataset
slices (e.g.dataset[0:100]
) now return aDataset
rather thanlist[Sample]
.- Relative path to
INSPECT_LOG_DIR
in.env
file is now correctly resolved for execution within subdirectories. inspect list tasks
andlist_tasks()
now only parse source files (rather than loading them), ensuring that it is fast even for task files that have non-trivial global initialisation.inspect list logs
andlist_eval_logs()
now enumerate log files recursively by default, and only enumerate json files that match log file naming conventions.- Provide
header_only
option forread_eval_log()
andinspect info log-file
for bypassing the potentially expensive reading of samples. - Provide
filter
option forlist_eval_logs()
to filter based on log file header info (i.e. anything but samples). - Added
__main__.py
entry point for invocation viapython3 -m inspect_ai
. - Removed prompt and callable from model
ToolDef
(renamed toToolInfo
). - Fix issue with accesses of
completion
property onModelOutput
with no choices.
- Initial release.