Unitxt 1.7.6
What's Changed
The most significat change in this release is the addition of the notion of \N
(slash capital N) to formats. With \N
you can define places where you want a single new line removing all newlines ahead.
A very detailed explanation if you want to go deeper:
The Capital New Line Notation (\N) transforms a given string by applying the Capital New Line Notation.
The Capital New Line Notation (\N) is designed to manage newline behavior in a string efficiently.
This custom notation aims to consolidate multiple newline characters (\n) into a single newline under
specific conditions, with tailored handling based on whether there's preceding text. The function
distinguishes between two primary scenarios:
1. If there's text (referred to as a prefix) followed by any number of \n characters and then one or
more \N, the entire sequence is replaced with a single \n. This effectively simplifies multiple
newlines and notation characters into a single newline when there's preceding text.
2. If the string starts with \n characters followed by \N without any text before this sequence, or if
\N is at the very beginning of the string, the sequence is completely removed. This case is
applicable when the notation should not introduce any newlines due to the absence of preceding text.
This allows us two things:
First define system formats that are not having unnecassry new lines when instruciton of system prompt are missing.
Second, to ignore any new lines created by the template ensuring the number of new lines will be set by the format only.
For example if we defined the system format in the following way:
from unitxt.formats import SystemFormat
format = SystemFormat(model_input_format="{system_prompt}\n{instruction}\n|user|\n{source}\n|assistant|\n{target_prefix}")
We faced two issues:
- If the system prompt is empty or the instruction is empty we have two trailing new lines for no reason.
- If the source finished with new line (mostly due to template structre) we would have unnecassry empty line before the "|user|"
Both problems are solved with \N notation:
from unitxt.formats import SystemFormat
format = SystemFormat(model_input_format="{system_prompt}\\N{instruction}\\N|user|\n{source}\\N|assistant|\n{target_prefix}")
Breaking changes
- Fix typo in MultipleChoiceTemplate field choices_seperator -> choices_separator
- Deprecation of use_query option in all operators , for now it is just raising warning but will be removed in the next major release. The new default behavior is equivalent to use_query=True.
All Changes
Bug Fixes:
- Fix error in unitxt versions conflict and improve message by @elronbandel in #730
- Fix wrong handling of list in dict_get by @yoavkatz in #733
- Fix classification datasets with wrong schema by @elronbandel in #735
- Fix codespell by @elronbandel in #742
- Fix UI errors cause by grammar tasks by @elronbandel in #750
- Fix src layout and enforce its rules with pre-commit hooks by @elronbandel in #753
Assets Fixes:
New Features:
- Add notion of \N to formats, to fix format new line clashes by @elronbandel in #751
- Ability to dynamically change InstanceMetric inputs + grammar metrics by @arielge in #736
- Add DeprecatedFIeld for more informative procedure for deprecating fields of artifacts by @dafnapension in #741
New Assets:
- Add rerank recall metric to unitxt by @jlqibm in #662
- Add many selection and human preference tasks and datasets by @elronbandel in #746
- Adding Detector metric for running any classifier from huggingface as a metric by @mnagired in #745
- Add operators: RegexSplit, TokensSplit, Chunk by @elronbandel in #749
- Add bert score large and base versions by @assaftibm in #748
Enhancments:
- Remove use_dpath parameter from dict_get and dict_set by @dafnapension in #727
- Add mock judge test to cohere for ai by @perlitz in #720
New Contributors
Full Changelog: 1.7.4...1.7.6