Releases: waikato-llm/llm-dataset-converter
Releases · waikato-llm/llm-dataset-converter
Release v0.2.4
Release v0.2.3
- requiring seppl>=0.2.4 now
Release v0.2.2
- requiring seppl>=0.2.3 now
Release v0.2.1
- filters
split
andtee
now supportClassificationData
as well - added
metadata-from-name
filter to extract meta-data from the current input file name - added
inspect
filter that allows inspecting data interactively as it passes through the pipeline - added
empty_str_if_none
helper method toldc.text_utils
to ensure no None/null values are output with writers - upgraded seppl to 0.2.2 and switched to using
seppl.ClassListerRegistry
Release v0.2.0
- added support for XTuner conversation JSON format:
from-xtuner
andto-xtuner
- added filter
update-pair-data
to allow tweaking or rearranging of the data - introduced
ldc.api
module to separate out abstract superclasses and avoid circular imports - readers now set the 'file' meta-data value
- added
file-filter
filter for explicitly allowing/discarding records that stem from certain files (entry in meta-data: 'file') - added
record-files
filter for recording the files that the records are based on (entry in meta-data: 'file') - filter
pretrain-sentences-to-pairs
can now omit filling theinstruction
when using 0 as prompt step - requiring seppl>=0.1.2 now
- added global option
-U, --unescape_unicode
tollm-convert
tool to allow conversion of escaped unicode characters - the
llm-append
tool now supports appending for json, jsonlines and CSV files apart from plain-text files (default)
Release v0.1.1
- added
classification
domain - added
from-jsonlines-cl
reader andto-jsonlines-cl
writer for classification data in JSON lines format - added filter
pretrain-sentences-to-classification
to turn pretrain data into classification data (with a predefined label) - added filter
classification-label-map
that can generate a label string/int map - the
to-llama2-format
filter now has the--skip_tokens
options to leave out the [INST] [/INST] tokens - added
from-parquet-cl
reader andto-parquet-cl
writer for classification data in Parquet database format - added
from-csv-cl
/from-tsv-cl
readers andto-csv-cl
/to-tsv-cl
writers for classification data in CSV/TSV file format
Release v0.1.0
- fixed output format of
to-llama2-format
filter llama2-to-pairs
filter has more robust parsing now- upgraded seppl to 0.1.0
- switched to seppl classes: Splitter, MetaDataHandler, Reader, Writer, StreamWriter, BatchWriter
Release v0.0.5
- added flag
-b/--force_batch
to thellm-convert
tool which all data to be reader from the reader before filtering it and then passing it to the writer; useful for batch filters. - added the
randomize-records
batch filter - added the
--encoding ENC
option to file readers - auto-determined encoding is now being logged (
INFO
level) - the
LDC_ENCODING_MAX_CHECK_LENGTH
environment variable allows overriding the default number of bytes used for determining the file encoding in auto-detect mode - default max number of bytes inspected for determining file encoding is now 10kb
- method
locate_files
inbase_io
no longer includes directories when expanding globs - added tool
llm-file-encoding
for determining file encodings of text files - added method
replace_extension
tobase_io
module for changing a files extension (removes any supported compression suffix first) - stream writers (.jsonl/.txt) now work with
--force_batch
mode; the output file name gets automatically generated from the input file name when just using a directory for the output
Release v0.0.4
pairs-to-llama2
filter now has an optional--prefix
parameter to use with the instruction- added the
pretrain-sentences-to-pairs
filter for generating artificial prompt/response datasets from pretrain data - requires seppl>=0.0.11 now
- the
LDC_MODULES_EXCL
environment variable is now used for specifying modules to be excluded from the registration process (e.g., used when generating help screens for derived libraries that shouldn't output the base plugins as well) llm-registry
andllm-help
now allow specifying excluded modules via-e/--excluded_modules
optionto-alpaca
writer now has the-a/--ensure_ascii
flag to enforce ASCII compatibility in the output- added global option
-u/--update_interval
toconvert
tool to customize how often progress of # records processed is being output in the console (default: 1000) text-length
filter now handles None values, i.e., ignores them- locations (i.e., input/instructions/output/etc) can be specified now multiple times
- the
llm-help
tool can generate index files for all the plugins now; in case of markdown it will link to the other markdown files
Release v0.0.3
- added the
record-window
filter - added the
llm-registry
tool for querying the registry from the command-line - added the
replace_patterns
method toldc.text_utils
module - added the
replace-patterns
filter - added
-p/--pretty-print
flag toto-alpaca
writer - added
pairs-to-llama2
andllama2-to-pairs
filter (since llama2 has instruction as part of the string, it is treated as pretrain data) - added
to-llama2-format
filter for pretrain records (no [INST]...[/INST] block) - now requiring seppl>=0.0.8 in order to raise Exceptions when encountering unknown arguments