Releases · waikato-llm/llm-dataset-converter

04 Jul 23:26

$@fracpete$ fracpete

v0.2.4

5245364

Release v0.2.4 Latest

Latest

requiring seppl>=0.2.6 now
readers use default globs now, allowing the user to simply supply directories as input
renamed split filter to split-records to avoid name clash with meta-data key split as parameter

Assets 3

06 Jun 03:58

$@fracpete$ fracpete

v0.2.3

c1f6138

Release v0.2.3

requiring seppl>=0.2.4 now

Assets 3

03 May 02:54

$@fracpete$ fracpete

v0.2.2

5adde5f

Release v0.2.2

requiring seppl>=0.2.3 now

Assets 3

02 May 02:54

$@fracpete$ fracpete

v0.2.1

be64989

Release v0.2.1

filters split and tee now support ClassificationData as well
added metadata-from-name filter to extract meta-data from the current input file name
added inspect filter that allows inspecting data interactively as it passes through the pipeline
added empty_str_if_none helper method to ldc.text_utils to ensure no None/null values are output with writers
upgraded seppl to 0.2.2 and switched to using seppl.ClassListerRegistry

Assets 3

27 Feb 03:39

$@fracpete$ fracpete

v0.2.0

91c0371

Release v0.2.0

added support for XTuner conversation JSON format: from-xtuner and to-xtuner
added filter update-pair-data to allow tweaking or rearranging of the data
introduced ldc.api module to separate out abstract superclasses and avoid circular imports
readers now set the 'file' meta-data value
added file-filter filter for explicitly allowing/discarding records that stem from certain files (entry in meta-data: 'file')
added record-files filter for recording the files that the records are based on (entry in meta-data: 'file')
filter pretrain-sentences-to-pairs can now omit filling the instruction when using 0 as prompt step
requiring seppl>=0.1.2 now
added global option -U, --unescape_unicode to llm-convert tool to allow conversion of escaped unicode characters
the llm-append tool now supports appending for json, jsonlines and CSV files apart from plain-text files (default)

Assets 3

15 Feb 03:45

$@fracpete$ fracpete

v0.1.1

3f28570

Release v0.1.1

added classification domain
added from-jsonlines-cl reader and to-jsonlines-cl writer for classification data in JSON lines format
added filter pretrain-sentences-to-classification to turn pretrain data into classification data (with a predefined label)
added filter classification-label-map that can generate a label string/int map
the to-llama2-format filter now has the --skip_tokens options to leave out the [INST] [/INST] tokens
added from-parquet-cl reader and to-parquet-cl writer for classification data in Parquet database format
added from-csv-cl/from-tsv-cl readers and to-csv-cl/to-tsv-cl writers for classification data in CSV/TSV file format

Assets 3

15 Feb 03:44

$@fracpete$ fracpete

v0.1.0

97dd57a

Release v0.1.0

fixed output format of to-llama2-format filter
llama2-to-pairs filter has more robust parsing now
upgraded seppl to 0.1.0
switched to seppl classes: Splitter, MetaDataHandler, Reader, Writer, StreamWriter, BatchWriter

Assets 3

23 Jan 23:26

$@fracpete$ fracpete

v0.0.5

9b08d12

Release v0.0.5

added flag -b/--force_batch to the llm-convert tool which all data to be reader from the reader before filtering it and then passing it to the writer; useful for batch filters.
added the randomize-records batch filter
added the --encoding ENC option to file readers
auto-determined encoding is now being logged (INFO level)
the LDC_ENCODING_MAX_CHECK_LENGTH environment variable allows overriding the default number of bytes used for determining the file encoding in auto-detect mode
default max number of bytes inspected for determining file encoding is now 10kb
method locate_files in base_io no longer includes directories when expanding globs
added tool llm-file-encoding for determining file encodings of text files
added method replace_extension to base_io module for changing a files extension (removes any supported compression suffix first)
stream writers (.jsonl/.txt) now work with --force_batch mode; the output file name gets automatically generated from the input file name when just using a directory for the output

Assets 3

19 Dec 02:50

$@fracpete$ fracpete

v0.0.4

f96b234

Release v0.0.4

pairs-to-llama2 filter now has an optional --prefix parameter to use with the instruction
added the pretrain-sentences-to-pairs filter for generating artificial prompt/response datasets from pretrain data
requires seppl>=0.0.11 now
the LDC_MODULES_EXCL environment variable is now used for specifying modules to be excluded from the registration process (e.g., used when generating help screens for derived libraries that shouldn't output the base plugins as well)
llm-registry and llm-help now allow specifying excluded modules via -e/--excluded_modules option
to-alpaca writer now has the -a/--ensure_ascii flag to enforce ASCII compatibility in the output
added global option -u/--update_interval to convert tool to customize how often progress of # records processed is being output in the console (default: 1000)
text-length filter now handles None values, i.e., ignores them
locations (i.e., input/instructions/output/etc) can be specified now multiple times
the llm-help tool can generate index files for all the plugins now; in case of markdown it will link to the other markdown files

Assets 3

10 Nov 03:31

$@fracpete$ fracpete

v0.0.3

0986819

Release v0.0.3

added the record-window filter
added the llm-registry tool for querying the registry from the command-line
added the replace_patterns method to ldc.text_utils module
added the replace-patterns filter
added -p/--pretty-print flag to to-alpaca writer
added pairs-to-llama2 and llama2-to-pairs filter (since llama2 has instruction as part of the string, it is treated as pretrain data)
added to-llama2-format filter for pretrain records (no [INST]...[/INST] block)
now requiring seppl>=0.0.8 in order to raise Exceptions when encountering unknown arguments

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: waikato-llm/llm-dataset-converter

Release v0.2.4

Release v0.2.3

Release v0.2.2

Release v0.2.1

Release v0.2.0

Release v0.1.1

Release v0.1.0

Release v0.0.5

Release v0.0.4

Release v0.0.3