Skip to content

Commit

Permalink
moved examples into llm-dataset-converter-examples repo (mkdocs site)
Browse files Browse the repository at this point in the history
  • Loading branch information
fracpete committed Jun 6, 2024
1 parent 1ea3a90 commit c1f6138
Showing 1 changed file with 3 additions and 210 deletions.
213 changes: 3 additions & 210 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,218 +373,11 @@ optional arguments:
See [here](plugins/README.md) for an overview of all plugins.


## Command-line examples
## Examples

Use the [alpaca_data_cleaned.json](https://github.com/gururise/AlpacaDataCleaned/blob/main/alpaca_data_cleaned.json)
dataset for the following examples.
You can find examples for using the library (command-line and code) here:

### Conversion

```bash
llm-convert \
from-alpaca \
--input ./alpaca_data_cleaned.json \
to-csv-pr \
--output alpaca_data_cleaned.csv
```

If you want some logging output, e.g., on progress and what files are being
processed/generated:

```bash
llm-convert \
-l INFO \
from-alpaca \
--input ./alpaca_data_cleaned.json \
-l INFO \
to-csv-pr \
--output alpaca_data_cleaned.csv
-l INFO
```

### Compression

The output gets automatically compressed (when the format supports that), based
on the extension that you use for the output.

The following uses Gzip to compress the CSV file:

```bash
llm-convert \
from-alpaca \
--input ./alpaca_data_cleaned.json \
to-csv-pr \
--output alpaca_data_cleaned.csv.gz
```

The input gets automatically decompressed based on its extension, provided
the format supports that.

### Processing multiple files

Provided that the reader supports, you can also process multiple files, one
after the other. For that you either specify them explicitly (multiple
arguments to the `--input` option) or use a glob syntax (e.g., `--input "*.json"`).
For the latter, you should surround the argument with double quotes to avoid
the shell expanding the names automatically.

If you have a lot of files, it will be more efficient to store these in text
files (with one file per line) and pass these to the reader using the
`--input_list` option (assuming that the reader supports this). Such file
lists can be generated with the `llm-find` tool. See below `Locating files`.

As for specifying the output, you simply specify the output directory. An output
file name gets automatically generated from the name of the current input file
that is being processed.

If you want to compress the output files, you need to specify your preferred
compression format via the global `-c/--compression` option of the `llm-convert`
tool. By default, no compression is used.

Please note, that when using a *stream writer* (e.g., for text or jsonlines
output) in conjunction with an output directory, each record will be stored
in a separate file. In order to transfer all the records into a single file,
you have to explicitly specify that file as output.

### Filtering

Instead of just reading and writing the data records, you can also inject
filters in between them. E.g., the following command-line will load the
Alpaca JSON dataset and only keep records that have the keyword `function`
in either the `instruction`, `input` or `output` data of the record:

```bash
llm-convert \
-l INFO \
from-alpaca \
-l INFO \
--input alpaca_data_cleaned.json \
keyword \
-l INFO \
--keyword function \
--location any \
--action keep \
to-alpaca \
-l INFO \
--output alpaca_data_cleaned-filtered.json
```

**NB:** When chaining filters, the tool checks whether accepted input and
generated output is compatible (including from reader/writer).

### Download

The following command downloads the file `vocab.json` from the Hugging Face
project [lysandre/arxiv-nlp](https://huggingface.co/lysandre/arxiv-nlp):

```bash
llm-download \
huggingface \
-l INFO \
-i lysandre/arxiv-nlp \
-f vocab.json \
-o .
```

The next command gets the file `part_1_200000.parquet` from the dataset
[nampdn-ai/tiny-codes](https://huggingface.co/datasets/nampdn-ai/tiny-codes)
(if you don't specify a filename, the complete dataset will get downloaded):

```bash
llm-download \
huggingface \
-l INFO \
-i nampdn-ai/tiny-codes \
-t dataset \
-f part_1_200000.parquet \
-o .
```

**NB:** Hugging Face will cache files locally in your home directory before
copying it to the location that you specified.


### Locating files

The following command scans the `/some/dir` directory recursively for `.txt`
files that do not have `raw` in the file path:

```
llm-find \
-l INFO \
-i /some/dir/
-r \
-m ".*\.txt" \
-n ".*\/raw\/.*" \
-o ./files.txt
```

The same command, but splitting the files into training, validation and test
lists, using a ratio of 70/15/15:

```
llm-find \
-l INFO \
-i /some/dir/
-r \
-m ".*\.txt" \
-n ".*\/raw\/.*" \
--split_ratios 70 15 15 \
--split_names train val test \
-o ./files.txt
```

This results in the following three files: `files-train.txt`, `files-val.txt`
and `files-test.txt`.


## Code example

Of course, you can use the library also from Python itself.

The following code sets up a pipeline that reads in a prompt/response
dataset in Alpaca format, filters out records that do not contain the
keyword `function` anywhere in the record, converts it to *pretrain* data
and then outputs it in zstandard-compressed jsonlines format:

```python
from wai.logging import LOGGING_INFO, init_logging
from seppl.io import execute
from ldc.api.supervised.pairs import PAIRDATA_FIELDS
from ldc.core import Session, ENV_LLM_LOGLEVEL
from ldc.api import COMPRESSION_ZSTD
from ldc.registry import register_plugins
from ldc.supervised.pairs import AlpacaReader
from ldc.pretrain import JsonLinesPretrainWriter
from ldc.filter import PairsToPretrain, Keyword

init_logging(env_var=ENV_LLM_LOGLEVEL)
register_plugins()

execute(
AlpacaReader(
source="./alpaca_data_cleaned.json",
logging_level=LOGGING_INFO
),
[
Keyword(
keywords=["function"],
logging_level=LOGGING_INFO
),
PairsToPretrain(
data_fields=PAIRDATA_FIELDS
),
],
JsonLinesPretrainWriter(
target="./output",
att_content="text",
logging_level=LOGGING_INFO
),
Session()
.set_logging_level(LOGGING_INFO)
.set_compression(COMPRESSION_ZSTD),
)
```
https://waikato-llm.github.io/llm-dataset-converter-examples/


## Additional libraries
Expand Down

0 comments on commit c1f6138

Please sign in to comment.