moved examples into llm-dataset-converter-examples repo (mkdocs site)

waikato-llm · Jun 6, 2024 · c1f6138 · c1f6138
1 parent 1ea3a90
commit c1f6138
Showing 1 changed file with 3 additions and 210 deletions.
diff --git a/README.md b/README.md
@@ -373,218 +373,11 @@ optional arguments:
 See [here](plugins/README.md) for an overview of all plugins.
 
 
-## Command-line examples
+## Examples
 
-Use the [alpaca_data_cleaned.json](https://github.com/gururise/AlpacaDataCleaned/blob/main/alpaca_data_cleaned.json)
-dataset for the following examples.
+You can find examples for using the library (command-line and code) here:
 
-### Conversion
-
-```bash
-llm-convert \
-  from-alpaca \
-    --input ./alpaca_data_cleaned.json \
-  to-csv-pr \
-    --output alpaca_data_cleaned.csv
-```
-
-If you want some logging output, e.g., on progress and what files are being 
-processed/generated:
-
-```bash
-llm-convert \
-  -l INFO \
-  from-alpaca \
-    --input ./alpaca_data_cleaned.json \
-    -l INFO \
-  to-csv-pr \
-    --output alpaca_data_cleaned.csv
-    -l INFO
-```
-
-### Compression
-
-The output gets automatically compressed (when the format supports that), based
-on the extension that you use for the output.
-
-The following uses Gzip to compress the CSV file:
-
-```bash
-llm-convert \
-  from-alpaca \
-    --input ./alpaca_data_cleaned.json \
-  to-csv-pr \
-    --output alpaca_data_cleaned.csv.gz
-```
-
-The input gets automatically decompressed based on its extension, provided 
-the format supports that.
-
-### Processing multiple files
-
-Provided that the reader supports, you can also process multiple files, one
-after the other. For that you either specify them explicitly (multiple 
-arguments to the `--input` option) or use a glob syntax (e.g., `--input "*.json"`).
-For the latter, you should surround the argument with double quotes to avoid
-the shell expanding the names automatically.
-
-If you have a lot of files, it will be more efficient to store these in text
-files (with one file per line) and pass these to the reader using the 
-`--input_list` option (assuming that the reader supports this). Such file
-lists can be generated with the `llm-find` tool. See below `Locating files`.
-
-As for specifying the output, you simply specify the output directory. An output
-file name gets automatically generated from the name of the current input file
-that is being processed.
-
-If you want to compress the output files, you need to specify your preferred
-compression format via the global `-c/--compression` option of the `llm-convert`
-tool. By default, no compression is used.
-
-Please note, that when using a *stream writer* (e.g., for text or jsonlines 
-output) in conjunction with an output directory, each record will be stored 
-in a separate file. In order to transfer all the records into a single file, 
-you have to explicitly specify that file as output.
-
-### Filtering
-
-Instead of just reading and writing the data records, you can also inject
-filters in between them. E.g., the following command-line will load the
-Alpaca JSON dataset and only keep records that have the keyword `function`
-in either the `instruction`, `input` or `output` data of the record:
-
-```bash
-llm-convert \
-  -l INFO \
-  from-alpaca \
-    -l INFO \
-    --input alpaca_data_cleaned.json \
-  keyword \
-    -l INFO \
-    --keyword function \
-    --location any \
-    --action keep \
-  to-alpaca \
-    -l INFO \
-    --output alpaca_data_cleaned-filtered.json 
-```
-
-**NB:** When chaining filters, the tool checks whether accepted input and 
-generated output is compatible (including from reader/writer).
-
-### Download
-
-The following command downloads the file `vocab.json` from the Hugging Face
-project [lysandre/arxiv-nlp](https://huggingface.co/lysandre/arxiv-nlp):
-
-```bash
-llm-download \
-  huggingface \
-  -l INFO \
-  -i lysandre/arxiv-nlp \
-  -f vocab.json \
-  -o .
-```
-
-The next command gets the file `part_1_200000.parquet` from the dataset
-[nampdn-ai/tiny-codes](https://huggingface.co/datasets/nampdn-ai/tiny-codes) 
-(if you don't specify a filename, the complete dataset will get downloaded):
-
-```bash
-llm-download \
-  huggingface \
-  -l INFO \
-  -i nampdn-ai/tiny-codes \
-  -t dataset \
-  -f part_1_200000.parquet \
-  -o .
-```
-
-**NB:** Hugging Face will cache files locally in your home directory before
-copying it to the location that you specified.
-
-
-### Locating files
-
-The following command scans the `/some/dir` directory recursively for `.txt`
-files that do not have `raw` in the file path:
-
-```
-llm-find \
-    -l INFO \
-    -i /some/dir/
-    -r \
-    -m ".*\.txt" \
-    -n ".*\/raw\/.*" \
-    -o ./files.txt
-```
-
-The same command, but splitting the files into training, validation and test 
-lists, using a ratio of 70/15/15:
-
-```
-llm-find \
-    -l INFO \
-    -i /some/dir/
-    -r \
-    -m ".*\.txt" \
-    -n ".*\/raw\/.*" \
-    --split_ratios 70 15 15 \
-    --split_names train val test \
-    -o ./files.txt
-```
-
-This results in the following three files: `files-train.txt`, `files-val.txt` 
-and `files-test.txt`.
-
-
-## Code example
-
-Of course, you can use the library also from Python itself.
-
-The following code sets up a pipeline that reads in a prompt/response 
-dataset in Alpaca format, filters out records that do not contain the 
-keyword `function` anywhere in the record, converts it to *pretrain* data
-and then outputs it in zstandard-compressed jsonlines format:
-
-```python
-from wai.logging import LOGGING_INFO, init_logging
-from seppl.io import execute
-from ldc.api.supervised.pairs import PAIRDATA_FIELDS
-from ldc.core import Session, ENV_LLM_LOGLEVEL
-from ldc.api import COMPRESSION_ZSTD
-from ldc.registry import register_plugins
-from ldc.supervised.pairs import AlpacaReader
-from ldc.pretrain import JsonLinesPretrainWriter
-from ldc.filter import PairsToPretrain, Keyword
-
-init_logging(env_var=ENV_LLM_LOGLEVEL)
-register_plugins()
-
-execute(
-    AlpacaReader(
-        source="./alpaca_data_cleaned.json",
-        logging_level=LOGGING_INFO
-    ),
-    [
-        Keyword(
-            keywords=["function"],
-            logging_level=LOGGING_INFO
-        ),
-        PairsToPretrain(
-            data_fields=PAIRDATA_FIELDS
-        ),
-    ],
-    JsonLinesPretrainWriter(
-        target="./output",
-        att_content="text",
-        logging_level=LOGGING_INFO
-    ),
-    Session()
-        .set_logging_level(LOGGING_INFO)
-        .set_compression(COMPRESSION_ZSTD),
-)
-```
+https://waikato-llm.github.io/llm-dataset-converter-examples/
 
 
 ## Additional libraries