diff --git a/CLAUDE.md b/CLAUDE.md index 30f160e..569111a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -96,9 +96,8 @@ The tokenizer project is organized into modular packages with clear separation o - `tokens/` - Special token handling 4. **llama3/cmd/llama3/** - Llama3-specific CLI commands - - `encode.go` - Text encoding command + - `encode.go` - Text encoding command (with memory-efficient streaming for stdin) - `decode.go` - Token decoding command - - `stream.go` - Streaming tokenization command - `info.go` - Tokenizer information command ### Key Architectural Decisions diff --git a/cmd/tokenizer/README.md b/cmd/tokenizer/README.md index c5c78d0..2855331 100644 --- a/cmd/tokenizer/README.md +++ b/cmd/tokenizer/README.md @@ -78,27 +78,21 @@ echo "128000 9906 11 1917 0 128001" | tokenizer llama3 decode # Round-trip encoding and decoding tokenizer llama3 "test" | tokenizer llama3 decode -# Stream large files (automatic) +# Process large files efficiently (automatic memory-efficient streaming) cat large_file.txt | tokenizer llama3 - -# Stream large files (explicit) -cat large_file.txt | tokenizer llama3 stream ``` -### Streaming Mode +### Processing Large Files -For processing large files or real-time input: +The tokenizer automatically uses memory-efficient streaming when processing piped input: ```bash -# Automatic streaming (detects piped input) +# Process large files with O(1) memory usage tokenizer llama3 < input.txt cat large_file.txt | tokenizer llama3 -# Explicit streaming with options -tokenizer llama3 stream --buffer-size=8192 --max-buffer=2097152 < large_file.txt - -# Stream without special tokens -tokenizer llama3 stream --bos=false --eos=false < input.txt +# Process without special tokens +tokenizer llama3 --bos=false --eos=false < input.txt ``` ## Available Tokenizers @@ -108,9 +102,8 @@ tokenizer llama3 stream --bos=false --eos=false < input.txt Meta's Llama 3 tokenizer with 128,256 tokens (128,000 regular + 256 special tokens). **Commands:** -- `encode` - Convert text to token IDs +- `encode` - Convert text to token IDs (memory-efficient for stdin) - `decode` - Convert token IDs to text -- `stream` - Process text in streaming mode - `info` - Display tokenizer information ## Examples diff --git a/llama3/IMPLEMENTATION.md b/llama3/IMPLEMENTATION.md index 5a39803..ac19abc 100644 --- a/llama3/IMPLEMENTATION.md +++ b/llama3/IMPLEMENTATION.md @@ -204,7 +204,7 @@ type Scanner interface { } ``` -Create a scanner with `tokenizer.NewScanner(reader)` or `NewScannerOptions` for custom configuration. +Create a scanner with `tokenizer.NewScanner(reader, opts...)` with optional configuration. ### Pipeline Interfaces @@ -285,7 +285,7 @@ if err := scanner.Err(); err != nil { Custom buffer configuration: ```go -scanner := tokenizer.NewScannerOptions(reader, +scanner := tokenizer.NewScanner(reader, llama3.WithBufferSize(8192), llama3.WithMaxBuffer(1024*1024), llama3.WithEncodeOptions(&llama3.EncodeOptions{