Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,9 +96,8 @@ The tokenizer project is organized into modular packages with clear separation o
- `tokens/` - Special token handling

4. **llama3/cmd/llama3/** - Llama3-specific CLI commands
- `encode.go` - Text encoding command
- `encode.go` - Text encoding command (with memory-efficient streaming for stdin)
- `decode.go` - Token decoding command
- `stream.go` - Streaming tokenization command
- `info.go` - Tokenizer information command

### Key Architectural Decisions
Expand Down
21 changes: 7 additions & 14 deletions cmd/tokenizer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,27 +78,21 @@ echo "128000 9906 11 1917 0 128001" | tokenizer llama3 decode
# Round-trip encoding and decoding
tokenizer llama3 "test" | tokenizer llama3 decode

# Stream large files (automatic)
# Process large files efficiently (automatic memory-efficient streaming)
cat large_file.txt | tokenizer llama3

# Stream large files (explicit)
cat large_file.txt | tokenizer llama3 stream
```

### Streaming Mode
### Processing Large Files

For processing large files or real-time input:
The tokenizer automatically uses memory-efficient streaming when processing piped input:

```bash
# Automatic streaming (detects piped input)
# Process large files with O(1) memory usage
tokenizer llama3 < input.txt
cat large_file.txt | tokenizer llama3

# Explicit streaming with options
tokenizer llama3 stream --buffer-size=8192 --max-buffer=2097152 < large_file.txt

# Stream without special tokens
tokenizer llama3 stream --bos=false --eos=false < input.txt
# Process without special tokens
tokenizer llama3 --bos=false --eos=false < input.txt
```

## Available Tokenizers
Expand All @@ -108,9 +102,8 @@ tokenizer llama3 stream --bos=false --eos=false < input.txt
Meta's Llama 3 tokenizer with 128,256 tokens (128,000 regular + 256 special tokens).

**Commands:**
- `encode` - Convert text to token IDs
- `encode` - Convert text to token IDs (memory-efficient for stdin)
- `decode` - Convert token IDs to text
- `stream` - Process text in streaming mode
- `info` - Display tokenizer information

## Examples
Expand Down
4 changes: 2 additions & 2 deletions llama3/IMPLEMENTATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ type Scanner interface {
}
```

Create a scanner with `tokenizer.NewScanner(reader)` or `NewScannerOptions` for custom configuration.
Create a scanner with `tokenizer.NewScanner(reader, opts...)` with optional configuration.

### Pipeline Interfaces

Expand Down Expand Up @@ -285,7 +285,7 @@ if err := scanner.Err(); err != nil {

Custom buffer configuration:
```go
scanner := tokenizer.NewScannerOptions(reader,
scanner := tokenizer.NewScanner(reader,
llama3.WithBufferSize(8192),
llama3.WithMaxBuffer(1024*1024),
llama3.WithEncodeOptions(&llama3.EncodeOptions{
Expand Down
Loading