This repository was archived by the owner on Sep 10, 2025. It is now read-only.
  
  
  - 
                Notifications
    You must be signed in to change notification settings 
- Fork 248
    This repository was archived by the owner on Sep 10, 2025. It is now read-only.
  
  
Improve Tokenizer New Type Onboarding #1536
Copy link
Copy link
Open
Labels
actionableItems in the backlog waiting for an appropriate impl/fixItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomersGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🚀 The feature, motivation and pitch
As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__, we need to further improve it to simplify new tokenizer type onboarding:
Tasks
- Move TokenizerType to a centralized place
- We now have two of them: Lines 67 to 69 in 0299a37 class TokenizerType(Enum): Tiktoken = auto() SentencePiece = auto() torchchat/torchchat/cli/builder.py Lines 241 to 245 in 0299a37 class TokenizerType(Enum): NONE = 0 TIKTOKEN = 1 SENTENCEPIECE = 2 HF_TOKENIZER = 3 
 
- We now have two of them: 
- Check all getters of tokenizer types
- It may be able to be simplified as inline torchchat/torchchat/generate.py Line 368 in 0299a37 self.is_llama3_model = self.tokenizer_args.is_tiktoken() 
 
- It may be able to be simplified as inline 
- Add documentation for future tokenizer onboard.
- We may need to point people to update the model validation logic: torchchat/torchchat/cli/builder.py Lines 290 to 322 in 0299a37 def validate_model( self, model: Optional[Model], model_description: str = "model", ) -> None: if model is None: return if self.tokenizer_type == TokenizerType.NONE: raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}") is_tiktoken = self.is_tiktoken() is_sentencepiece = self.is_sentencepiece() is_hf_tokenizer = self.is_hf_tokenizer() use_tiktoken = model.config.use_tiktoken use_hf_tokenizer = model.config.use_hf_tokenizer use_sentencepiece = not (use_tiktoken or use_hf_tokenizer) if ( (is_tiktoken and not use_tiktoken) or (is_hf_tokenizer and not use_hf_tokenizer) or (is_sentencepiece and not use_sentencepiece) ): raise RuntimeError( "model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format( tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer), tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer), model_description, ) ) return 
 
- We may need to point people to update the model validation logic: 
To test, run a model with each tokenizer type:
- python torchchat.py generate llama2
- python torchchat.py generate llama3
- python torchchat.py generate granite-code
cc @Jack-Khuu @byjlw
Metadata
Metadata
Assignees
Labels
actionableItems in the backlog waiting for an appropriate impl/fixItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomersGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
No status