refactor: replace tiktoken with character-based token estimation #64

cellwebb · 2026-01-01T22:08:36Z

🎯 Summary

Removes the tiktoken dependency and replaces it with a simple character-based token estimation system, significantly reducing the dependency footprint while maintaining reliable functionality across all 26 AI providers.

📋 Changes Made

Removed tiktoken dependency from pyproject.toml
Simplified count_tokens() to use round(len(text) / 3.4) calculation
Removed get_encoding() function entirely (no longer needed)
Updated all tests to work with character-based approach
Ensured consistent behavior across all AI providers
Eliminated external dependencies and complex caching logic

✅ Testing

110 tests passing with the new implementation
All provider tests show identical behavior (no provider-specific logic)
Unicode, emoji, and special characters handled correctly
Empty content returns 0 tokens as expected
Character-based estimation provides reasonable approximations

🚀 Benefits

Lightweight: No external dependencies required
Fast: No network calls or complex calculations
Reliable: No cache issues or encoding failures
Consistent: All 26 providers work identically
Simple: Easy to understand and maintain code

📊 Token Estimation

The character-based approach (1 token ≈ 3.4 characters) provides reasonable estimates:

Hello world → 3 tokens
Sample test message → 6 tokens
This is a longer sentence → 14 tokens

This maintains the same approximate accuracy as tiktoken while eliminating complexity.

🔧 Files Changed

pyproject.toml - Remove tiktoken dependency
src/gac/ai_utils.py - Simplify token counting logic
src/gac/config.py - Remove tiktoken-related config
src/gac/constants/defaults.py - Clean up constants
All test files - Update to verify character-based counting

Summary by CodeRabbit

Refactor
- Simplified token counting with a consistent character-based approach across all models.
Chores
- Removed an external dependency to reduce installation footprint and streamline the project.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Remove tiktoken dependency from pyproject.toml - Replace count_tokens() with simple len/3.4 calculation - Remove get_encoding() function entirely - Update all tests to work with character-based approach - Ensure consistent behavior across all 26 AI providers - Eliminate external dependencies and complex caching logic All 110 tests passing with the simplified implementation.

coderabbitai · 2026-01-01T22:08:45Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

The pull request removes the tiktoken dependency from the project and replaces token counting with a simple character-based estimation formula (approximately 1 token per 3.4 characters). This removes external dependency requirements while maintaining token estimation functionality across all token counting operations.

Changes

Cohort / File(s)	Summary
Dependency Management `pyproject.toml`	Removed tiktoken>=0.12.0 from dependencies
Configuration & Constants `src/gac/config.py`, `src/gac/constants/defaults.py`	Removed `no_tiktoken` config field and related environment default (`NO_TIKTOKEN`); removed `DEFAULT_ENCODING` constant
Token Counting Implementation `src/gac/ai_utils.py`	Replaced tiktoken-based token counting with character-based estimation (len(text) / 3.4); removed `_should_skip_tiktoken_counting`, `get_encoding` helpers and tiktoken imports; `count_tokens` now unconditionally applies simple formula
Core Tests `tests/test_ai.py`, `tests/test_ai_utils.py`	Replaced tiktoken mocks and provider-specific encoding tests with character-based counting assertions; added new tests validating consistency across providers and edge cases (empty strings, Unicode, various lengths)
Extended Tests `tests/test_ai_utils_additional.py`, `tests/test_ai_utils_extended.py`	Removed tests exercising tiktoken fallback paths and environment-based skipping; added tests verifying deterministic, environment-independent character-based counting across multiple providers and text types
Configuration & Constants Tests `tests/test_config.py`, `tests/test_constants.py`	Removed assertion for `GAC_NO_TIKTOKEN` environment variable handling; replaced encoding constants test with token-limit constant check

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Away with tokens counted byte by byte,
We'll use a simpler spell—3.4's just right!
No external chains, no encoding dance,
Just characters and math, a swift advance.
Hopping faster now, our code runs lean,
The simplest way is often best, I've seen! ✨

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebb31e7 and f6f7eaa.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

pyproject.toml
src/gac/ai_utils.py
src/gac/config.py
src/gac/constants/defaults.py
tests/test_ai.py
tests/test_ai_utils.py
tests/test_ai_utils_additional.py
tests/test_ai_utils_extended.py
tests/test_config.py
tests/test_constants.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-01T22:09:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Removed assertion for 'no_tiktoken' config key since tiktoken dependency was completely removed in the previous commit. The configuration option no longer exists after simplifying to character-based token counting.

fix: remove no_tiktoken config test assertion

f6f7eaa

Removed assertion for 'no_tiktoken' config key since tiktoken dependency was completely removed in the previous commit. The configuration option no longer exists after simplifying to character-based token counting.

cellwebb merged commit f83cedb into main Jan 1, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: replace tiktoken with character-based token estimation #64

refactor: replace tiktoken with character-based token estimation #64

Uh oh!

cellwebb commented Jan 1, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 1, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

codecov bot commented Jan 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor: replace tiktoken with character-based token estimation #64

refactor: replace tiktoken with character-based token estimation #64

Uh oh!

Conversation

cellwebb commented Jan 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 Summary

📋 Changes Made

✅ Testing

🚀 Benefits

📊 Token Estimation

🔧 Files Changed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

codecov bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cellwebb commented Jan 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 1, 2026 •

edited

Loading

codecov bot commented Jan 1, 2026 •

edited

Loading