-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the decoding issues series: BPE Tokenizer #1854
base: master
Are you sure you want to change the base?
Conversation
ResultsThis PR 047ae5b Note Note: The file hashes are different because the Python code that generates the reference file uses
Master
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done! Given that this produces correct results for train.csv
I feel quite confident that the implementation is correct.
Do you have an explanation about the failing test in llama.cpp
after applying the changes from here? Before merging, let's try to figure it out
whisper.cpp
Outdated
// This is not perfect | ||
// Occasionally, it produces different results compared to OpenAI's tiktoken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example where the results don't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I directly used the implementation of bpe_gpt2_preprocess
in llama.cpp, there was a 0.06%
error, which disappeared after modification. I should delete this comment.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This comment was marked as resolved.
This comment was marked as resolved.
I've just gone over regex in detail again, and I think I can develop a function that replicates the exact behavior of Python's regex. |
Based on the standard of regex, I have almost rewritten everything, and now |
The following functions generates codepoint ranges used in Click me
import unicodedata
def find_category_ranges():
category_ranges = {}
for codepoint in range(0x110000): # Unicode range from 0 to 0x10FFFF
try:
char = chr(codepoint)
category = unicodedata.category(char)
if category not in category_ranges:
category_ranges[category] = [(codepoint, codepoint)]
else:
last_range_start, last_range_end = category_ranges[category][-1]
if codepoint == last_range_end + 1:
# Extend the current range
category_ranges[category][-1] = (last_range_start, codepoint)
else:
# Start a new range for this category
category_ranges[category].append((codepoint, codepoint))
except ValueError:
# Skip invalid code points
continue
# Simplify and print ranges
for category, ranges in category_ranges.items():
print(f"{category}:")
for start, end in ranges:
print(f" {hex(start)}-{hex(end)}")
print()
with open("unicode_range.txt", "w") as file:
general_category_ranges = dict()
def add_range(category, ranges):
if category not in general_category_ranges:
general_category_ranges[category] = set()
for start_end in ranges:
general_category_ranges[category].add(start_end)
for category, ranges in category_ranges.items():
if category.startswith("L"):
add_range("letter", ranges)
elif category.startswith("M"):
add_range("mark", ranges)
elif category.startswith("N"):
add_range("number", ranges)
elif category.startswith("P"):
add_range("punctuation", ranges)
elif category.startswith("S"):
add_range("symbol", ranges)
elif category.startswith("Z"):
add_range("separator", ranges)
elif category.startswith("C"):
add_range("other", ranges)
# Function to custom format tuples with {}
def tuple_to_braces(values):
string = "["
for x in range(len(values) - 1):
string += "{" + str(hex(values[x][0])) + "," + str(hex(values[x][1])) + "},"
string += "{" + str(hex(values[x][0])) + "," + str(hex(values[x][1])) + "}]"
return string
def dict_to_custom_string(d):
output_lines = ""
for key, values in d.items():
formatted_values = tuple_to_braces(values)
line = f'"{key}": {formatted_values},\n'
output_lines += line
# Join all lines to form the final string
return "{" + output_lines[:-2] + "}"
for x in general_category_ranges.keys():
general_category_ranges[x] = sorted(list(general_category_ranges[x]))
buffer = []
for range_start, range_end in general_category_ranges[x]:
if buffer == []:
buffer.append((range_start, range_end))
else:
last_range_start, last_range_end = buffer[-1]
if range_start == last_range_end + 1:
# Extend the current range
buffer[-1] = (last_range_start, range_end)
else:
# Start a new range for this category
buffer.append((range_start, range_end))
general_category_ranges[x] = buffer
file.write(dict_to_custom_string(general_category_ranges))
if __name__ == '__main__':
find_category_ranges()
import regex
regex_c = regex.compile(''''s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+''')
for codepoint in range(0x110000):
try:
char = chr(codepoint)
s = " " + char
if len(regex_c.findall(s)) == 1:
print(hex(codepoint))
except ValueError:
# Skip invalid code points
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should apply this new tokenization implementation in llama.cpp
first and make sure that the tokenizer tests are all passing
|
||
static int codepoint_type(uint32_t cp) { | ||
static bool codepoint_type_initialized = codepoint_type_init(); | ||
return codepoint_type_binary_search(cp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old version of doing an unordered_map
lookup seems better. No need to implement binary search when the data structure is already available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, my recent tests confirm that unordered_map
is significantly faster, about 10X
so, but it also requires a lot more RAM, especially for full UTF-8 support. Consequently, I've opted for a hybrid approach. For high-frequency UTF-8 ranges, I'm using unordered_map
, while for the rest, I'm employing binary search. This strategy strikes a nice balance between speed and efficiency.
Relevant pull request: ggml-org/llama.cpp#5613 |
I feel that reviewing everything in one PR is too difficult, and some controversies have arisen. I've decided to extract the parts that are less likely to cause controversy. This PR uses some of the code from llama.cpp and fixes the tokenizer issue in whisper.cpp. We can use the following code to check the implementation for correctness.
The original implementation of
bpe_gpt2_preprocess
in llama.cpp and the Python regex implementation have some differences, leading to about a0.06%
discrepancy. After some modifications, the discrepancy has been essentially eliminated. Although I tested a relatively large dataset, I still cannot guarantee that all possibilities have been covered.The test utilizes dataset train.csv from https://huggingface.co/datasets/papluca/language-identification