Fix the decoding issues series: BPE Tokenizer #1854

bobqianic · 2024-02-10T18:12:39Z

I feel that reviewing everything in one PR is too difficult, and some controversies have arisen. I've decided to extract the parts that are less likely to cause controversy. This PR uses some of the code from llama.cpp and fixes the tokenizer issue in whisper.cpp. We can use the following code to check the implementation for correctness.

The original implementation of bpe_gpt2_preprocess in llama.cpp and the Python regex implementation have some differences, leading to about a 0.06% discrepancy. After some modifications, the discrepancy has been essentially eliminated. Although I tested a relatively large dataset, I still cannot guarantee that all possibilities have been covered.

The test utilizes dataset train.csv from https://huggingface.co/datasets/papluca/language-identification

bobqianic · 2024-02-10T18:41:27Z

Results

This PR 047ae5b

Note

Note: The file hashes are different because the Python code that generates the reference file uses \n as the newline character, while the C++ code uses \r\n as the newline character.

The files are different.
294f77dd7b09dbf17be8a5bdca541c897d9440d86151448eabb76114d37da7d8
0bb45e15215145d0c573a4a569fe9353fcf36590e1f99c68796afe1fcd85f1b7
--------------------

error rate: 0.0%

Master

The files are different.
294f77dd7b09dbf17be8a5bdca541c897d9440d86151448eabb76114d37da7d8
1d4dca77f62435e95b458dd5f12820b52610c6776f35a632d2128f9818b53489
--------------------

OpenAI whisper b'[503, 329, 10530, 279, 368, 1060, 13708, 1120, 871, 1801, 654, 11, 718, 1801, 654, 11, 287, 6380, 9667, 654, 11, 6775, 1601, 1641, 11, 309, 11447, 654, 11, 785, 6040, 1641, 308, 785, 28257, 842, 358, 654, 1256, 6470, 1076, 277, 49392, 1690, 337, 716, 1776, 24811, 308, 24323, 8824, 1690, 277, 24607, 889]'
whisper.cpp b'[503, 329, 10530, 279, 368, 7486, 5790, 1120, 871, 812, 12679, 11, 718, 812, 12679, 11, 7997, 84, 3201, 12679, 11, 6775, 1601, 1641, 11, 309, 842, 14218, 11, 17488, 5728, 64, 308, 785, 28257, 842, 18170, 64, 1256, 6470, 1046, 78, 277, 49392, 1690, 337, 716, 1776, 24811, 308, 24323, 8824, 1690, 277, 24607, 889]'
...
...
...
...
...
OpenAI whisper b'[43454, 42041, 255, 5142, 18066, 14121, 239, 10825, 7588, 4108, 2129, 236, 7732, 235, 44561, 4895, 23072, 31001, 6102, 33950, 25785, 4767, 1543, 16324, 15686, 2849, 16206, 1764, 3193, 13806, 2972, 4767, 5142]'
whisper.cpp b'[43454, 42041, 255, 5142, 18066, 14121, 239, 10825, 7588, 4108, 2129, 236, 7732, 235, 44561, 4895, 23072, 31001, 6102, 33950, 25785, 4767, 1543, 16324, 15686, 2849, 16206, 1764, 3193, 13806, 23072, 2659, 5142]'
--------------------
OpenAI whisper b'[503, 138, 251, 5074, 3182, 33179, 5554, 28684, 26117, 15878, 15974, 40822, 4826, 11656, 10541, 1781, 29756, 3182, 44716, 26865, 15974, 27841, 25090, 47861, 26751, 1781, 5733, 23814, 26117, 17579, 3721, 32043, 15878, 4903, 9903, 19389, 6452, 18679, 7068, 10013, 12622, 5554, 17688, 40822, 4826, 11656, 25090, 47861, 18231, 5337, 4826, 32927, 20702, 5532, 4339, 3721, 13580, 18679, 7068, 10013, 5532, 13580, 6744, 4043, 7068, 8839, 21807, 5733, 16900, 4654, 5074, 8385, 26855, 9083, 17579, 19851, 3849, 16008, 20867, 17321, 17309, 26017, 3786, 1800, 7597, 7068, 23449, 3596, 44352, 3182, 11383, 14220, 9137, 14832, 9213, 20511, 48298, 10605, 5733, 25962, 2411, 18793, 15009, 4826, 15974, 5532, 17928, 1]'
whisper.cpp b'[503, 138, 251, 5074, 3182, 33179, 40810, 5733, 26117, 15878, 15974, 40822, 4826, 11656, 10541, 1781, 29756, 3182, 44716, 26865, 15974, 27841, 25090, 47861, 26751, 1781, 5733, 23814, 26117, 17579, 3721, 32043, 15878, 4903, 9903, 19389, 6452, 18679, 7068, 10013, 12622, 5554, 17688, 40822, 4826, 11656, 25090, 47861, 18231, 5337, 4826, 32927, 20702, 5532, 4339, 3721, 13580, 18679, 7068, 10013, 5532, 13580, 6744, 4043, 7068, 8839, 21807, 5733, 16900, 23380, 1835, 8385, 26855, 9083, 17579, 19851, 3849, 16008, 20867, 17321, 17309, 26017, 3786, 1800, 7597, 7068, 23449, 3596, 44352, 3182, 11383, 14220, 9137, 14832, 9213, 20511, 48298, 10605, 5733, 25962, 2411, 18793, 15009, 4826, 15974, 5532, 17928, 1]'
--------------------
OpenAI whisper b'[503, 10637, 385, 7438, 79, 7138, 8526, 419, 4899, 19457, 1103, 1032, 70, 5409, 13, 3511, 4580, 408, 7011, 64, 1690, 806, 46916, 631, 7066, 11, 572, 450, 23334, 7304, 13, 4493, 9262, 64, 329, 416, 2439, 35240, 631, 37108, 282, 1690, 635, 3980, 2560, 1515, 21838, 889]'
whisper.cpp b'[503, 10637, 385, 7438, 22630, 812, 8526, 419, 43823, 752, 1103, 41964, 2595, 81, 13, 3511, 4580, 408, 7011, 64, 1690, 806, 46916, 631, 7066, 11, 572, 450, 23334, 7304, 13, 4493, 9262, 25548, 82, 416, 2439, 35240, 631, 22111, 77, 1690, 635, 46784, 812, 77, 1515, 21838, 889]'
--------------------
OpenAI whisper b'[31970, 33926, 45938, 33926, 3941, 113, 33926, 31970, 3941, 98, 17937, 31970, 17937, 8485, 113, 33279, 3941, 113, 25411, 3941, 96, 8485, 250, 33279, 45938, 48521, 21981, 31945, 49316, 220, 27099, 8485, 98, 17937, 35082, 31881, 48268, 8485, 108, 17937, 3941, 115, 220, 27099, 8485, 253, 220, 27099, 8485, 108, 3941, 113, 17937, 3941, 99, 31881, 816, 929, 8485, 242, 25411, 6220, 388, 3981, 530, 8485, 101, 21981, 8485, 237, 41858, 8485, 103, 25411, 33279, 3941, 113, 17937, 25411, 31970, 21981, 31970, 48521, 49316, 21981, 31970, 48521, 5080, 49316, 3941, 99, 45938, 220, 27099, 8485, 107, 33926, 31945, 31970, 31881, 37139, 36158, 220, 27099, 8485, 107, 17937, 31970, 31881, 8485, 98, 31881, 2411]'
whisper.cpp b'[31970, 33926, 45938, 33926, 3941, 113, 33926, 31970, 3941, 98, 17937, 31970, 17937, 8485, 113, 33279, 3941, 113, 25411, 3941, 96, 8485, 250, 33279, 45938, 48521, 21981, 31945, 49316, 220, 27099, 8485, 98, 17937, 35082, 31881, 48268, 8485, 108, 17937, 3941, 115, 220, 27099, 8485, 253, 220, 27099, 8485, 108, 3941, 113, 17937, 3941, 99, 31881, 816, 929, 8485, 242, 25411, 6220, 34050, 4889, 8485, 101, 21981, 8485, 237, 41858, 8485, 103, 25411, 33279, 3941, 113, 17937, 25411, 31970, 21981, 31970, 48521, 49316, 21981, 31970, 48521, 5080, 49316, 3941, 99, 45938, 220, 27099, 8485, 107, 33926, 31945, 31970, 31881, 37139, 36158, 220, 27099, 8485, 107, 17937, 31970, 31881, 8485, 98, 31881, 2411]'
--------------------
error rate: 78.81774806434102%

ggerganov

Well done! Given that this produces correct results for train.csv I feel quite confident that the implementation is correct.

Do you have an explanation about the failing test in llama.cpp after applying the changes from here? Before merging, let's try to figure it out

whisper.cpp

ggerganov · 2024-02-11T15:09:05Z

whisper.cpp

+// This is not perfect
+// Occasionally, it produces different results compared to OpenAI's tiktoken


Can you give an example where the results don't match?

When I directly used the implementation of bpe_gpt2_preprocess in llama.cpp, there was a 0.06% error, which disappeared after modification. I should delete this comment.

whisper.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

bobqianic · 2024-02-13T17:16:34Z

I've just gone over regex in detail again, and I think I can develop a function that replicates the exact behavior of Python's regex.

Fix

bobqianic · 2024-02-14T01:27:01Z

Based on the standard of regex, I have almost rewritten everything, and now bpe_gpt2_preprocess is exactly the same as Python regex. After testing several GiBs of synthetic data, as well as some real-world data, I can confirm that there are no issues with this algorithm now. @ggerganov

bobqianic · 2024-02-14T01:34:37Z

The following functions generates codepoint ranges used in unicode.h

Click me

General

import unicodedata


def find_category_ranges():
    category_ranges = {}
    for codepoint in range(0x110000):  # Unicode range from 0 to 0x10FFFF
        try:
            char = chr(codepoint)
            category = unicodedata.category(char)
            if category not in category_ranges:
                category_ranges[category] = [(codepoint, codepoint)]
            else:
                last_range_start, last_range_end = category_ranges[category][-1]
                if codepoint == last_range_end + 1:
                    # Extend the current range
                    category_ranges[category][-1] = (last_range_start, codepoint)
                else:
                    # Start a new range for this category
                    category_ranges[category].append((codepoint, codepoint))
        except ValueError:
            # Skip invalid code points
            continue

    # Simplify and print ranges
    for category, ranges in category_ranges.items():
        print(f"{category}:")
        for start, end in ranges:
            print(f"    {hex(start)}-{hex(end)}")
        print()

    with open("unicode_range.txt", "w") as file:
        general_category_ranges = dict()

        def add_range(category, ranges):
            if category not in general_category_ranges:
                general_category_ranges[category] = set()
            for start_end in ranges:
                general_category_ranges[category].add(start_end)

        for category, ranges in category_ranges.items():
            if category.startswith("L"):
                add_range("letter", ranges)
            elif category.startswith("M"):
                add_range("mark", ranges)
            elif category.startswith("N"):
                add_range("number", ranges)
            elif category.startswith("P"):
                add_range("punctuation", ranges)
            elif category.startswith("S"):
                add_range("symbol", ranges)
            elif category.startswith("Z"):
                add_range("separator", ranges)
            elif category.startswith("C"):
                add_range("other", ranges)

        # Function to custom format tuples with {}
        def tuple_to_braces(values):
            string = "["
            for x in range(len(values) - 1):
                string += "{" + str(hex(values[x][0])) + "," + str(hex(values[x][1])) + "},"
            string += "{" + str(hex(values[x][0])) + "," + str(hex(values[x][1])) + "}]"
            return string

        def dict_to_custom_string(d):
            output_lines = ""
            for key, values in d.items():
                formatted_values = tuple_to_braces(values)
                line = f'"{key}": {formatted_values},\n'
                output_lines += line
            # Join all lines to form the final string
            return "{" + output_lines[:-2] + "}"

        for x in general_category_ranges.keys():
            general_category_ranges[x] = sorted(list(general_category_ranges[x]))
            buffer = []
            for range_start, range_end in general_category_ranges[x]:
                if buffer == []:
                    buffer.append((range_start, range_end))
                else:
                    last_range_start, last_range_end = buffer[-1]
                    if range_start == last_range_end + 1:
                        # Extend the current range
                        buffer[-1] = (last_range_start, range_end)
                    else:
                        # Start a new range for this category
                        buffer.append((range_start, range_end))
            general_category_ranges[x] = buffer
        file.write(dict_to_custom_string(general_category_ranges))


if __name__ == '__main__':
    find_category_ranges()

Whitespace

import regex
    regex_c = regex.compile(''''s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+''')
    for codepoint in range(0x110000):
        try:
            char = chr(codepoint)
            s = "  " + char
            if len(regex_c.findall(s)) == 1:
                print(hex(codepoint))
        except ValueError:
            # Skip invalid code points
            continue

ggerganov

I think we should apply this new tokenization implementation in llama.cpp first and make sure that the tokenizer tests are all passing

ggerganov · 2024-02-19T08:46:08Z

unicode.h

+
+static int codepoint_type(uint32_t cp) {
+    static bool codepoint_type_initialized = codepoint_type_init();
+    return codepoint_type_binary_search(cp);


The old version of doing an unordered_map lookup seems better. No need to implement binary search when the data structure is already available.

Absolutely, my recent tests confirm that unordered_map is significantly faster, about 10X so, but it also requires a lot more RAM, especially for full UTF-8 support. Consequently, I've opted for a hybrid approach. For high-frequency UTF-8 ranges, I'm using unordered_map, while for the rest, I'm employing binary search. This strategy strikes a nice balance between speed and efficiency.

sindresorhus · 2024-03-08T06:20:47Z

I think we should apply this new tokenization implementation in llama.cpp first and make sure that the tokenizer tests are all passing

Relevant pull request: ggml-org/llama.cpp#5613

bobqianic added 5 commits February 10, 2024 17:53

bpe_tokenizer implementation

a29a3c8

Add files via upload

0806bc3

Update CMakeLists.txt

f8c8d49

Update Makefile

221d8d9

Add files via upload

14fef7c

bobqianic requested a review from ggerganov February 10, 2024 18:53

bobqianic added 2 commits February 10, 2024 21:48

Reduce error rate

56a7a22

reduce error rate

047ae5b

bobqianic requested a review from felrock February 10, 2024 23:08

bobqianic mentioned this pull request Feb 10, 2024

Fix bpe_gpt2_preprocess ggml-org/llama.cpp#5446

Closed

ggerganov reviewed Feb 11, 2024

View reviewed changes

bobqianic commented Feb 11, 2024

View reviewed changes

whisper.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

99e5322

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

This comment was marked as resolved.

Sign in to view

bobqianic marked this pull request as draft February 12, 2024 17:44

bobqianic added 3 commits February 13, 2024 02:51

fix bugs

0f6ad6c

Add files via upload

b668927

Add files via upload

dac6533

bobqianic added 3 commits February 14, 2024 00:24

Add files via upload

6eb97e9

rewrite bpe_gpt2_preprocess

53d58fb

Merge pull request #9 from bobqianic/fix

b7bc969

Fix

bobqianic marked this pull request as ready for review February 14, 2024 00:33

Add files via upload

9d3d1d2

ggerganov reviewed Feb 19, 2024

View reviewed changes

jwijffels mentioned this pull request Apr 5, 2024

Fix the decoding issues #1768

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the decoding issues series: BPE Tokenizer #1854

Fix the decoding issues series: BPE Tokenizer #1854

bobqianic commented Feb 10, 2024 •

edited

Loading

bobqianic commented Feb 10, 2024 •

edited

Loading

ggerganov left a comment

ggerganov Feb 11, 2024

bobqianic Feb 11, 2024

This comment was marked as resolved.

bobqianic commented Feb 13, 2024

bobqianic commented Feb 14, 2024

bobqianic commented Feb 14, 2024

ggerganov left a comment

ggerganov Feb 19, 2024

bobqianic Feb 19, 2024

sindresorhus commented Mar 8, 2024

		// This is not perfect
		// Occasionally, it produces different results compared to OpenAI's tiktoken

Fix the decoding issues series: BPE Tokenizer #1854

Are you sure you want to change the base?

Fix the decoding issues series: BPE Tokenizer #1854

Conversation

bobqianic commented Feb 10, 2024 • edited Loading

bobqianic commented Feb 10, 2024 • edited Loading

Results

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Feb 11, 2024

Choose a reason for hiding this comment

bobqianic Feb 11, 2024

Choose a reason for hiding this comment

This comment was marked as resolved.

bobqianic commented Feb 13, 2024

bobqianic commented Feb 14, 2024

bobqianic commented Feb 14, 2024

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Feb 19, 2024

Choose a reason for hiding this comment

bobqianic Feb 19, 2024

Choose a reason for hiding this comment

sindresorhus commented Mar 8, 2024

bobqianic commented Feb 10, 2024 •

edited

Loading

bobqianic commented Feb 10, 2024 •

edited

Loading