Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the decoding issues series: BPE Tokenizer #1854

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

bobqianic
Copy link
Collaborator

@bobqianic bobqianic commented Feb 10, 2024

I feel that reviewing everything in one PR is too difficult, and some controversies have arisen. I've decided to extract the parts that are less likely to cause controversy. This PR uses some of the code from llama.cpp and fixes the tokenizer issue in whisper.cpp. We can use the following code to check the implementation for correctness.

The original implementation of bpe_gpt2_preprocess in llama.cpp and the Python regex implementation have some differences, leading to about a 0.06% discrepancy. After some modifications, the discrepancy has been essentially eliminated. Although I tested a relatively large dataset, I still cannot guarantee that all possibilities have been covered.

The test utilizes dataset train.csv from https://huggingface.co/datasets/papluca/language-identification

@bobqianic
Copy link
Collaborator Author

bobqianic commented Feb 10, 2024

Results

This PR 047ae5b

Note

Note: The file hashes are different because the Python code that generates the reference file uses \n as the newline character, while the C++ code uses \r\n as the newline character.

The files are different.
294f77dd7b09dbf17be8a5bdca541c897d9440d86151448eabb76114d37da7d8
0bb45e15215145d0c573a4a569fe9353fcf36590e1f99c68796afe1fcd85f1b7
--------------------

error rate: 0.0%

Master

The files are different.
294f77dd7b09dbf17be8a5bdca541c897d9440d86151448eabb76114d37da7d8
1d4dca77f62435e95b458dd5f12820b52610c6776f35a632d2128f9818b53489
--------------------

OpenAI whisper b'[503, 329, 10530, 279, 368, 1060, 13708, 1120, 871, 1801, 654, 11, 718, 1801, 654, 11, 287, 6380, 9667, 654, 11, 6775, 1601, 1641, 11, 309, 11447, 654, 11, 785, 6040, 1641, 308, 785, 28257, 842, 358, 654, 1256, 6470, 1076, 277, 49392, 1690, 337, 716, 1776, 24811, 308, 24323, 8824, 1690, 277, 24607, 889]'
whisper.cpp b'[503, 329, 10530, 279, 368, 7486, 5790, 1120, 871, 812, 12679, 11, 718, 812, 12679, 11, 7997, 84, 3201, 12679, 11, 6775, 1601, 1641, 11, 309, 842, 14218, 11, 17488, 5728, 64, 308, 785, 28257, 842, 18170, 64, 1256, 6470, 1046, 78, 277, 49392, 1690, 337, 716, 1776, 24811, 308, 24323, 8824, 1690, 277, 24607, 889]'
...
...
...
...
...
OpenAI whisper b'[43454, 42041, 255, 5142, 18066, 14121, 239, 10825, 7588, 4108, 2129, 236, 7732, 235, 44561, 4895, 23072, 31001, 6102, 33950, 25785, 4767, 1543, 16324, 15686, 2849, 16206, 1764, 3193, 13806, 2972, 4767, 5142]'
whisper.cpp b'[43454, 42041, 255, 5142, 18066, 14121, 239, 10825, 7588, 4108, 2129, 236, 7732, 235, 44561, 4895, 23072, 31001, 6102, 33950, 25785, 4767, 1543, 16324, 15686, 2849, 16206, 1764, 3193, 13806, 23072, 2659, 5142]'
--------------------
OpenAI whisper b'[503, 138, 251, 5074, 3182, 33179, 5554, 28684, 26117, 15878, 15974, 40822, 4826, 11656, 10541, 1781, 29756, 3182, 44716, 26865, 15974, 27841, 25090, 47861, 26751, 1781, 5733, 23814, 26117, 17579, 3721, 32043, 15878, 4903, 9903, 19389, 6452, 18679, 7068, 10013, 12622, 5554, 17688, 40822, 4826, 11656, 25090, 47861, 18231, 5337, 4826, 32927, 20702, 5532, 4339, 3721, 13580, 18679, 7068, 10013, 5532, 13580, 6744, 4043, 7068, 8839, 21807, 5733, 16900, 4654, 5074, 8385, 26855, 9083, 17579, 19851, 3849, 16008, 20867, 17321, 17309, 26017, 3786, 1800, 7597, 7068, 23449, 3596, 44352, 3182, 11383, 14220, 9137, 14832, 9213, 20511, 48298, 10605, 5733, 25962, 2411, 18793, 15009, 4826, 15974, 5532, 17928, 1]'
whisper.cpp b'[503, 138, 251, 5074, 3182, 33179, 40810, 5733, 26117, 15878, 15974, 40822, 4826, 11656, 10541, 1781, 29756, 3182, 44716, 26865, 15974, 27841, 25090, 47861, 26751, 1781, 5733, 23814, 26117, 17579, 3721, 32043, 15878, 4903, 9903, 19389, 6452, 18679, 7068, 10013, 12622, 5554, 17688, 40822, 4826, 11656, 25090, 47861, 18231, 5337, 4826, 32927, 20702, 5532, 4339, 3721, 13580, 18679, 7068, 10013, 5532, 13580, 6744, 4043, 7068, 8839, 21807, 5733, 16900, 23380, 1835, 8385, 26855, 9083, 17579, 19851, 3849, 16008, 20867, 17321, 17309, 26017, 3786, 1800, 7597, 7068, 23449, 3596, 44352, 3182, 11383, 14220, 9137, 14832, 9213, 20511, 48298, 10605, 5733, 25962, 2411, 18793, 15009, 4826, 15974, 5532, 17928, 1]'
--------------------
OpenAI whisper b'[503, 10637, 385, 7438, 79, 7138, 8526, 419, 4899, 19457, 1103, 1032, 70, 5409, 13, 3511, 4580, 408, 7011, 64, 1690, 806, 46916, 631, 7066, 11, 572, 450, 23334, 7304, 13, 4493, 9262, 64, 329, 416, 2439, 35240, 631, 37108, 282, 1690, 635, 3980, 2560, 1515, 21838, 889]'
whisper.cpp b'[503, 10637, 385, 7438, 22630, 812, 8526, 419, 43823, 752, 1103, 41964, 2595, 81, 13, 3511, 4580, 408, 7011, 64, 1690, 806, 46916, 631, 7066, 11, 572, 450, 23334, 7304, 13, 4493, 9262, 25548, 82, 416, 2439, 35240, 631, 22111, 77, 1690, 635, 46784, 812, 77, 1515, 21838, 889]'
--------------------
OpenAI whisper b'[31970, 33926, 45938, 33926, 3941, 113, 33926, 31970, 3941, 98, 17937, 31970, 17937, 8485, 113, 33279, 3941, 113, 25411, 3941, 96, 8485, 250, 33279, 45938, 48521, 21981, 31945, 49316, 220, 27099, 8485, 98, 17937, 35082, 31881, 48268, 8485, 108, 17937, 3941, 115, 220, 27099, 8485, 253, 220, 27099, 8485, 108, 3941, 113, 17937, 3941, 99, 31881, 816, 929, 8485, 242, 25411, 6220, 388, 3981, 530, 8485, 101, 21981, 8485, 237, 41858, 8485, 103, 25411, 33279, 3941, 113, 17937, 25411, 31970, 21981, 31970, 48521, 49316, 21981, 31970, 48521, 5080, 49316, 3941, 99, 45938, 220, 27099, 8485, 107, 33926, 31945, 31970, 31881, 37139, 36158, 220, 27099, 8485, 107, 17937, 31970, 31881, 8485, 98, 31881, 2411]'
whisper.cpp b'[31970, 33926, 45938, 33926, 3941, 113, 33926, 31970, 3941, 98, 17937, 31970, 17937, 8485, 113, 33279, 3941, 113, 25411, 3941, 96, 8485, 250, 33279, 45938, 48521, 21981, 31945, 49316, 220, 27099, 8485, 98, 17937, 35082, 31881, 48268, 8485, 108, 17937, 3941, 115, 220, 27099, 8485, 253, 220, 27099, 8485, 108, 3941, 113, 17937, 3941, 99, 31881, 816, 929, 8485, 242, 25411, 6220, 34050, 4889, 8485, 101, 21981, 8485, 237, 41858, 8485, 103, 25411, 33279, 3941, 113, 17937, 25411, 31970, 21981, 31970, 48521, 49316, 21981, 31970, 48521, 5080, 49316, 3941, 99, 45938, 220, 27099, 8485, 107, 33926, 31945, 31970, 31881, 37139, 36158, 220, 27099, 8485, 107, 17937, 31970, 31881, 8485, 98, 31881, 2411]'
--------------------
error rate: 78.81774806434102%

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Given that this produces correct results for train.csv I feel quite confident that the implementation is correct.

Do you have an explanation about the failing test in llama.cpp after applying the changes from here? Before merging, let's try to figure it out

whisper.cpp Outdated Show resolved Hide resolved
whisper.cpp Outdated Show resolved Hide resolved
whisper.cpp Outdated Show resolved Hide resolved
whisper.cpp Outdated Show resolved Hide resolved
whisper.cpp Outdated
Comment on lines 2978 to 2979
// This is not perfect
// Occasionally, it produces different results compared to OpenAI's tiktoken
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example where the results don't match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I directly used the implementation of bpe_gpt2_preprocess in llama.cpp, there was a 0.06% error, which disappeared after modification. I should delete this comment.

whisper.cpp Outdated Show resolved Hide resolved
@bobqianic

This comment was marked as resolved.

@bobqianic bobqianic marked this pull request as draft February 12, 2024 17:44
@bobqianic
Copy link
Collaborator Author

I've just gone over regex in detail again, and I think I can develop a function that replicates the exact behavior of Python's regex.

@bobqianic bobqianic marked this pull request as ready for review February 14, 2024 00:33
@bobqianic
Copy link
Collaborator Author

Based on the standard of regex, I have almost rewritten everything, and now bpe_gpt2_preprocess is exactly the same as Python regex. After testing several GiBs of synthetic data, as well as some real-world data, I can confirm that there are no issues with this algorithm now. @ggerganov

@bobqianic
Copy link
Collaborator Author

The following functions generates codepoint ranges used in unicode.h

Click me

General

import unicodedata


def find_category_ranges():
    category_ranges = {}
    for codepoint in range(0x110000):  # Unicode range from 0 to 0x10FFFF
        try:
            char = chr(codepoint)
            category = unicodedata.category(char)
            if category not in category_ranges:
                category_ranges[category] = [(codepoint, codepoint)]
            else:
                last_range_start, last_range_end = category_ranges[category][-1]
                if codepoint == last_range_end + 1:
                    # Extend the current range
                    category_ranges[category][-1] = (last_range_start, codepoint)
                else:
                    # Start a new range for this category
                    category_ranges[category].append((codepoint, codepoint))
        except ValueError:
            # Skip invalid code points
            continue

    # Simplify and print ranges
    for category, ranges in category_ranges.items():
        print(f"{category}:")
        for start, end in ranges:
            print(f"    {hex(start)}-{hex(end)}")
        print()

    with open("unicode_range.txt", "w") as file:
        general_category_ranges = dict()

        def add_range(category, ranges):
            if category not in general_category_ranges:
                general_category_ranges[category] = set()
            for start_end in ranges:
                general_category_ranges[category].add(start_end)

        for category, ranges in category_ranges.items():
            if category.startswith("L"):
                add_range("letter", ranges)
            elif category.startswith("M"):
                add_range("mark", ranges)
            elif category.startswith("N"):
                add_range("number", ranges)
            elif category.startswith("P"):
                add_range("punctuation", ranges)
            elif category.startswith("S"):
                add_range("symbol", ranges)
            elif category.startswith("Z"):
                add_range("separator", ranges)
            elif category.startswith("C"):
                add_range("other", ranges)

        # Function to custom format tuples with {}
        def tuple_to_braces(values):
            string = "["
            for x in range(len(values) - 1):
                string += "{" + str(hex(values[x][0])) + "," + str(hex(values[x][1])) + "},"
            string += "{" + str(hex(values[x][0])) + "," + str(hex(values[x][1])) + "}]"
            return string

        def dict_to_custom_string(d):
            output_lines = ""
            for key, values in d.items():
                formatted_values = tuple_to_braces(values)
                line = f'"{key}": {formatted_values},\n'
                output_lines += line
            # Join all lines to form the final string
            return "{" + output_lines[:-2] + "}"

        for x in general_category_ranges.keys():
            general_category_ranges[x] = sorted(list(general_category_ranges[x]))
            buffer = []
            for range_start, range_end in general_category_ranges[x]:
                if buffer == []:
                    buffer.append((range_start, range_end))
                else:
                    last_range_start, last_range_end = buffer[-1]
                    if range_start == last_range_end + 1:
                        # Extend the current range
                        buffer[-1] = (last_range_start, range_end)
                    else:
                        # Start a new range for this category
                        buffer.append((range_start, range_end))
            general_category_ranges[x] = buffer
        file.write(dict_to_custom_string(general_category_ranges))


if __name__ == '__main__':
    find_category_ranges()

Whitespace

import regex
    regex_c = regex.compile(''''s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+''')
    for codepoint in range(0x110000):
        try:
            char = chr(codepoint)
            s = "  " + char
            if len(regex_c.findall(s)) == 1:
                print(hex(codepoint))
        except ValueError:
            # Skip invalid code points
            continue

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should apply this new tokenization implementation in llama.cpp first and make sure that the tokenizer tests are all passing


static int codepoint_type(uint32_t cp) {
static bool codepoint_type_initialized = codepoint_type_init();
return codepoint_type_binary_search(cp);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version of doing an unordered_map lookup seems better. No need to implement binary search when the data structure is already available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, my recent tests confirm that unordered_map is significantly faster, about 10X so, but it also requires a lot more RAM, especially for full UTF-8 support. Consequently, I've opted for a hybrid approach. For high-frequency UTF-8 ranges, I'm using unordered_map, while for the rest, I'm employing binary search. This strategy strikes a nice balance between speed and efficiency.

@sindresorhus
Copy link
Contributor

I think we should apply this new tokenization implementation in llama.cpp first and make sure that the tokenizer tests are all passing

Relevant pull request: ggerganov/llama.cpp#5613

@jwijffels jwijffels mentioned this pull request Apr 5, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants