llama : refactor unicode stuff #5992

ggerganov · 2024-03-11T09:25:35Z

ref #5981

add unicode.cpp and move codepoint tables there
rename unicode-related functions to have prefix unicode_
make a thin API in unicode.h

Build time when touching llama.cpp source file is now improved by ~25%

ggml-ci

ggerganov · 2024-03-11T11:57:35Z

Whenever I save the unicode.h (and now unicode.cpp) file with my text editor (Neovim) it stops working on Windows:

https://github.com/ggerganov/llama.cpp/actions/runs/8230676605/job/22504415172?pr=5992#step:3:91

Does anyone have an idea how to make vim save it in the proper format?
It should prefix the file with some special bytes - I look about this sometime ago, and forgot the specifics. Any ideas how to resolve this would be appreciated, because otherwise I cannot work on this file without breaking the Windows builds

slaren · 2024-03-11T12:01:26Z

I don't know about VIM, but the prefix that you mention is called BOM (byte order mark). It should have an option to save with BOM.

ggerganov · 2024-03-11T12:05:09Z

Thanks - the command is :set bomb

ggml-ci

unicode.cpp

* llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref

llama : refactor unicode stuff

0458996

ggml-ci

ggerganov marked this pull request as draft March 11, 2024 09:25

ggerganov added 7 commits March 11, 2024 11:41

unicode : names

9654d62

make : fix c++ compiler

9f3f7d8

unicode : names

de0929a

unicode : straighten tables

e607540

zig : fix build

be12d8b

unicode : put nfd normalization behind API

6568c62

ggml-ci

swift : fix build

4600538

unicode : add BOM

58d5491

ggerganov marked this pull request as ready for review March 11, 2024 13:46

unicode : add <cstdint>

af0621e

ggml-ci

slaren reviewed Mar 11, 2024

View reviewed changes

unicode.cpp Outdated Show resolved Hide resolved

unicode : pass as cpts as const ref

3680bc2

ggerganov force-pushed the gg/refactor-unicode branch from 6ab22d2 to 3680bc2 Compare March 11, 2024 14:37

ggerganov merged commit 83796e6 into master Mar 11, 2024
45 of 63 checks passed

ggerganov deleted the gg/refactor-unicode branch March 11, 2024 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : refactor unicode stuff #5992

llama : refactor unicode stuff #5992

ggerganov commented Mar 11, 2024 •

edited

ggerganov commented Mar 11, 2024

slaren commented Mar 11, 2024

ggerganov commented Mar 11, 2024

llama : refactor unicode stuff #5992

llama : refactor unicode stuff #5992

Conversation

ggerganov commented Mar 11, 2024 • edited

ggerganov commented Mar 11, 2024

slaren commented Mar 11, 2024

ggerganov commented Mar 11, 2024

ggerganov commented Mar 11, 2024 •

edited