Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : refactor unicode stuff #5992

Merged
merged 11 commits into from Mar 11, 2024
Merged

llama : refactor unicode stuff #5992

merged 11 commits into from Mar 11, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Mar 11, 2024

ref #5981

  • add unicode.cpp and move codepoint tables there
  • rename unicode-related functions to have prefix unicode_
  • make a thin API in unicode.h

Build time when touching llama.cpp source file is now improved by ~25%

@ggerganov ggerganov marked this pull request as draft March 11, 2024 09:25
@ggerganov
Copy link
Owner Author

Whenever I save the unicode.h (and now unicode.cpp) file with my text editor (Neovim) it stops working on Windows:

https://github.com/ggerganov/llama.cpp/actions/runs/8230676605/job/22504415172?pr=5992#step:3:91

Does anyone have an idea how to make vim save it in the proper format?
It should prefix the file with some special bytes - I look about this sometime ago, and forgot the specifics. Any ideas how to resolve this would be appreciated, because otherwise I cannot work on this file without breaking the Windows builds

@slaren
Copy link
Collaborator

slaren commented Mar 11, 2024

I don't know about VIM, but the prefix that you mention is called BOM (byte order mark). It should have an option to save with BOM.

@ggerganov
Copy link
Owner Author

Thanks - the command is :set bomb

@ggerganov ggerganov marked this pull request as ready for review March 11, 2024 13:46
unicode.cpp Outdated Show resolved Hide resolved
@ggerganov ggerganov merged commit 83796e6 into master Mar 11, 2024
45 of 63 checks passed
@ggerganov ggerganov deleted the gg/refactor-unicode branch March 11, 2024 15:47
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants