Upgrade to 2.3.0 #225

yuanwu2017 · 2024-09-26T06:34:03Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

update commit

* feat: support response_format in chat * fix: adjust typos * fix: add trufflehog lint

* fix(layers): fix SuRotaryEmbedding * change arange * remove logs

* Use minijinja's pycompat mode for python methods * fix: cargo fmt lint for pre commit --------- Co-authored-by: Armin Ronacher <[email protected]>

* feat: add kserve feature and basic routes * feat: implement infer endpoint wrapper around generate * fix: refactor and improve types * fix: improve infer and simplify * fix: cleanup and improve api docs * fix: refactor and encapsulate kserve feat in file * fix: remove typos after rebase

Add support for GPTQ Marlin kernels GPTQ Marlin extends the Marlin kernels to support common GPTQ configurations: - bits: 4 or 8 - groupsize: -1, 32, 64, or 128 - desc_act: true/false Using the GPTQ Marlin kernels requires repacking the parameters in the Marlin quantizer format. The kernels were contributed by Neural Magic to VLLM. We vendor them here for convenience.

* Update the link for qwen2 * Fix Qwen2 model URL in model table * Fix too eager staging --------- Co-authored-by: Daniël de Kok <[email protected]>

* doc: adding architecture document * doc: add architecture to toctree * fix: avoid cargo lock changes * fix: avoid cargo lock tweak --------- Co-authored-by: drbh <[email protected]>

When a batch contained images if different sizes during prefill, the server would fail (see e.g. huggingface#2056). Images were processed separately and then concatenated. However, this can fail for images with different sizes. Fix this by preprocessing all images in the batch together, so that the image processor can ensure that all image tensors have compatible sizes.

* Contributing guide & Code of Conduct * Redirect to GitHub's tutorial on PRs

* Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass

Fixes huggingface#2081.

For Phi-3-Small I need to shard a packed QKV bias tensor, for which I implemented the `Weights.get_packed_sharded` method. However, this method can also replace the `Weights._get_qweight` method and the custom sharding code from `Weights.get_weights_col_packed`.

The subcommand did not work due to some broken imports.

* New runner. Manual squash. * Network host. * Put back trufflehog with proper extension. * No network host ? * Moving buildx install after tailscale ? * 1.79

* Fix cargo-chef prepare In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly. If Cargo.lock is not present, cargo-chef will generate a new one first, which might vary a lot and invalidate docker build caches. * Fix Dockerfile_amd and Dockerfile_intel

* Support HF_TOKEN environement variable * Load test. --------- Co-authored-by: Nicolas Patry <[email protected]>

* Adding Service Name Environment variable for huggingface#2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option

* corrected Pydantic warning. * Update clients/python/text_generation/types.py Co-authored-by: Daniël de Kok <[email protected]> --------- Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Daniël de Kok <[email protected]>

* use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by: Wang, Yi A <[email protected]> * Update server/text_generation_server/utils/import_utils.py Co-authored-by: Daniël de Kok <[email protected]> --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Daniël de Kok <[email protected]>

…le with standard openai api (huggingface#2089) Co-authored-by: sunxichen <[email protected]>

* add CPU tgi support Signed-off-by: Wang, Yi A <[email protected]> * ipex distributed ops support Signed-off-by: Wang, Yi A <[email protected]> --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Funtowicz Morgan <[email protected]>

* feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes

* Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN

Signed-off-by: Wang, Yi A <[email protected]>

* Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI

* Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?

* Attempting to discard the trufflehog warning. * Attempt to fix trufflehog.

) Ideally we wouldn't have the router wrapper that this change adds, but when I give PyO3 a Python interpreter with packages, it ends up linking libpython from the Python interpreter rather than the constructed environment and cannot pick up the Python modules as a result.

* Add nix test. * Modifying yourself means you need to rerun. * Fixing the test + adding click (needed for pre-commit hooks). * Try thuis. * Our runner + pure test (not written) * Reemove server. * Root user. * Different user ? * Add the actual test target. * Forgot this modification. * Add a formatter. * Add the secrets. * Fixed the auth token ? * Adding the other tests. * Missing pre-commit. * Test requires cargo for cargo fmt. * Update it a bit. * Up. * Attempting to use a cache location for the models. * Ignore the cache for now.

huggingface#2510) fix: pass missing revision arg for lora adapter when loading multiple adapters

enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <[email protected]>

* use ratatui not archived tui * bump ratatui all the way with options

Disable by default because CI runners do not have enough GPUs.

* Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.

Runs the tests in a Nix build sandbox.

…ce#2511) * Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner

* Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow

* Update to moe-kenels 0.3.1 * Attempt to fix apt failure

…huggingface#2536)

…e#2532) Signed-off-by: Wang, Yi A <[email protected]>

…ce#2539) * fix: wrap python basic logs in debug assertion in launcher * use level filters instead

* Preparing for release. * Upgrade version in docs.

Signed-off-by: yuanwu <[email protected]>

mandy-li · 2024-10-01T22:15:50Z

@yuanwu2017 , for the testing, pls add the torch.compile model to your test cases.

tthakkal · 2024-10-03T18:37:44Z

server/text_generation_server/models/causal_lm.py

+# CHUNK_SIZES = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]
+# LAZY_MODE = int(os.environ.get('PT_HPU_LAZY_MODE', 1))
+MAX_TOTAL_TOKENS = int(os.environ.get('MAX_TOTAL_TOKENS', 8192))
+MAX_BATCH_TOTAL_TOKENS = int(os.environ.get('MAX_BATCH_TOTAL_TOKENS', 65536))


@yuanwu2017 looks like you are not reading param --max-batch-total-tokens coming from command line, instead always setting it to MAX_BATCH_TOTAL_TOKENS=65536 , its causing out of memory at "Decode warmup with bigger batch_size"

It is only a draft. I am running the performance benchmark and fix some issues I found.

tthakkal · 2024-10-03T23:25:29Z

server/text_generation_server/models/causal_lm.py

+ batch = self.generate_warmup_batch(request, PREFILL_WARMUP_SEQLEN_LIST[0], DECODE_WARMUP_BATCH_SIZE_LIST[-1], is_warmup)
+ _, prefill_batch, _ = self.generate_token([batch], is_warmup)
+ batches.append(prefill_batch)
+ while batch_size <= max_decode_batch_size:


Warming up multiple decode batches causing out of memory. for example below command crashes in this loop with out of memory which used to work in last release.

docker run -it --rm -p 8056:80 \ --runtime=habana \ -v /data2/models:/data \ -v /data2/models:/data/hub \ -v /data2/models:/root/.cache/huggingface/hub \ -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ -e HABANA_VISIBLE_DEVICES=1 \ -e MAX_TOTAL_TOKENS=2048 \ -e PREFILL_BATCH_BUCKET_SIZE=2 \ -e BATCH_BUCKET_SIZE=32 \ -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \ -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ -e ENABLE_HPU_GRAPH=true \ -e LIMIT_HPU_GRAPH=true \ -e USE_FLASH_ATTENTION=true \ -e FLASH_ATTENTION_RECOMPUTE=true \ --entrypoint bash \ --cap-add=sys_nice --ipc=host <PR_225_image> --model-id meta-llama/Llama-2-7b-chat-hf \ --max-input-length 1024 \ --max-total-tokens 2048 \ --max-batch-prefill-tokens 2048 \ --max-batch-total-tokens 65536 \ --max-waiting-tokens 7 \ --waiting-served-ratio 1.2 \ --max-concurrent-requests 64

Ok. I will fix it.

yuanwu2017 · 2024-10-05T06:45:55Z

@yuanwu2017 , for the testing, pls add the torch.compile model to your test cases.
Ok.

refine the warmup Signed-off-by: yuanwu <[email protected]>

McPatate and others added 30 commits September 24, 2024 03:42

fix(ci): remove unnecessary permissions (huggingface#2045)

5381fa7

Update LLMM1 bound (huggingface#2050)

eb8b76d

update commit

Support chat response format (huggingface#2046)

99c9474

* feat: support response_format in chat * fix: adjust typos * fix: add trufflehog lint

fix(server): fix OPT implementation (huggingface#2061)

e85e7ac

fix(layers): fix SuRotaryEmbedding (huggingface#2060)

2fdad64

* fix(layers): fix SuRotaryEmbedding * change arange * remove logs

PR huggingface#2049 CI run (huggingface#2054)

d0a1d50

* Use minijinja's pycompat mode for python methods * fix: cargo fmt lint for pre commit --------- Co-authored-by: Armin Ronacher <[email protected]>

Update the link for qwen2 (huggingface#2068)

b07a251

* Update the link for qwen2 * Fix Qwen2 model URL in model table * Fix too eager staging --------- Co-authored-by: Daniël de Kok <[email protected]>

Adding architecture document (huggingface#2044)

8ee52e9

* doc: adding architecture document * doc: add architecture to toctree * fix: avoid cargo lock changes * fix: avoid cargo lock tweak --------- Co-authored-by: drbh <[email protected]>

Contributing guide & Code of Conduct (huggingface#2074)

58c743b

* Contributing guide & Code of Conduct * Redirect to GitHub's tutorial on PRs

fix build.rs watch files (huggingface#2072)

b3dadbd

Set maximum grpc message receive size to 2GiB (huggingface#2075)

6b2cbd0

* Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass

Support exl2-quantized Qwen2 models (huggingface#2085)

38741fe

Fixes huggingface#2081.

Fix text-generation-server quantize (huggingface#2103)

f0ed8d2

The subcommand did not work due to some broken imports.

feat: sort cuda graphs in descending order (huggingface#2104)

d930724

New runner. Manual squash. (huggingface#2110)

b6a59e2

* New runner. Manual squash. * Network host. * Put back trufflehog with proper extension. * No network host ? * Moving buildx install after tailscale ? * 1.79

Support HF_TOKEN environment variable (huggingface#2066)

931ff16

* Support HF_TOKEN environement variable * Load test. --------- Co-authored-by: Nicolas Patry <[email protected]>

Add OTLP Service Name Environment Variable (huggingface#2076)

76c6a5c

* Adding Service Name Environment variable for huggingface#2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option

fix ChatCompletion and ChatCompletionChunk object string not compatib…

a9faabc

…le with standard openai api (huggingface#2089) Co-authored-by: sunxichen <[email protected]>

Cpu tgi (huggingface#1936)

0d879fe

* add CPU tgi support Signed-off-by: Wang, Yi A <[email protected]> * ipex distributed ops support Signed-off-by: Wang, Yi A <[email protected]> --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Funtowicz Morgan <[email protected]>

feat: add simple tests for weights (huggingface#2092)

1f70bb7

* feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes

fix cpu and xpu issue (huggingface#2116)

27ae4f7

Signed-off-by: Wang, Yi A <[email protected]>

Add pytest release marker (huggingface#2114)

136fb7e

* Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI

Narsil and others added 20 commits September 25, 2024 06:15

Fix truffle (huggingface#2514)

f32fa56

* Attempting to discard the trufflehog warning. * Attempt to fix trufflehog.

fix: pass missing revision arg for lora adapter when loading multiple… (

5fc0e0c

huggingface#2510) fix: pass missing revision arg for lora adapter when loading multiple adapters

hotfix : enable intel ipex cpu and xpu in python3.11 (huggingface#2517)

cbfe9e5

enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <[email protected]>

Use ratatui not (deprecated) tui (huggingface#2521)

afe5cae

* use ratatui not archived tui * bump ratatui all the way with options

Add tests for Mixtral (huggingface#2520)

e8c3293

Disable by default because CI runners do not have enough GPUs.

nix: pure Rust check/fmt/clippy/test (huggingface#2525)

0ecbd61

Runs the tests in a Nix build sandbox.

fix: metrics unbounded memory (huggingface#2528)

88b72c8

Stream options. (huggingface#2533)

2d470c8

* Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow

Update to moe-kenels 0.3.1 (huggingface#2535)

c1a99e2

* Update to moe-kenels 0.3.1 * Attempt to fix apt failure

doc: clarify that --quantize is not needed for pre-quantized models (…

b6ef2bf

…huggingface#2536)

hotfix: ipex fails since cuda moe kernel is not supported (huggingfac…

3519398

…e#2532) Signed-off-by: Wang, Yi A <[email protected]>

fix: wrap python basic logs in debug assertion in launcher (huggingfa…

bd9675c

…ce#2539) * fix: wrap python basic logs in debug assertion in launcher * use level filters instead

Preparing for release. (huggingface#2540)

514a5a7

* Preparing for release. * Upgrade version in docs.

Add some missing modification of 2.3.0 because of conflict

14fdc4a

Signed-off-by: yuanwu <[email protected]>

Make Gaudi adapt to the tgi 2.3.0

bab529c

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 marked this pull request as draft September 26, 2024 06:34

mandy-li requested review from regisss, schoi-habana and tthakkal October 1, 2024 22:15

tthakkal reviewed Oct 3, 2024

View reviewed changes

yuanwu2017 added 2 commits October 23, 2024 08:28

Pass the max_batch_total_tokens to causal_lm

67ee45a

refine the warmup Signed-off-by: yuanwu <[email protected]>

Merge branch 'habana-main' into 2.3.0

8686a0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to 2.3.0 #225

Upgrade to 2.3.0 #225

yuanwu2017 commented Sep 26, 2024

mandy-li commented Oct 1, 2024

tthakkal Oct 3, 2024 •

edited

Loading

yuanwu2017 Oct 5, 2024

tthakkal Oct 3, 2024

yuanwu2017 Oct 5, 2024

yuanwu2017 commented Oct 5, 2024

Upgrade to 2.3.0 #225

Are you sure you want to change the base?

Upgrade to 2.3.0 #225

Conversation

yuanwu2017 commented Sep 26, 2024

What does this PR do?

Before submitting

Who can review?

mandy-li commented Oct 1, 2024

tthakkal Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

yuanwu2017 Oct 5, 2024

Choose a reason for hiding this comment

tthakkal Oct 3, 2024

Choose a reason for hiding this comment

yuanwu2017 Oct 5, 2024

Choose a reason for hiding this comment

yuanwu2017 commented Oct 5, 2024

tthakkal Oct 3, 2024 •

edited

Loading