server: bench: continuous performance testing #6233

phymbert · 2024-03-22T11:36:09Z

Motivation

llama.cpp is under active development, new papers on LLM are implemented quickly (for the good) and backend device
optimizations are continuously added.

All these factors have an impact on the server performances, especially the following metrics:

latency: pp (prompt processing) + tg (tokens generation) per request
server latency: total pp+tg per second across all requests with continuous batching
concurrency: how many concurrent request/users the server can handle in parallel
VRAM usage
RAM usage
GPU usage
CPU usage

It is important to monitor and control the impact of the codebase evolution on these metrics,
example from:

Since #5941, we have a server bench framework, we can now trigger it based on different events:

scheduled on master branch
on PR pushes

The approach should be reproducible: use the same hardware architecture, same models size and quants.

It would be nice to follow performances changes on a time series graph like it is done
in Apache Lucene.

Proposed approach

Bench will run on a T4 GPU node in Azure
Cloud, so:

Standard_NC4as_T4_v3
20.04.1-Ubuntu
4 VCPU
28GB RAM
1 NVidia Tesla T4
16GB VRAM
/dev/sdb, 256GB standard SSD, mounted at /
/dev/sda, 1T premium SSD, mounted at /mnt

On
a GitHub self-hosted runners
with prometheus installed.

A GitHub workflow, will:

build the server target using cmake Release build type and LLAMA_CUDA with native CUDA architecture
for each bench parameters
start the server
configure prometheus scrapping on the server instance
wait for the server to start
build the relevant dataset for the test
start performance test scenario using the right dataset
export the results to json
Download prometheus metrics graph
plot results into time series images
Add a comment in the PR with the metrics results images

Technical consideration

One important aspect of this configuration would be to make it easy to add more nodes in the future.
If we see that it works and is useful, we can find ways to add more hardware in order to do metrics for different cases.
All the code used must be stored in examples/server/bench folder.

GitHub Self-Hosted runner security

Self-hosted runner security:

Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your
public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request
that
executes the code in a workflow.

By design, we will be using just-in-time runners:

with ggml-ci in a docker container, loop look for new workflow job waiting for the host GPU series type label:
Create configuration for a just-in-time runner with this label
Start a rootless docker container with nvidia docker runtime with the JIT configuration token
start the GitHub runner within the container
wait for the container to exit
restart the loop

As the GitHub checks can only be run by collaborators, the job is running in a non-root docker container, I think we are safe.

Server scenario parameters matrix

scenario	duration	users	hf-repo	hf-file	model-alias	model-size	model-type	ngl	parallel	ctx-size	batch-size	ubatch-size	n-predict	grp-attn-n	grp-attn-w	embeddings	SERVER_BENCH_N_PROMPTS	SERVER_BENCH_MAX_PROMPT_TOKENS	SERVER_BENCH_MAX_CONTEXT
completions	10m	8	TODO		phi2	3B	F16	33	8	16384	2048	256	2048	1	512	false	1000	1024	1024
completions	10m	8	ggml-org/models	phi-2/ggml-model-q4_0.gguf	phi2	3B	MOSTLY_Q4_K_M	33	8	16384	2048	256	2048	1	512	false	1000	1024	1024
embeddings	5m	8	ggml-org/models	bert-bge-large/ggml-model-f16.gguf	bert-bge-large	?	F16	TODO	8	16384	4096	4096	NA	NA	NA	true	1000	4096	NA

In addition, following parameters will be used:

--log-disable no need to have a log file
--metrics to allow prometheus metrics scrapping
--cont-batching, probably need to enable by default server: enable --cont-batching by default #6229
--threads 1, we will test only with all layers offloaded to GPU
--threads-batch 1, we will test only with all layers offloaded to GPU
--model ggml-model.gguf as now we can download anything from HF
--defrag-thold 0.1

Only the OAI Chat completions endpoint with streaming enabled will be tested for completions.

Dataset consideration

dataset must contain system, assistant and user prompts (in order to test chat template overhead if any)
random must not be used to select prompt, running the test twice must output almost the same metrics
it must be possible to select prompts in order they fit in KV Cache (or not) using parameters listed
in bench/README.md:
- SERVER_BENCH_N_PROMPTS total prompts to select in the benchmark
- SERVER_BENCH_MAX_PROMPT_TOKENS maximum prompt tokens to filter out in the dataset
- SERVER_BENCH_MAX_CONTEXT maximum context size of the completions request to filter out in the dataset: prompt +
  predicted tokens

Selected dataset:

scenario	dataset	comment
completions	ShareGPT_Vicuna_unfiltered	taken from VLLM to have a baseline
embeddings	IMDB Data	suggested by @ngxson, looks good for embeddings

Tasks

The text was updated successfully, but these errors were encountered:

phymbert · 2024-03-22T11:36:55Z

@ggerganov @ngxson @slaren appreciate your early feedback on the approach before I start implementing too much

Azeirah · 2024-03-22T13:16:37Z

This is honestly so cool, I think it'd be a very worthwhile investment to track performance changes for a small set of select hardware over time. I think we'll be seeing that some small changes affect performance in unexpected ways (both positive and negative)

Only one thing I am wondering right now, do these servers run on some kind of shared hardware? It's incredibly important that everything on the system is in the exact same clean slate whenever a test is runned.

For example, if it's on shared hardware it's possible certain caches are unoptimal, whereas in the opposite case if the same hardware is ran 5x in a row, will the second run be a lot faster due to all sorts of arcane kernel caches, filesystem, ssd, driver caches etc?

I believe I saw a presentation by a C++ benchmarking expert that they'd developed a script that can reset all this arcane and hidden shared state/caches affecting benchmarking in one go. I'll go look and see if I can find it.

slaren · 2024-03-22T13:32:28Z

Looks good, it would be nice to have other parameters in the matrix such as different values of -ngl, but that's not important right now.

ggerganov · 2024-03-22T13:37:15Z

Only one thing I am wondering right now, do these servers run on some kind of shared hardware?

All tests will be running on dedicated Azure nodes (thanks @aigrant) that will do just this benchmark. We are starting with a single T4 node and if this works out, we will add more

ngxson · 2024-03-22T13:49:20Z

Cool idea, it will be very useful to keep track of llama.cpp's performance compared to "pure" GPU alternative like TensorRT or exllama.

A GitHub workflow, will:

One thing I think we need to consider though: the proposal here seems to based on the idea of having a "manager" machine and a "runner" machine - this will not be the case when using self-hosted runner. You can imagine that github will simply send out SSH commands to self-hosted runner, so there will be only one machine evolved.

Because of that, prometheus may not really fit the usage (because everything run on the same machine). Also I think we can firstly start with something more basic than prometheus, for example just a simple python script that collect metrics each X seconds. My idea here is to prevent have an external dependency from the beginning - we should add it in the way when we feel absolutely needed.

Personally, on my company, I often have to work with self-hosted gitlab and self-hosted runners, so I think I can help on setup scripts if needed.

Figuring out how to properly setup the self-hosted runner is also a big task to do I think, let's focus more on that for now.

Azeirah · 2024-03-22T14:20:13Z

Only one thing I am wondering right now, do these servers run on some kind of shared hardware?

All tests will be running on dedicated Azure nodes (thanks @aigrant) that will do just this benchmark. We are starting with a single T4 node and if this works out, we will add more

Yes I understand, but is it a bare metal server that is completely isolated? Or is it sharing resources on one huge server?

Either way, it doesn't matter much regardless what exactly it's running on. My point is that any hidden arcane state needs to be reset before running any benchmark script.

ngxson · 2024-03-22T14:33:58Z

Yes I understand, but is it a bare metal server that is completely isolated?

Servers with T4 GPU are usually "shared CPU but dedicated GPU". I believe that's also the case with other GPU like A100 or A10G, but not sure if it's also the same with H100 or not.

ngxson · 2024-03-22T14:37:57Z

My point is that any hidden arcane state needs to be reset before running any benchmark script.

On my company we have gitlab runners that plugged into docker on each machine, so in the end each CI run is isolated (multiple CI can even run in parallel). Even when the CI is failed for some reason, the resource is automatically clean up by docker. I believe that Github runner "agent" will have the same function.

Edit: but yeah sometime it's better to simply reset the machine (maybe via a snapshot) especially when working with benchmark. We can look into this in the future.

phymbert · 2024-03-22T15:43:32Z

Servers with T4 GPU are usually "shared CPU but dedicated GPU". I believe that's also the case with other GPU like A100 or A10G, but not sure if it's also the same with H100 or not.

Yes, AFAIK NVidia GPU virtualization does not exists on Azure (yet?), it is possible to fraction them only, but this is not our case. There is a solution from the vendor and I also have some good feedback with fractional GPU sharing for Kubernetes of run.ai.

@Azeirah In this proposition, all layers will be offloaded to GPU, only one test at a time per runner, so I believe we will not suffer too much with the hypervisor throttling.

phymbert · 2024-03-22T16:01:35Z

@ggerganov We need to keep this in mind:

Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

see Self-hosted runner security

So by design we will be using just-in-time runners and ideally the workflow should be started only by Collaborators.

Be sure I will test all this on my private fork first.

EDIT: solution proposed in the summary

phymbert · 2024-03-24T17:04:30Z

@ggerganov what about the defragmentation target for the baseline, without, I see lot of: update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1024

https://github.com/phymbert/llama.cpp/actions/runs/8410726314/job/23029507559

With --defrag-thold 0.8, it does not look better:

https://github.com/phymbert/llama.cpp/actions/runs/8410811453/job/23029705954

ggerganov · 2024-03-25T05:48:30Z

The thold should be 5-10% (e.g. --defrag-thold 0.1)

If you are getting that error, it means your --context is too small.
It should be equal to (num slots)*(max prompt + max predict) in order to fit the worst-case scenario

phymbert · 2024-03-26T00:37:49Z

First workflow ready to receive feedback:

comment added automatically: WIP server: bench: init phymbert/llama.cpp#1 (comment)
workflow which run on the Azure T4 self-hosted github runner: https://github.com/phymbert/llama.cpp/actions/runs/8428687623/job/23082119753
code: server: continuous performance monitoring and PR comment #6283

Based on this, we can modify duration, all parameters, comment template, frequency, etc...

If you agree with the approach, I can later on continue to add models or embeddings.

phymbert · 2024-04-01T07:31:16Z

Hello everyone,

The workflow is deployed since one week, and some concerns have been identified:

PR Comment can be distratful or inappropriate, for example, one marks "This comment was marked as off-topic."
- ggml : update mul_mat_id to use the same tensor for all the experts #6387, @slaren @ggerganov, do you want to generalize this approach or completely remove the comment ?
metric varies, I am not sure if it's always coherent, and related with the code changes

	#5021	#6367	#6387	#6403	#6408	#6412	#6413	#6414
iterations	264	481	498	534	504	518	504	516
req duration	18372.67	9814.5	9403.71	8767.58	9274.74	9059.66	9304.35	9070.17
total pp	105.34	190.52	198.39	205.85	200.65	201.57	199.63	201.14
total tg	219.35	128.99	128.22	130.33	129.75	128.18	129.67	128.61
/metrics pp	302.09	713.28	704/14	727.66	721.63	661.12	645.38	638.66
/metrics tg	0.24	17.81	17.65	18.05	17.92	17.74	17.68	18.11

@Azeirah @ngxson Any idea what can cause the discruptencies ? finally maybe the virtualization has an impact on
performances, at least on the k6 client side.

@ggerganov In which direction do you want we go further ? add an A100 test :) ? add embeddings ? other models MOE like ?

Thanks for your feedback

ggerganov · 2024-04-01T10:45:50Z

Regarding the PR comment with benchmark information: I find it a little bit distracting since it pops up in all PRs even unrelated to speed. I think it would be better to implement the long-term plot that you suggested at some point where we would be able to see the performance metrics as a function with time

Variations: are we using the same random seed for all runs? AFAICT from bench.py this is not the case and it might improve the reproducibility of the metrics.

We should add F16 and Q8_0 benchmarks for Phi-2

ngxson · 2024-04-01T12:31:45Z

Seems interesting. I’m currently limited to working from mobile phone, so can’t have a look right now. I’ll try when I can

phymbert · 2024-04-13T00:03:42Z

@ggerganov the node seems to be down. Maybe we should configure the runner as a service ?
Also, note that I did not forget to revert xk6-sse, I will do it in a couple of days.

ggerganov · 2024-04-14T08:15:33Z

Hm, not sure why it was down - restarted it again. A service could be useful

phymbert added enhancement New feature or request performance Speed related topics server/webui labels Mar 22, 2024

phymbert self-assigned this Mar 22, 2024

This was referenced Mar 22, 2024

ci: add install-docker.sh ggml-org/ci#1

Merged

JIT GitHub docker runner ggml-org/ci#2

Merged

server: continuous performance monitoring and PR comment #6283

Merged

phymbert mentioned this issue Mar 25, 2024

Add some models in ggml-models HF repo #6292

Open

phymbert added the need feedback Testing and feedback with results are needed label Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: bench: continuous performance testing #6233

server: bench: continuous performance testing #6233

phymbert commented Mar 22, 2024 •

edited

phymbert commented Mar 22, 2024

Azeirah commented Mar 22, 2024 •

edited

slaren commented Mar 22, 2024

ggerganov commented Mar 22, 2024

ngxson commented Mar 22, 2024 •

edited

Azeirah commented Mar 22, 2024

ngxson commented Mar 22, 2024

ngxson commented Mar 22, 2024 •

edited

phymbert commented Mar 22, 2024

phymbert commented Mar 22, 2024 •

edited

phymbert commented Mar 24, 2024 •

edited

ggerganov commented Mar 25, 2024 •

edited

phymbert commented Mar 26, 2024

phymbert commented Apr 1, 2024 •

edited

ggerganov commented Apr 1, 2024

ngxson commented Apr 1, 2024

phymbert commented Apr 13, 2024

ggerganov commented Apr 14, 2024

server: bench: continuous performance testing #6233

server: bench: continuous performance testing #6233

Comments

phymbert commented Mar 22, 2024 • edited

Motivation

Proposed approach

Technical consideration

GitHub Self-Hosted runner security

Server scenario parameters matrix

Dataset consideration

Tasks

phymbert commented Mar 22, 2024

Azeirah commented Mar 22, 2024 • edited

slaren commented Mar 22, 2024

ggerganov commented Mar 22, 2024

ngxson commented Mar 22, 2024 • edited

Azeirah commented Mar 22, 2024

ngxson commented Mar 22, 2024

ngxson commented Mar 22, 2024 • edited

phymbert commented Mar 22, 2024

phymbert commented Mar 22, 2024 • edited

phymbert commented Mar 24, 2024 • edited

ggerganov commented Mar 25, 2024 • edited

phymbert commented Mar 26, 2024

phymbert commented Apr 1, 2024 • edited

ggerganov commented Apr 1, 2024

ngxson commented Apr 1, 2024

phymbert commented Apr 13, 2024

ggerganov commented Apr 14, 2024

phymbert commented Mar 22, 2024 •

edited

Azeirah commented Mar 22, 2024 •

edited

ngxson commented Mar 22, 2024 •

edited

ngxson commented Mar 22, 2024 •

edited

phymbert commented Mar 22, 2024 •

edited

phymbert commented Mar 24, 2024 •

edited

ggerganov commented Mar 25, 2024 •

edited

phymbert commented Apr 1, 2024 •

edited