Some benchmarks of the encoding performance of a sentence-transformer model using Python and Elixir. I ran these tests on my MacBook Pro with M1 Max (CPU only, so M1 Pro should be the same).
In the initial results, I forgot to set the compiler option for the Nx.Serving. Doing this improves the performance of the Nx code significantly. The updated results are below. You can find the old results in the git history.
For running the python server, create a virtual environment:
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip3 install sentence-transformers flask
Then run the server using
$ FLASK_APP=simple flask run
The multi_*.sh
scripts also expect the Caddy webserver to be installed.
On macOS, a simple brew install caddy
will do.
I used wrk to run a simple HTTP benchmark.
Let's start with the Python server:
$ FLASK_APP=simple flask run
$ wrk http://127.0.0.1:5000 -t 8 -c 32 -d 60
Running 1m test @ http://127.0.0.1:5001
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 275.36ms 26.36ms 420.76ms 70.38%
Req/Sec 14.70 6.70 30.00 87.30%
6951 requests in 1.00m, 1.15MB read
Requests/sec: 115.69
Transfer/sec: 19.66KB
So roughly ~110 requests per second with an average latency of 275ms. Looking at the Activity Monitor of my MacBook showed that there is still some headroom with only ~350% usage. Starting two python servers with a caddy reverse proxy:
$ ./multi_py.sh
$ wrk http://127.0.0.1:6000 -t 8 -c 32 -d 60
Running 1m test @ http://127.0.0.1:6000
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 173.08ms 111.59ms 628.10ms 55.31%
Req/Sec 24.14 12.96 100.00 56.53%
11200 requests in 1.00m, 1.82MB read
Requests/sec: 186.35
Transfer/sec: 30.94KB
Now we nearly achieve 200 requests per second with an even better latency of ~170ms.
The simplest example can be found in the nx_serving.exs
file:
$ elixir nx_serving.exs
$ wrk http://127.0.0.1:5001 -t 8 -c 32 -d 60
Running 1m test @ http://127.0.0.1:5001
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 27.98ms 3.23ms 117.13ms 90.78%
Req/Sec 143.28 19.25 161.00 64.48%
68572 requests in 1.00m, 8.44MB read
Requests/sec: 1141.16
Transfer/sec: 143.76KB
So we achieve nearly 10x the performance of the simple Python server with an impressively low latency. The Activity Monitor only shows ~450% cpu usage.
Let's try to run multiple BEAM instances instead:
$ ./multi_beam.sh
$ wrk http://127.0.0.1:6000 -t 8 -c 32 -d 60
Running 1m test @ http://127.0.0.1:6000
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 136.22ms 8.20ms 274.61ms 91.79%
Req/Sec 32.07 10.02 40.00 65.19%
14086 requests in 1.00m, 2.19MB read
Requests/sec: 234.37
Transfer/sec: 37.33KB
Interestingly, although the CPU is not fully loaded with one BEAM instance, starting another one and load-balancing the requests does not improve the performance and actually lowers it quite significantly.