Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add performance reports & comparisons to other major maven hosts #1137

Open
solonovamax opened this issue Feb 25, 2022 · 5 comments
Open

Add performance reports & comparisons to other major maven hosts #1137

solonovamax opened this issue Feb 25, 2022 · 5 comments
Labels
documentation Issues related to docs performance Issues related to performance aspects of Reposilite

Comments

@solonovamax
Copy link
Contributor

solonovamax commented Feb 25, 2022

Request Details

I found reposilite and am very interested in this project.
And although I don't doubt the claims that reposilite, I'd be interested to see how it compares in terms of performance to the other major players.

I believe the following measurements would be beneficial to record:

  • requests/sec served for repeated request
    • This measurement would basically be "serve as many requests as possible for this one specific artifact"
    • The purpose of this measurement is to measure the access speed for cached artifacts
    • For this test, the server will have a certain amount of requests performed before the test, to allow it to warm up and cache the artifact.
  • requests/sec served for unique requests
    • This measurement would basically be "server as many unique artifacts as possible", aka each requests is looking for a new artifact.
    • The purpose of this measurement is for measuring the performance when the server has not cached the artifact
  • requests/sec for mixed requests
    • This measurement will be a mix of the previous two tests.
    • This test will be performed with 3 different levels:
      • 25% unique requests, 75% repeated request
      • 50% unique requests, 75% repeated requests
      • 75% unique requests, 25% repeated requests
    • The purpose of this test is to simulate real world conditions, rather than artificial benchmarks.
    • For this test, the artifacts which have been selected to be used as repeated requests shall be requested a certain number of times to allow the server to warm up and cache them.
  • Requests to a remote repository which are not cached
    • This measurement will be a request to the server which does not have the artifact, but an upstream server does.
    • For this test, the upstream server should either be running locally (so it has a 0ms response time) or the response time should be subtracted.
    • The purpose of this test is to measure retrieval time when the artifact only exists upstream.

The following statistics should also be recorded for each test execution, so that the bottlenecks of each server can be observed:

  • Memory usage
  • CPU usage
  • Disk access
    • Both quantity of data accessed, as well as rate the data was accessed at.

The tests should also be run in the following environments:

  • A "garbage tier" vps
    • aka, a generic 5$ vps with 1 core & 512M of ram
  • a "low tier" vps
    • aka, a vps with roughly 2 cores & 2gb of ram (smth that costs like 10-15$/mo)
  • a "mid tier" vps
    • aka, a vps that costs something in the range of 30-50$/mo
  • a "high tier" vps
    • aka, a vps that costs something in the range of 100$+/mo
  • a "high memory" vps
    • aka, a vps with something like 96gb of mem

(If you can't/don't want to run it in all those different environments because of money, it can always just be run with jvm args to limit mem usage, to simulate lower spec systems.)

The test would also be run with different amounts of artifacts:

  • 1000 artifacts
  • 10000 artifacts
  • 100000 artifacts
  • 1000000 artifacts

To get such a large quantity of artifacts, a random selection of artifacts can be downloaded from maven central and be rehosted for the purpose of testing.

I would be more than happy to help with writing the programs to perform these tests, or any other ways I'd be able to contribute.

@dzikoysk dzikoysk added documentation Issues related to docs performance Issues related to performance aspects of Reposilite labels Feb 25, 2022
@dzikoysk
Copy link
Owner

dzikoysk commented Feb 26, 2022

I think that when we'll have stable 3.x we could invest some time to write benchmarks. A few notes to the proposed scenarios:

  • Reposilite does not use cache, so there is no point to take into account tbh
  • Environments are inadequate to our goals
    • Imo there is no point to use anything above mid tier. Reposilite 3.x is designed to run as a microservice, so it's more about scaling through small instances in independent environment. Why? Such scaling is more effective and gives a possibility to avoid bottlenecks caused by limitations of current hardware (Reposilite will more likely block on read IO when disk won't be able to serve more data in a given period of time)
    • Tests should start from ~32MB of ram to probably sth like 8GB (I don't quite see a scenario when we could use more, because it'll block on disk anyway - ofc depends on used hardware)
  • There is no point to download real artifacts, file is a file, so it could be even random bulk file with fixed size.

Also, you need 2 machines to perform reliable tests - for client & server.

@solonovamax
Copy link
Contributor Author

solonovamax commented Feb 26, 2022

Reposilite does not use cache, so there is no point to take into account tbh

Still, that is a reason to take it into account. Performance benchmarks should be used to show users whether or not this tool is appropriate for them. So, being transparent that the other repository servers perform better with large quantities of ram and cache should be shown.
(Not saying it already isn't, but this would give users an idea at what point they should choose to use another repo server)

Imo there is no point to use anything above mid tier. Reposilite 3.x is designed to run as a microservice, so it's more about scaling through small instances in independent environment. Why? Such scaling is more effective and gives a possibility to avoid bottlenecks caused by limitations of current hardware (Reposilite will more likely block on read IO when disk won't be able to serve more data in a given period of time)

Yeah, ofc. I'm assuming it'd be done on a VPS with decent I/O speeds.

There is no point to download real artifacts, file is a file, so it could be even random bulk file with fixed size.

True, I just thought it would be good to use something that accurately models the real world.

Also, you need 2 machines to perform reliable tests - for client & server.

It could also be 2 virtual machines containerized using something liker KVM.

Also the different quantities of artifacts in the server has the goal of seeing how other maven repo servers compare when there are more/less artifacts

@dzikoysk
Copy link
Owner

Performance benchmarks should be used to show users whether or not this tool is appropriate for them

I mean, the result might be different, because there is e.g. disk cache, but it'd be unrelated to Reposilite internals. Also, I've added it more like a note, because I assume most people don't know how Reposilite works under the hood, so it's good to mention it anyway.

Speaking of preparing such benchmark, it should be:

  1. Transparent - represented by a Git project on GitHub, probably here: https://github.com/reposilite-playground
  2. Simple - easy to clone & launch by any user
  3. Maintainable - clean, so it's relatively easy to develop

I'd keep it simple, so we may start with only one mainstream manager:

  • Nexus (most popular, so probably the best one for now)
  • Archiva
  • Artifactory

And later we can extend it :)

@solonovamax
Copy link
Contributor Author

solonovamax commented Feb 27, 2022

Speaking of preparing such benchmark, it should be:

  1. Transparent - represented by a Git project on GitHub, probably here: https://github.com/reposilite-playground
  2. Simple - easy to clone & launch by any user
  3. Maintainable - clean, so it's relatively easy to develop

Of course.
That was entirely implied in all of this, and such a benchmark would not mean much if it wasn't transparent, simple, and reproducible.

@dzikoysk
Copy link
Owner

Results could be summarized in guide:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues related to docs performance Issues related to performance aspects of Reposilite
Projects
None yet
Development

No branches or pull requests

2 participants