Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some APIs result into 500 error #665

Open
Madhu1029 opened this issue Jul 14, 2022 · 7 comments
Open

Some APIs result into 500 error #665

Madhu1029 opened this issue Jul 14, 2022 · 7 comments
Assignees

Comments

@Madhu1029
Copy link

Madhu1029 commented Jul 14, 2022

We are trying to execute QuantumLeap APIs via jmeter script. But some of the APIs result into 500 error. Below are the configurations for API execution:
API: GET http://host/v2/entities/tid1/attrs/pressure?type=typ&limit=100&offset=1534&fromDate=2022-06-02T00:00:00Z&toDate=2022-07-04T23:59:59Z
Throughput: 20req/s.
WORKERS: 40
QuantumLeap Version: 0.8.1
We have tried increasing the value of WORKERS variable, it decreases the probability of 500 error. But the 500 error is still occurring.
Is increasing the WORKERS value be correct solution for this issue? If yes, how could I calculate the correct value for WORKERS value?

@c0c0n3
Copy link
Member

c0c0n3 commented Jul 14, 2022

hi @MadhuNEC :-)

We have tried increasing the value of WORKER variable

The variable name is actually WORKERS, notice the S at the end, setting WORKER=40 has no effect :-)
See:

500 error is still occurring.

Can you post more details about the error you're getting? Any trace of that in the logs? What's the backend you're using? Crate DB or Timescale? What's the load on the DB? We've had similar issues in the past which boiled down to not allocating enough resources to the database backend. So when QL tried to run a query, the DB would just refuse to run it b/c it was overloaded. It could be you're experiencing something similar, but I can't be sure. Like I said, we'd need to know more about your test environment, errors you get in the logs, DB load, etc.

@Madhu1029
Copy link
Author

Yes, I am using WORKERS variable. Sorry, mistaken to write in the comment, updated the same.

I am using CrateDB backend.
QuantumLeap is running behind nginx. I can only see 500 code response for the API in nginx logs. There are no logs at QuantumLeap side for the API which results into 500 error. It seems that QuantumLeap is unable to accept request.

For DB Load, there are 10 entities created and for each entity approx 5000 data is present. For example, there are 10 entities like tid1, tid2, tid3 etc and each entity (tid1,tid2, ..) contains approx 5000 values of pressure.

@c0c0n3
Copy link
Member

c0c0n3 commented Jul 15, 2022

I am using WORKERS variable

cool, just wanted to double-check w/ you to rule out possible config issues.

There are no logs at QuantumLeap side for the API which results into 500 error

Then yes, I agree w/ you QL could be the bottleneck. At a 20 req/sec throughput rate and 40 workers, it looks like each worker should be busy for up to 2 secs. That means, the producer (jmeter) is faster than the consumer (QL) and eventually QL will be in a situation where all 40 workers are busy but new requests are still coming in. Keep in mind workers do the work sequentially (pun intended :-), so 40 workers means at most 40 concurrent queries. In this scenario Gunicorn will have no worker process to assign incoming requests to and so will just return a 500.

On the other hand, it could also well be that each request takes up to 2 secs on average not b/c QL is slow but rather Crate DB can't keep up w/ the query rate. I've seen this in the past and the solution was to give Crate DB enough RAM to perform decently---Crate is a fine piece of software but you can't expect it to match your workload if you don't give it enough resources, have a look at the manual for the details. Then you could also up the number of QL workers.

On a side note, we never really worked on query optimisation, but we did identify some potential performance hot spots. For a complete analysis you can read

In particular, the issues/performance section could be applicable to your scenario. But again, keep in mind that model is just an abstract model, we've never validated it w/ real measurements. Speaking of which, one way to get to the bottom of this would be to use QL's built-in telemetry

to figure out how the QL load varies as a function of the input requests and how much of the processing time is spent waiting for Crate DB to return query result sets.

Hope this helps!

@Madhu1029
Copy link
Author

Madhu1029 commented Oct 10, 2022

Hi @c0c0n3 ,

I have checked the number of busy workers during the script execution. I have checked that only 4-5 workers are busy at any point of time.
But 40 workers are configured in the gconfig.py . I have updated below code in src/server/gconfig.py file to print logs while assigning and releasing the workers:

import statsd
import datetime
sc = statsd.StatsClient('localhost', 8668)
def pre_request(worker, req):
    print("increment ",worker,datetime.datetime.now())

def post_request(worker, req):
    print("decrement ",worker,datetime.datetime.now())

I have executed the QuantumLeap APIs from jmeter. Below is the details for script execution:
WORKERS: 40
Quantumleap: 0.8.1
Throughput: 2req/s
And below error is found in jmeter log:
Non HTTP response code: org.apache.http.NoHttpResponseException
It seems that jmeter has not received any response from quantumleap side, so it results into NoHttpResponseException.
Note: There is no log at quantumleap end for the failed request.
Could you please help on this issue?

@c0c0n3
Copy link
Member

c0c0n3 commented Nov 9, 2022

Hi @MadhuNEC :-)

So I've finally found the time to look at this issue. What I did, I followed the steps in

https://github.com/orchestracities/ngsi-timeseries-api/wiki/Gauging-Performance

to do some load testing. Then as explained in the wiki article I used Pandas to do some basic data analysis. It turns out Gunicorn actually distributed the work quite evenly among my 10 workers---I don't have enough horsepower to test w/ 40. So basically I got pretty much the same results as in the wiki article.

I have checked that only 4-5 workers are busy at any point of time.

How did you do that? Looking at the code in your gconfig.py I don't see any easy way to analyse the data you output in the pre/post request hooks? Can you please try using our telemetry framework

and do statistical analysis with Pandas as explained in the article and let us know if you get different results?

Thanks alot!!

@NEC-Vishal
Copy link
Contributor

Hi @c0c0n3
Thanks for guiding us.
but when we are running the command: ./baseline-load-test.sh
we are stucking in this step, we tried a lot but unfortunately we can not overcome this situation.
image

can you please suggest a way to do this?

@c0c0n3
Copy link
Member

c0c0n3 commented Jan 5, 2023

Hi @NEC-Vishal,

did you run these commands before calling baseline-load-test.sh?

$ cd /path/to/ngsi-timeseries-api
$ source setup_dev_env.sh
$ pipenv install --dev
$ cd src/tests/benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants