Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API is down #264

Closed
Abdirahiim opened this issue Apr 7, 2020 · 27 comments
Closed

API is down #264

Abdirahiim opened this issue Apr 7, 2020 · 27 comments
Labels
bug Something isn't working down Related to API availability performance Issue related to performance and optimizations

Comments

@Abdirahiim
Copy link
Contributor

So I was doing some unit tests for my project when I saw that they failed and I checked out the API to see if everything is working but it seems that it's down

@Abdirahiim Abdirahiim changed the title API down API is down Apr 7, 2020
@Bost
Copy link
Contributor

Bost commented Apr 7, 2020

This works flawlessly https://github.com/ExpDev07/coronavirus-tracker-api#running--development

And seriously, this API service has been reaching it's capacity, so using your localhost endpoint for unit testing is more than just nice towards the others.

@Abdirahiim
Copy link
Contributor Author

I was testing out my API wrapper and since that it uses the API using it for tests is inevitable

@AafaaqAli
Copy link

yes, it's down... but why so?

@paolotamag
Copy link

down for me as well during my webinar XD
luckily i had previously cached data in my dashboard!

@azeezgaa
Copy link

azeezgaa commented Apr 7, 2020

Anyone has any idea is this going to solved ?

Since i had a demo to my manager on this scheduled this Friday

@GabrielDS
Copy link
Contributor

Anyone has any idea is this going to solved ?

Since i had a demo to my manager on this scheduled this Friday

I think the problem is the app on Dyno (instances heroku). I runned without problems locally.
@Kilo59

@ibhuiyan17
Copy link
Contributor

Screen Shot 2020-04-07 at 1 19 16 PM

https://www.heroku.com/pricing

Unfortunately scaling up on Heroku isn't cheap but here are some of the options that can be considered. I think the best option for now is running it locally

@toxyl
Copy link

toxyl commented Apr 7, 2020

Maybe crowdfunding?

@Kilo59
Copy link
Collaborator

Kilo59 commented Apr 7, 2020

I have another version of it hosted in the US that doesn't get nearly as much traffic.

https://covid-tracker-us.herokuapp.com/
#248 (comment)

@ibhuiyan17
Copy link
Contributor

Could some sort of load-balancing scheme be implemented that directs traffic to multiple free instances?

@toxyl
Copy link

toxyl commented Apr 7, 2020

I'm still trying to get a node working on a Ubuntu server 'cause then I can spin up some droplets and put a load balancer in front.

@Kilo59 Kilo59 pinned this issue Apr 7, 2020
@Kilo59 Kilo59 added the bug Something isn't working label Apr 7, 2020
@azeezgaa
Copy link

azeezgaa commented Apr 7, 2020

Anyone has any idea is this going to solved ?
Since i had a demo to my manager on this scheduled this Friday

I think the problem is the app on Dyno (instances heroku). I runned without problems locally.
@Kilo59

What do you mean run locally, can you please help on this. currently i am just using the heroko api and getting the response in json format which i am doing in Rest message in javascript

@itsamirrezah
Copy link

@azeezgaa actually it's pretty simple, just follow the instruction that mentioned in Readme file.

  • Clone repository
  • Install python 3.8
  • Install pipenv
  • Run pipenv shell command.

Checkout installation section: https://github.com/ExpDev07/coronavirus-tracker-api#installation

@azeezgaa
Copy link

azeezgaa commented Apr 7, 2020

@azeezgaa actually it's pretty simple, just follow the instruction that mentioned in Readme file.

  • Clone repository
  • Install python 3.8
  • Install pipenv
  • Run pipenv shell command.

Checkout installation section: https://github.com/ExpDev07/coronavirus-tracker-api#installation

But this will work onlly in my pc not in other pc's right and it wont work on mobile devices am i right

@Kilo59 Kilo59 added the performance Issue related to performance and optimizations label Apr 7, 2020
@itsamirrezah
Copy link

@azeezgaa Yes, your right, but for development & testing purposes, running locally is more reliable way.
also I think you can deploy the repository on your own server. (if its allowed)

@toxyl
Copy link

toxyl commented Apr 7, 2020

It didn't work quite so well for me the first time around when I tried with Debian 10 and Ubuntu 19.10, Ubuntu 18.04.3 did though. Should have looked at the files earlier, then I would have seen the docker-compose.yml xD Now making an Ansible playbook to deploy those.

@gribok
Copy link
Contributor

gribok commented Apr 7, 2020

Many platforms supports COVID-19 projects with enterprise accounts for free during pandemic.
See https://github.blog/2020-03-23-open-collaboration-on-covid-19/

Maybe this could push the project to get on landingpage.

@Kilo59
Copy link
Collaborator

Kilo59 commented Apr 7, 2020

We don't have a good error tracking system in place so it's hard to determine what the underlying cause was.
We don't actually get THAT hot, which seems to be anywhere between 200 - 800 requests per minute depending on the time of day. Usually closer to 300.

Here are some metrics from around the time of the outage and then you can see when the app recovered. It could have just been a problem with Heroku's infrastructure. 🤷‍♂

Outage.

The outage seemed to last about an hour.
On the Status codes graph. Blue is 2xx, Red 4xx, Purple 5xx.

image

Recent

As you can see we are actually handling a lot more requests now than at the time of the outage.
image

@toxyl
Copy link

toxyl commented Apr 7, 2020

I've setup two small droplets running the API behind a load balancer, feel free to use them as long as they can handle the load: https://cvtapi.nl
I've also written an Ansible playbook (available here: https://github.com/Toxyl/coronavirus-tracker-api-ansible.git) that I can use to add new nodes when needed.

@Kilo59
Copy link
Collaborator

Kilo59 commented Apr 7, 2020

The API is back up.
It seems to have ONLY 😬 lasted about an hour.
Going to leave the issue open for a bit in case it goes down again.

@toxyl
Copy link

toxyl commented Apr 9, 2020

avg_req_duration

That is from my load balancer which is experiencing hardly any load. Ignore the range 12PM-6PM in first half as I was still doing some setup on the nodes. Makes me wonder if the API has memory leaks or a similar issue which could explain the outage. Can you check logs on heroku? Maybe the API crashed after a while of timing out (i.e. the average request duration growing too high).
The intervals between increments in the graph are 1 hour and 1.5 hours respectively.

Just had a quick look at the logs of the Docker container on one node, not much to see there but something that might give a hint:

WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!

In the beginning I see those every now and then. and I see

INFO:     source provided: JhuLocationService

in regular intervals, but after a while these happened twice per interval instead of once. And the warnings started changing, with gradually more warnings printed at the same time. By now the warnings look like this:

WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!

I'll keep an eye on this to see if the trend continues.

@Kilo59
Copy link
Collaborator

Kilo59 commented Apr 9, 2020

@toxyl no I can't go back and look at the logs more that far back.
I have added the free tier of Timber IO #266, so in the future, I should be able to look back.

I've also seen those log messages, but I don't know enough about that section to know if that's concerning. If it's not actually a problem I do think we should change the log level for that message because it does imply that there's a problem.

In regards to a possible memory leak being the cause. Not saying that's not possible, but if you look at the screenshot at the time of the outage, you'll also notice that it recovers on its own, not because of a redeployment.

I need to spend more time looking over the code, but the only clue I have is that we cache our source data results for an hour and the outage lasted for about an hour.
But typically things like Python's LRU cache do not cache exceptions.
That isn't what we are using but it's possible our solution does cache exceptions or the results that cached caused downstream exceptions.
An hour later the cache was refreshed and the application recovered 🤷 .

Going to close this issue since the API has been back up for days at this point, but I will open another issue to continue this investigation.

#270

Edit:

That purple dot on the graphs indicates a deployment. It didn't resolve the issue either, so IMO that means it probably wasn't the cache 🙄 .

@Kilo59 Kilo59 closed this as completed Apr 9, 2020
@toxyl
Copy link

toxyl commented Apr 9, 2020

@toxyl no I can't go back and look at the logs more that far back.
I have added the free tier of Timber IO #266, so in the future, I should be able to look back.

I've also seen those log messages, but I don't know enough about that section to know if that's concerning. If it's not actually a problem I do think we should change the log level for that message because it does imply that there's a problem.

In regards to a possible memory leak being the cause. Not saying that's not possible, but if you look at the screenshot at the time of the outage, you'll also notice that it recovers on its own, not because of a redeployment.

I need to spend more time looking over the code, but the only clue I have is that we cache our source data results for an hour and the outage lasted for about an hour.
But typically things like Python's LRU cache do not cache exceptions.
That isn't what we are using but it's possible our solution does cache exceptions or the results that cached caused downstream exceptions.
An hour later the cache was refreshed and the application recovered 🤷 .

Going to close this issue since the API has been back up for days at this point, but I will open another issue to continue this investigation.

#270

I don't know how you've set up the server. The way I would set it up it would restart the API when it crashes, which would happen after a while of timeouts, so it would appear unreachable for a while before coming back. If it was a memory leak the API would crash after all memory has been consumed which does not imply that it's reachable until then. But yeah, I might be way off with that, it's just a hunch.

@Bost
Copy link
Contributor

Bost commented Apr 9, 2020

with gradually more warnings printed at the same time. By now the warnings look like this:

WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!

I'll keep an eye on this to see if the trend continues.

The API service was looking for a country code in the COUNTRY_NAME__COUNTRY_CODE dictionary using MS Zaandam and/or Diamond Princess as a country name. And it returned XX with a warning since no such countries exist. And such warnings appear only once per hour since the responses are cached for exactly one hour. No magic here, nothing to worry about.

Among other reasons, the warnings appear, if the JHU CSEE misspells or abbreviates a country name or uses some non ISO standard country name in their csv files. In this situation a new alias must be added to the dictionary, programmatically correcting such a mistake.

The case of MS Zaandam and Diamond Princess is a bit tricky - these are proper names of real ships. I've been thinking about suppressing such warnings in these two particular cases, but(!) I decided not to do so, since ships are "tricky". They can move from port to port, i.e. from country to country or disappear, i.e. move to international waters, or the passengers can leave the ship, thus decreasing(!) the number of confirmed cases, deaths etc. So I prefer to keep these warnings around as a reminder of this trickiness.

@toxyl
Copy link

toxyl commented Apr 9, 2020

And such warnings appear only once per hour since the responses are cached for exactly one hour. No magic here, nothing to worry about.

That wasn't the point. The point being that they appear multiple times in one update, indicating that there is more than one update happening.

@Kilo59
Copy link
Collaborator

Kilo59 commented Apr 9, 2020

@toxyl @Bost
The log message appearing multiple times isn't necessarily an issue either. You could be seeing the logs from multiple processes.
The application is deployed using the gunicorn which uses multiple workers.
http://docs.gunicorn.org/en/latest/settings.html#worker-processes

I believe the number of workers (if not set) is determined by the number of cores on the host machine.
These are independent processes that will each have their own cache.
Gunicorn orchestrates them.
https://github.com/ExpDev07/coronavirus-tracker-api/blob/master/Procfile

Let's continue this here #270

Repository owner locked and limited conversation to collaborators Apr 9, 2020
@Kilo59 Kilo59 added the down Related to API availability label Apr 18, 2020
@Kilo59
Copy link
Collaborator

Kilo59 commented Apr 18, 2020

@toxyl @Bost
Added more logging in #290

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working down Related to API availability performance Issue related to performance and optimizations
Projects
None yet
Development

No branches or pull requests