API is down #264

Abdirahiim · 2020-04-07T12:45:35Z

So I was doing some unit tests for my project when I saw that they failed and I checked out the API to see if everything is working but it seems that it's down

Bost · 2020-04-07T13:04:05Z

This works flawlessly https://github.com/ExpDev07/coronavirus-tracker-api#running--development

And seriously, this API service has been reaching it's capacity, so using your localhost endpoint for unit testing is more than just nice towards the others.

Abdirahiim · 2020-04-07T14:34:15Z

I was testing out my API wrapper and since that it uses the API using it for tests is inevitable

AafaaqAli · 2020-04-07T14:40:51Z

yes, it's down... but why so?

paolotamag · 2020-04-07T16:12:48Z

down for me as well during my webinar XD
luckily i had previously cached data in my dashboard!

azeezgaa · 2020-04-07T16:30:50Z

Anyone has any idea is this going to solved ?

Since i had a demo to my manager on this scheduled this Friday

GabrielDS · 2020-04-07T16:52:33Z

Anyone has any idea is this going to solved ?

Since i had a demo to my manager on this scheduled this Friday

I think the problem is the app on Dyno (instances heroku). I runned without problems locally.
@Kilo59

ibhuiyan17 · 2020-04-07T17:24:29Z

https://www.heroku.com/pricing

Unfortunately scaling up on Heroku isn't cheap but here are some of the options that can be considered. I think the best option for now is running it locally

toxyl · 2020-04-07T17:27:00Z

Maybe crowdfunding?

Kilo59 · 2020-04-07T17:29:56Z

I have another version of it hosted in the US that doesn't get nearly as much traffic.

https://covid-tracker-us.herokuapp.com/
#248 (comment)

ibhuiyan17 · 2020-04-07T17:31:58Z

Could some sort of load-balancing scheme be implemented that directs traffic to multiple free instances?

toxyl · 2020-04-07T17:32:37Z

I'm still trying to get a node working on a Ubuntu server 'cause then I can spin up some droplets and put a load balancer in front.

azeezgaa · 2020-04-07T18:02:17Z

Anyone has any idea is this going to solved ?
Since i had a demo to my manager on this scheduled this Friday

I think the problem is the app on Dyno (instances heroku). I runned without problems locally.
@Kilo59

What do you mean run locally, can you please help on this. currently i am just using the heroko api and getting the response in json format which i am doing in Rest message in javascript

itsamirrezah · 2020-04-07T19:08:32Z

@azeezgaa actually it's pretty simple, just follow the instruction that mentioned in Readme file.

Clone repository
Install python 3.8
Install pipenv
Run pipenv shell command.

Checkout installation section: https://github.com/ExpDev07/coronavirus-tracker-api#installation

azeezgaa · 2020-04-07T19:13:05Z

@azeezgaa actually it's pretty simple, just follow the instruction that mentioned in Readme file.

Clone repository

Install python 3.8

Install pipenv

Run pipenv shell command.

Checkout installation section: https://github.com/ExpDev07/coronavirus-tracker-api#installation

But this will work onlly in my pc not in other pc's right and it wont work on mobile devices am i right

itsamirrezah · 2020-04-07T19:27:42Z

@azeezgaa Yes, your right, but for development & testing purposes, running locally is more reliable way.
also I think you can deploy the repository on your own server. (if its allowed)

toxyl · 2020-04-07T19:31:19Z

It didn't work quite so well for me the first time around when I tried with Debian 10 and Ubuntu 19.10, Ubuntu 18.04.3 did though. Should have looked at the files earlier, then I would have seen the docker-compose.yml xD Now making an Ansible playbook to deploy those.

gribok · 2020-04-07T21:39:54Z

Many platforms supports COVID-19 projects with enterprise accounts for free during pandemic.
See https://github.blog/2020-03-23-open-collaboration-on-covid-19/

Maybe this could push the project to get on landingpage.

Kilo59 · 2020-04-07T21:40:17Z

We don't have a good error tracking system in place so it's hard to determine what the underlying cause was.
We don't actually get THAT hot, which seems to be anywhere between 200 - 800 requests per minute depending on the time of day. Usually closer to 300.

Here are some metrics from around the time of the outage and then you can see when the app recovered. It could have just been a problem with Heroku's infrastructure. 🤷‍♂

Outage.

The outage seemed to last about an hour.
On the Status codes graph. Blue is 2xx, Red 4xx, Purple 5xx.

Recent

As you can see we are actually handling a lot more requests now than at the time of the outage.

toxyl · 2020-04-07T21:55:07Z

I've setup two small droplets running the API behind a load balancer, feel free to use them as long as they can handle the load: https://cvtapi.nl
I've also written an Ansible playbook (available here: https://github.com/Toxyl/coronavirus-tracker-api-ansible.git) that I can use to add new nodes when needed.

Kilo59 · 2020-04-07T22:18:13Z

The API is back up.
It seems to have ONLY 😬 lasted about an hour.
Going to leave the issue open for a bit in case it goes down again.

toxyl · 2020-04-09T05:44:04Z

That is from my load balancer which is experiencing hardly any load. Ignore the range 12PM-6PM in first half as I was still doing some setup on the nodes. Makes me wonder if the API has memory leaks or a similar issue which could explain the outage. Can you check logs on heroku? Maybe the API crashed after a while of timing out (i.e. the average request duration growing too high).
The intervals between increments in the graph are 1 hour and 1.5 hours respectively.

Just had a quick look at the logs of the Docker container on one node, not much to see there but something that might give a hint:

WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!

In the beginning I see those every now and then. and I see

INFO:     source provided: JhuLocationService

in regular intervals, but after a while these happened twice per interval instead of once. And the warnings started changing, with gradually more warnings printed at the same time. By now the warnings look like this:

WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!

I'll keep an eye on this to see if the trend continues.

Kilo59 · 2020-04-09T10:53:24Z

@toxyl no I can't go back and look at the logs more that far back.
I have added the free tier of Timber IO #266, so in the future, I should be able to look back.

I've also seen those log messages, but I don't know enough about that section to know if that's concerning. If it's not actually a problem I do think we should change the log level for that message because it does imply that there's a problem.

In regards to a possible memory leak being the cause. Not saying that's not possible, but if you look at the screenshot at the time of the outage, you'll also notice that it recovers on its own, not because of a redeployment.

I need to spend more time looking over the code, but the only clue I have is that we cache our source data results for an hour and the outage lasted for about an hour.
But typically things like Python's LRU cache do not cache exceptions.
That isn't what we are using but it's possible our solution does cache exceptions or the results that cached caused downstream exceptions.
An hour later the cache was refreshed and the application recovered 🤷 .

Going to close this issue since the API has been back up for days at this point, but I will open another issue to continue this investigation.

#270

Edit:

That purple dot on the graphs indicates a deployment. It didn't resolve the issue either, so IMO that means it probably wasn't the cache 🙄 .

toxyl · 2020-04-09T11:16:51Z

@toxyl no I can't go back and look at the logs more that far back.
I have added the free tier of Timber IO #266, so in the future, I should be able to look back.

I've also seen those log messages, but I don't know enough about that section to know if that's concerning. If it's not actually a problem I do think we should change the log level for that message because it does imply that there's a problem.

In regards to a possible memory leak being the cause. Not saying that's not possible, but if you look at the screenshot at the time of the outage, you'll also notice that it recovers on its own, not because of a redeployment.

I need to spend more time looking over the code, but the only clue I have is that we cache our source data results for an hour and the outage lasted for about an hour.
But typically things like Python's LRU cache do not cache exceptions.
That isn't what we are using but it's possible our solution does cache exceptions or the results that cached caused downstream exceptions.
An hour later the cache was refreshed and the application recovered 🤷 .

Going to close this issue since the API has been back up for days at this point, but I will open another issue to continue this investigation.

#270

I don't know how you've set up the server. The way I would set it up it would restart the API when it crashes, which would happen after a while of timeouts, so it would appear unreachable for a while before coming back. If it was a memory leak the API would crash after all memory has been consumed which does not imply that it's reachable until then. But yeah, I might be way off with that, it's just a hunch.

Bost · 2020-04-09T14:44:18Z

with gradually more warnings printed at the same time. By now the warnings look like this:

WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'Diamond Princess'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!
WARNING:  No country code found for 'MS Zaandam'. Using 'XX'!

I'll keep an eye on this to see if the trend continues.

The API service was looking for a country code in the COUNTRY_NAME__COUNTRY_CODE dictionary using MS Zaandam and/or Diamond Princess as a country name. And it returned XX with a warning since no such countries exist. And such warnings appear only once per hour since the responses are cached for exactly one hour. No magic here, nothing to worry about.

Among other reasons, the warnings appear, if the JHU CSEE misspells or abbreviates a country name or uses some non ISO standard country name in their csv files. In this situation a new alias must be added to the dictionary, programmatically correcting such a mistake.

The case of MS Zaandam and Diamond Princess is a bit tricky - these are proper names of real ships. I've been thinking about suppressing such warnings in these two particular cases, but(!) I decided not to do so, since ships are "tricky". They can move from port to port, i.e. from country to country or disappear, i.e. move to international waters, or the passengers can leave the ship, thus decreasing(!) the number of confirmed cases, deaths etc. So I prefer to keep these warnings around as a reminder of this trickiness.

toxyl · 2020-04-09T15:04:31Z

And such warnings appear only once per hour since the responses are cached for exactly one hour. No magic here, nothing to worry about.

That wasn't the point. The point being that they appear multiple times in one update, indicating that there is more than one update happening.

Kilo59 · 2020-04-09T17:02:33Z

@toxyl @Bost
The log message appearing multiple times isn't necessarily an issue either. You could be seeing the logs from multiple processes.
The application is deployed using the gunicorn which uses multiple workers.
http://docs.gunicorn.org/en/latest/settings.html#worker-processes

I believe the number of workers (if not set) is determined by the number of cores on the host machine.
These are independent processes that will each have their own cache.
Gunicorn orchestrates them.
https://github.com/ExpDev07/coronavirus-tracker-api/blob/master/Procfile

Let's continue this here #270

Kilo59 · 2020-04-18T16:56:04Z

@toxyl @Bost
Added more logging in #290

Abdirahiim changed the title ~~API down~~ API is down Apr 7, 2020

rudykonio mentioned this issue Apr 7, 2020

API is not working for me #265

Closed

Kilo59 pinned this issue Apr 7, 2020

Kilo59 added the bug Something isn't working label Apr 7, 2020

Kilo59 added the performance Issue related to performance and optimizations label Apr 7, 2020

Kamaropoulos mentioned this issue Apr 8, 2020

HTTPError: 503 Server Error: Service Unavailable for url: https://coronavirus-tracker-api.herokuapp.com/v2/locations?country_code=BD&source=jhu Kamaropoulos/COVID19Py#13

Open

Kilo59 mentioned this issue Apr 9, 2020

Investigate cause of API outage on April 7 2020 #270

Open

Kilo59 closed this as completed Apr 9, 2020

Repository owner locked and limited conversation to collaborators Apr 9, 2020

Kilo59 added the down Related to API availability label Apr 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API is down #264

API is down #264

Abdirahiim commented Apr 7, 2020

Bost commented Apr 7, 2020

Abdirahiim commented Apr 7, 2020

AafaaqAli commented Apr 7, 2020

paolotamag commented Apr 7, 2020

azeezgaa commented Apr 7, 2020

GabrielDS commented Apr 7, 2020

ibhuiyan17 commented Apr 7, 2020

toxyl commented Apr 7, 2020

Kilo59 commented Apr 7, 2020

ibhuiyan17 commented Apr 7, 2020

toxyl commented Apr 7, 2020

azeezgaa commented Apr 7, 2020 •

edited

itsamirrezah commented Apr 7, 2020

azeezgaa commented Apr 7, 2020

itsamirrezah commented Apr 7, 2020

toxyl commented Apr 7, 2020

gribok commented Apr 7, 2020

Kilo59 commented Apr 7, 2020 •

edited

toxyl commented Apr 7, 2020 •

edited

Kilo59 commented Apr 7, 2020 •

edited

toxyl commented Apr 9, 2020 •

edited

Kilo59 commented Apr 9, 2020 •

edited

toxyl commented Apr 9, 2020

Bost commented Apr 9, 2020

toxyl commented Apr 9, 2020 •

edited

Kilo59 commented Apr 9, 2020 •

edited

Kilo59 commented Apr 18, 2020

API is down #264

API is down #264

Comments

Abdirahiim commented Apr 7, 2020

Bost commented Apr 7, 2020

Abdirahiim commented Apr 7, 2020

AafaaqAli commented Apr 7, 2020

paolotamag commented Apr 7, 2020

azeezgaa commented Apr 7, 2020

GabrielDS commented Apr 7, 2020

ibhuiyan17 commented Apr 7, 2020

toxyl commented Apr 7, 2020

Kilo59 commented Apr 7, 2020

ibhuiyan17 commented Apr 7, 2020

toxyl commented Apr 7, 2020

azeezgaa commented Apr 7, 2020 • edited

itsamirrezah commented Apr 7, 2020

azeezgaa commented Apr 7, 2020

itsamirrezah commented Apr 7, 2020

toxyl commented Apr 7, 2020

gribok commented Apr 7, 2020

Kilo59 commented Apr 7, 2020 • edited

Outage.

Recent

toxyl commented Apr 7, 2020 • edited

Kilo59 commented Apr 7, 2020 • edited

toxyl commented Apr 9, 2020 • edited

Kilo59 commented Apr 9, 2020 • edited

Edit:

toxyl commented Apr 9, 2020

Bost commented Apr 9, 2020

toxyl commented Apr 9, 2020 • edited

Kilo59 commented Apr 9, 2020 • edited

Kilo59 commented Apr 18, 2020

azeezgaa commented Apr 7, 2020 •

edited

Kilo59 commented Apr 7, 2020 •

edited

toxyl commented Apr 7, 2020 •

edited

Kilo59 commented Apr 7, 2020 •

edited

toxyl commented Apr 9, 2020 •

edited

Kilo59 commented Apr 9, 2020 •

edited

toxyl commented Apr 9, 2020 •

edited

Kilo59 commented Apr 9, 2020 •

edited