OOM killed environment should reload function after restart #218
Labels
env-go
Go environment related issues/PR
env-jvm
Java/JVM environment related issues/PR
env-nodejs
NodeJS environment related issues/PR
env-python
Python environment related issues/PR
Describe the bug
Parent Bug fission/fission#1874
If a pod is OOMKilled and restarted, the requests going to that pod typically do not work and start throwing 502. This was observed in a customer's environment with a custom python env which was using Debian image. It is not clear if this is a fetcher or env or executor issue but needs to be investigated and fixed.
To Reproduce
fission/fission#1874 (comment)
Reproducing issue with pool manager:
From a fission perspective, the fn runtime server(python container) would be assumed as specialized.
But from when the fn runtime server restarted, it's not specialized. When a request is made to the env server we would get an error.
I am planning to try out adding a signal handler in the python server to check we can detect exit signals before OOM.
Additional context
fission/fission#1874 (comment)
What happens when container within pods get killed due to OOM ?
Why we get 502 after function runtime container restart ?
Possible approaches
a. seems to be more effort on the user side. Where b. would be lot additional scanning on Fission side on continuous basis just to identify OOM scenario.
Both a. b. try to solve build solution at Fission level where as c. is something could be implemented at environment level.
c. seems easier to get quickly working in the environment.
I did a quick POC for this and after restart of function runtime was respecialized.
Please check PR for more details.
#213
The same fix needs to be implemented across environments.
The text was updated successfully, but these errors were encountered: