Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM killed environment should reload function after restart #218

Closed
4 tasks
sanketsudake opened this issue Jan 18, 2022 · 1 comment
Closed
4 tasks

OOM killed environment should reload function after restart #218

sanketsudake opened this issue Jan 18, 2022 · 1 comment
Labels
env-go Go environment related issues/PR env-jvm Java/JVM environment related issues/PR env-nodejs NodeJS environment related issues/PR env-python Python environment related issues/PR

Comments

@sanketsudake
Copy link
Member

Describe the bug

Parent Bug fission/fission#1874

If a pod is OOMKilled and restarted, the requests going to that pod typically do not work and start throwing 502. This was observed in a customer's environment with a custom python env which was using Debian image. It is not clear if this is a fetcher or env or executor issue but needs to be investigated and fixed.

To Reproduce

fission/fission#1874 (comment)

Reproducing issue with pool manager:

  1. Create kind k8s cluster & deploy fission
  2. Create env
fission env create --name python --image fission/python-env --mincpu 40 --maxcpu 80   --minmemory 64 --maxmemory 96  --poolsize 1 --version 3
  1. Create code file
$ cat code.py
def main():
    a = []
    while True:
        a.append(' ' * 10**6) #This line will create a OOM in seconds
    return "Done"
  1. Create fn
$ fission fn create --name oom-fn --env python --code code.py 
$ fission env pods --name oom-fn
  1. On making request we can see function pod is restarted, where as no restart on fetcher pod.
fetcher/0.log
python/0.log
python/1.log

From a fission perspective, the fn runtime server(python container) would be assumed as specialized.
But from when the fn runtime server restarted, it's not specialized. When a request is made to the env server we would get an error.

I am planning to try out adding a signal handler in the python server to check we can detect exit signals before OOM.

Additional context

fission/fission#1874 (comment)

What happens when container within pods get killed due to OOM ?

  • Container is given signal SIGKILL and container is forced to exit.
  • Container is stopped immediately and restarted to get rid consumed memory.
  • Any process typically can capture signals like SIGTERM or SIGINT, and act on it for cleanup. But in case of SIGKILL we can't really capture it.
  • Unless we are using something like proper monitoring or preoomkiller-controller, we cant identify oom situation in advance.
  • In case of fission function pod, typically we have two containers function runtime container and fetcher container. Typically function runtime container would hit OOM so it gets restarted.

Why we get 502 after function runtime container restart ?

  • Before container gets killed due to OOM, it is speciallized container generally, i.e, function is loaded into container and it is serving requests.
  • Fission currently doesnt have mechanism to continuously scan and identify if pod got restarted due OOM.
  • Restart window is very small so Fission doesnt even come to know if pod is restarted.
  • Once pod is specialized fission assume pod is able to serve function until gets cleaned up/deleted.
  • After container restart, Fission assumes pod is specialized and keeps throwing incoming requests at it. Since function is not loaded after container restart, its generic container and request fails.

Possible approaches

  • a. Identify OOM situation in advance and fix it by reducing memory usage of the pod. Or use something like veritical pod autoscaler to increase limits for the particular pod.
  • b. After restart, identify pod restarted due to OOM and get it out of the service.
  • c. After restart, restore state in function pod with specialization and make it ready to serve the function.

a. seems to be more effort on the user side. Where b. would be lot additional scanning on Fission side on continuous basis just to identify OOM scenario.
Both a. b. try to solve build solution at Fission level where as c. is something could be implemented at environment level.

c. seems easier to get quickly working in the environment.
I did a quick POC for this and after restart of function runtime was respecialized.

Please check PR for more details.
#213

The same fix needs to be implemented across environments.

@sanketsudake sanketsudake added env-jvm Java/JVM environment related issues/PR env-go Go environment related issues/PR env-python Python environment related issues/PR env-nodejs NodeJS environment related issues/PR labels Sep 1, 2022
@soharab-ic
Copy link
Contributor

Please try latest environments released with updated language version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
env-go Go environment related issues/PR env-jvm Java/JVM environment related issues/PR env-nodejs NodeJS environment related issues/PR env-python Python environment related issues/PR
Projects
None yet
Development

No branches or pull requests

2 participants