OOM killed environment should reload function after restart #218

sanketsudake · 2022-01-18T05:29:38Z

Describe the bug

If a pod is OOMKilled and restarted, the requests going to that pod typically do not work and start throwing 502. This was observed in a customer's environment with a custom python env which was using Debian image. It is not clear if this is a fetcher or env or executor issue but needs to be investigated and fixed.

To Reproduce

fission/fission#1874 (comment)

Reproducing issue with pool manager:

Create kind k8s cluster & deploy fission
Create env

fission env create --name python --image fission/python-env --mincpu 40 --maxcpu 80   --minmemory 64 --maxmemory 96  --poolsize 1 --version 3

Create code file

$ cat code.py
def main():
    a = []
    while True:
        a.append(' ' * 10**6) #This line will create a OOM in seconds
    return "Done"

Create fn

$ fission fn create --name oom-fn --env python --code code.py 
$ fission env pods --name oom-fn

On making request we can see function pod is restarted, where as no restart on fetcher pod.

fetcher/0.log
python/0.log
python/1.log

From a fission perspective, the fn runtime server(python container) would be assumed as specialized.
But from when the fn runtime server restarted, it's not specialized. When a request is made to the env server we would get an error.

I am planning to try out adding a signal handler in the python server to check we can detect exit signals before OOM.

Additional context

fission/fission#1874 (comment)

What happens when container within pods get killed due to OOM ?

Container is given signal SIGKILL and container is forced to exit.
Container is stopped immediately and restarted to get rid consumed memory.
Any process typically can capture signals like SIGTERM or SIGINT, and act on it for cleanup. But in case of SIGKILL we can't really capture it.
Unless we are using something like proper monitoring or preoomkiller-controller, we cant identify oom situation in advance.
In case of fission function pod, typically we have two containers function runtime container and fetcher container. Typically function runtime container would hit OOM so it gets restarted.

Why we get 502 after function runtime container restart ?

Before container gets killed due to OOM, it is speciallized container generally, i.e, function is loaded into container and it is serving requests.
Fission currently doesnt have mechanism to continuously scan and identify if pod got restarted due OOM.
Restart window is very small so Fission doesnt even come to know if pod is restarted.
Once pod is specialized fission assume pod is able to serve function until gets cleaned up/deleted.
After container restart, Fission assumes pod is specialized and keeps throwing incoming requests at it. Since function is not loaded after container restart, its generic container and request fails.

Possible approaches

a. Identify OOM situation in advance and fix it by reducing memory usage of the pod. Or use something like veritical pod autoscaler to increase limits for the particular pod.
b. After restart, identify pod restarted due to OOM and get it out of the service.
c. After restart, restore state in function pod with specialization and make it ready to serve the function.

a. seems to be more effort on the user side. Where b. would be lot additional scanning on Fission side on continuous basis just to identify OOM scenario.
Both a. b. try to solve build solution at Fission level where as c. is something could be implemented at environment level.

c. seems easier to get quickly working in the environment.
I did a quick POC for this and after restart of function runtime was respecialized.

Please check PR for more details.
#213

The same fix needs to be implemented across environments.

python | Restore python environment state after container restart due to OOM #213
go |
nodejs |
java |

The text was updated successfully, but these errors were encountered:

soharab-ic · 2024-09-06T10:30:27Z

Please try latest environments released with updated language version.

sanketsudake mentioned this issue Jan 18, 2022

OOMKilled pod starts dropping request fission/fission#1874

Closed

sanketsudake added env-jvm Java/JVM environment related issues/PR env-go Go environment related issues/PR env-python Python environment related issues/PR env-nodejs NodeJS environment related issues/PR labels Sep 1, 2022

soharab-ic closed this as completed Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM killed environment should reload function after restart #218

OOM killed environment should reload function after restart #218

sanketsudake commented Jan 18, 2022

soharab-ic commented Sep 6, 2024

OOM killed environment should reload function after restart #218

OOM killed environment should reload function after restart #218

Comments

sanketsudake commented Jan 18, 2022

What happens when container within pods get killed due to OOM ?

Why we get 502 after function runtime container restart ?

Possible approaches

soharab-ic commented Sep 6, 2024