-
Notifications
You must be signed in to change notification settings - Fork 1
Is dataflow job submission still broken? #220
Comments
Members of the pangeo-forge Heroku team can get a shell on the running production container with:
I've just done this and confirmed that there is an importable installation of |
This type of issue, i.e. "Why did a particular job submission fail?", is a common problem in our backend service. As is the case here, with our current logging infrastructure, it's typically not immediately clear if the failure was due to some general problem in the backend service, or if the problem is specific to this particular recipe. Typically I have been re-running these |
I've just done this for this job and I was able to submit the job from my local machine, using the same command as was run in production. So this would appear to be a problem with the production environment after all. |
Based on these logs following the test run triggered by pangeo-forge/staged-recipes#247 (comment)
I believe this particular recipe is causing worker OOM before it can finish submitted the job to Dataflow. I'm going to hypothesize, then, that this is not some more general bug in the production deployment, but rather something specific to this recipe. Will do a bit more digging before closing this issue. |
When I left off with this last week, I'd thought that the one feedstock where this behavior was observed was an outlier due to OOM issues. Perhaps that is the case, but following the failed test run deployment from pangeo-forge/staged-recipes#245 (comment), it seems clear that there is some more general issue at play. Here's the server trace following that slash command:
I do not yet have a working theory of what this issue is, and I believe we've hit the point at which manually fixing these issues without true integration tests that can reproduce these issues, is paying diminishing returns for the project. I am going to open a separate issue on that, and then proceed in that direction. |
Today I re-deployed the backend service via #218, which should have reverted changes last week that caused Dataflow job submission problems.
While certain issues may have been resolved, some issue seems to be persisting, as pangeo-forge/staged-recipes#247 (comment) did not succeed in deploying a job. Here are the backend logs following that comment:
Looks like something having to do with the version of beam?
The text was updated successfully, but these errors were encountered: