-
Notifications
You must be signed in to change notification settings - Fork 242
New AWS jobstore. #5123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
New AWS jobstore. #5123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this probably works as written, but the documentation for how the S3-backed storage works (metadata objects named by file IDs referencing ETag hashes for content-addressed data objects many-to-one) doesn't match what's actually implemented (metadata objects and data objects both named by file ID).
It might be good to actually implement the described content-addressing design, but there are are some big unsolved questions:
- ETags on S3 are all MD5-based and not collision-resistant, so it's easy to get two actual files with the same MD5 and put them in the job store to break it.
- Streaming uploads don't know their hash until they are uploaded, so where should they be uploaded to?
- Without transactions, how do you maintain the consistency of many-to-one relationships between metadata objects and data objects, while still allowing data objects to be deleted when the last referencing metadata object goes away?
It might make sense to not solve this deduplication problem now, and to just admit to using one data copy per file ID as is actually implemented. If we cram executability into the ID somehow, and use the ETag provided by S3 in the response header for integrity checking, it might be possible to not have a metadata file at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is way better!
I did notice that the "existing" key getter function was used in a couple places where the key won't actually necessarily exist, which I don't think will work.
And I don't think we want to comment out the contents of docstrings.
But those are probably the only things I think really need to be changed.
@@ -110,7 +110,7 @@ class NoSuchFileException(Exception): | |||
"""Indicates that the specified file does not exist.""" | |||
|
|||
def __init__( | |||
self, jobStoreFileID: FileID, customName: Optional[str] = None, *extra: Any | |||
self, jobStoreFileID: Union[FileID, str], customName: Optional[str] = None, *extra: Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason to say this can be a string now? Is it too hard to drag the typed file ID object through in the new implementation for some good reason?
When this is a string, is this meant to be the string-packed version of the file ID? Or just the ID part without e.g. the size packed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess some of the user-facing file ID API still accepts file IDs typed as strings, so maybe we do still need to be able to take them here.
@@ -1502,6 +1502,7 @@ def read_file_stream( | |||
) -> ContextManager[IO[str]]: ... | |||
|
|||
@abstractmethod | |||
@contextmanager # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we really only need/want @contextmanager
on the implementation? It doesn't really help on the unimplemented stub, and it apparently means we have to break out of typing.
# 4. Shared Files: These are a small set of special files. Most are needed by all jobs: | ||
# * environment.pickle (environment variables) | ||
# * config.pickle (user options) | ||
# * pid.log (process ID of the workflow; when it finishes, the workflow either succeeded/failed) | ||
# * userScript (hot deployment; this is the job module) | ||
# * rootJobReturnValue (workflow succeeded or not) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right; Python workflows have access to user-defined file names in here, which we just hope don't collide with any that Toil uses internally.
# 1. AWS s3 has strong consistency. | ||
# 2. s3's filter/query speed is pretty good. | ||
# However, there may be reasons in the future to provide users with a database: | ||
# * s3 throttling has limits (3,500/5,000 requests; something like dynamodb supports 100,000+ requests). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these per second?
# WARNING: Etag values differ for the same file when the part size changes, so part size should always | ||
# be Set In Stone, unless we hit s3's 10,000 part limit, and we need to account for that. | ||
# | ||
# - This class inherits self.config only when initialized/restarted and is None upon class instantiation. These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "inherits" is the right word; that makes it sound like it is coming from a base class but only sometimes.
# content_type = response['ContentType'] # e.g. "binary/octet-stream" | ||
# etag = response['ETag'].strip('\"') # e.g. "\"586af4cbd7416e6aefd35ccef9cbd7c8\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably these can just be cut.
# content_type = response['ContentType'] # e.g. "binary/octet-stream" | |
# etag = response['ETag'].strip('\"') # e.g. "\"586af4cbd7416e6aefd35ccef9cbd7c8\"" |
) | ||
# TODO: verify etag after copying here? | ||
|
||
# cannot determine exec bit from foreign s3 so default to False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs to go up where the 0
is now.
RuntimeError: Hello, world! | ||
>>> y = os.dup(0); os.close(y); x == y | ||
True | ||
# An object-oriented wrapper for os.pipe. Clients should subclass it, implement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to try to have a big comment inside a docstring like this.
os.close(self.readable_fh) | ||
except OSError as e: | ||
# OSError: [Errno 9] Bad file descriptor implies this file handle is already closed | ||
if not e.errno == 9: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be errno.EBADF
and not a magic 9.
if not e.errno == 9: | |
if not e.errno == errno.EBADF: |
@@ -482,7 +482,7 @@ def needs_aws_batch(test_item: MT) -> MT: | |||
test_item | |||
) | |||
test_item = needs_env_var( | |||
"TOIL_AWS_BATCH_JOB_ROLE_ARN", "an IAM role ARN that grants S3 and SDB access" | |||
"TOIL_AWS_BATCH_JOB_ROLE_ARN", "an IAM role ARN that grants S3 access" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also revise the role permissions in the provisioner? Or make a new issue for that?
Rewritten without SDB.
Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist