Skip to content

New AWS jobstore. #5123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

New AWS jobstore. #5123

wants to merge 17 commits into from

Conversation

DailyDreaming
Copy link
Member

@DailyDreaming DailyDreaming commented Oct 14, 2024

Rewritten without SDB.

Changelog Entry

To be copied to the draft changelog by merger:

  • PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@DailyDreaming DailyDreaming self-assigned this Oct 14, 2024
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this probably works as written, but the documentation for how the S3-backed storage works (metadata objects named by file IDs referencing ETag hashes for content-addressed data objects many-to-one) doesn't match what's actually implemented (metadata objects and data objects both named by file ID).

It might be good to actually implement the described content-addressing design, but there are are some big unsolved questions:

  • ETags on S3 are all MD5-based and not collision-resistant, so it's easy to get two actual files with the same MD5 and put them in the job store to break it.
  • Streaming uploads don't know their hash until they are uploaded, so where should they be uploaded to?
  • Without transactions, how do you maintain the consistency of many-to-one relationships between metadata objects and data objects, while still allowing data objects to be deleted when the last referencing metadata object goes away?

It might make sense to not solve this deduplication problem now, and to just admit to using one data copy per file ID as is actually implemented. If we cram executability into the ID somehow, and use the ETag provided by S3 in the response header for integrity checking, it might be possible to not have a metadata file at all.

@DailyDreaming DailyDreaming requested a review from adamnovak March 31, 2025 19:23
@DailyDreaming DailyDreaming marked this pull request as ready for review April 1, 2025 17:18
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is way better!

I did notice that the "existing" key getter function was used in a couple places where the key won't actually necessarily exist, which I don't think will work.

And I don't think we want to comment out the contents of docstrings.

But those are probably the only things I think really need to be changed.

@@ -110,7 +110,7 @@ class NoSuchFileException(Exception):
"""Indicates that the specified file does not exist."""

def __init__(
self, jobStoreFileID: FileID, customName: Optional[str] = None, *extra: Any
self, jobStoreFileID: Union[FileID, str], customName: Optional[str] = None, *extra: Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to say this can be a string now? Is it too hard to drag the typed file ID object through in the new implementation for some good reason?

When this is a string, is this meant to be the string-packed version of the file ID? Or just the ID part without e.g. the size packed in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess some of the user-facing file ID API still accepts file IDs typed as strings, so maybe we do still need to be able to take them here.

@@ -1502,6 +1502,7 @@ def read_file_stream(
) -> ContextManager[IO[str]]: ...

@abstractmethod
@contextmanager # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we really only need/want @contextmanager on the implementation? It doesn't really help on the unimplemented stub, and it apparently means we have to break out of typing.

Comment on lines +108 to +113
# 4. Shared Files: These are a small set of special files. Most are needed by all jobs:
# * environment.pickle (environment variables)
# * config.pickle (user options)
# * pid.log (process ID of the workflow; when it finishes, the workflow either succeeded/failed)
# * userScript (hot deployment; this is the job module)
# * rootJobReturnValue (workflow succeeded or not)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right; Python workflows have access to user-defined file names in here, which we just hope don't collide with any that Toil uses internally.

# 1. AWS s3 has strong consistency.
# 2. s3's filter/query speed is pretty good.
# However, there may be reasons in the future to provide users with a database:
# * s3 throttling has limits (3,500/5,000 requests; something like dynamodb supports 100,000+ requests).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these per second?

# WARNING: Etag values differ for the same file when the part size changes, so part size should always
# be Set In Stone, unless we hit s3's 10,000 part limit, and we need to account for that.
#
# - This class inherits self.config only when initialized/restarted and is None upon class instantiation. These
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "inherits" is the right word; that makes it sound like it is coming from a base class but only sometimes.

Comment on lines +553 to +554
# content_type = response['ContentType'] # e.g. "binary/octet-stream"
# etag = response['ETag'].strip('\"') # e.g. "\"586af4cbd7416e6aefd35ccef9cbd7c8\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably these can just be cut.

Suggested change
# content_type = response['ContentType'] # e.g. "binary/octet-stream"
# etag = response['ETag'].strip('\"') # e.g. "\"586af4cbd7416e6aefd35ccef9cbd7c8\""

)
# TODO: verify etag after copying here?

# cannot determine exec bit from foreign s3 so default to False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs to go up where the 0 is now.

RuntimeError: Hello, world!
>>> y = os.dup(0); os.close(y); x == y
True
# An object-oriented wrapper for os.pipe. Clients should subclass it, implement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to try to have a big comment inside a docstring like this.

os.close(self.readable_fh)
except OSError as e:
# OSError: [Errno 9] Bad file descriptor implies this file handle is already closed
if not e.errno == 9:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be errno.EBADF and not a magic 9.

Suggested change
if not e.errno == 9:
if not e.errno == errno.EBADF:

@@ -482,7 +482,7 @@ def needs_aws_batch(test_item: MT) -> MT:
test_item
)
test_item = needs_env_var(
"TOIL_AWS_BATCH_JOB_ROLE_ARN", "an IAM role ARN that grants S3 and SDB access"
"TOIL_AWS_BATCH_JOB_ROLE_ARN", "an IAM role ARN that grants S3 access"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also revise the role permissions in the provisioner? Or make a new issue for that?

This was referenced Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants