Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cas scaling #1189

Draft
wants to merge 13 commits into
base: develop
Choose a base branch
from
Draft

Conversation

smrz2001
Copy link
Contributor

@smrz2001 smrz2001 commented Mar 26, 2024

This PR makes two main changes:

  • Puts IPFS store operations behind a new feature flag (CAS_USE_IPFS_STORAGE).
  • Detects whether a batch queue message was received without request IDs and, if so, pulls the batch from S3. This avoids a large batch from running into the 256KB SQS message size limit.

These changes are backwards compatible. Once ready, they should be merged before the changes from this PR for updating the Scheduler to generate/store larger batches.

@smrz2001 smrz2001 self-assigned this Mar 26, 2024
Copy link

linear bot commented Mar 26, 2024

@stbrody
Copy link
Contributor

stbrody commented Mar 27, 2024

Can we split CAS_USE_IPFS into two env vars? One for publishing to pubsub and another for storing the commits themselves? I think it would be good to turn those off one at a time to help us detect if there are issues along the way

@smrz2001
Copy link
Contributor Author

Can we split CAS_USE_IPFS into two env vars? One for publishing to pubsub and another for storing the commits themselves? I think it would be good to turn those off one at a time to help us detect if there are issues along the way

I didn't think we'd want to apply the load of either publishing or storing commits to IPFS. I feel like that's just asking for (now gigantic) failed batches due to IPFS issues. Do we really care to spend time figuring out how IPFS does under either type of load? IMO, cutting IPFS out of the picture will be essential for scaling CAS.

@stbrody
Copy link
Contributor

stbrody commented Mar 27, 2024

I definitely agree that the goal is to stop publishing to ipfs entirely. I'm just saying let's wind down our use of ipfs in 2 phases. First let's stop publishing notifications of the anchor commits to pubsub, while continuing to still create them in ipfs so that they are available to bitswap. Then if we see no problems with that, then we can stop creating the anchor commits against the CAS IPFS entirely.

@smrz2001
Copy link
Contributor Author

I definitely agree that the goal is to stop publishing to ipfs entirely. I'm just saying let's wind down our use of ipfs in 2 phases. First let's stop publishing notifications of the anchor commits to pubsub, while continuing to still create them in ipfs so that they are available to bitswap. Then if we see no problems with that, then we can stop creating the anchor commits against the CAS IPFS entirely.

I'll add the second flag though I'm quite certain we'll be turning it off right about the time we start benchmarking CAS with 1M sized batches 😉

I'd also posit a different way of looking at this. Our decision to keep IPFS a part of the CAS architecture shouldn't be based on whether or not IPFS has issues during testing, but on whether it's more or less likely to have issues with larger batches. I'm certain that it is much more likely to run into issues with larger batches that we'll then have to mitigate (e.g. writing intermediate Merkle-tree nodes might be be too much too fast and need additional pacing, etc.)

In other words, even if it sails through all of our benchmarking, I would still recommend turning IPFS off completely for production. Whether it works now under high load (I'm skeptical) doesn't mean that it will remain stable in production even under the same type of load.

@stbrody
Copy link
Contributor

stbrody commented Mar 28, 2024

Sorry, when I talked about making decisions based on whether or not there are issues, I wasn't saying that we shouldn't remove IPFS unless it presents performance issues. I want to remove IPFS ASAP. I was saying that I'm concerned that once we remove it, we'll have issues in the field from our users where they aren't getting anchor commits. I'm concerned about any lingering bugs in the anchor polling system delivering anchor commits via CAR files, because up until now we've always had pubsub and bitswap as a backup, so if that isn't totally working we might not ever have noticed. I'm also worried about users in the wild running old versions of js-ceramic that don't even have the new polling and CAR file features, and might not upgrade or do anything until suddenly their writes start failing with CACAO errors due to missing anchor commits once we turn off publishing of anchors to IPFS.

So that is the reason I want to do this in two phases, so that we have time for users in the wild to report issues to us if they stop getting anchor commits, and for us to have a clearer picture where things are breaking down if that happens.

@smrz2001
Copy link
Contributor Author

smrz2001 commented Mar 28, 2024

Yeah, that makes sense, thanks for clarifying. If we have to choose, we can also always setup separate (dedicated) CAS clusters for V' partners that don't use IPFS at all, and leave current CAS with IPFS.

@@ -55,6 +55,7 @@ export class AnchorRepository implements IAnchorRepository {
async findByRequestId(requestId: string): Promise<StoredAnchor | null> {
const row = await this.table.where({ requestId: requestId }).first()
if (!row) return null
console.log("IM HERE with ", row, requestId)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take this log out, @JulissaDantes? Might flood the logs at high throughput.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are completely right. Thank you for pointing this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants