Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hma] Implement Bank & content-level disable #1682

Open
14 tasks
Dcallies opened this issue Nov 6, 2024 · 2 comments · May be fixed by #1732
Open
14 tasks

[hma] Implement Bank & content-level disable #1682

Dcallies opened this issue Nov 6, 2024 · 2 comments · May be fixed by #1732

Comments

@Dcallies
Copy link
Contributor

Dcallies commented Nov 6, 2024

enabled_ratio: Mapped[float] = mapped_column(default=1.0)

disable_until_ts: Mapped[int] = mapped_column(default=BankContentConfig.ENABLED)

Both Bank and BankContent have database fields to allow them to be disabled. However, those fields are possibly not settable, nor read today. Banks can be ramped up fractionally, and BankContent can be set to disable for a time.

This is a multi-stage feature issue, here are roughly the stages:

  • Confirm that an API exists (under curator role) that allows setting Bank and BankContent disable states (implement if not)
    • Bank enable_ratio
    • BankContent disable_until_ts
  • Implement fractional matching for Bank
    • If the bank is 0% enabled, it should not contribute its hashes to the index (skipped during indexing)
      • Implement skip in index_build
    • If (0 < enable_pct < 100), then during resolution to bank, then a coinflip should be made to determine whether this lookup should be a match or not. This coinfip should be stable (e.g. you get the consistent answer each time). To make this stable, you can digest the signal string to a value between 0 and 1, and compare that to the enable pct.
      • Additionally, we should add an optional content_id string field to the request, which if provided, should be the source of the coinflip seed instead.
  • Implement time disabled for BankContent
    • Add a constant which represents "permanently disabled", which should not contribute its hashes to the index (skipped during indexing)
      • Implement skip in index_build
    • If (0 < disabled_until_ts < PERMANENTLY_DISABLED), then the request time should be compared against the disable timestamp, and it should not be considered a match if before this time
    • Raw lookup should not do this check (it's meant to be ~direct access to the index)
  • Unittest everything
@aryzle
Copy link
Collaborator

aryzle commented Dec 29, 2024

Hey @Dcallies I wanna make sure I'm understanding the concept of Bank, BankContent, and ContentSignal... looking at the db diagram it doesn't seem like ContentSignal was included yet. I get Bank and BankContent, so is ContentSignal different signal type/value pairs for a specific piece of content (url, file)?

I see BankContent (disable_until_ts) and Bank (enabled_ratio) have a way to be disabled already, so is the goal here to do the same on ContentSignal?

@Dcallies
Copy link
Contributor Author

Dcallies commented Dec 30, 2024

Hey @aryzle , good question, and I note that we didn't add documentation to any of these classes to help answer it in the code itself, which is where I'd prefer the answer to live!

  • Bank: Conceptually, a collection of content that has been labeled with similar labels. Matches to the contents of this bank should be classified with those labels. Basically a folder.
  • BankContent: A single piece of content that has been labeled. Due to data retention limits for harmful content, and hash sharing, this may no longer point to any original content, but represent the idea of a single piece of content.
  • ContentSignal: The signals for a single piece of labeled content. We could have also called this BankContentSignal

Matching only takes place on signals - during the lookup operation, we find matching signals and return ids corresponding to the BankContent, which further resolve to the banks themselves which then essentially returns the classification labels.

I see BankContent (disable_until_ts) and Bank (enabled_ratio) have a way to be disabled already, so is the goal here to do the same on ContentSignal?

Nope, we only need to the ability to disable BankContent - but the functionality is unimplemented! We need:

  1. An API that allows setting disable
  2. The disable state to be read during matching, to ignore it during lookup
  3. The disable state to be read during indexing, to not add it to the index

@Dcallies Dcallies changed the title [hma] Implement content-level disable [hma] Implement Bank & content-level disable Dec 30, 2024
@aryzle aryzle linked a pull request Jan 11, 2025 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants