Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CandidateExtractor doesn't scale for larger relations #546

Open
robbieculkin opened this issue Sep 2, 2021 · 1 comment
Open

CandidateExtractor doesn't scale for larger relations #546

robbieculkin opened this issue Sep 2, 2021 · 1 comment

Comments

@robbieculkin
Copy link

Hello, thanks for providing this framework. My group has run into a bit of a snag:

For context, we've successfully completed candidate extraction & labeling for binary relations, with reasonable runtimes. With parallelism = 6, candidate extraction takes ~2 minutes per document.

We've since moved on to a 3-ary relation that is very similar to the binary relation. This 3-ary relation shares some mentions with the binary relation, and uses a very similar candidate extractor. We have done performance testing for the 3-ary throttler function and found it to have a very similar runtime to the binary throttler. Candidate extraction now takes 4 hours per document. This immense slowdown is due to the fact that computational complexity scales exponentially for each entity added to a relation.

Here are some numbers from our use case:

  • Mention A: 800 mentions found
  • Mention B: 140 mentions found
  • Mention C: 150 mentions found

If our relation only includes (A,B), we have a total of 800*140 = 112,000 temporary candidates to evaluate with our throttler. Should we add mention C to form the relation (A,B,C), our total now grows to 800*140*150 = 16.8 million temporary candidates. We're unable to narrow our mention matchers further without excluding true positives.

This slowdown makes the Fonduer framework effectively unusable for any large-scale use case that requires relations with more than 2 entities. Can you provide guidance to address this issue?

@robbieculkin
Copy link
Author

The first workaround that comes to mind is to override the CandidateExtractorUDF.apply method where temporary candidates are formed from the product of a document's mentions.

cands = product(
*[
enumerate(
# a list of mentions for each mention subclass within a doc
getattr(doc, mention.__tablename__ + "s")
+ ([None] if nullable else [])
)
for mention, nullable in zip(
candidate_class.mentions, candidate_class.nullables
)
]
)

If we're able to group mentions by their contexts and form products within these groups, it can reduce the number of temporary candidates passed to the throttler. This restriction can be applied across the hierarchy of Fonduer's data model (Paragraph, Sentence, Table, ...).

For example, we can

  1. select all mentions within a given page
  2. take their Cartesian product,
  3. then sum the results across all pages produced via steps 1-2.

This way, we eliminate all temporary candidates that span different pages (if this is what the programmer indicates). This is preferable to the same_page method as it reduces computational complexity significantly - there are far fewer temporary candidates formed. Ultimately, the programmer can use a shared_context parameter to the .apply method to indicate the right degree of detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant