CandidateExtractor doesn't scale for larger relations #546

robbieculkin · 2021-09-02T21:04:37Z

Hello, thanks for providing this framework. My group has run into a bit of a snag:

For context, we've successfully completed candidate extraction & labeling for binary relations, with reasonable runtimes. With parallelism = 6, candidate extraction takes ~2 minutes per document.

We've since moved on to a 3-ary relation that is very similar to the binary relation. This 3-ary relation shares some mentions with the binary relation, and uses a very similar candidate extractor. We have done performance testing for the 3-ary throttler function and found it to have a very similar runtime to the binary throttler. Candidate extraction now takes 4 hours per document. This immense slowdown is due to the fact that computational complexity scales exponentially for each entity added to a relation.

Here are some numbers from our use case:

Mention A: 800 mentions found
Mention B: 140 mentions found
Mention C: 150 mentions found

If our relation only includes (A,B), we have a total of 800*140 = 112,000 temporary candidates to evaluate with our throttler. Should we add mention C to form the relation (A,B,C), our total now grows to 800*140*150 = 16.8 million temporary candidates. We're unable to narrow our mention matchers further without excluding true positives.

This slowdown makes the Fonduer framework effectively unusable for any large-scale use case that requires relations with more than 2 entities. Can you provide guidance to address this issue?

robbieculkin · 2021-09-02T21:24:41Z

The first workaround that comes to mind is to override the CandidateExtractorUDF.apply method where temporary candidates are formed from the product of a document's mentions.

fonduer/src/fonduer/candidates/candidates.py

Lines 263 to 274 in c9fd6b9

    
           cands = product( 
        
               *[ 
        
                   enumerate( 
        
                       # a list of mentions for each mention subclass within a doc 
        
                       getattr(doc, mention.__tablename__ + "s") 
        
                       + ([None] if nullable else []) 
        
                   ) 
        
                   for mention, nullable in zip( 
        
                       candidate_class.mentions, candidate_class.nullables 
        
                   ) 
        
               ] 
        
           )

If we're able to group mentions by their contexts and form products within these groups, it can reduce the number of temporary candidates passed to the throttler. This restriction can be applied across the hierarchy of Fonduer's data model (Paragraph, Sentence, Table, ...).

For example, we can

select all mentions within a given page
take their Cartesian product,
then sum the results across all pages produced via steps 1-2.

This way, we eliminate all temporary candidates that span different pages (if this is what the programmer indicates). This is preferable to the same_page method as it reduces computational complexity significantly - there are far fewer temporary candidates formed. Ultimately, the programmer can use a shared_context parameter to the .apply method to indicate the right degree of detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CandidateExtractor doesn't scale for larger relations #546

CandidateExtractor doesn't scale for larger relations #546

robbieculkin commented Sep 2, 2021

robbieculkin commented Sep 2, 2021

CandidateExtractor doesn't scale for larger relations #546

CandidateExtractor doesn't scale for larger relations #546

Comments

robbieculkin commented Sep 2, 2021

robbieculkin commented Sep 2, 2021