Array comparisons vs Single value comparisons #2550
Unanswered
p4pratikjain
asked this question in
Q&A
Replies: 1 comment
-
It's because the array comparison is a filter condition, whereas the l.email=r.email comparison is an equi join condition. You can read more about this here: In your situation I'd recommend sorting your arrays and using something like |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We are trying to match two big datasets (300M actual size but for POC it is sampled to 1.5M). I can either run comparisons on exploded data which is 25M vs array comparisons on 1.5M dataset. For some reason when I do this on 25M linker.training.estimate_probability_two_random_records_match This function runs in few minutes. where as when I do this via array comparisons it takes about 6 hrs.
I have converted my deterministic rules to array based comparisons for this. for eg l.email = r. email becomes array_length(array_intersect(l.email,r.email))>=1.
Am I missing something or is there any better way to run this on grouped dataset ?
Beta Was this translation helpful? Give feedback.
All reactions