Summary
RewriteBucketedSemiJoinToJoin rewrites SemiJoin to LEFT JOIN + Project for bucketed tables, but the rewrite loses three-valued NULL semantics, potentially producing incorrect results for NOT IN queries when the subquery contains NULLs.
Root Cause
The rewrite computes semiJoinOutput as:
IF(l.key IS NULL, NULL, NOT(r.key IS NULL))
The original HashSemiJoinOperator returns NULL when the probe key does not match any build value AND the build set contains NULLs (standard SQL IN semantics: 5 IN (NULL, 1, 2, 3) = UNKNOWN).
The LEFT JOIN rewrite produces FALSE instead — equi-join never matches NULL, so r.key is always null-filled on non-match, yielding NOT(NULL IS NULL) = NOT(TRUE) = FALSE.
Impact
For NOT IN queries where semiJoinOutput is consumed by a bare NOT(S) filter:
5 NOT IN (NULL, 1, 2, 3) |
Original SemiJoin |
Rewritten LeftJoin |
semiJoinOutput |
NULL |
FALSE |
NOT(semiJoinOutput) |
NULL (falsy, row excluded) |
TRUE (row incorrectly included) |
The NOT EXISTS pattern is safe because it uses NOT(COALESCE(S, false)) which treats both NULL and FALSE the same way.
Relevant Files
RewriteBucketedSemiJoinToJoin.java — lines 174-180 (the rewrite expression)
HashSemiJoinOperator.java — lines 160-184 (correct NULL semantics)
TestLogicalPlanner.java — lines 744-752 (NOT IN plan pattern showing bare NOT S)
Suggested Fix
Add a guard to RewriteBucketedSemiJoinToJoin that checks the semiJoinOutput is not consumed by a bare NOT() filter. Only fire when the output is used directly as a boolean filter (positive IN) or wrapped in COALESCE (NOT EXISTS).
Alternatively, handle build-side NULLs correctly by adding a NOT NULL constraint check on the filtering source join key column.
Summary
RewriteBucketedSemiJoinToJoinrewritesSemiJointoLEFT JOIN + Projectfor bucketed tables, but the rewrite loses three-valued NULL semantics, potentially producing incorrect results forNOT INqueries when the subquery contains NULLs.Root Cause
The rewrite computes
semiJoinOutputas:The original
HashSemiJoinOperatorreturns NULL when the probe key does not match any build value AND the build set contains NULLs (standard SQLINsemantics:5 IN (NULL, 1, 2, 3)= UNKNOWN).The LEFT JOIN rewrite produces FALSE instead — equi-join never matches NULL, so
r.keyis always null-filled on non-match, yieldingNOT(NULL IS NULL)=NOT(TRUE)=FALSE.Impact
For
NOT INqueries wheresemiJoinOutputis consumed by a bareNOT(S)filter:5 NOT IN (NULL, 1, 2, 3)semiJoinOutputNOT(semiJoinOutput)The
NOT EXISTSpattern is safe because it usesNOT(COALESCE(S, false))which treats both NULL and FALSE the same way.Relevant Files
RewriteBucketedSemiJoinToJoin.java— lines 174-180 (the rewrite expression)HashSemiJoinOperator.java— lines 160-184 (correct NULL semantics)TestLogicalPlanner.java— lines 744-752 (NOT IN plan pattern showing bareNOT S)Suggested Fix
Add a guard to
RewriteBucketedSemiJoinToJointhat checks thesemiJoinOutputis not consumed by a bareNOT()filter. Only fire when the output is used directly as a boolean filter (positiveIN) or wrapped inCOALESCE(NOT EXISTS).Alternatively, handle build-side NULLs correctly by adding a NOT NULL constraint check on the filtering source join key column.