Presto NaN behavior

We are moving to a consistent semantics with regards to NaN across all functions and operators. This will bring internal consistency to Presto, as well as align us with Velox, as we move towards CPP workers in Presto. According to the new definition:

NaN = NaN is true for all equality and distinctness purposes
NaN is the largest value for all comparison and sorting purposes. You can find more information below about where this definition differs from the previous behavior, and how you can handle it for your queries.

What is nan?

NaN, short for "not a number", is a value in floating point types (DOUBLE and REAL) that indicates an invalid result to mathematical operations. For example 0.0/0.0 returns NaN, as does sqrt(-1). NaN is different from NULL, which indicates an absent value and is significantly less common.

What is changing and how will this affect queries

The new NaN semanitcs may cause your queries to produce different results by returning different values than previously or returning a different number of rows. Below are places where the new Nan semantics differ from the old.

Comparison operators

Comparison operators are by far the most common cause of results differences between the old and new semantics. Especially comparisons such as double_column > value.

> and < operators - Filters such as x > 0.5 previously would not include NaNs. Now, since NaNs are considered larger than all other numbers, it will include NaN values. This is the most common cause of results differences we see with the new NaN changes. Sometimes this is from nan values stored in tables. Other times it comes from comparison after division operations where you end up with 0 / 0. e.g. WHERE (col1 / col2) > 0.5, but col1 and col2 are both 0.

SELECT x > 0.5 FROM (VALUES nan()) t(x)

Previous results: 
 _col0                
-------
 false 
(1 row)

New results:
 _col0                
-------
 true 
(1 row)

SELECT x/y > 0.5 FROM (VALUES (0.0, 0.0))t(x, y)

Previous results: 
 _col0                
-------
 false 
(1 row)

New results:
 _col0                
-------
 true 
(1 row)

= and IN operators - Filters such as x = y will now return true if both values are NaN, whereas previously they returned false
<> and NOT IN operators - x <> y now returns false if both values are NaN. Previously it returned true.

Functions

min, max, min_by, max_by - Will now consider NaNs largest. Previously they returned NaN if it was the first value encountered and otherwise they returned the min or max non-NaN value.
array_min - previously this returned NaN if there was any NaN in the input. Now it will consider NaN as the largest value.
greatest, least - previously these functions threw an error if there was NaN in the input. Now they will consider NaN largest.

presto:di> SELECT greatest (1, nan());
Previous results:
Query 20240627_211145_97754_m2atc failed: Invalid argument to greatest(): NaN

New results:
 _col0                
-------
   NaN 
(1 row)

set_agg, set_union, map_agg, multimap_agg, map_union and other map and array functions that are supposed to return a distinct set of elements in an array or map - previously these did not deduplicate NaN values and maps could also contain multiple NaN keys. Now they will deduplicate NaN values and maps can only contain one NaN key


SELECT set_agg(x) from (values nan(), nan()) t(x);

Previous results:
   _col0    
------------
 [NaN, NaN] 
(1 row)
New results:
   _col0    
------------
 [NaN]

contains, element_at, arrays_overlap and other functions that find or match an element in a map or array - previously these functions would not match NaN values in the array or map (e.g. contains(ARRAY[nan()], nan()) would return false). Now these functions will match nan to itself (e.g. contains(ARRAY[nan()], nan()) would be true). map_top_n - previously returned wrong results in the presence of NaNs. Now it will consider NaN largest.

Joins

NaN join keys will now match. Previously they would not. Example:

with nan_input as (SELECT * FROM (values nan())t(x)) SELECT t1.x FROM nan_input t1 JOIN nan_input t2 on t1.x = t2.x;

Previous results:
 x 
---
(0 rows)

New results:
 x 
---
NaN
(1 row)

Aggregations

GROUP BY/DISTINCT for REAL types Previously we did not deduplicate NaNs when performing grouping operations on REAL types. Now we will deduplicate NaNs and consider them all part of the same gropu for a "GROUP BY". There is no change for DOUBLE types which already had this behavior.

SELECT col, count(*) FROM (VALUES cast(nan() AS REAL), CAST(nan() AS REAL))t(col) GROUP BY col;

Previous results:
 col | _col1 
-----+-------
 NaN |     1 
 NaN |     1 
(2 rows)

New results:
 col | _col1 
-----+-------
 NaN |     2 
(1 row)

In addition to the above, a bug was fixed regarding casting nan or infinity to bigint/int/tinyint/smallint. Previously these casts would incorrectly return 0. nan and infinity cannot be cast to integer types, and such casts will now throw an error.

What can I protect against NaNs in my data?

In most cases where NaNs show up, people do not realize that they may be producing or processing NaNs.

If you would like to filter out NaNs from your data in a way that is compatible with bot the old and new semanitcs, you can use the is_nan() function. For example: SELECT * FROM my_table WHERE NOT is_nan(my_double_column). You can also use is_finite() to filter out both NaNs and infinities. For example: SELECT * FROM my_table WHERE is_finite(my_double_column).

In order to prevent yourself from creating NaNs, you can filter them out after performing an operation on doubles or reals, e.g. SELECT * FROM (SELECT col1/col2 as ratio FROM my_table) WHERE NOT is_nan(ratio) or you can filter your inputs so as to prevent producing NaNs , e.g. SELECT col1/col2 as ratio FROM my_table WHERE col2 <> 0.