-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specifying a set of possible values for categorical variables #534
Comments
Hi @mhauru, The way that SmartNoise SQL is currently designed, rare dimension combinations get censored, because of the possibility that they will "fingerprint" people. This means that the total number of dimension combinations in the output of a The most direct way to do this in a SQL-friendly manner is to It's conceivable that this could be done automatically under the covers, but SmartNoise would need to know the cardinalities of the dimensions to avoid creating an explosion. I can think of a few ways to design this:
I don't love #3, because it's not very SQL-friendly and could lead to situations where the metadata gets out of date with the SQL dimensions. And we would need to shuffle the rows before returning, or prohibit results without an Even with #2, we could skip the Would be interested in your feedback |
Thanks for the extensive response @joshua-oss. I hadn't thought about the option of using Regarding using |
Is there a way in smartnoise-sql to specify the set of possible values a categorical variable could take, that would then affect count queries, so that all valid values would have a chance of getting a non-zero count reported for them, even if some may not appear in the data?
For a use case, imagine a people table with a gender column. In a small dataset the only values that feature might be
'male'
and'female'
(as strings or enums, for instance), but I might want to specify that other values like'N/A'
would also be valid, and have queries likeSELECT COUNT(*) AS num, gender FROM people GROUP BY gender
have a chance of reporting non-zero values for them.The text was updated successfully, but these errors were encountered: