Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error assigning denominator_column_id in acs2017_1yr #19

Open
JoeGermuska opened this issue Sep 21, 2018 · 0 comments
Open

Error assigning denominator_column_id in acs2017_1yr #19

JoeGermuska opened this issue Sep 21, 2018 · 0 comments
Assignees

Comments

@JoeGermuska
Copy link
Member

Something about the load process led to the acs2017_1yr.census_table_metadata being off in a few cases.

Specifically, the value for denominator_column_id was incorrectly set. This column is used to indicate when it's sensible to represent values from the table as percentages of a whole -- its value should be null when values in a table represent medians, aggregates, means, or other values which don't work that way.

Its value should also be null for tables with only one column, such as B01003: Total Population among others.

I fixed this directly in the PostgreSQL database for the running system, but we should review, possibly implement some of the sanity checks below, and possibly create a new PSQL dump (lower priority)

Specifically, separate from the aggregate/mean/etc case, this query should, logically, always return zero rows:

SELECT tmd.* 
    FROM acs2017_1yr.census_table_metadata tmd, 
    (SELECT table_id, count(*) 
        FROM acs2017_1yr.census_column_metadata 
        GROUP BY table_id 
        HAVING count(*) = 1) as cmd 
    WHERE tmd.table_id = cmd.table_id 
      AND tmd.denominator_column_id IS NOT NULL;

This command will ensure that the above command returns zero rows. It could probably be included as part of the data process after census_table_metadata is populated:

UPDATE acs2017_1yr.census_table_metadata tmd
SET denominator_column_id = null 
FROM (SELECT table_id, count(*) 
        FROM acs2017_1yr.census_column_metadata 
        GROUP BY table_id 
        HAVING count(*) = 1) as cmd
    WHERE tmd.table_id = cmd.table_id
    AND denominator_column_id IS NOT NULL;

This command will show which tables don't have a denominator column, but have more than one row, and don't match the most obvious table naming conventions which are a sign that the data shouldn't be "ratio'd". It is likely to return a few rows, and for now all I can think is that a person can scan them and think about whether there's any likely issue...

SELECT tmd.* 
    FROM acs2017_1yr.census_table_metadata tmd, 
    (SELECT table_id, count(*) 
        FROM acs2017_1yr.census_column_metadata 
        GROUP BY table_id 
        HAVING count(*) > 1) as cmd 
    WHERE tmd.table_id =cmd.table_id 
      AND tmd.denominator_column_id IS NULL
      AND LOWER(table_title) NOT LIKE '%aggregate%'
      AND LOWER(table_title) NOT LIKE '%median%'
      AND LOWER(table_title) NOT LIKE '%mean%';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants