Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More validations #75

Open
nicolas-raoul opened this issue Apr 9, 2018 · 0 comments
Open

More validations #75

nicolas-raoul opened this issue Apr 9, 2018 · 0 comments

Comments

@nicolas-raoul
Copy link
Collaborator

nicolas-raoul commented Apr 9, 2018

I am experimenting with data.world and it gave some interesting ideas for validation:

Seen at 16 duplicate rows detected
    Row 1630 is a duplicate of the one above it.
    Row 29643 is a duplicate of the one above it.
    Row 65945 is a duplicate of the one above it.
    Row 77711 is a duplicate of the one above it.

+ 12 similar issues
158 blank cells detected
Dismiss

    Cell at row 1345, column title appears blank.
    Cell at row 1457, column title appears blank.
    Cell at row 1949, column title appears blank.
    Cell at row 2214, column title appears blank.

+ 154 similar issues
Numeric(114)
114 numeric values outside standard deviation detected
Dismiss

    Value 1.1102015E7 at row 66099, column checkin is more than 4 standard deviations of 646327.73 from the mean of 45586.41.
    Value 1.4102015E7 at row 66099, column checkout is more than 4 standard deviations of 778967.24 from the mean of 50554.01.
    Value -1800.0 at row 109847, column hours is more than 4 standard deviations of 248.63 from the mean of 8.14.
    Value 1130.0 at row 171909, column hours is more than 4 standard deviations of 248.63 from the mean of 8.14.

+ 110 similar issues
Noise(146)
2 non-numeric characters in number field detected
Dismiss

    Value at row 110516, column latitude does not appear to be numeric, but column is numeric.
    Value at row 110516, column longitude does not appear to be numeric, but column is numeric.

1 possible placeholder number detected
Dismiss

    Value 5555 at row 185359, column title appears to be a placeholder.

143 possible placeholder text values detected
Dismiss

    Value VVV at row 5521, column alt appears to be a placeholder.
    Value *** at row 12105, column alt appears to be a placeholder.
    Value CCC at row 39923, column alt appears to be a placeholder.
    Value *** at row 104112, column alt appears to be a placeholder.

+ 139 similar issues
Text(1,945)
1,945 text values outside standard deviation detected
Dismiss

    Text value 95 at row 508, column address has length more than 4 standard deviations away from the mean.
    Text value 88 at row 591, column address has length more than 4 standard deviations away from the mean.
    Text value 162 at row 599, column address has length more than 4 standard deviations away from the mean.
    Text value 144 at row 742, column address has length more than 4 standard deviations away from the mean.

+ 1,941 similar issues

Date(3)
3 dates detected far in the future
Dismiss

    Date is far in the future at column lastedit, row 77420.
    Date is far in the future at column lastedit, row 130787.
    Date is far in the future at column lastedit, row 205038.

Feel free to split this issue into separate issues, and implement only the ones you want.

For your information, here are the current validations: http://wvpoi.batalex.ru/download/listings/wikivoyage-listings-en-latest.validation-report.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant