How to deal with messy data in corpora? #1417

BeritJanssen · 2018-08-30T09:36:08Z

BeritJanssen
Aug 30, 2018
Maintainer

Some keyword fields have terms which only differ slightly, making you wonder whether this is an inconsistency when producing the metadata. E.g., the Times corpus has the following results for an aggregate search on the 'illustration' field:

"aggregations": {
        "illustrations": {
            "doc_count_error_upper_bound": 1,
            "sum_other_doc_count": 60,
            "buckets": [
                {
                    "key": "Photograph",
                    "doc_count": 1066798
                },
                {
                    "key": "Drawing",
                    "doc_count": 501804
                },
                {
                    "key": "Table",
                    "doc_count": 357874
                },
                {
                    "key": "Drawing-Painting",
                    "doc_count": 213248
                },
                {
                    "key": "Map",
                    "doc_count": 60567
                },
                {
                    "key": "Cartoons",
                    "doc_count": 53097
                },
                {
                    "key": "Graph",
                    "doc_count": 43758
                },
                {
                    "key": "Cartoon",
                    "doc_count": 8576
                },
                {
                    "key": "A commemorative stone set on what will be the university campus",
                    "doc_count": 1
                },
                {
                    "key": "A new industry helps a new city. Aeriel survey photographs of the Craigavon site are linked together for interpretation",
                    "doc_count": 1
                }
            ]
        }
    }

'Cartoons' and 'Cartoon' might have to be consolidated. This could be done

during indexing: the logic for that is already in place, cf. transforming the 'bank' field with DUTCHBANK_MAP in dutchbanking.py. Disadvantage: if a user at some point decides the difference in fields is meaningful (cf. also the newspaper titles in the KB corpus), we need to reindex.
during search: we would need to implement similar logic which gives the user one choice but internally searches for both variants of the term (this logic would have to be implemented both front- and backend).

BeritJanssen · 2018-08-30T12:07:39Z

BeritJanssen
Aug 30, 2018
Maintainer Author

Noticed something similar in the category field of the Times index:

{
                    "key": "Classified Advertisements",
                    "doc_count": 1
                },
                {
                    "key": "Diplay Advertising",
                    "doc_count": 1
                },
                {
                    "key": "Letters and Correspondece",
                    "doc_count": 1
                }

First term is usually called 'Classified Advertising'. Of course we hardly lose any data with these misspellings, but this is still something to be aware of.

0 replies

BeritJanssen · 2018-09-05T10:59:56Z

BeritJanssen
Sep 5, 2018
Maintainer Author

This may be related to #222 .

0 replies

BeritJanssen · 2024-02-08T09:32:39Z

BeritJanssen
Feb 8, 2024
Maintainer Author

I wonder what our stance is here? Should we clean up the data, or present it as-is? We could write some transform functions which take care of the inconsistencies at index time, but this may be beyond the scope of what we want to invest in presenting corpora?

0 replies

BeritJanssen · 2024-02-08T10:25:54Z

BeritJanssen
Feb 8, 2024
Maintainer Author

See also #1077: minor inconsistencies, such as lower / uppercase, may be taken care of by Elasticsearch during ingest.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with messy data in corpora? #1417

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to deal with messy data in corpora? #1417

BeritJanssen Aug 30, 2018 Maintainer

Replies: 4 comments

BeritJanssen Aug 30, 2018 Maintainer Author

BeritJanssen Sep 5, 2018 Maintainer Author

BeritJanssen Feb 8, 2024 Maintainer Author

BeritJanssen Feb 8, 2024 Maintainer Author

BeritJanssen
Aug 30, 2018
Maintainer

BeritJanssen
Aug 30, 2018
Maintainer Author

BeritJanssen
Sep 5, 2018
Maintainer Author

BeritJanssen
Feb 8, 2024
Maintainer Author

BeritJanssen
Feb 8, 2024
Maintainer Author