How to deal with messy data in corpora? #1417
Replies: 4 comments
-
Noticed something similar in the category field of the Times index: {
"key": "Classified Advertisements",
"doc_count": 1
},
{
"key": "Diplay Advertising",
"doc_count": 1
},
{
"key": "Letters and Correspondece",
"doc_count": 1
} First term is usually called 'Classified Advertising'. Of course we hardly lose any data with these misspellings, but this is still something to be aware of. |
Beta Was this translation helpful? Give feedback.
-
This may be related to #222 . |
Beta Was this translation helpful? Give feedback.
-
I wonder what our stance is here? Should we clean up the data, or present it as-is? We could write some transform functions which take care of the inconsistencies at index time, but this may be beyond the scope of what we want to invest in presenting corpora? |
Beta Was this translation helpful? Give feedback.
-
See also #1077: minor inconsistencies, such as lower / uppercase, may be taken care of by Elasticsearch during ingest. |
Beta Was this translation helpful? Give feedback.
-
Some keyword fields have terms which only differ slightly, making you wonder whether this is an inconsistency when producing the metadata. E.g., the Times corpus has the following results for an aggregate search on the 'illustration' field:
'Cartoons' and 'Cartoon' might have to be consolidated. This could be done
DUTCHBANK_MAP
indutchbanking.py
. Disadvantage: if a user at some point decides the difference in fields is meaningful (cf. also the newspaper titles in the KB corpus), we need to reindex.Beta Was this translation helpful? Give feedback.
All reactions