Implementing simple parsing arguments #34

andrewtavis · 2021-04-05T11:18:30Z

This issue is to discuss and implement keys for wikirec.data_utils.input_conversion_dict to make it easier for people to find valid arguments to parse Wikipedia articles using wikirec.data_utils.parse_to_ndjson. Rather than needing to search for the given Infobox topic, a user could instead simply query the keys of input_conversion_dict for the desired language and see what would be valid values to pass to the topics argument. Suggestions and pull requests are welcome for any language :)

Thanks for your interest in contributing!

The text was updated successfully, but these errors were encountered:

victle · 2021-04-07T03:10:30Z

Sorry if I'm not understanding the issue here; are you looking to scrape different languages on Wikipedia to find valid Infobox topics? Would adding keys manually for different languages also be sufficient for smaller pull requests?

andrewtavis · 2021-04-07T07:54:05Z

Hi and thanks for the message :) Adding keys manually would certainly be enough for smaller pull requests. I guess I'm not 100% sure how I want this to function yet, so your input would be welcome. As of now I figure that it would be helpful for people to be able to use data_utils.input_conversion_dict as a way to check for the most likely arguments and have them standardized (ex: both films and movies pointing to the same target to avoid confusion). If we're scraping then we'll get all the ones that are valid, but not particularly useful.

I think that just adding more language keys and then conversions within would be sufficient (but let me know), and then we could add something to the readme that details how to query common options via:

data_utils.input_conversion_dict()["en"].keys()

Let me know, and I appreciate your interest in helping out!

victle · 2021-04-07T14:38:26Z

Adding new language keys in a similar format to the "en" should be simple enough. As for the conversions, what do you think about having multiple keys (e.g., "movies: ...", "films:...",) that point to the same value? Or perhaps, in a single "movies" key, the value is a list of all the related Infoboxes? That way, if people use data_utils.input_conversion_dict, they get a minimal list of relevant possible arguments.

In fact, maybe another option is to simply have a key that is consistent across all languages. Something like, "common", that will spit out the most likely arguments. Though, this might be a bit naive as we might miss related Infoboxes?

andrewtavis · 2021-04-07T16:19:08Z

In thinking about it, I agree that a minimal list is better than an expansive one that has multiple arguments per target :) Honestly I just added the films key as I got annoyed that I forgot the key was movies 😄 The goal is that people use data_utils.input_conversion_dict to explore the options, so let's fully implement it in the workflow.

I think that language based arguments would be best, as I could see some people being confused if arguments in other languages besides English are displayed. The shear size and depth of the English Wikipedia means that it will be the go to choice for most NLP tasks, wikirec included, with other languages picking up the slack to provide cultural insights or be used in areas where the English wiki's lacking. With that being said, I like the idea of a common argument - but it might be best if it's kept in mind for later :)

andrewtavis changed the title ~~Compiling simple parsing arguments~~ Implementing simple parsing arguments Apr 5, 2021

andrewtavis added enhancement New feature or request good first issue Good for newcomers labels Apr 5, 2021

andrewtavis added the help wanted Extra attention is needed label Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing simple parsing arguments #34

Implementing simple parsing arguments #34

andrewtavis commented Apr 5, 2021

victle commented Apr 7, 2021

andrewtavis commented Apr 7, 2021 •

edited

Loading

victle commented Apr 7, 2021

andrewtavis commented Apr 7, 2021

Implementing simple parsing arguments #34

Implementing simple parsing arguments #34

Comments

andrewtavis commented Apr 5, 2021

victle commented Apr 7, 2021

andrewtavis commented Apr 7, 2021 • edited Loading

victle commented Apr 7, 2021

andrewtavis commented Apr 7, 2021

andrewtavis commented Apr 7, 2021 •

edited

Loading