Outfactor source extraction functions #1428

lukavdplas · 2023-10-16T12:34:05Z

lukavdplas
Oct 16, 2023
Maintainer

Amongst other things, I-analyzer contains the functionality to convert XML, HTML or CSV files to a flat collection of key/value pairs. To extract these, you essentially pass on a few parameters and a list of Extractor objects. This functionality could be outfactored into an independent python package.

The package would essentially contain:

XMLReader / HTMLReader / CSVReader classes, which implement the source2dicts function we now have in CorpusDefinition
The Extractor classes

Individual corpus definitions in the I-analyzer backend would look pretty much the same, but the CorpusDefinition child classes would be more lightweight.

BeritJanssen · 2023-10-18T12:13:45Z

BeritJanssen
Oct 18, 2023
Maintainer

Yes, that's a good plan! The new application might also be combined with harvesting steps. Then the process of harvesting new data & ingesting the data can be combined. That would be very nice for periodical updates, so we can do all from one application, and don't have to switch to I-Analyzer to actually ingest the data. Only potential disadvantage is that then we have information of corpora (one: which metadata to expect, the other: which data format to expect) spread over two places. But then, it's also not unusual that we get the same corpus in two different formats, which would be very difficult to accommodate in the current setup, whereas the CorpusDefinition could just be completely agnostic as to where the data comes from and which format the source data is in.

0 replies

lukavdplas · 2023-10-18T12:58:51Z

lukavdplas
Oct 18, 2023
Maintainer Author

Hm, this idea of a separate harvesting + extraction application is something we've discussed, and I think you summarise the pros and cons well here. It's not quite what I meant, though. The package I envision would only define, for instance, the XML extractor class, but the corpus definition would still create the instances of XML used for a specific corpus. Think of it more like an data-extraction-utils package.

That is not exclusive with making a separate "source extraction" application - you can do both. But I think there is still something to be said for a lightweight module for parsing source files. An issue that I would like to address is the following:

When other researchers or projects work with the same source data (or very similarly structured data), it would be nice if I could point to relatively lightweight python package and say "here is the Extractor we used to get the chapter titles from the XML files in this dataset". That would be a lot easier if there was a standalone python package for reading source files, that isn't integrated in a web application.

Another reason is that we ourselves may work on projects that also involve reading XML/HTML/CSV data but otherwise have nothing to do with I-analyzer.

0 replies

BeritJanssen · 2023-10-18T15:01:47Z

BeritJanssen
Oct 18, 2023
Maintainer

Do you think we could factor this out as in: release a PyPi package that we could import in I-Analyzer?

0 replies

lukavdplas · 2023-10-18T16:13:21Z

lukavdplas
Oct 18, 2023
Maintainer Author

I think so! You wouldn't necessarily need to upload to PyPI if it's very niche, but that would be nice the interest of FAIRness.

0 replies

lukavdplas · 2023-12-05T15:42:43Z

lukavdplas
Dec 5, 2023
Maintainer Author

I've put a preliminary implementation in https://github.com/UUDigitalHumanitieslab/ianalyzer-readers - it needs some more polishing to be properly reusable.

0 replies

BeritJanssen · 2023-12-06T10:11:55Z

BeritJanssen
Dec 6, 2023
Maintainer

Nice! I think it misses requirements, for one thing?

0 replies

lukavdplas · 2023-12-06T11:02:54Z

lukavdplas
Dec 6, 2023
Maintainer Author

Can you make that an issue in the repository?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outfactor source extraction functions #1428

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Outfactor source extraction functions #1428

lukavdplas Oct 16, 2023 Maintainer

Replies: 7 comments

BeritJanssen Oct 18, 2023 Maintainer

lukavdplas Oct 18, 2023 Maintainer Author

BeritJanssen Oct 18, 2023 Maintainer

lukavdplas Oct 18, 2023 Maintainer Author

lukavdplas Dec 5, 2023 Maintainer Author

BeritJanssen Dec 6, 2023 Maintainer

lukavdplas Dec 6, 2023 Maintainer Author

lukavdplas
Oct 16, 2023
Maintainer

BeritJanssen
Oct 18, 2023
Maintainer

lukavdplas
Oct 18, 2023
Maintainer Author

BeritJanssen
Oct 18, 2023
Maintainer

lukavdplas
Oct 18, 2023
Maintainer Author

lukavdplas
Dec 5, 2023
Maintainer Author

BeritJanssen
Dec 6, 2023
Maintainer

lukavdplas
Dec 6, 2023
Maintainer Author