Outfactor source extraction functions #1428
Replies: 7 comments
-
Yes, that's a good plan! The new application might also be combined with harvesting steps. Then the process of harvesting new data & ingesting the data can be combined. That would be very nice for periodical updates, so we can do all from one application, and don't have to switch to I-Analyzer to actually ingest the data. Only potential disadvantage is that then we have information of corpora (one: which metadata to expect, the other: which data format to expect) spread over two places. But then, it's also not unusual that we get the same corpus in two different formats, which would be very difficult to accommodate in the current setup, whereas the CorpusDefinition could just be completely agnostic as to where the data comes from and which format the source data is in. |
Beta Was this translation helpful? Give feedback.
-
Hm, this idea of a separate harvesting + extraction application is something we've discussed, and I think you summarise the pros and cons well here. It's not quite what I meant, though. The package I envision would only define, for instance, the That is not exclusive with making a separate "source extraction" application - you can do both. But I think there is still something to be said for a lightweight module for parsing source files. An issue that I would like to address is the following: When other researchers or projects work with the same source data (or very similarly structured data), it would be nice if I could point to relatively lightweight python package and say "here is the Another reason is that we ourselves may work on projects that also involve reading XML/HTML/CSV data but otherwise have nothing to do with I-analyzer. |
Beta Was this translation helpful? Give feedback.
-
Do you think we could factor this out as in: release a PyPi package that we could import in I-Analyzer? |
Beta Was this translation helpful? Give feedback.
-
I think so! You wouldn't necessarily need to upload to PyPI if it's very niche, but that would be nice the interest of FAIRness. |
Beta Was this translation helpful? Give feedback.
-
I've put a preliminary implementation in https://github.com/UUDigitalHumanitieslab/ianalyzer-readers - it needs some more polishing to be properly reusable. |
Beta Was this translation helpful? Give feedback.
-
Nice! I think it misses requirements, for one thing? |
Beta Was this translation helpful? Give feedback.
-
Can you make that an issue in the repository? |
Beta Was this translation helpful? Give feedback.
-
Amongst other things, I-analyzer contains the functionality to convert XML, HTML or CSV files to a flat collection of key/value pairs. To extract these, you essentially pass on a few parameters and a list of
Extractor
objects. This functionality could be outfactored into an independent python package.The package would essentially contain:
XMLReader
/HTMLReader
/CSVReader
classes, which implement thesource2dicts
function we now have inCorpusDefinition
Extractor
classesIndividual corpus definitions in the I-analyzer backend would look pretty much the same, but the
CorpusDefinition
child classes would be more lightweight.Beta Was this translation helpful? Give feedback.
All reactions