Not clear why malformed entries are malformed #82

NickleDave · 2022-09-20T02:09:55Z

Research Software Encyclopedia version: 0.0.46
Python version: 3.10
Operating System: Ubuntu

Description

Hi @vsoch thank you again for adding the ability to import .csv files directly.
I'm working on a script to clean up a .csv but I'm not actually able to figure out why the scraper reports some entries are malformed, instead of turning them into a "custom" repo.
I think it has something to do with the format of the url?

Below I'll put output from running.
Seems like anything that's not "https://" will fail.
Is that so? Is it in the docs somewhere that I need to have all urls be this format and I'm missing it? I don't find anything about the custom parser.
Sorry if I'm misunderstanding what the main loop in CSVImporter.create is doing.

What I Did

Here's the report of malformed entries.
Seems like anything that's not "https://" fails.
E.g., "www.", "readthedocs.", etc.

Yes I haven't set up a GitHub token yet

... all the INFO logs here of "Found software record" ...
Found 70 results
WARNING:rse.main.import.csv:Skipping malformed entry www.adobe.com/products/audition.html
WARNING:rse.main.import.csv:Skipping malformed entry www.titley-scientific.com/us/anabat-insight.html
WARNING:rse.main.import.csv:Skipping malformed entry datadryad.org/stash/dataset/doi:10.5061/dryad.221mq23
WARNING:rse.main.import.csv:Skipping malformed entry arbimon.rfcx.org/
WARNING:rse.main.import.csv:Skipping malformed entry soundanalysis.wp.st-andrews.ac.uk/
WARNING:rse.main.import.csv:Skipping malformed entry www.audacityteam.org/download/
WARNING:rse.main.import.csv:Skipping malformed entry autoencoded-vocal-analysis.readthedocs.io/en/latest/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.avianz.net/index.php
WARNING:rse.main.import.csv:Skipping malformed entry www.avisoft.com/sound-analysis/
WARNING:rse.main.import.csv:Skipping malformed entry bitbucket.org/chrisscott/batclassify/src
WARNING:rse.main.import.csv:Skipping malformed entry www.batlogger.com/en/products/batexplorer/
WARNING:rse.main.import.csv:Skipping malformed entry www.wsl.ch/en/services-and-products/software-websites-and-apps/batscope-4.html
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/bioacoustics/index.html
WARNING:rse.main.import.csv:Skipping malformed entry birdnet.cornell.edu/
WARNING:rse.main.import.csv:Skipping malformed entry www.oldbird.org/glassofire.htm
WARNING:rse.main.import.csv:Skipping malformed entry www.goldwave.com/
WARNING:rse.main.import.csv:Skipping malformed entry sites.google.com/view/alcore-suzuki/home/harkbird
WARNING:rse.main.import.csv:Skipping malformed entry bioacoustics.us/ishmael.html
WARNING:rse.main.import.csv:Skipping malformed entry www.wildlifeacoustics.com/products/kaleidoscope-pro
WARNING:rse.main.import.csv:Skipping malformed entry meridian.cs.dal.ca/2015/04/12/ketos/
WARNING:rse.main.import.csv:Skipping malformed entry koe.io.ac.nz/
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/tree/v0.6.5.
WARNING:rse.main.import.csv:Skipping malformed entry github.com/shyamblast/Koogu/tree/v0.6.5
WARNING:rse.main.import.csv:Skipping malformed entry librosa.org/librosa/
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/rflachlanhub.io/Luscinia.
WARNING:rse.main.import.csv:Skipping malformed entry rflachlan.github.io/Luscinia/
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/monitoR/index.html
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/ohun/index.html.
WARNING:rse.main.import.csv:Skipping malformed entry marce10.github.io/ohun/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.pamguard.org/
WARNING:rse.main.import.csv:Skipping malformed entry www.fon.hum.uva.nl/praat/
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/shivChitinous/prinia-project: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/shivChitinous/prinia-project
WARNING:rse.main.import.csv:Skipping malformed entry ravensoundsoftware.com/software/raven-lite/
WARNING:rse.main.import.csv:Skipping malformed entry ravensoundsoftware.com/software/raven-pro
WARNING:rse.main.import.csv:Skipping malformed entry www.reaper.fm/
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/scikit-maad/scikit-maad: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/scikit-maad/scikit-maad
WARNING:rse.main.import.csv:Skipping malformed entry docs.scipy.org/doc/scipy/reference/signal.html
WARNING:rse.main.import.csv:Skipping malformed entry dx.doi.org/10.6084/m9.figshare.3792780
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/seewave/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.sonicvisualiser.org/
WARNING:rse.main.import.csv:Skipping malformed entry sonobat.com/
WARNING:rse.main.import.csv:Skipping malformed entry doi.org/10.1080/09524622.2013.827588
WARNING:rse.main.import.csv:Skipping malformed entry soundata.readthedocs.io/en/latest/
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/soundecology/vignettes/intro.html
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/macster110/aipam: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/macster110/aipam
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/rhine3/specky: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/rhine3/specky
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/YvesBas/Tadarida-C: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/YvesBas/Tadarida-C
WARNING:rse.main.import.csv:Skipping malformed entry www.cetus.ucsd.edu/technologies_triton.html
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/yardencsGitHub/tweetynet: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/yardencsGitHub/tweetynet
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/vocalpy/vak: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/vocalpy/vak
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/HaroldMills/Vesper: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/HaroldMills/Vesper
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/warbleR/index.html

The text was updated successfully, but these errors were encountered:

vsoch · 2022-09-20T02:14:04Z

Most of those are malformed because they are not GitHub URLs in the format https://github.com/<user>/<repo>. A handful look OK but we can't know for sure if it would work because "403, rate limit exceeded"

vsoch · 2022-09-20T02:14:39Z

And that's correct, it's expecting a full formed GitHub url, not a link to an index.html, readthe docs, or anywhere else.

NickleDave · 2022-09-20T02:31:13Z

None of these are GitHub--the first ~30 that are malformed.
When I inspect with a breakpoint I see that the re used to match parser <-> uri returns a CustomParser, at least for the first one (www.adobe.com)

Are they supposed to parse as custom automagically or am I misunderstanding how rse works?

How come sometimes the cran ones work and other times they don't?

Not trying to be a grumpy pain, I'm just trying to understand the intended functionality.

WARNING:rse.main.import.csv:Skipping malformed entry www.adobe.com/products/audition.html
WARNING:rse.main.import.csv:Skipping malformed entry www.titley-scientific.com/us/anabat-insight.html
WARNING:rse.main.import.csv:Skipping malformed entry datadryad.org/stash/dataset/doi:10.5061/dryad.221mq23
WARNING:rse.main.import.csv:Skipping malformed entry arbimon.rfcx.org/
WARNING:rse.main.import.csv:Skipping malformed entry soundanalysis.wp.st-andrews.ac.uk/
WARNING:rse.main.import.csv:Skipping malformed entry www.audacityteam.org/download/
WARNING:rse.main.import.csv:Skipping malformed entry autoencoded-vocal-analysis.readthedocs.io/en/latest/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.avianz.net/index.php
WARNING:rse.main.import.csv:Skipping malformed entry www.avisoft.com/sound-analysis/
WARNING:rse.main.import.csv:Skipping malformed entry bitbucket.org/chrisscott/batclassify/src
WARNING:rse.main.import.csv:Skipping malformed entry www.batlogger.com/en/products/batexplorer/
WARNING:rse.main.import.csv:Skipping malformed entry www.wsl.ch/en/services-and-products/software-websites-and-apps/batscope-4.html
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/bioacoustics/index.html
WARNING:rse.main.import.csv:Skipping malformed entry birdnet.cornell.edu/
WARNING:rse.main.import.csv:Skipping malformed entry www.oldbird.org/glassofire.htm
WARNING:rse.main.import.csv:Skipping malformed entry www.goldwave.com/
WARNING:rse.main.import.csv:Skipping malformed entry sites.google.com/view/alcore-suzuki/home/harkbird
WARNING:rse.main.import.csv:Skipping malformed entry bioacoustics.us/ishmael.html
WARNING:rse.main.import.csv:Skipping malformed entry www.wildlifeacoustics.com/products/kaleidoscope-pro
WARNING:rse.main.import.csv:Skipping malformed entry meridian.cs.dal.ca/2015/04/12/ketos/
WARNING:rse.main.import.csv:Skipping malformed entry koe.io.ac.nz/

vsoch · 2022-09-20T02:34:00Z

Can you give me your sheets URL and the command here so I can reproduce?

NickleDave · 2022-09-20T02:36:38Z

Sorry, was working on it. I should've thought to include it in the first place.

.csv attached. I'm doing

rse import --type csv copy-Bioacoustics-software.csv

copy-Bioacoustics-software.csv

vsoch · 2022-09-20T02:46:54Z

I think I'm seeing the ones I'd expect to be imported, imported

database/
└── github
    ├── BirdVox
    │   ├── birdvoxclassify
    │   │   └── metadata.json
    │   └── birdvoxdetect
    │       └── metadata.json
    ├── Cdevenish
    │   └── hardRain
    │       └── metadata.json
    ├── ChristianBergler
    │   └── ANIMAL-SPOT
    │       └── metadata.json
    ├── DanWoodrich
    │   └── INSTINCT
    │       └── metadata.json
    ├── DenaJGibbon
    │   └── gibbonR-package
    │       └── metadata.json
    ├── DrCoffey
    │   └── DeepSqueak
    │       └── metadata.json
    ├── EricArcher
    │   └── banter
    │       └── metadata.json
    ├── kitzeslab
    │   └── opensoundscape
    │       └── metadata.json
    ├── macaodha
    │   └── batdetect
    │       └── metadata.json
    ├── macster110
    │   └── aipam
    │       └── metadata.json
    ├── MarineBioAcousticsRC
    │   └── DetEdit
    │       └── metadata.json
    ├── nilomr
    │   └── fieldtools
    │       └── metadata.json
    ├── nwolek
    │   └── audiomoth-scripts
    │       └── metadata.json
    ├── OpenWild
    │   └── caracal
    │       └── metadata.json
    ├── patriceguyot
    │   └── Acoustic_Indices
    │       └── metadata.json
    ├── rhine3
    │   └── specky
    │       └── metadata.json
    ├── sarabsethi
    │   └── audioset_soundscape_feats_sethi2019
    │       └── metadata.json
    ├── scikit-maad
    │   └── scikit-maad
    │       └── metadata.json
    ├── shivChitinous
    │   └── prinia-project
    │       └── metadata.json
    ├── TaikiSan21
    │   └── PAMr
    │       └── metadata.json
    ├── timsainb
    │   └── AVGN
    │       └── metadata.json
    ├── vocalpy
    │   ├── crowsetta
    │   │   └── metadata.json
    │   ├── hybrid-vocal-classifier
    │   │   └── metadata.json
    │   └── vak
    │       └── metadata.json
    ├── YannickJadoul
    │   └── Parselmouth
    │       └── metadata.json
    ├── yardencsGitHub
    │   └── tweetynet
    │       └── metadata.json
    └── YvesBas
        └── Tadarida-C
            └── metadata.json

54 directories, 28 files

I don't have spreadsheet software handy so I can't open the csv to see if you have all the required fields for each, but I'd check that.

NickleDave · 2022-09-22T03:49:11Z

Thank you for taking the time to check.

I am confused about why I can no longer generate records for entries that worked previously.

For example, see this custom record for arbimon that was generated with the inital run:
https://github.com/NickleDave/bioacoustics-software/blob/add-rse/database/custom/arbimon-rfcx-org/metadata.json
From the time stamp I can see it was generated on 2022-08-18:
https://github.com/NickleDave/bioacoustics-software/blob/bdebcc7669c05f7a92e275af34c3ce6acf6a75c5/database/custom/arbimon-rfcx-org/metadata.json#L26

        "timestamp": "2022-08-18 14:26:06.351778"

This is when I first ran import I think.

But now this entry is considered malformed.

In fact, if I'm in my feature branch and I do rm -rf database and then rerun rse import --type csv copy-Bioacoustics-software.csv, then none of the custom entries get re-generated. I only have database/github.

$ git st
On branch add-rse
Your branch is up to date with 'origin/add-rse'.

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        deleted:    database/custom/arbimon-rfcx-org/metadata.json
        deleted:    database/custom/autoencoded-vocal-analysis-readthedocs-io/en/latest/index-html/metadata.json
        deleted:    database/custom/bioacoustics-us/ishmael-html/metadata.json
        deleted:    database/custom/birdnet-cornell-edu/metadata.json
        ...
        deleted:    database/custom/www-wildlifeacoustics-com/products/kaleidoscope-pro/metadata.json
        deleted:    database/custom/www-wsl-ch/en/services-and-products/software-websites-and-apps/batscope-4-html/metadata.json
        modified:   database/github/BirdVox/birdvoxclassify/metadata.json
        modified:   database/github/BirdVox/birdvoxdetect/metadata.json
        ...

If I roll back to rse==0.0.44 and I import directly from --google-sheet then I am able to generate those custom records again.

Is there something specific about the csv importer that makes those not work?

NickleDave · 2022-09-22T03:49:54Z

This is what I ran with rse==0.0.44 to generate a database that does include the custom records, if it helps:

rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"

vsoch · 2022-09-22T03:51:36Z

Perhaps the place to start is to compare the google sheet to your csv? They use the same underlying logic. If you want to walk through in detail check out IPython. As a sanity check I’d also try Google Sheet without “rolling back” as it still exists in the new release.

NickleDave · 2022-09-22T04:07:41Z

Ah, sorry I wasn't clear.
It's literally the same sheet, I haven't done anything to it, I just downloaded it.

I will test using the --google-sheet import with the current version

NickleDave · 2022-09-22T11:01:13Z

I tested with the current version and I no longer get the "custom" entries when I do
rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"

It seems like all the custom entries are considered malformed.

This is true for rse==0.0.45 also: I don't get the custom entries.

I do for rse==0.0.44 though.

NickleDave · 2022-09-22T11:04:00Z

I think it's because the continue statement in the block that issues the warning about malformed entries prevents the inner loop from reaching the logic that creates custom entries?

i.e., if this line

rse/rse/main/scrapers/csv.py

Line 117 in 5982054

continue

runs because repo is a CustomParser instance that returns an empty dict from get_metadata, then we never get to

rse/rse/main/scrapers/csv.py

Line 121 in 5982054

except NotImplementedError:

vsoch · 2022-09-22T14:02:53Z

Nice you have a hypothesis! It looks like a bug I think - does it work when you comment it out?

vsoch · 2022-09-23T00:34:24Z

Please see #84 it should fix the issues here! It was a bit more than the continue - a change in design that I didn't properly implement.

vsoch mentioned this issue Sep 23, 2022

allowing csv export to use custom parser #84

Merged

vsoch closed this as completed in #84 Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not clear why malformed entries are malformed #82

Not clear why malformed entries are malformed #82

NickleDave commented Sep 20, 2022

vsoch commented Sep 20, 2022

vsoch commented Sep 20, 2022

NickleDave commented Sep 20, 2022 •

edited

Loading

vsoch commented Sep 20, 2022

NickleDave commented Sep 20, 2022

vsoch commented Sep 20, 2022

NickleDave commented Sep 22, 2022 •

edited

Loading

NickleDave commented Sep 22, 2022

vsoch commented Sep 22, 2022

NickleDave commented Sep 22, 2022

NickleDave commented Sep 22, 2022

NickleDave commented Sep 22, 2022

vsoch commented Sep 22, 2022

vsoch commented Sep 23, 2022

Not clear why malformed entries are malformed #82

Not clear why malformed entries are malformed #82

Comments

NickleDave commented Sep 20, 2022

Description

What I Did

vsoch commented Sep 20, 2022

vsoch commented Sep 20, 2022

NickleDave commented Sep 20, 2022 • edited Loading

vsoch commented Sep 20, 2022

NickleDave commented Sep 20, 2022

vsoch commented Sep 20, 2022

NickleDave commented Sep 22, 2022 • edited Loading

NickleDave commented Sep 22, 2022

vsoch commented Sep 22, 2022

NickleDave commented Sep 22, 2022

NickleDave commented Sep 22, 2022

NickleDave commented Sep 22, 2022

vsoch commented Sep 22, 2022

vsoch commented Sep 23, 2022

NickleDave commented Sep 20, 2022 •

edited

Loading

NickleDave commented Sep 22, 2022 •

edited

Loading