Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not clear why malformed entries are malformed #82

Closed
NickleDave opened this issue Sep 20, 2022 · 14 comments · Fixed by #84
Closed

Not clear why malformed entries are malformed #82

NickleDave opened this issue Sep 20, 2022 · 14 comments · Fixed by #84

Comments

@NickleDave
Copy link

  • Research Software Encyclopedia version: 0.0.46
  • Python version: 3.10
  • Operating System: Ubuntu

Description

Hi @vsoch thank you again for adding the ability to import .csv files directly.
I'm working on a script to clean up a .csv but I'm not actually able to figure out why the scraper reports some entries are malformed, instead of turning them into a "custom" repo.
I think it has something to do with the format of the url?

Below I'll put output from running.
Seems like anything that's not "https://" will fail.
Is that so? Is it in the docs somewhere that I need to have all urls be this format and I'm missing it? I don't find anything about the custom parser.
Sorry if I'm misunderstanding what the main loop in CSVImporter.create is doing.

What I Did

Here's the report of malformed entries.
Seems like anything that's not "https://" fails.
E.g., "www.", "readthedocs.", etc.

Yes I haven't set up a GitHub token yet

... all the INFO logs here of "Found software record" ...
Found 70 results
WARNING:rse.main.import.csv:Skipping malformed entry www.adobe.com/products/audition.html
WARNING:rse.main.import.csv:Skipping malformed entry www.titley-scientific.com/us/anabat-insight.html
WARNING:rse.main.import.csv:Skipping malformed entry datadryad.org/stash/dataset/doi:10.5061/dryad.221mq23
WARNING:rse.main.import.csv:Skipping malformed entry arbimon.rfcx.org/
WARNING:rse.main.import.csv:Skipping malformed entry soundanalysis.wp.st-andrews.ac.uk/
WARNING:rse.main.import.csv:Skipping malformed entry www.audacityteam.org/download/
WARNING:rse.main.import.csv:Skipping malformed entry autoencoded-vocal-analysis.readthedocs.io/en/latest/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.avianz.net/index.php
WARNING:rse.main.import.csv:Skipping malformed entry www.avisoft.com/sound-analysis/
WARNING:rse.main.import.csv:Skipping malformed entry bitbucket.org/chrisscott/batclassify/src
WARNING:rse.main.import.csv:Skipping malformed entry www.batlogger.com/en/products/batexplorer/
WARNING:rse.main.import.csv:Skipping malformed entry www.wsl.ch/en/services-and-products/software-websites-and-apps/batscope-4.html
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/bioacoustics/index.html
WARNING:rse.main.import.csv:Skipping malformed entry birdnet.cornell.edu/
WARNING:rse.main.import.csv:Skipping malformed entry www.oldbird.org/glassofire.htm
WARNING:rse.main.import.csv:Skipping malformed entry www.goldwave.com/
WARNING:rse.main.import.csv:Skipping malformed entry sites.google.com/view/alcore-suzuki/home/harkbird
WARNING:rse.main.import.csv:Skipping malformed entry bioacoustics.us/ishmael.html
WARNING:rse.main.import.csv:Skipping malformed entry www.wildlifeacoustics.com/products/kaleidoscope-pro
WARNING:rse.main.import.csv:Skipping malformed entry meridian.cs.dal.ca/2015/04/12/ketos/
WARNING:rse.main.import.csv:Skipping malformed entry koe.io.ac.nz/
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/tree/v0.6.5.
WARNING:rse.main.import.csv:Skipping malformed entry github.com/shyamblast/Koogu/tree/v0.6.5
WARNING:rse.main.import.csv:Skipping malformed entry librosa.org/librosa/
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/rflachlanhub.io/Luscinia.
WARNING:rse.main.import.csv:Skipping malformed entry rflachlan.github.io/Luscinia/
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/monitoR/index.html
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/ohun/index.html.
WARNING:rse.main.import.csv:Skipping malformed entry marce10.github.io/ohun/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.pamguard.org/
WARNING:rse.main.import.csv:Skipping malformed entry www.fon.hum.uva.nl/praat/
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/shivChitinous/prinia-project: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/shivChitinous/prinia-project
WARNING:rse.main.import.csv:Skipping malformed entry ravensoundsoftware.com/software/raven-lite/
WARNING:rse.main.import.csv:Skipping malformed entry ravensoundsoftware.com/software/raven-pro
WARNING:rse.main.import.csv:Skipping malformed entry www.reaper.fm/
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/scikit-maad/scikit-maad: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/scikit-maad/scikit-maad
WARNING:rse.main.import.csv:Skipping malformed entry docs.scipy.org/doc/scipy/reference/signal.html
WARNING:rse.main.import.csv:Skipping malformed entry dx.doi.org/10.6084/m9.figshare.3792780
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/seewave/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.sonicvisualiser.org/
WARNING:rse.main.import.csv:Skipping malformed entry sonobat.com/
WARNING:rse.main.import.csv:Skipping malformed entry doi.org/10.1080/09524622.2013.827588
WARNING:rse.main.import.csv:Skipping malformed entry soundata.readthedocs.io/en/latest/
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/soundecology/vignettes/intro.html
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/macster110/aipam: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/macster110/aipam
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/rhine3/specky: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/rhine3/specky
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/YvesBas/Tadarida-C: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/YvesBas/Tadarida-C
WARNING:rse.main.import.csv:Skipping malformed entry www.cetus.ucsd.edu/technologies_triton.html
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/yardencsGitHub/tweetynet: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/yardencsGitHub/tweetynet
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/vocalpy/vak: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/vocalpy/vak
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/HaroldMills/Vesper: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/HaroldMills/Vesper
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/warbleR/index.html
@vsoch
Copy link
Contributor

vsoch commented Sep 20, 2022

Most of those are malformed because they are not GitHub URLs in the format https://github.com/<user>/<repo>. A handful look OK but we can't know for sure if it would work because "403, rate limit exceeded"

@vsoch
Copy link
Contributor

vsoch commented Sep 20, 2022

And that's correct, it's expecting a full formed GitHub url, not a link to an index.html, readthe docs, or anywhere else.

@NickleDave
Copy link
Author

NickleDave commented Sep 20, 2022

None of these are GitHub--the first ~30 that are malformed.
When I inspect with a breakpoint I see that the re used to match parser <-> uri returns a CustomParser, at least for the first one (www.adobe.com)

Are they supposed to parse as custom automagically or am I misunderstanding how rse works?

How come sometimes the cran ones work and other times they don't?

Not trying to be a grumpy pain, I'm just trying to understand the intended functionality.

WARNING:rse.main.import.csv:Skipping malformed entry www.adobe.com/products/audition.html
WARNING:rse.main.import.csv:Skipping malformed entry www.titley-scientific.com/us/anabat-insight.html
WARNING:rse.main.import.csv:Skipping malformed entry datadryad.org/stash/dataset/doi:10.5061/dryad.221mq23
WARNING:rse.main.import.csv:Skipping malformed entry arbimon.rfcx.org/
WARNING:rse.main.import.csv:Skipping malformed entry soundanalysis.wp.st-andrews.ac.uk/
WARNING:rse.main.import.csv:Skipping malformed entry www.audacityteam.org/download/
WARNING:rse.main.import.csv:Skipping malformed entry autoencoded-vocal-analysis.readthedocs.io/en/latest/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.avianz.net/index.php
WARNING:rse.main.import.csv:Skipping malformed entry www.avisoft.com/sound-analysis/
WARNING:rse.main.import.csv:Skipping malformed entry bitbucket.org/chrisscott/batclassify/src
WARNING:rse.main.import.csv:Skipping malformed entry www.batlogger.com/en/products/batexplorer/
WARNING:rse.main.import.csv:Skipping malformed entry www.wsl.ch/en/services-and-products/software-websites-and-apps/batscope-4.html
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/bioacoustics/index.html
WARNING:rse.main.import.csv:Skipping malformed entry birdnet.cornell.edu/
WARNING:rse.main.import.csv:Skipping malformed entry www.oldbird.org/glassofire.htm
WARNING:rse.main.import.csv:Skipping malformed entry www.goldwave.com/
WARNING:rse.main.import.csv:Skipping malformed entry sites.google.com/view/alcore-suzuki/home/harkbird
WARNING:rse.main.import.csv:Skipping malformed entry bioacoustics.us/ishmael.html
WARNING:rse.main.import.csv:Skipping malformed entry www.wildlifeacoustics.com/products/kaleidoscope-pro
WARNING:rse.main.import.csv:Skipping malformed entry meridian.cs.dal.ca/2015/04/12/ketos/
WARNING:rse.main.import.csv:Skipping malformed entry koe.io.ac.nz/

@vsoch
Copy link
Contributor

vsoch commented Sep 20, 2022

Can you give me your sheets URL and the command here so I can reproduce?

@NickleDave
Copy link
Author

Sorry, was working on it. I should've thought to include it in the first place.

.csv attached. I'm doing

rse import --type csv copy-Bioacoustics-software.csv

copy-Bioacoustics-software.csv

@vsoch
Copy link
Contributor

vsoch commented Sep 20, 2022

I think I'm seeing the ones I'd expect to be imported, imported

database/
└── github
    ├── BirdVox
    │   ├── birdvoxclassify
    │   │   └── metadata.json
    │   └── birdvoxdetect
    │       └── metadata.json
    ├── Cdevenish
    │   └── hardRain
    │       └── metadata.json
    ├── ChristianBergler
    │   └── ANIMAL-SPOT
    │       └── metadata.json
    ├── DanWoodrich
    │   └── INSTINCT
    │       └── metadata.json
    ├── DenaJGibbon
    │   └── gibbonR-package
    │       └── metadata.json
    ├── DrCoffey
    │   └── DeepSqueak
    │       └── metadata.json
    ├── EricArcher
    │   └── banter
    │       └── metadata.json
    ├── kitzeslab
    │   └── opensoundscape
    │       └── metadata.json
    ├── macaodha
    │   └── batdetect
    │       └── metadata.json
    ├── macster110
    │   └── aipam
    │       └── metadata.json
    ├── MarineBioAcousticsRC
    │   └── DetEdit
    │       └── metadata.json
    ├── nilomr
    │   └── fieldtools
    │       └── metadata.json
    ├── nwolek
    │   └── audiomoth-scripts
    │       └── metadata.json
    ├── OpenWild
    │   └── caracal
    │       └── metadata.json
    ├── patriceguyot
    │   └── Acoustic_Indices
    │       └── metadata.json
    ├── rhine3
    │   └── specky
    │       └── metadata.json
    ├── sarabsethi
    │   └── audioset_soundscape_feats_sethi2019
    │       └── metadata.json
    ├── scikit-maad
    │   └── scikit-maad
    │       └── metadata.json
    ├── shivChitinous
    │   └── prinia-project
    │       └── metadata.json
    ├── TaikiSan21
    │   └── PAMr
    │       └── metadata.json
    ├── timsainb
    │   └── AVGN
    │       └── metadata.json
    ├── vocalpy
    │   ├── crowsetta
    │   │   └── metadata.json
    │   ├── hybrid-vocal-classifier
    │   │   └── metadata.json
    │   └── vak
    │       └── metadata.json
    ├── YannickJadoul
    │   └── Parselmouth
    │       └── metadata.json
    ├── yardencsGitHub
    │   └── tweetynet
    │       └── metadata.json
    └── YvesBas
        └── Tadarida-C
            └── metadata.json

54 directories, 28 files

I don't have spreadsheet software handy so I can't open the csv to see if you have all the required fields for each, but I'd check that.

@NickleDave
Copy link
Author

NickleDave commented Sep 22, 2022

Thank you for taking the time to check.

I am confused about why I can no longer generate records for entries that worked previously.

For example, see this custom record for arbimon that was generated with the inital run:
https://github.com/NickleDave/bioacoustics-software/blob/add-rse/database/custom/arbimon-rfcx-org/metadata.json
From the time stamp I can see it was generated on 2022-08-18:
https://github.com/NickleDave/bioacoustics-software/blob/bdebcc7669c05f7a92e275af34c3ce6acf6a75c5/database/custom/arbimon-rfcx-org/metadata.json#L26

        "timestamp": "2022-08-18 14:26:06.351778"

This is when I first ran import I think.

But now this entry is considered malformed.

In fact, if I'm in my feature branch and I do rm -rf database and then rerun rse import --type csv copy-Bioacoustics-software.csv, then none of the custom entries get re-generated. I only have database/github.

$ git st
On branch add-rse
Your branch is up to date with 'origin/add-rse'.

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        deleted:    database/custom/arbimon-rfcx-org/metadata.json
        deleted:    database/custom/autoencoded-vocal-analysis-readthedocs-io/en/latest/index-html/metadata.json
        deleted:    database/custom/bioacoustics-us/ishmael-html/metadata.json
        deleted:    database/custom/birdnet-cornell-edu/metadata.json
        ...
        deleted:    database/custom/www-wildlifeacoustics-com/products/kaleidoscope-pro/metadata.json
        deleted:    database/custom/www-wsl-ch/en/services-and-products/software-websites-and-apps/batscope-4-html/metadata.json
        modified:   database/github/BirdVox/birdvoxclassify/metadata.json
        modified:   database/github/BirdVox/birdvoxdetect/metadata.json
        ...

If I roll back to rse==0.0.44 and I import directly from --google-sheet then I am able to generate those custom records again.

Is there something specific about the csv importer that makes those not work?

@NickleDave
Copy link
Author

This is what I ran with rse==0.0.44 to generate a database that does include the custom records, if it helps:

rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"

@vsoch
Copy link
Contributor

vsoch commented Sep 22, 2022

Perhaps the place to start is to compare the google sheet to your csv? They use the same underlying logic. If you want to walk through in detail check out IPython. As a sanity check I’d also try Google Sheet without “rolling back” as it still exists in the new release.

@NickleDave
Copy link
Author

Ah, sorry I wasn't clear.
It's literally the same sheet, I haven't done anything to it, I just downloaded it.

I will test using the --google-sheet import with the current version

@NickleDave
Copy link
Author

I tested with the current version and I no longer get the "custom" entries when I do
rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"

It seems like all the custom entries are considered malformed.

This is true for rse==0.0.45 also: I don't get the custom entries.

I do for rse==0.0.44 though.

@NickleDave
Copy link
Author

I think it's because the continue statement in the block that issues the warning about malformed entries prevents the inner loop from reaching the logic that creates custom entries?

i.e., if this line

continue

runs because repo is a CustomParser instance that returns an empty dict from get_metadata, then we never get to
except NotImplementedError:

@vsoch
Copy link
Contributor

vsoch commented Sep 22, 2022

Nice you have a hypothesis! It looks like a bug I think - does it work when you comment it out?

@vsoch
Copy link
Contributor

vsoch commented Sep 23, 2022

Please see #84 it should fix the issues here! It was a bit more than the continue - a change in design that I didn't properly implement.

@vsoch vsoch closed this as completed in #84 Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants