-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not clear why malformed entries are malformed #82
Comments
Most of those are malformed because they are not GitHub URLs in the format |
And that's correct, it's expecting a full formed GitHub url, not a link to an index.html, readthe docs, or anywhere else. |
None of these are GitHub--the first ~30 that are malformed. Are they supposed to parse as custom automagically or am I misunderstanding how How come sometimes the Not trying to be a grumpy pain, I'm just trying to understand the intended functionality.
|
Can you give me your sheets URL and the command here so I can reproduce? |
Sorry, was working on it. I should've thought to include it in the first place. .csv attached. I'm doing
|
I think I'm seeing the ones I'd expect to be imported, imported
I don't have spreadsheet software handy so I can't open the csv to see if you have all the required fields for each, but I'd check that. |
Thank you for taking the time to check. I am confused about why I can no longer generate records for entries that worked previously. For example, see this custom record for arbimon that was generated with the inital run: "timestamp": "2022-08-18 14:26:06.351778" This is when I first ran import I think. But now this entry is considered malformed. In fact, if I'm in my feature branch and I do $ git st
On branch add-rse
Your branch is up to date with 'origin/add-rse'.
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
deleted: database/custom/arbimon-rfcx-org/metadata.json
deleted: database/custom/autoencoded-vocal-analysis-readthedocs-io/en/latest/index-html/metadata.json
deleted: database/custom/bioacoustics-us/ishmael-html/metadata.json
deleted: database/custom/birdnet-cornell-edu/metadata.json
...
deleted: database/custom/www-wildlifeacoustics-com/products/kaleidoscope-pro/metadata.json
deleted: database/custom/www-wsl-ch/en/services-and-products/software-websites-and-apps/batscope-4-html/metadata.json
modified: database/github/BirdVox/birdvoxclassify/metadata.json
modified: database/github/BirdVox/birdvoxdetect/metadata.json
... If I roll back to Is there something specific about the csv importer that makes those not work? |
This is what I ran with
|
Perhaps the place to start is to compare the google sheet to your csv? They use the same underlying logic. If you want to walk through in detail check out IPython. As a sanity check I’d also try Google Sheet without “rolling back” as it still exists in the new release. |
Ah, sorry I wasn't clear. I will test using the --google-sheet import with the current version |
I tested with the current version and I no longer get the "custom" entries when I do It seems like all the custom entries are considered malformed. This is true for I do for |
I think it's because the i.e., if this line Line 117 in 5982054
runs because repo is a CustomParser instance that returns an empty dict from get_metadata , then we never get toLine 121 in 5982054
|
Nice you have a hypothesis! It looks like a bug I think - does it work when you comment it out? |
Please see #84 it should fix the issues here! It was a bit more than the continue - a change in design that I didn't properly implement. |
Description
Hi @vsoch thank you again for adding the ability to import .csv files directly.
I'm working on a script to clean up a .csv but I'm not actually able to figure out why the scraper reports some entries are malformed, instead of turning them into a "custom" repo.
I think it has something to do with the format of the url?
Below I'll put output from running.
Seems like anything that's not "https://" will fail.
Is that so? Is it in the docs somewhere that I need to have all urls be this format and I'm missing it? I don't find anything about the custom parser.
Sorry if I'm misunderstanding what the main loop in
CSVImporter.create
is doing.What I Did
Here's the report of malformed entries.
Seems like anything that's not "https://" fails.
E.g., "www.", "readthedocs.", etc.
Yes I haven't set up a GitHub token yet
The text was updated successfully, but these errors were encountered: