-
-
Notifications
You must be signed in to change notification settings - Fork 19
Conversation
feat: added progress bar for paragraphs feat: consume wikipedia summary from keyword feat: consume directly from pdb feat: consume directly from webpage, with several fallback methods and ability to change what matters to something else than "p" feat: consume directly from user inpu style: renamed a lot of function to make it more clear fix: instead of using csv library, use pandas to output as csv feat: output as json feat: output directly the pandas dataframe feat: all functions include docstrings feat: all code is PEP compliant fix: set an environnement variable for the tokenizer otherwise you get a warning written at the top of the output file fix: prints a message when autocard is loading as it can be long on slow devicecs refactor: self.qg is called differently, it is now wrapped so as to be part of a dataframe feat: txt containing wikipedia style reference like [5] is sanitized fix: when outputting as csv, all included commas are escaped feat: consume_web uses several ways to find the title of the webpage fix: renamed self.clear to self.clear_qa for clarity feat: added self.string_output() to allow the user to get directly the output instead of only being able to print it feat: self.print and self.pprint allow to print and pretty print the output new: added a commented section that allows to pause the script using ctrl+c then resume it later. Very useful when running on very large text but could potentially cause issues when uploading it to PyPi so I commented it
mentions downloading punkt for first time users
_sanitize_text was missing a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! I think this is a great first effort. I listed a few comments for you to look into, but overall the amount of added functionality is impressive. I'll tweak some of the non-Python stuff a bit (comments, README) after we eventually merge a more polished version of the code.
self.qa_pairs = [] | ||
global n, cur_n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When working with a class, variables which have to be shared across multiple functions should be fields of that class, through something like self.qa_pair_count = len(self.qa_pairs)
. Use of global variables is discouraged in Python (and mostly everywhere), that's why you have to explicitly state that you want those to be global.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I decided to do it like that because I was not sure of when the line "n= len(self.qa_pairs)" woud be executed otherwise. For example it has to be reset when clear_qa
is run.
Looking back it's actually a way more terrible idea that I thought because I can imagine most users loading autocards already having set a "n" variable somewhere...
Unfortunately I don't feel comfortable correcting this myself as I'm unfamiliar with classes. It's actually my first ever class... Would you be interested in explaining me the best route or doing it yourself? I will surely have to ankify that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be more clear: I don't feel up to solving when and how to set n and cur_n. I can take care of the issue of organizing qa_pairs differently later on myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, so I'll tackle this myself after the PR we decided.
autocards.py
Outdated
+ "<br>{{c1::"\ | ||
+ self.qa_pairs[-i]['answer']\ | ||
+ "}}" | ||
self.qa_pairs[-i] = {**self.qa_pairs[-i], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This processing of self.qa_pairs
could help some readability. You can use range(start, end)
to start from a certain value and avoid the weird negative index. self.qa_pairs
could solve, well, the question/answer pairs, while another variable could be used to store this more complex structure with metadata and all maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so I took a look and I can do this only after resolving the issue above IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems I made a mistake, the last commit name is wrong and I don't really know if I can change it. Sorry :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies, as you can see I'm still learning git by trial and error and apparently screwed the naming of this commit and merged it with another.
I had to relocate the re.sub that were previously used in the pdf section because otherwise it messes up the "per paragraph" function.
Sorry for not making more commits and ending up with one giant commit, I changed a lot of things so many times I decided no to commit during brainstorming :/ but please don't hesitate to tell me if I can make it better :) Or if you have any question.
In theory it's all been tested on a few examples but I think it would be nice of you tried it yourself to proofcheck it.
I'm also making a PR for the psionica website right this instant there.