This repository has been archived by the owner on May 5, 2023. It is now read-only.
forked from patil-suraj/question_generation
-
-
Notifications
You must be signed in to change notification settings - Fork 19
new revision #15
Merged
Merged
new revision #15
Changes from 7 commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
c107544
docs: added example files
2eaa499
new revision needs new libraries
96af48d
added example script by users
fd4f5f0
new revision readme
9338b1c
modified: autocards.py
db01b7e
Update README.md
thiswillbeyourgithub 03ca81c
Update autocards.py
thiswillbeyourgithub 1204b35
better _sanitize_text function
4cc55c5
style: methods called after pandas
0142744
remove function defined but used once
d1a6abf
removed commented section launching pdb
9d5d8c3
better docstring for sanitize text
4355add
removed unecessary import from the commented area
f18762c
style: added better default titles
3e57726
PEP8: line was too long
e139367
modified: autocards.py
823e4ff
minor: wrong fstring and extra newline at EOF
c1d2b25
fix: better title for text file
ea70298
docs: csv_export is now to_csv, same for json
6fcb6db
docs: clearer text
213c121
missing ebooklib + remove extra newline
90c69ae
docs: minor phrasing
de023d2
remove useless notebook
dea3d44
style: renamed qa_pairs to qa_dict
6e94f6b
phrasing
bb82eb9
more robust epub extraction
088bfe6
adds basic cloze functionnality and notetype
0f40dd3
adds basic cloze functionnality and notetype
92eca6a
added docstring for main class
3576c7f
feat: added a store_content and watermark flag
978e9ff
minor style
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,41 @@ | ||
# Autocards | ||
Learn more by reading [the official write-up](https://psionica.org/docs/lab/autocards/). | ||
* Automatically create flashcards from user input, PDF files, wikipedia summary, webpages, and more! | ||
* To get a real world example, you can take a look at the complete output on [this article](https://www.biography.com/political-figure/philip-ii-of-macedon) can be found [in this folder](./output_example/). Nothing has been manually altered, it's the direct output. | ||
* Code is PEP compliant and all docstrings are written, hence contributions and PR are extremely appreciated | ||
* Learn more by reading [the official write-up](https://psionica.org/docs/lab/autocards/). | ||
|
||
## Install guide: | ||
* `git clone https//github.com/Psionica/Autocards` | ||
* `cd Autocards` | ||
* `pip install -r ./requirements.txt` | ||
* run a python console: `ipython3` | ||
* install punkt by running `!python -m nltk.downloader punkt` | ||
|
||
### Autocards usage | ||
``` | ||
# loading | ||
from autocards import Autocards | ||
a = Autocards() | ||
|
||
# eating the input text using one of the following ways: | ||
a.consume_var(my_text, per_paragraph=True) | ||
a.consume_user_input(title="") | ||
a.consume_wiki_summary(keyword, lang="en") | ||
a.consume_textfile(filename, per_paragraph=True) | ||
a.consume_pdf(pdf_path, per_paragraph=True) | ||
a.consume_web(source, mode="url", element="p") | ||
# => * element is the html element, like p for paragraph | ||
# * mode can be "url" or "local" | ||
|
||
# three ways to get the results back: | ||
out = a.string_output(prefix='', jeopardy=False) | ||
# => * prefix is a text that will be appended before the qa | ||
# * jeopardy is when switching question and answer | ||
a.print(prefix='', jeopardy=False) | ||
a.pprint(prefix='', jeopardy=False) # pretty printing | ||
df = a.pandas_output(prefix='') | ||
a.csv_export("output.csv", prefix="", jeopardy=False) | ||
a.json_export("output.json", prefix="", jeopardy=False) | ||
|
||
# Also note that a user provided his own terrible scripts that you can get inspiration from, they are located in the folder `examples_script` | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,63 +1,285 @@ | ||
from pipelines import qg_pipeline | ||
from transformers import pipeline | ||
|
||
from tqdm import tqdm | ||
from pathlib import Path | ||
import pandas as pd | ||
import time | ||
import signal | ||
import pdb | ||
import re | ||
import os | ||
from contextlib import suppress | ||
|
||
import requests | ||
import PyPDF2 | ||
import wikipedia | ||
from wikipedia.exceptions import PageError | ||
from bs4 import BeautifulSoup | ||
import csv | ||
from pprint import pprint | ||
|
||
# otherwise csv and json outputs contain a warning string | ||
os.environ["TOKENIZERS_PARALLELISM"] = "true" | ||
|
||
|
||
class Autocards: | ||
def __init__(self): | ||
self.qg = qg_pipeline('question-generation', model='valhalla/t5-base-qg-hl', ans_model='valhalla/t5-small-qa-qg-hl') | ||
print("Loading backend...") | ||
self.qg = qg_pipeline('question-generation', | ||
model='valhalla/t5-base-qg-hl', | ||
ans_model='valhalla/t5-small-qa-qg-hl') | ||
self.qa_pairs = [] | ||
global n, cur_n | ||
n = len(self.qa_pairs) | ||
cur_n = n | ||
|
||
def _call_qg(self, text, title): | ||
""" | ||
Call question generation module, then turn the answer into a | ||
dictionnary containing metadata (clozed formating, creation time, | ||
title, source text) | ||
""" | ||
try: | ||
self.qa_pairs += self.qg(text) | ||
except IndexError: | ||
print(f"\nSkipping section because no cards \ | ||
could be made from it:{text}\n") | ||
self.qa_pairs.append({"question": "skipped", | ||
"answer": "skipped"}) | ||
|
||
global n, cur_n | ||
cur_n = len(self.qa_pairs) | ||
diff = cur_n - n | ||
n = len(self.qa_pairs) | ||
|
||
cur_time = time.asctime() | ||
for i in range(diff): | ||
i += 1 | ||
cloze = self.qa_pairs[-i]['question']\ | ||
+ "<br>{{c1::"\ | ||
+ self.qa_pairs[-i]['answer']\ | ||
+ "}}" | ||
self.qa_pairs[-i] = {**self.qa_pairs[-i], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This processing of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay so I took a look and I can do this only after resolving the issue above IMO |
||
"clozed_text": cloze, | ||
"creation_time": cur_time, | ||
"title": title, | ||
"source": text | ||
} | ||
tqdm.write(f"Added {diff} qa pair (total = {cur_n})") | ||
|
||
def _sanitize_text(self, text): | ||
"remove wikipedia style citation" | ||
return re.sub(r"\[\d*\]", "", text) | ||
|
||
def consume_text(self, text, per_paragraph=False): | ||
def consume_var(self, text, title="", per_paragraph=False): | ||
"Take text as input and create qa pairs" | ||
text = text.replace('\xad ', '') | ||
|
||
if per_paragraph: | ||
for paragraph in text.split('\n\n'): | ||
self.qa_pairs += self.qg(paragraph) | ||
for paragraph in tqdm(text.split('\n\n'), | ||
desc="Processing by paragraph", | ||
unit="paragraph"): | ||
self._call_qg(paragraph, title) | ||
else: | ||
text = text.replace('\n\n', '. ').replace('..', '.') | ||
self.qa_pairs += self.qg(text) | ||
text = self._sanitize_text(text) | ||
self._call_qg(text, title) | ||
|
||
def consume_user_input(self, title=""): | ||
"Take user input and create qa pairs" | ||
user_input = input("Enter your text below then press Enter (press\ | ||
enter twice to validate input):\n>") | ||
|
||
print("\nFeeding your text to Autocards.") | ||
user_input = self._sanitize_text(user_input) | ||
self.consume_var(user_input, title, per_paragraph=False) | ||
print("Done feeding text.") | ||
|
||
def consume_wiki_summary(self, keyword, lang="en"): | ||
"Take a wikipedia keyword and creates qa pairs from its summary" | ||
if "http" in keyword: | ||
print("To consume a wikipedia summmary, you have to input \ | ||
the title of the article and not the url") | ||
return None | ||
wikipedia.set_lang(lang) | ||
try: | ||
wiki = wikipedia.page(keyword) | ||
except PageError as e: | ||
print(f"Page not found, error code:\n{e}") | ||
return None | ||
summary = wiki.summary | ||
title = wiki.title | ||
print(f"Article title: {title}") | ||
|
||
def consume_text_file(self, filename): | ||
self.consume_text(open(filename).read()) | ||
summary = self._sanitize_text(summary) | ||
self.consume_var(summary, title, True) | ||
|
||
def consume_paper(self, filename): | ||
soup = BeautifulSoup(open(filename), 'xml') | ||
paragraphs = [] | ||
def consume_pdf(self, pdf_path, per_paragraph=True): | ||
if not Path(pdf_path).exists(): | ||
print(f"PDF file not found at {pdf_path}!") | ||
return None | ||
pdf = PyPDF2.PdfFileReader(open(pdf_path, 'rb')) | ||
try: | ||
title = pdf.documentInfo['/Title'] | ||
print(f"PDF title : {title}") | ||
except KeyError: | ||
title = pdf_path.split("/")[-1] | ||
print(f"PDF title : {title}") | ||
|
||
for paragraph in soup.article.body.find_all('p'): | ||
paragraph = ' '.join(paragraph.get_text().split()) | ||
if len(paragraph) > 40: | ||
paragraphs += [paragraph] | ||
full_text = [] | ||
for page in pdf.pages: | ||
full_text.append(page.extractText()) | ||
text = " ".join(full_text) | ||
text = text.replace(" ", "") | ||
thiswillbeyourgithub marked this conversation as resolved.
Show resolved
Hide resolved
|
||
text = self._sanitize_text(text) | ||
|
||
qa_pairs = [] | ||
for paragraph in paragraphs: | ||
qa_pairs += self.qg(paragraph) | ||
self.consume_var(text, title, per_paragraph) | ||
|
||
self.qa_pairs += qa_pairs | ||
def consume_textfile(self, filepath, per_paragraph=False): | ||
"Take text file as input and create qa pairs" | ||
if not Path(filepath).exists(): | ||
print(f"File not found at {filepath}") | ||
text = open(filepath).read() | ||
text = self._sanitize_text(text) | ||
self.consume_var(text, | ||
filepath, | ||
per_paragraph=per_paragraph) | ||
|
||
def clear(self): | ||
def consume_web(self, source, mode="url", element="p"): | ||
"Take html file (local or via url) and create qa pairs" | ||
if mode not in ["local", "url"]: | ||
return "invalid arguments" | ||
if mode == "local": | ||
soup = BeautifulSoup(open(source), 'xml') | ||
elif mode == "url": | ||
res = requests.get(source, timeout=15) | ||
html = res.content | ||
soup = BeautifulSoup(html, 'xml') | ||
|
||
try: | ||
el = soup.article.body.find_all(element) | ||
except AttributeError: | ||
print("Using fallback method to extract page content") | ||
el = soup.find_all(element) | ||
|
||
title = "" | ||
with suppress(Exception): | ||
title = soup.find_all('h1')[0].text | ||
with suppress(Exception): | ||
title = soup.find_all('h1').text | ||
with suppress(Exception): | ||
title = soup.find_all('title').text | ||
title.strip() | ||
if title == "": | ||
print(f"Couldn't find title of the page") | ||
title = source | ||
|
||
valid_sections = [] # remove text sections that are too short: | ||
for section in el: | ||
section = ' '.join(section.get_text().split()) | ||
if len(section) > 40: | ||
valid_sections += [section] | ||
else: | ||
print(f"Ignored string because too short: {section}") | ||
|
||
if not valid_sections: | ||
print("No valid sections found, change the 'element' argument\ | ||
to look for other html sections than 'p'") | ||
return None | ||
|
||
for section in tqdm(valid_sections, | ||
desc="Processing by section", | ||
unit="section"): | ||
section = self._sanitize_text(section) | ||
self._call_qg(section, title) | ||
|
||
def clear_qa(self): | ||
"Delete currently stored qa pairs" | ||
self.qa_pairs = [] | ||
global n, cur_n | ||
n = 0 | ||
cur_n = n | ||
|
||
def print(self, prefix='', jeopardy=False): | ||
if prefix != '': | ||
def string_output(self, prefix='', jeopardy=False): | ||
"Return qa pairs to the user" | ||
if prefix != "" and prefix[-1] != ' ': | ||
prefix += ' ' | ||
|
||
if len(self.qa_pairs) == 0: | ||
print("No qa generated yet!") | ||
return None | ||
|
||
res = [] | ||
for qa_pair in self.qa_pairs: | ||
if jeopardy: | ||
print('\"' + prefix + qa_pair['answer'] + '\",\"' + qa_pair['question'] + '\"') | ||
string = f"\"{prefix}{qa_pair['answer']}\",\"\ | ||
{qa_pair['question']}\"" | ||
else: | ||
print('\"' + prefix + qa_pair['question'] + '\",\"' + qa_pair['answer'] + '\"') | ||
string = f"\"{prefix}{qa_pair['question']}\",\"\ | ||
{qa_pair['answer']}\"" | ||
res.append(string) | ||
return res | ||
|
||
def export(self, filename, prefix='', jeopardy=False): | ||
if prefix != '': | ||
def print(self, *args, **kwargs): | ||
"Print qa pairs to the user" | ||
print(self.string_output(*args, **kwargs)) | ||
|
||
def pprint(self, *args, **kwargs): | ||
"Prettyprint qa pairs to the user" | ||
pprint(self.string_output(*args, **kwargs)) | ||
|
||
def pandas_output(self, prefix=''): | ||
if len(self.qa_pairs) == 0: | ||
print("No qa generated yet!") | ||
return None | ||
"Output a Pandas DataFrame containing qa pairs and metadata" | ||
df = pd.DataFrame(columns=list(self.qa_pairs[0].keys())) | ||
for qa in self.qa_pairs: | ||
df = df.append(qa, ignore_index=True) | ||
for i in df.index: | ||
for c in df.columns: | ||
if pd.isna(df.loc[i, c]): | ||
# otherwise export functions break: | ||
df.loc[i, c] = "Error" | ||
return df | ||
|
||
def csv_export(self, filename, prefix='', jeopardy=False): | ||
thiswillbeyourgithub marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"Export qa pairs as csv file" | ||
if len(self.qa_pairs) == 0: | ||
print("No qa generated yet!") | ||
return None | ||
if prefix != "" and prefix[-1] != ' ': | ||
prefix += ' ' | ||
|
||
with open(filename, 'w', newline='') as file: | ||
writer = csv.writer(file) | ||
for qa_pair in self.qa_pairs: | ||
if jeopardy: | ||
writer.writerow([prefix + qa_pair['answer'], qa_pair['question']]) | ||
else: | ||
writer.writerow([prefix + qa_pair['question'], qa_pair['answer']]) | ||
df = self.pandas_output(prefix) | ||
|
||
def _remove_commas(string): | ||
thiswillbeyourgithub marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return string.replace(",", r"\,") | ||
|
||
for i in df.index: | ||
for c in df.columns: | ||
df.loc[i, c] = _remove_commas(df.loc[i, c]) | ||
|
||
df.to_csv(filename) | ||
print(f"Done writing qa pairs to {filename}") | ||
|
||
def json_export(self, filename, prefix='', jeopardy=False): | ||
"Export qa pairs as json file" | ||
if len(self.qa_pairs) == 0: | ||
print("No qa generated yet!") | ||
return None | ||
if prefix != "" and prefix[-1] != ' ': | ||
prefix += ' ' | ||
|
||
self.pandas_output(prefix).to_json(filename) | ||
print(f"Done writing qa pairs to {filename}") | ||
|
||
|
||
#def _debug_signal_handler(signal, frame): | ||
thiswillbeyourgithub marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# """ | ||
# According to stackoverflow, this allows to make the script interruptible | ||
# and resume it at will (ctrl+C / c) | ||
# https://stackoverflow.com/questions/10239760/interrupt-pause-running-python-program-in-pdb/39478157#39478157 | ||
# """ | ||
# pdb.set_trace() | ||
# | ||
# | ||
#signal.signal(signal.SIGINT, _debug_signal_handler) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
The folder `examples_script` is a collection of ad hoc scripts by users. They can probably not really used by anyone but provide some examples of how to use Autocards in the field for specific tasks |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When working with a class, variables which have to be shared across multiple functions should be fields of that class, through something like
self.qa_pair_count = len(self.qa_pairs)
. Use of global variables is discouraged in Python (and mostly everywhere), that's why you have to explicitly state that you want those to be global.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I decided to do it like that because I was not sure of when the line "n= len(self.qa_pairs)" woud be executed otherwise. For example it has to be reset when
clear_qa
is run.Looking back it's actually a way more terrible idea that I thought because I can imagine most users loading autocards already having set a "n" variable somewhere...
Unfortunately I don't feel comfortable correcting this myself as I'm unfamiliar with classes. It's actually my first ever class... Would you be interested in explaining me the best route or doing it yourself? I will surely have to ankify that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be more clear: I don't feel up to solving when and how to set n and cur_n. I can take care of the issue of organizing qa_pairs differently later on myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, so I'll tackle this myself after the PR we decided.