Skip to content
This repository has been archived by the owner on May 5, 2023. It is now read-only.

new revision #15

Merged
merged 31 commits into from
Jul 13, 2021
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
c107544
docs: added example files
Jul 7, 2021
2eaa499
new revision needs new libraries
Jul 7, 2021
96af48d
added example script by users
Jul 7, 2021
fd4f5f0
new revision readme
Jul 7, 2021
9338b1c
modified: autocards.py
Jul 7, 2021
db01b7e
Update README.md
thiswillbeyourgithub Jul 7, 2021
03ca81c
Update autocards.py
thiswillbeyourgithub Jul 7, 2021
1204b35
better _sanitize_text function
Jul 9, 2021
4cc55c5
style: methods called after pandas
Jul 9, 2021
0142744
remove function defined but used once
Jul 9, 2021
d1a6abf
removed commented section launching pdb
Jul 9, 2021
9d5d8c3
better docstring for sanitize text
Jul 9, 2021
4355add
removed unecessary import from the commented area
Jul 9, 2021
f18762c
style: added better default titles
Jul 9, 2021
3e57726
PEP8: line was too long
Jul 9, 2021
e139367
modified: autocards.py
Jul 9, 2021
823e4ff
minor: wrong fstring and extra newline at EOF
Jul 9, 2021
c1d2b25
fix: better title for text file
Jul 9, 2021
ea70298
docs: csv_export is now to_csv, same for json
Jul 10, 2021
6fcb6db
docs: clearer text
Jul 10, 2021
213c121
missing ebooklib + remove extra newline
Jul 10, 2021
90c69ae
docs: minor phrasing
Jul 10, 2021
de023d2
remove useless notebook
Jul 10, 2021
dea3d44
style: renamed qa_pairs to qa_dict
Jul 10, 2021
6e94f6b
phrasing
Jul 11, 2021
bb82eb9
more robust epub extraction
Jul 11, 2021
088bfe6
adds basic cloze functionnality and notetype
Jul 12, 2021
0f40dd3
adds basic cloze functionnality and notetype
Jul 12, 2021
92eca6a
added docstring for main class
Jul 12, 2021
3576c7f
feat: added a store_content and watermark flag
Jul 12, 2021
978e9ff
minor style
Jul 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 40 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,41 @@
# Autocards
Learn more by reading [the official write-up](https://psionica.org/docs/lab/autocards/).
* Automatically create flashcards from user input, PDF files, wikipedia summary, webpages, and more!
* To get a real world example, you can take a look at the complete output on [this article](https://www.biography.com/political-figure/philip-ii-of-macedon) can be found [in this folder](./output_example/). Nothing has been manually altered, it's the direct output.
* Code is PEP compliant and all docstrings are written, hence contributions and PR are extremely appreciated
* Learn more by reading [the official write-up](https://psionica.org/docs/lab/autocards/).

## Install guide:
* `git clone https//github.com/Psionica/Autocards`
* `cd Autocards`
* `pip install -r ./requirements.txt`
* run a python console: `ipython3`
* install punkt by running `!python -m nltk.downloader punkt`

### Autocards usage
```
# loading
from autocards import Autocards
a = Autocards()

# eating the input text using one of the following ways:
a.consume_var(my_text, per_paragraph=True)
a.consume_user_input(title="")
a.consume_wiki_summary(keyword, lang="en")
a.consume_textfile(filename, per_paragraph=True)
a.consume_pdf(pdf_path, per_paragraph=True)
a.consume_web(source, mode="url", element="p")
# => * element is the html element, like p for paragraph
# * mode can be "url" or "local"

# three ways to get the results back:
out = a.string_output(prefix='', jeopardy=False)
# => * prefix is a text that will be appended before the qa
# * jeopardy is when switching question and answer
a.print(prefix='', jeopardy=False)
a.pprint(prefix='', jeopardy=False) # pretty printing
df = a.pandas_output(prefix='')
a.csv_export("output.csv", prefix="", jeopardy=False)
a.json_export("output.json", prefix="", jeopardy=False)

# Also note that a user provided his own terrible scripts that you can get inspiration from, they are located in the folder `examples_script`
```
292 changes: 257 additions & 35 deletions autocards.py
Original file line number Diff line number Diff line change
@@ -1,63 +1,285 @@
from pipelines import qg_pipeline
from transformers import pipeline

from tqdm import tqdm
from pathlib import Path
import pandas as pd
import time
import signal
import pdb
import re
import os
from contextlib import suppress

import requests
import PyPDF2
import wikipedia
from wikipedia.exceptions import PageError
from bs4 import BeautifulSoup
import csv
from pprint import pprint

# otherwise csv and json outputs contain a warning string
os.environ["TOKENIZERS_PARALLELISM"] = "true"


class Autocards:
def __init__(self):
self.qg = qg_pipeline('question-generation', model='valhalla/t5-base-qg-hl', ans_model='valhalla/t5-small-qa-qg-hl')
print("Loading backend...")
self.qg = qg_pipeline('question-generation',
model='valhalla/t5-base-qg-hl',
ans_model='valhalla/t5-small-qa-qg-hl')
self.qa_pairs = []
global n, cur_n
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When working with a class, variables which have to be shared across multiple functions should be fields of that class, through something like self.qa_pair_count = len(self.qa_pairs). Use of global variables is discouraged in Python (and mostly everywhere), that's why you have to explicitly state that you want those to be global.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I decided to do it like that because I was not sure of when the line "n= len(self.qa_pairs)" woud be executed otherwise. For example it has to be reset when clear_qa is run.

Looking back it's actually a way more terrible idea that I thought because I can imagine most users loading autocards already having set a "n" variable somewhere...

Unfortunately I don't feel comfortable correcting this myself as I'm unfamiliar with classes. It's actually my first ever class... Would you be interested in explaining me the best route or doing it yourself? I will surely have to ankify that...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be more clear: I don't feel up to solving when and how to set n and cur_n. I can take care of the issue of organizing qa_pairs differently later on myself

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, so I'll tackle this myself after the PR we decided.

n = len(self.qa_pairs)
cur_n = n

def _call_qg(self, text, title):
"""
Call question generation module, then turn the answer into a
dictionnary containing metadata (clozed formating, creation time,
title, source text)
"""
try:
self.qa_pairs += self.qg(text)
except IndexError:
print(f"\nSkipping section because no cards \
could be made from it:{text}\n")
self.qa_pairs.append({"question": "skipped",
"answer": "skipped"})

global n, cur_n
cur_n = len(self.qa_pairs)
diff = cur_n - n
n = len(self.qa_pairs)

cur_time = time.asctime()
for i in range(diff):
i += 1
cloze = self.qa_pairs[-i]['question']\
+ "<br>{{c1::"\
+ self.qa_pairs[-i]['answer']\
+ "}}"
self.qa_pairs[-i] = {**self.qa_pairs[-i],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This processing of self.qa_pairs could help some readability. You can use range(start, end) to start from a certain value and avoid the weird negative index. self.qa_pairs could solve, well, the question/answer pairs, while another variable could be used to store this more complex structure with metadata and all maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I took a look and I can do this only after resolving the issue above IMO

"clozed_text": cloze,
"creation_time": cur_time,
"title": title,
"source": text
}
tqdm.write(f"Added {diff} qa pair (total = {cur_n})")

def _sanitize_text(self, text):
"remove wikipedia style citation"
return re.sub(r"\[\d*\]", "", text)

def consume_text(self, text, per_paragraph=False):
def consume_var(self, text, title="", per_paragraph=False):
"Take text as input and create qa pairs"
text = text.replace('\xad ', '')

if per_paragraph:
for paragraph in text.split('\n\n'):
self.qa_pairs += self.qg(paragraph)
for paragraph in tqdm(text.split('\n\n'),
desc="Processing by paragraph",
unit="paragraph"):
self._call_qg(paragraph, title)
else:
text = text.replace('\n\n', '. ').replace('..', '.')
self.qa_pairs += self.qg(text)
text = self._sanitize_text(text)
self._call_qg(text, title)

def consume_user_input(self, title=""):
"Take user input and create qa pairs"
user_input = input("Enter your text below then press Enter (press\
enter twice to validate input):\n>")

print("\nFeeding your text to Autocards.")
user_input = self._sanitize_text(user_input)
self.consume_var(user_input, title, per_paragraph=False)
print("Done feeding text.")

def consume_wiki_summary(self, keyword, lang="en"):
"Take a wikipedia keyword and creates qa pairs from its summary"
if "http" in keyword:
print("To consume a wikipedia summmary, you have to input \
the title of the article and not the url")
return None
wikipedia.set_lang(lang)
try:
wiki = wikipedia.page(keyword)
except PageError as e:
print(f"Page not found, error code:\n{e}")
return None
summary = wiki.summary
title = wiki.title
print(f"Article title: {title}")

def consume_text_file(self, filename):
self.consume_text(open(filename).read())
summary = self._sanitize_text(summary)
self.consume_var(summary, title, True)

def consume_paper(self, filename):
soup = BeautifulSoup(open(filename), 'xml')
paragraphs = []
def consume_pdf(self, pdf_path, per_paragraph=True):
if not Path(pdf_path).exists():
print(f"PDF file not found at {pdf_path}!")
return None
pdf = PyPDF2.PdfFileReader(open(pdf_path, 'rb'))
try:
title = pdf.documentInfo['/Title']
print(f"PDF title : {title}")
except KeyError:
title = pdf_path.split("/")[-1]
print(f"PDF title : {title}")

for paragraph in soup.article.body.find_all('p'):
paragraph = ' '.join(paragraph.get_text().split())
if len(paragraph) > 40:
paragraphs += [paragraph]
full_text = []
for page in pdf.pages:
full_text.append(page.extractText())
text = " ".join(full_text)
text = text.replace(" ", "")
thiswillbeyourgithub marked this conversation as resolved.
Show resolved Hide resolved
text = self._sanitize_text(text)

qa_pairs = []
for paragraph in paragraphs:
qa_pairs += self.qg(paragraph)
self.consume_var(text, title, per_paragraph)

self.qa_pairs += qa_pairs
def consume_textfile(self, filepath, per_paragraph=False):
"Take text file as input and create qa pairs"
if not Path(filepath).exists():
print(f"File not found at {filepath}")
text = open(filepath).read()
text = self._sanitize_text(text)
self.consume_var(text,
filepath,
per_paragraph=per_paragraph)

def clear(self):
def consume_web(self, source, mode="url", element="p"):
"Take html file (local or via url) and create qa pairs"
if mode not in ["local", "url"]:
return "invalid arguments"
if mode == "local":
soup = BeautifulSoup(open(source), 'xml')
elif mode == "url":
res = requests.get(source, timeout=15)
html = res.content
soup = BeautifulSoup(html, 'xml')

try:
el = soup.article.body.find_all(element)
except AttributeError:
print("Using fallback method to extract page content")
el = soup.find_all(element)

title = ""
with suppress(Exception):
title = soup.find_all('h1')[0].text
with suppress(Exception):
title = soup.find_all('h1').text
with suppress(Exception):
title = soup.find_all('title').text
title.strip()
if title == "":
print(f"Couldn't find title of the page")
title = source

valid_sections = [] # remove text sections that are too short:
for section in el:
section = ' '.join(section.get_text().split())
if len(section) > 40:
valid_sections += [section]
else:
print(f"Ignored string because too short: {section}")

if not valid_sections:
print("No valid sections found, change the 'element' argument\
to look for other html sections than 'p'")
return None

for section in tqdm(valid_sections,
desc="Processing by section",
unit="section"):
section = self._sanitize_text(section)
self._call_qg(section, title)

def clear_qa(self):
"Delete currently stored qa pairs"
self.qa_pairs = []
global n, cur_n
n = 0
cur_n = n

def print(self, prefix='', jeopardy=False):
if prefix != '':
def string_output(self, prefix='', jeopardy=False):
"Return qa pairs to the user"
if prefix != "" and prefix[-1] != ' ':
prefix += ' '

if len(self.qa_pairs) == 0:
print("No qa generated yet!")
return None

res = []
for qa_pair in self.qa_pairs:
if jeopardy:
print('\"' + prefix + qa_pair['answer'] + '\",\"' + qa_pair['question'] + '\"')
string = f"\"{prefix}{qa_pair['answer']}\",\"\
{qa_pair['question']}\""
else:
print('\"' + prefix + qa_pair['question'] + '\",\"' + qa_pair['answer'] + '\"')
string = f"\"{prefix}{qa_pair['question']}\",\"\
{qa_pair['answer']}\""
res.append(string)
return res

def export(self, filename, prefix='', jeopardy=False):
if prefix != '':
def print(self, *args, **kwargs):
"Print qa pairs to the user"
print(self.string_output(*args, **kwargs))

def pprint(self, *args, **kwargs):
"Prettyprint qa pairs to the user"
pprint(self.string_output(*args, **kwargs))

def pandas_output(self, prefix=''):
if len(self.qa_pairs) == 0:
print("No qa generated yet!")
return None
"Output a Pandas DataFrame containing qa pairs and metadata"
df = pd.DataFrame(columns=list(self.qa_pairs[0].keys()))
for qa in self.qa_pairs:
df = df.append(qa, ignore_index=True)
for i in df.index:
for c in df.columns:
if pd.isna(df.loc[i, c]):
# otherwise export functions break:
df.loc[i, c] = "Error"
return df

def csv_export(self, filename, prefix='', jeopardy=False):
thiswillbeyourgithub marked this conversation as resolved.
Show resolved Hide resolved
"Export qa pairs as csv file"
if len(self.qa_pairs) == 0:
print("No qa generated yet!")
return None
if prefix != "" and prefix[-1] != ' ':
prefix += ' '

with open(filename, 'w', newline='') as file:
writer = csv.writer(file)
for qa_pair in self.qa_pairs:
if jeopardy:
writer.writerow([prefix + qa_pair['answer'], qa_pair['question']])
else:
writer.writerow([prefix + qa_pair['question'], qa_pair['answer']])
df = self.pandas_output(prefix)

def _remove_commas(string):
thiswillbeyourgithub marked this conversation as resolved.
Show resolved Hide resolved
return string.replace(",", r"\,")

for i in df.index:
for c in df.columns:
df.loc[i, c] = _remove_commas(df.loc[i, c])

df.to_csv(filename)
print(f"Done writing qa pairs to {filename}")

def json_export(self, filename, prefix='', jeopardy=False):
"Export qa pairs as json file"
if len(self.qa_pairs) == 0:
print("No qa generated yet!")
return None
if prefix != "" and prefix[-1] != ' ':
prefix += ' '

self.pandas_output(prefix).to_json(filename)
print(f"Done writing qa pairs to {filename}")


#def _debug_signal_handler(signal, frame):
thiswillbeyourgithub marked this conversation as resolved.
Show resolved Hide resolved
# """
# According to stackoverflow, this allows to make the script interruptible
# and resume it at will (ctrl+C / c)
# https://stackoverflow.com/questions/10239760/interrupt-pause-running-python-program-in-pdb/39478157#39478157
# """
# pdb.set_trace()
#
#
#signal.signal(signal.SIGINT, _debug_signal_handler)
1 change: 1 addition & 0 deletions examples_script/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The folder `examples_script` is a collection of ad hoc scripts by users. They can probably not really used by anyone but provide some examples of how to use Autocards in the field for specific tasks
Loading