Bird migration survey protocol - Government of Alberta, Canada
Brind your PDF documents to life using wordclouds to represent keywords and important topics in the text. Sometimes overlooked, wordclouds are a very useful tool to use to summarize information quickly. This repository also includes the possibility to easily add a mask (shape) to the wordcloud.
To use or develop the repo locally, fork this repository to your GitHub account and/or then clone it to your computer. To clone the master branch locally, navigate to the directory from the console.
> git clone -b master https://github.com/gstaxy/pdf2wordcloud.git
This app was built in a specific working environment configuration to maintain all functionnalities. To get familiar with virtual environments, please read the tutorial I wrote on the subject for Windows and Linux users. From the console (here with PowerShell), run the following lines:
# If not installed already
> pip install virtualenv
> virtualenv --version
virtualenv 20.0.18
# Create the virtual environment
> virtualenv venv --python=python3.7.6
# Activate the virtual environment
> venv/Scripts/activate.ps1
# Install the library requirements
(venv)> pip install -r requirements.txt
Now, the environment is ready to generate wordclouds!
- Drop the PDF document in the folder
pdf_files/
. - In
config.py
, replace thepdf_filename
inFILENAME
by the document name to use. - Customize the wordcloud look and content from
config.py
. More details in the Customization section. - From the root directory, run this line in the console to generate the wordcloud.
(venv)> py main.py
- All the processing steps will be described in the console and the image will appear in a separate window once it's ready. Simutalneously, the wordcloud will be saved in the
saved_wc/
folder.
Most of the wordcloud configurations are located in config.py
and are directly loaded from there when running main()
.
The most common stopwords are already filtered with the nltk
library in the text cleaning step. The add custom stopwords, copy the examples/stopwords.txt
starter file in the root directory and customize it.
The current size configurations are specific to LinkedIn profile banners. To customize the image size, change the pixel length of FIG_HEIGHT
and FIG_WIDTH
in config.py
. Some common image sizes used on social medias can be found on this website.
The background BG_COLOR
and text WORDS_COLORMAP
color can both be changed in config.py
. Available matplotlib colormaps can be found here.
The number of words can be changed under NUM_OF_WORDS
in config.py
An image outline can be added to the wordcloud to represent a specific shape. To do so, find a .png
image and copy its URL in IMAGE_LINK
in the config.py
file. Make sure the URL link finishes with .png
once it's copied. The black outline can be modified or removed by modifying arguments in lib/cloud.py
.
The default language used to filter the text is English. To change it, modify the line 28 from lib/cleaning.py
to the desired language. The custom stopwords will also need to be changed accordingly.
Here are some replicable examples. Source images are located in examples/
folder.
Click on any wordcloud image to open pdf source link.
Oil sand annual monitoring report - Government of Alberta, Canada
Pride and Prejudice - by Jane Austen
Robinson Crusoe - by Daniel Defoe
Resume samples - Bellevue University
- Add aggparse to the main() function to modify its arguments directly from command line.
Any contribution to the project is welcomed and encouraged. To propose an addition or improvement, please start an Issue or make a Pull Request.