Skip to content

Commit dfe9ae2

Browse files
committed
Write minimal instructions for reproduction
1 parent bf523c1 commit dfe9ae2

File tree

1 file changed

+16
-4
lines changed

1 file changed

+16
-4
lines changed

README.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,18 @@
1-
Implement latetent dirichlet allocation.
1+
# Latent Dirichlet Allocation
2+
3+
Implement variational inference algorithm for latent dirichlet allocation.
24
Train model on a small subset of wikipedia.
5+
Evaluate and visualize with pyLDAvis
6+
7+
To reproduce check the following scripts:
8+
- scripts/setup_anaconda_env.bash to build suitable anaconda-environment.
9+
- scripts/00_setup.bash to download the wikipedia dataset.
10+
- scripts/extractSmallSubset.bash to extract a subset of the dataset.
11+
- scripts/01_preprocess.bash to process xml files and save the dictionary and wordcounts for each document.
12+
- scripts/02_training.bash to estimate the distribution parameters and save the
13+
- to visualize run the jupyter-notebook with the same name and point it to the location of your trained model (by setting the path in the second cell). A Small model is in
314

4-
Questions:
5-
- is english ok?
6-
- gsm.parsing.preprocessing.preprocess_string does all at once, ok?
15+
There are three relevant Python classes in the package **lda**.
16+
- Dataset in lda/dataset.py for all corpus preprocessing operations as well as loading and saving datasets in the native Python serialization format pickle.
17+
- LDA in lda/inference.py to perform the inference algorithm on a dataset
18+
- GenMod in lda/generativeModel.py to sample from a LDA model given the hyperparameters

0 commit comments

Comments
 (0)