Skip to content
/ lda Public

An implementation of the latent Dirichlet allocation.

Notifications You must be signed in to change notification settings

chrlen/lda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

e446785 · Feb 23, 2019
Feb 21, 2019
Feb 22, 2019
Feb 23, 2019
Feb 23, 2019
Nov 30, 2018
Feb 23, 2019
Feb 16, 2019
Feb 19, 2019
Feb 23, 2019
Feb 19, 2019
Jan 29, 2019
Feb 16, 2019
Feb 19, 2019
Feb 16, 2019
Feb 16, 2019
Nov 30, 2018
Feb 16, 2019
Feb 20, 2019
Feb 23, 2019

Repository files navigation

Latent Dirichlet Allocation

Implement variational inference algorithm for latent dirichlet allocation. Train model on a small subset of wikipedia. Evaluate and visualize with pyLDAvis

To reproduce check the following scripts:

  • scripts/setup_anaconda_env.bash to build suitable anaconda-environment.
  • scripts/00_setup.bash to download the wikipedia dataset.
  • scripts/extractSmallSubset.bash to extract a subset of the dataset.
  • scripts/01_preprocess.bash to process xml files and save the dictionary and wordcounts for each document.
  • scripts/02_training.bash to estimate the distribution parameters and save the
  • to visualize run the jupyter-notebook with the same name and point it to the location of your trained model (by setting the path in the second cell). A Small model is in

There are three relevant Python classes in the package lda.

  • Dataset in lda/dataset.py for all corpus preprocessing operations as well as loading and saving datasets in the native Python serialization format pickle.
  • LDA in lda/inference.py to perform the inference algorithm on a dataset
  • GenMod in lda/generativeModel.py to sample from a LDA model given the hyperparameters