Corpora, models, and tools for the study of complex language.
See this notebook for a more interactive quickstart (run the code here on Binder).
Open a terminal, Jupyter, or Colab notebook and type:
pip install -qU lltk-dh
# or for very latest version:
#pip install -qU git+https://github.com/quadrismegistus/lltk
Show available corpora:
lltk show
Or, within python, show in markdown:
import lltk
lltk.show()
See below for available corpora.
# Load/install a corpus
import lltk
corpus = lltk.load('ECCO_TCP') # load the corpus by name or ID
# Metadata
meta = corpus.meta # metadata as data frame
smpl = meta.query('1770<year<1830') # easy query access
# Data
mfw = corpus.mfw() # get the 10K most frequent words as a list
dtm = corpus.dtm() # get a document-term matrix as a pandas dataframe
dtm = corpus.dtm(tfidf=True) # get DTM as tf-idf
mdw = corpus.mdw('gender') # get most distinctive words for a metadata group
# accessing text objs
texts = corpus.texts() # get a list of corpus's text objects
texts_smpl = corpus.texts(smpl) # text objects from df/list of ids
texts_rad = corpus.au.Radcliffe # hit "tab" after typing e.g. "Rad" to autocomplete
text = corpus.t # get a random text object from corpus
# metadata access
text_meta = text.meta # get text metadata as dictionary
author = text.author # get common metadata as attributes
title = text.title
year = text.year
dec = text.decade # few inferred attributes too
dec_str = text.decade_str # '1890-1900' rather than 1890
# data access
txt = text.txt # get plain text as string
xml = text.xml # get xml as string
# simple nlp
words = text.words # get list of words (excl punct)
sents = text.sents # get list of sentences
counts = text.counts # get word counts as dictionary (from JSON if saved)
# other nlp
tnltk = text.nltk # get nltk Text object
tblob = text.blob # get TextBlob object
tstanza = text.stanza # get list of stanza objects (one per para)
tspacy = text.spacy # get list of spacy objects (one per para)
LLTK has built in functionality for the following corpora. Some (π) are freely downloadable from the links below or the LLTK interface. Some of them (β) require first accessing the raw data through your institutional or other subscription. Some corpora have a mixture, with some data open through fair research use (e.g. metadata, freqs) and some closed (e.g. txt, xml, raw).
Incomplete for now. See this sample notebook for some examples.
Import a corpus into LLTK:
lltk import # use the "import" command \
-path_txt mycorpus/txts # a folder of txt files (use -path_xml for xml) \
-path_metadata mycorpus/meta.xls # a metadata csv/tsv/xls about those txt files \
-col_fn filename # .txt/.xml filename col in metadata (use -col_id if no ext)
Or create a new one:
lltk create
corpus.mfw_df(
n=None, # Number of top words overall
by_ntext=False, # Count number of documents not number of words
by_fpm=False, # Count by within-text relative sums
min_count=None, # Minimum count of word
yearbin=None, # Average relative counts across `yearbin` periods
col_group='period', # Which column to periodize on
n_by_period=None, # Number of top words per period
keep_periods=True, # Keep periods in output dataframe
n_agg='median', # How to aggregate across periods
min_periods=None, # minimum number of periods a word must appear in
excl_stopwords=False, # Exclude stopwords (at `PATH_TO_ENGLISH_STOPWORDS`)
excl_top=0, # Exclude words ranked 1:`not_top`
valtype='fpm', # valtype to compute top words by
**attrs
)
corpus.dtm(
words=[], # words to use in DTM
n=25000, # if not `words`, how many mfw?
texts=None, # set texts to use explicitly, otherwise use all
tf=False, # return term frequencies, not counts
tfidf=False, # return tfidf, not counts
meta=False, # include metadata (e.g. ["gender","nation"])
**mfw_attrs, # all other attributes passed to self.mfw()
)
corpus.mdw(
groupby, # metadata categorical variable to group by
words=[], # explicitly set words to use
texts=None, # explicitly set texts to use
tfidf=True, # use tfidf as mdw calculation
keep_null_cols=False, # remove texts which do not have `groupby` set
remove_zeros=True, # remove rows summing to zero
agg='median', # aggregate by `agg`
)