Felix Holz | David Wildauer | Leopold Magurano |
Data (pre-)processing | ML model | Visualization and evaluation |
Language evolves over time. We assume that it is possible for a machine learning model to pick up on these small changes, given a large enough dataset.
In this project, we aim to employ various document representation techniques, such as analyzing the word frequencies or using doc2vec (an extension of word2vec), to create embeddings of the documents. These embeddings are used to train a machine learning model, which is able to predict the time period (year) in which a text snippet was published.
The download of the raw unprocessed dataset can be found here
For our training, testing, and evaluation dataset we will be using text snippets from a subset of the publicly available Project Gutenberg eBooks. The text snippets with the corresponding publishing dates are parsed from the eBooks extracted from the .zim file[1] and written to a simple database to give everyone in our team easy access to it. We aim to implement and test different embeddings of our text snippets, which we then use for our machine learning model to predict the time period, or more accurately the year, in which this text snippet was published.