Linear Regression models to predict the readability score of a given text. The dataset is that from the CommonLit Readability Competition on Kaggle. This competition took place in July, and while I could not submit by the deadline, I still wish to share my work here and showcase my use of the data science methodology.
The scoring criteria used for the competition was the room mean squared error, or RMSE.
As someone interensted in NLP, this was a passion project where I got to use text features in building a model, and I really enjoyed the process.
Python Version: 3.8.10
Packages: pandas, numpy, matplotlib, seaborn, nltk, string, re, scipy, sklearn, readability
As we're predicting the reading ease of texts from literature, the data consists of excerpts from several time periods and a wide range of reading ease scores. The test set includes a slightly larger proportion of modern texts, which is the type of texts we seek to generalize to, than the training set.
Link: https://www.kaggle.com/c/commonlitreadabilityprize/data
Columns (from Kaggle Data Description):
- id - unique ID for excerpt
- url_legal - URL of source - this is blank in the test set.
- license - license of source material - this is blank in the test set.
- excerpt - text to predict reading ease of
- target - reading ease
- standard_error - measure of spread of scores among multiple raters for each excerpt. Not included for test data.
Before any data processing. I checked to see the data shape (# of rows and columns), data types, presence of nulls, and descriptive statistices of the target variable. There were no nulls and the target variable has a median/50% percentile of -0.912.
Initially, the excerpts have an average character length of about 972 with a standard deviation of 117.24. This metric and the variance shows that the length of the excerpts in oour dataset vary greatly.
A word frequency distribution of the excerpts before NLP processing is also shown to give a better sense of the construction of the excerpts in terms of the 20 most common words:
Through NLTK, I performed the following NLP steps to prepare the data for model fitting:
- convert to lowercase
- word tokenization
- punctuation removal
- stopword removal
- stemming and lemmatization
After processing, the most common words are words with context rather than stopwords as in the previous word frequency distribution chart:
My approach to feature engineering was to generate all the information I could from the excerpts, such as number of nouns, verbs, and other grammar terms, and after some research, using the readability package to generate scores for each excerpt using different readability scoring systems and useful sentence information. I accomplished this by using the NLTK POS tagger, or part-of-speech tagger, and the readability package for scorings and sentence info such as syllables, word types, long words, complex words, and similiar details.
Below is a screenshot of the final dataframe after these features were generated from the pre- and post-processed excerpts:
From these features, I visualized a Spearman correlation heatmap to get a sense of the features that correlate with the target variable the most, as well as where multicollinearity occurs among our features.
For the sake of simplicity, I decided to use features that had a spearman correlation of at least |35| (absolute value) with the target variable, regardless of multicollinearity. Resulting in these final variables for model building:
The final feature I wished to use was the text itself, in the form of a TF-IDF, or Term Frequency-Inverse Document Frequecy, vector. Using sklearn's Tfidfvectorizer, I created a vectorizer object to fit and transform the processed excerpt data accordingly, and used scipy's horizontal stacking function generate my final training and testing datasets.
My plan was to use Ridge Regression to model the data, since Ridge Regression is especially good for data that has multicollinearity. After, I would experiment with different models to see how they fared.
My best Ridge Regression model, after optimizing via RepeatedKFold cross validation, was a model with a Root Mean Squared Error of 0.709 and a R2 of 0.535. Other models I tried include Lasso Regression, Random Forest Regressor, and K Neigbors Regressor, which did not achieve results as good as the initial Ridge Regression.
After some more research, I decided to try my hand at a Stacking Model. For those who don't know, Stacking is an ensemble technique that uses predictions from base learners as features for a meta learner. The meta learner model is used for final predictions on the test data.
I defined a stacking model with Bayesian Ridge Regression, Support Vector Machine, and Random Forest Regressor models as base learners and a Ridge Regression model as a meta learner, the whole stacking ensemble having a cross validation of k=5.
With a cross validation of 5, the average scores of each set of predictions was calculated, with the average root mean squared error resulting in 0.704 and the average R2 (validation) resulting in 0.541, a slightly better result than our baseline Ridge Regression model. Given that stacking ensemble models add to the model complexity, this stacking solution may not be worth the pursuit given the slight improvement in scoring.
To further improve the model, there are plenty of ideas I could try. The following are three that I can try without procuring more data. Fun Fact: I tried scaling and normalizing the data before model training to see if it affected the result. It did not :(
Instead of processing the data to tokenized bodies of text with no stopwords nor punctuation, I could redo the process but maintain punctuation, since punctuation marks add context to a piece of text, and thus affects the readability. I could also see how leaving stopwords affects the end result, but I doubt it would add context.
Forward Selection, Backward Elimination, and Stepwise Selection are all methods of feature selection that would help in selecting variables that are not only significant predictors of the dependent variable, but also are not highly correlated with one another, as in less multicollinearity.
Incorporating a deep learning model can potentially improve scores since deep learning models can apply nonlinear transformations to the data, thus creating a deeper understanding of what determines a text's readability score than traditional machine learning models. Deep learning models are data hungry, but the idea may still be worth exploring.