Releases: Mascerade/supervised-product-matching
Cleaner Repository and a Package!
Release Notes
- With this release there were no changes to the model itself at all.
- All the changes that were made were to make it easier for people to actually use the model.
- There is now a new directory called
supervised_product_matching
which is described in the README.- This directory is a package that can be installed using the command provided.
- It gives people much easier access to the model architectures used for training as well as the preprocessing that was used before titles get sent to the model.
- The repository also now makes use of my CharacterBERT repository which essentially just updated the code of the original repository to work with the latest version of HuggingFace Transformers and exposes the architecture in the form of a package for better portability.
- There are command-line arguments now for
torch_train_model.py
- You can find the NLP Dashboard repository here and the NLP Dashboard Server here if you want to make use of them for training.
CharacterBERT and A LOT of Change!
Release Notes
- Completely revamped the data. The architecture of the project can be found in the README
- The gist of it is that there have more realistic laptop data and we use the WDC Product Corpus's electronic data.
- Using CharacterBERT as opposed to regular BERT
- CharacterBERT is much more robust towards number data, which helps with discerning between numerical attributes of data.
- ScaleTransformerEncoder can be added on top of CharacterBERT (check the README for more info)
- New method of batching the training data (to consume less memory).
- There is a test script to validate/manually test data.
Implementation Notes
- Need to download pre-trained CharacterBERT and BERT models
- Instructions in README
- Extract
train.zip
intodata/train
- Extract
test.zip
intodata/test
- Extract
CharacterBERT-Models.zip
intomodels
Results
- The models trained are much better at laptop data especially as well as in general.
- The models also are better regularized so they don't overfit the data.
- The models, in my evaluation, should be good enough to be used in production.
Using BERT!
Release Notes
- This model was a complete revamp of previous models
- We now use a pre-trained BERT model with an added classification head and fine-tune it
- The laptop data used now is much simpler (doesn't have the added fluff-words to replicate actual title data)
- The idea behind this choice is that the model should first learn how to properly recognize the different attributes of a title without having to worry about these added tokens
- The rest of the data remains the same
Results
- The results in this model are both better on paper and also when manually testing
- It is able to better understand laptop data's attributes and when tested on real data that we procured, it did quite well (about 70% accuracy)
- The problem with BERT, though, is that it overfits very easily
- At just 4 epochs, the model had overfitted on certain data
*BERT is promising, though, because it is more flexible with the structure of data and the semantic meanings
- At just 4 epochs, the model had overfitted on certain data
Issues/Future Improvements
- There are major problems with the laptop data being used
- Specifically, as noted in one of the commits, the model learns very easily the frequency at which some of the product names appear in positive and negative pairs
- For example "apple macbook", when in both titles, is always negative
- The model discovered this pattern and now anytime there is a laptop with "apple macbook" in both titles, it is always a negative pairing no matter how similar the titles actually are
- If this issue can be solved, we believe it will open up the model to much more learning because it will not be able to take these shortcuts
Notes
- The
csv
files go intodata/train
and0.2.0_BERT_epoch_3
goes intomodels
Expanded Amount of Models
Release Notes
We now have created more model architectures in order to explore different approaches to this problem. We have:
- The distance between the final layers in the siamese network sent to a sigmoid classifier (DistanceSigmiod)
- The exponential difference (so e^-abs(difference) between the final layers in the siamese network fed to a sigmoid classifier (ExpDistanceSigmoid)
- The exponential difference between the final layers in the siamese network fed to a softmax classifier (ExpDistanceSoftmax)
- The Manhattan distance between the final layers in the siamese network
Results
The results are not very good across the board, but what gives hope is the fact that numbers do not seem to work well with the fastText embeddings I am using. Most numbers are treated at the same, therefore tokens like 128
and 256
are largely viewed as the same. This is most likely because both numbers are found in almost the same exact context. This is the same with tokens like SSD
and HDD
. Because of this, I would like to explore different ways of creating embeddings.
Future Improvements/Research
There are many research papers about NLP to explore. I need to do research into using different embeddings (like with ConceptNet) and using other types of layers, like perhaps Transformers. I would also like to look into different regularization methods in order to get better validation and test results. In addition, analyzing the data we have some more in order to look into why adjectives play such a heavy role in determining how the models do can help.
How to Use
Trying to Improve Accuracy on Laptops
Release Notes
We now generate laptop data by having a spec list of laptop parts and using them to create one laptop. We shuffle all the tokens in the laptop data so that the LSTM network does not overfit to the positions of certain tokens (like the CPU always being first, then the ram, then the storage, etc.). In addition, there is added data for hard drives, CPUs and RAM. The network also now uses a dropout chance of 0.6 in the last two layers to help with overfitting.
Results
- Achieved 87% accuracy on the test set with 128 batch size and 80 epochs
- When manually testing, the laptops do not do well at all. The only things that govern whether two laptops are the same are the CPU and brand
Future Improvements
- Need more laptop data
- We cannot overfit to brand name and only one or two specs
Topics to Test/Explore
- Explore different distance layers
- Explore pre-trained LSTMs
- Perhaps train the fastText embedding on our data
- Maybe a separate model for laptops would be better
How To Use
- Unzip the
train.zip
into the train folder - If you want to use the model itself, put it into the
model
directory
Proof of Concept
Release Notes
This is the first release of this algorithm. I wanted to just see if it was possible to train a model that can classify two titles as the same or not. The LSTM network for 50 epochs with a batch size of 64 using the training data included. I attached the model itself as well, which achieved 87% accuracy on the test set and 91% accuracy on the training set. To train a model yourself, all you have to do is:
- Read the readme and download the fastText embedding model
- Put the
computers_train_bal_shuffle.csv
andcomputers_train_xlarge_norm_simple.csv
into thecomputers_train
in thedata
folder - Go in the
train_model.py
code and change the output model name to what you want it to be - Run
train_model.py
- * If you want to just test the model, put the
.h5
file into themodels
folder and runtest_model.py
- Change the titles if you want in the code
Future Improvements
- Use the cameras dataset from WDC Product Data Corpus
- Make the model differentiate between titles that have different attributes, like a laptop with 500gb of HDD vs a laptop with 750gb
- This includes manually getting data for this and switching out attributes
- Test with the contrastive loss function