22 Oct 00:01

Mascerade

14b67fc

Cleaner Repository and a Package! Latest

Latest

Release Notes

With this release there were no changes to the model itself at all.
- All the changes that were made were to make it easier for people to actually use the model.
There is now a new directory called supervised_product_matching which is described in the README.
- This directory is a package that can be installed using the command provided.
- It gives people much easier access to the model architectures used for training as well as the preprocessing that was used before titles get sent to the model.
The repository also now makes use of my CharacterBERT repository which essentially just updated the code of the original repository to work with the latest version of HuggingFace Transformers and exposes the architecture in the form of a package for better portability.
There are command-line arguments now for torch_train_model.py
You can find the NLP Dashboard repository here and the NLP Dashboard Server here if you want to make use of them for training.

Assets 2

16 Apr 18:25

Mascerade

0.3.0

361020e

CharacterBERT and A LOT of Change!

Release Notes

Completely revamped the data. The architecture of the project can be found in the README
- The gist of it is that there have more realistic laptop data and we use the WDC Product Corpus's electronic data.
Using CharacterBERT as opposed to regular BERT
- CharacterBERT is much more robust towards number data, which helps with discerning between numerical attributes of data.
- ScaleTransformerEncoder can be added on top of CharacterBERT (check the README for more info)
New method of batching the training data (to consume less memory).
There is a test script to validate/manually test data.

Implementation Notes

Need to download pre-trained CharacterBERT and BERT models
- Instructions in README
Extract train.zip into data/train
Extract test.zip into data/test
Extract CharacterBERT-Models.zip into models

Results

The models trained are much better at laptop data especially as well as in general.
The models also are better regularized so they don't overfit the data.
The models, in my evaluation, should be good enough to be used in production.

Assets 5

31 Dec 23:15

Mascerade

0.2.0

0060a70

Using BERT! Pre-release

Pre-release

Release Notes

This model was a complete revamp of previous models
We now use a pre-trained BERT model with an added classification head and fine-tune it
The laptop data used now is much simpler (doesn't have the added fluff-words to replicate actual title data)
- The idea behind this choice is that the model should first learn how to properly recognize the different attributes of a title without having to worry about these added tokens
The rest of the data remains the same

Results

The results in this model are both better on paper and also when manually testing
- It is able to better understand laptop data's attributes and when tested on real data that we procured, it did quite well (about 70% accuracy)
The problem with BERT, though, is that it overfits very easily
- At just 4 epochs, the model had overfitted on certain data
  *BERT is promising, though, because it is more flexible with the structure of data and the semantic meanings

Issues/Future Improvements

There are major problems with the laptop data being used
- Specifically, as noted in one of the commits, the model learns very easily the frequency at which some of the product names appear in positive and negative pairs
- For example "apple macbook", when in both titles, is always negative
  - The model discovered this pattern and now anytime there is a laptop with "apple macbook" in both titles, it is always a negative pairing no matter how similar the titles actually are
If this issue can be solved, we believe it will open up the model to much more learning because it will not be able to take these shortcuts

Notes

The csv files go into data/train and 0.2.0_BERT_epoch_3 goes into models

Assets 19

29 Sep 01:31

Mascerade

0.1.2

5597a2b

Expanded Amount of Models Pre-release

Pre-release

Release Notes

We now have created more model architectures in order to explore different approaches to this problem. We have:

The distance between the final layers in the siamese network sent to a sigmoid classifier (DistanceSigmiod)
The exponential difference (so e^-abs(difference) between the final layers in the siamese network fed to a sigmoid classifier (ExpDistanceSigmoid)
The exponential difference between the final layers in the siamese network fed to a softmax classifier (ExpDistanceSoftmax)
The Manhattan distance between the final layers in the siamese network

Results

The results are not very good across the board, but what gives hope is the fact that numbers do not seem to work well with the fastText embeddings I am using. Most numbers are treated at the same, therefore tokens like 128 and 256 are largely viewed as the same. This is most likely because both numbers are found in almost the same exact context. This is the same with tokens like SSD and HDD. Because of this, I would like to explore different ways of creating embeddings.

Future Improvements/Research

There are many research papers about NLP to explore. I need to do research into using different embeddings (like with ConceptNet) and using other types of layers, like perhaps Transformers. I would also like to look into different regularization methods in order to get better validation and test results. In addition, analyzing the data we have some more in order to look into why adjectives play such a heavy role in determining how the models do can help.

How to Use

Assets 14

25 Aug 19:53

Mascerade

0.1.1

f381835

Trying to Improve Accuracy on Laptops Pre-release

Pre-release

Release Notes

We now generate laptop data by having a spec list of laptop parts and using them to create one laptop. We shuffle all the tokens in the laptop data so that the LSTM network does not overfit to the positions of certain tokens (like the CPU always being first, then the ram, then the storage, etc.). In addition, there is added data for hard drives, CPUs and RAM. The network also now uses a dropout chance of 0.6 in the last two layers to help with overfitting.

Results

Achieved 87% accuracy on the test set with 128 batch size and 80 epochs
When manually testing, the laptops do not do well at all. The only things that govern whether two laptops are the same are the CPU and brand

Future Improvements

Need more laptop data
We cannot overfit to brand name and only one or two specs

Topics to Test/Explore

Explore different distance layers
Explore pre-trained LSTMs
Perhaps train the fastText embedding on our data
Maybe a separate model for laptops would be better

How To Use

Unzip the train.zip into the train folder
If you want to use the model itself, put it into the model directory

Assets 4

06 Jul 19:16

Mascerade

v0.1

862223c

Proof of Concept Pre-release

Pre-release

Release Notes

This is the first release of this algorithm. I wanted to just see if it was possible to train a model that can classify two titles as the same or not. The LSTM network for 50 epochs with a batch size of 64 using the training data included. I attached the model itself as well, which achieved 87% accuracy on the test set and 91% accuracy on the training set. To train a model yourself, all you have to do is:

Read the readme and download the fastText embedding model
Put the computers_train_bal_shuffle.csv and computers_train_xlarge_norm_simple.csv into the computers_train in the data folder
Go in the train_model.py code and change the output model name to what you want it to be
Run train_model.py
* If you want to just test the model, put the .h5 file into the models folder and run test_model.py
- Change the titles if you want in the code

Future Improvements

Use the cameras dataset from WDC Product Data Corpus
Make the model differentiate between titles that have different attributes, like a laptop with 500gb of HDD vs a laptop with 750gb
- This includes manually getting data for this and switching out attributes
Test with the contrastive loss function

Assets 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release Notes

Release Notes

Implementation Notes

Results

Release Notes

Results

Issues/Future Improvements

Notes

Release Notes

Results

Future Improvements/Research

How to Use

Release Notes

Results

Future Improvements

Topics to Test/Explore

How To Use

Release Notes

Future Improvements

Releases: Mascerade/supervised-product-matching

Cleaner Repository and a Package!

Release Notes

CharacterBERT and A LOT of Change!

Release Notes

Implementation Notes

Results

Using BERT!

Release Notes

Results

Issues/Future Improvements

Notes

Expanded Amount of Models

Release Notes

Results

Future Improvements/Research

How to Use

Trying to Improve Accuracy on Laptops

Release Notes

Results

Future Improvements

Topics to Test/Explore

How To Use

Proof of Concept

Release Notes

Future Improvements