Add wikilinks NN method for generating embeddings #47

victle · 2021-04-29T18:53:57Z

I'm a bit concerned this might be adding too much at once (and there will be stylistic/usability changes here and there), but I think most of the framework is here. I totally anticipate that changes will need to be made to clean things up. For example, the method only works with the books topics, and cleans data based on that assumption. Below is a list of what I tried to add:

New method argument under model.gen_embeddings(), called wikilinks
If wikilinks it will try to load a pre-trained model (I uploaded this under /examples/wikilinks_embedding_model.h5). Ideally, this would be part of the repo, and would be sufficient for playing around with the method (although it's on the larger size > 50 MB).
If there is no pre-trained model available, it will call _wikilinks_nn(...), which loads the enwiki_books.ndjson and builds/trains a neural network model on the wikilinks (this was adapted from the blogpost).
Full disclosure: I haven't tested the method without already having /examples/wikilinks_embedding_model.h5. So, I would want to run _wikilinks_nn(..) all the way through. But, the code was taken from my working Google Colab so I assumed it'd be ok. Let me know if I should run this all the way through before we can merge the PR.

For verification, I did a lot of the testing in a personal Google Colab notebook; here were the recommendations for War and Peace!

codecov · 2021-04-29T19:06:07Z

Codecov Report

Merging #47 (21e914a) into main (dc8bea8) will increase coverage by 1.06%.
The diff coverage is 91.39%.

@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
+ Coverage   81.04%   82.11%   +1.06%     
==========================================
  Files           4        4              
  Lines         575      654      +79     
==========================================
+ Hits          466      537      +71     
- Misses        109      117       +8

Impacted Files	Coverage Δ
src/wikirec/data_utils.py	`76.06% <ø> (ø)`
src/wikirec/languages.py	`100.00% <ø> (ø)`
src/wikirec/model.py	`87.50% <91.20%> (+1.78%)`	⬆️
src/wikirec/utils.py	`95.45% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46862c5...21e914a. Read the comment docs.

andrewtavis · 2021-04-29T19:37:16Z

Results look great, and I'm really happy we have so much progress so quickly! Thanks a lot for all of this :) I'll give this all a read through in the next few days. I might change the method to wikilinks_nn so that it's a bit more explicit for what's actually going on, but let me know if you'd prefer to keep it as is :)

Great idea with loading the pre-trained model! That's perfect 😊 I'll try to figure out a way to make it work with non-book inputs, with us potentially just needing to add an optional argument that's needed in case this method's ran.

I'll likely just work on this branch for a bit and also add in the needed readme changes and documentation before it's merged. That and potentially the tests, although I'll need to look first to see how involved they'll be.

Again though, nice results! Honestly stoked for this :)

victle · 2021-04-29T20:36:53Z

Sounds good! I know it is a bit messy as it is right now. So I appreciate you being able to look over it yourself and making any necessary changes. And feel free to change the method's name to wikilinks_nn as well!

My one thought about working with non-book inputs: I guess you'd have to call in methods from other utility functions that will download the wiki dump and parse a new json. It'd be quite a huge function call!

Let me know if you'd like thoughts on the tests or something!

andrewtavis

@victle All looks fine to me :) I'll work on adding to this more tomorrow, and will send along a commit that hopefully will get it working for articles outside of books. I'll then test it and run it through examples/rec_books, and with that we can add results and put it to bed :)

If you have an idea of how best to test this, then let me know. I honestly have some pretty baseline tests for this package that are just making sure that returns are the appropriate type and dimensions and that the files can be downloaded.

andrewtavis · 2021-05-02T07:20:34Z

src/wikirec/model.py

+
+    n_positive = 1024
+    gen = _generate_batch(pairs, n_positive, negative_ratio = 2)
+    h = model.fit_generator(gen, epochs = 15, steps_per_epoch = len(pairs) // n_positive)


@victle Just checking here, h is not used as it's just a placeholder for the generator being fit? I'll add a # pylint: disable=unused-variable if that's the case :)

@andrewtavis Yep! h is just a placeholder.

andrewtavis · 2021-05-16T19:42:05Z

@victle, first and foremost, my apologies that this has taken so long. The last few weeks can adequately be described as shit 😄 I've been working on getting this all set up for running on any kind of input including books. With that being said, I'm currently getting the following when running gen_embeddings:

The list of inputs passed to the model is redundant. All inputs should only appear once. Found: ['Darkness and the Light', .....

Darkness and the Light is the last entry in my books json. Have you seen this before, and if so do you have an idea of how to fix it? I might have swapped a variable name in there somewhere as I was switching from "books" to "articles", but then I was mostly search-all replacing on it.

Just wanted to check on this for now. I'll take another look at it if you're not sure. The basics of the tests are written (will just do a simple one to check output lengths as with the others so the process is ran through). Hopefully we can get #36 done in the next few days :)

Again, my apologies for not getting to this and further not communicating that it'd be a bit.

Hope you're well!

andrewtavis · 2021-05-16T19:55:20Z

The code that I'm running that's producing the above error is the following:

wikilink_embeddings = model.gen_embeddings(
    method="WikilinkNN",
    path_to_json="./enwiki_books.ndjson",
    path_to_embedding_model="books_embedding_model",
    embedding_size=50,
)

Basically I set it up to accept arguments that lead to a given json and a desired embeddings model. There are other changes to wikirec.model in here, but most are formatting changes that came from my VS Code autoformat on save trigger 😄

victle · 2021-05-17T00:26:08Z

No worries on the time! I've honestly been pretty busy too and haven't been able to dedicate enough time 😢

Regarding that error, I don't think I've ever seen it before, nor do I know what might be causing it exactly.

andrewtavis · 2021-05-17T16:58:31Z

src/wikirec/model.py

+    # Reshape to be a single number (shape will be (None, 1))
+    merged = Reshape(target_shape = [1])(merged)
+
+    model = Model(inputs = [book, link], outputs = merged)


@victle, Figured out that this here is supposed to be book_input and link_input based on comparing it all to the original blogpost and your changes :)

Am making good progress on this and will try to have it done by tonight/early tomorrow. Thanks again for all this!

andrewtavis · 2021-05-17T17:23:23Z

@victle, fully up and running 😊 I added a dot product for WikilinkNN into gen_sim_matrix so that the process will be the same as the others.

I'm gonna check out combining WikilinkNN with the other methods now (getting rid of the sim_matrix rows via selected_idxs that come from the cleaning process), and then should be able to finalize all this :)

victle added 4 commits April 29, 2021 14:31

Add method for training NN on wikilinks and books

ca666ae

Upload trained NN model for easy access and testing

96ca044

Add method checking for 'wikilinks'

6094e23

Add arguments for _wikilinks_nn

a206e36

andrewtavis self-requested a review April 29, 2021 19:24

andrewtavis self-assigned this May 1, 2021

andrewtavis approved these changes May 1, 2021

View reviewed changes

andrewtavis reviewed May 2, 2021

View reviewed changes

WIP: allow WikilinkNN to be ran on non-book inputs

2f25b12

andrewtavis reviewed May 17, 2021

View reviewed changes

andrewtavis linked an issue May 17, 2021 that may be closed by this pull request

Add neural network model #36

Closed

Finish WikilinkNN, add to tests, and add to readme

21e914a

andrewtavis merged commit e630724 into andrewtavis:main May 18, 2021

andrewtavis linked an issue May 18, 2021 that may be closed by this pull request

New recommendation models #31

Open

andrewtavis mentioned this pull request May 18, 2021

Add WikilinkNN Unwanted Links #48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wikilinks NN method for generating embeddings #47

Add wikilinks NN method for generating embeddings #47

victle commented Apr 29, 2021

codecov bot commented Apr 29, 2021 •

edited

Loading

andrewtavis commented Apr 29, 2021

victle commented Apr 29, 2021

andrewtavis left a comment •

edited

Loading

andrewtavis May 2, 2021 •

edited

Loading

victle May 2, 2021

andrewtavis commented May 16, 2021 •

edited

Loading

andrewtavis commented May 16, 2021 •

edited

Loading

victle commented May 17, 2021

andrewtavis May 17, 2021

andrewtavis commented May 17, 2021

Add wikilinks NN method for generating embeddings #47

Add wikilinks NN method for generating embeddings #47

Conversation

victle commented Apr 29, 2021

codecov bot commented Apr 29, 2021 • edited Loading

Codecov Report

andrewtavis commented Apr 29, 2021

victle commented Apr 29, 2021

andrewtavis left a comment • edited Loading

Choose a reason for hiding this comment

andrewtavis May 2, 2021 • edited Loading

Choose a reason for hiding this comment

victle May 2, 2021

Choose a reason for hiding this comment

andrewtavis commented May 16, 2021 • edited Loading

andrewtavis commented May 16, 2021 • edited Loading

victle commented May 17, 2021

andrewtavis May 17, 2021

Choose a reason for hiding this comment

andrewtavis commented May 17, 2021

codecov bot commented Apr 29, 2021 •

edited

Loading

andrewtavis left a comment •

edited

Loading

andrewtavis May 2, 2021 •

edited

Loading

andrewtavis commented May 16, 2021 •

edited

Loading

andrewtavis commented May 16, 2021 •

edited

Loading