Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wikilinks NN method for generating embeddings #47

Merged
merged 6 commits into from
May 18, 2021

Conversation

victle
Copy link
Contributor

@victle victle commented Apr 29, 2021

I'm a bit concerned this might be adding too much at once (and there will be stylistic/usability changes here and there), but I think most of the framework is here. I totally anticipate that changes will need to be made to clean things up. For example, the method only works with the books topics, and cleans data based on that assumption. Below is a list of what I tried to add:

  • New method argument under model.gen_embeddings(), called wikilinks
  • If wikilinks it will try to load a pre-trained model (I uploaded this under /examples/wikilinks_embedding_model.h5). Ideally, this would be part of the repo, and would be sufficient for playing around with the method (although it's on the larger size > 50 MB).
  • If there is no pre-trained model available, it will call _wikilinks_nn(...), which loads the enwiki_books.ndjson and builds/trains a neural network model on the wikilinks (this was adapted from the blogpost).
  • Full disclosure: I haven't tested the method without already having /examples/wikilinks_embedding_model.h5. So, I would want to run _wikilinks_nn(..) all the way through. But, the code was taken from my working Google Colab so I assumed it'd be ok. Let me know if I should run this all the way through before we can merge the PR.

For verification, I did a lot of the testing in a personal Google Colab notebook; here were the recommendations for War and Peace!
image

@codecov
Copy link

codecov bot commented Apr 29, 2021

Codecov Report

Merging #47 (21e914a) into main (dc8bea8) will increase coverage by 1.06%.
The diff coverage is 91.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
+ Coverage   81.04%   82.11%   +1.06%     
==========================================
  Files           4        4              
  Lines         575      654      +79     
==========================================
+ Hits          466      537      +71     
- Misses        109      117       +8     
Impacted Files Coverage Δ
src/wikirec/data_utils.py 76.06% <ø> (ø)
src/wikirec/languages.py 100.00% <ø> (ø)
src/wikirec/model.py 87.50% <91.20%> (+1.78%) ⬆️
src/wikirec/utils.py 95.45% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46862c5...21e914a. Read the comment docs.

@andrewtavis andrewtavis self-requested a review April 29, 2021 19:24
@andrewtavis
Copy link
Owner

Results look great, and I'm really happy we have so much progress so quickly! Thanks a lot for all of this :) I'll give this all a read through in the next few days. I might change the method to wikilinks_nn so that it's a bit more explicit for what's actually going on, but let me know if you'd prefer to keep it as is :)

Great idea with loading the pre-trained model! That's perfect 😊 I'll try to figure out a way to make it work with non-book inputs, with us potentially just needing to add an optional argument that's needed in case this method's ran.

I'll likely just work on this branch for a bit and also add in the needed readme changes and documentation before it's merged. That and potentially the tests, although I'll need to look first to see how involved they'll be.

Again though, nice results! Honestly stoked for this :)

@victle
Copy link
Contributor Author

victle commented Apr 29, 2021

Sounds good! I know it is a bit messy as it is right now. So I appreciate you being able to look over it yourself and making any necessary changes. And feel free to change the method's name to wikilinks_nn as well!

My one thought about working with non-book inputs: I guess you'd have to call in methods from other utility functions that will download the wiki dump and parse a new json. It'd be quite a huge function call!

Let me know if you'd like thoughts on the tests or something!

@andrewtavis andrewtavis self-assigned this May 1, 2021
Copy link
Owner

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victle All looks fine to me :) I'll work on adding to this more tomorrow, and will send along a commit that hopefully will get it working for articles outside of books. I'll then test it and run it through examples/rec_books, and with that we can add results and put it to bed :)

If you have an idea of how best to test this, then let me know. I honestly have some pretty baseline tests for this package that are just making sure that returns are the appropriate type and dimensions and that the files can be downloaded.


n_positive = 1024
gen = _generate_batch(pairs, n_positive, negative_ratio = 2)
h = model.fit_generator(gen, epochs = 15, steps_per_epoch = len(pairs) // n_positive)
Copy link
Owner

@andrewtavis andrewtavis May 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victle Just checking here, h is not used as it's just a placeholder for the generator being fit? I'll add a # pylint: disable=unused-variable if that's the case :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewtavis Yep! h is just a placeholder.

@andrewtavis
Copy link
Owner

andrewtavis commented May 16, 2021

@victle, first and foremost, my apologies that this has taken so long. The last few weeks can adequately be described as shit 😄 I've been working on getting this all set up for running on any kind of input including books. With that being said, I'm currently getting the following when running gen_embeddings:

The list of inputs passed to the model is redundant. All inputs should only appear once. Found: ['Darkness and the Light', .....

Darkness and the Light is the last entry in my books json. Have you seen this before, and if so do you have an idea of how to fix it? I might have swapped a variable name in there somewhere as I was switching from "books" to "articles", but then I was mostly search-all replacing on it.

Just wanted to check on this for now. I'll take another look at it if you're not sure. The basics of the tests are written (will just do a simple one to check output lengths as with the others so the process is ran through). Hopefully we can get #36 done in the next few days :)

Again, my apologies for not getting to this and further not communicating that it'd be a bit.

Hope you're well!

@andrewtavis
Copy link
Owner

andrewtavis commented May 16, 2021

The code that I'm running that's producing the above error is the following:

wikilink_embeddings = model.gen_embeddings(
    method="WikilinkNN",
    path_to_json="./enwiki_books.ndjson",
    path_to_embedding_model="books_embedding_model",
    embedding_size=50,
)

Basically I set it up to accept arguments that lead to a given json and a desired embeddings model. There are other changes to wikirec.model in here, but most are formatting changes that came from my VS Code autoformat on save trigger 😄

@victle
Copy link
Contributor Author

victle commented May 17, 2021

No worries on the time! I've honestly been pretty busy too and haven't been able to dedicate enough time 😢

Regarding that error, I don't think I've ever seen it before, nor do I know what might be causing it exactly.

# Reshape to be a single number (shape will be (None, 1))
merged = Reshape(target_shape = [1])(merged)

model = Model(inputs = [book, link], outputs = merged)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victle, Figured out that this here is supposed to be book_input and link_input based on comparing it all to the original blogpost and your changes :)

Am making good progress on this and will try to have it done by tonight/early tomorrow. Thanks again for all this!

@andrewtavis
Copy link
Owner

@victle, fully up and running 😊 I added a dot product for WikilinkNN into gen_sim_matrix so that the process will be the same as the others.

I'm gonna check out combining WikilinkNN with the other methods now (getting rid of the sim_matrix rows via selected_idxs that come from the cleaning process), and then should be able to finalize all this :)

@andrewtavis andrewtavis linked an issue May 17, 2021 that may be closed by this pull request
@andrewtavis andrewtavis merged commit e630724 into andrewtavis:main May 18, 2021
@andrewtavis andrewtavis linked an issue May 18, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New recommendation models Add neural network model
2 participants