-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add neural network model #36
Comments
I'm personally trying to learn more about neural networks, so I'd love to work and contribute to this in pieces. I took a quick skim through the linked resources, and there's use of t-SNE as well to visualize the books. I know that you're looking to put together t-SNE for wikirec as well #35, so in the future that could definitely be something I can work on as well. Let me know what you think is a good first step! |
For the neural network model the question would be if we implement the method from the blogpost directly where it's looking into the links to other Wikipedia pages, which would then require the cleaning process to have an option to prepare it in this way. Websites URLs are being removed as of now, which maybe they shouldn't? We could of course devise another method though :) We already have pretrained NNs covered with BERT, so a NN approach that tries to create embeddings might be wasted effort as that's a lot of computing to try to beat something that's explicitly trained on Wikipedia in the first place. Another model that's popped up in the last few years is XLNet, which I guess would be the other natural implementation to look into. Let me know what you think on this :) |
I think dealing with the links themselves might be a good first approach instead of training a whole new NN. I'll look into that blogpost and see if I can find how links are explicitly addressed, unless you had any thoughts on that? I'll have to look into XLNet as it does look interesting! Though I am definitely lacking in terms of raw computing power. |
The original blogpost author finds the wikilinks in the data preparation steps, which are shown in this notebook. He's using Implementing this his way would honestly be a huge change to the way the cleaning works, so maybe the best way to go about this is give an option in wikirec.data_utils.clean where the websites would not be removed (might be best anyway), and we could use string methods or regular expressions to extract the links from the texts themselves. For this it'd basically be finding instances of Once we have those it's basically following the original blogpost :) |
For XLNet it looks like we'd be able to use References for this are the XLNet documentation from |
@andrewtavis Hey! Just wanted to update, I've been pretty busy, but am still wanting to work on this issue. I wanted some clarification on what we can do to incorporate these different cleaning methods. As it is now, we're just grabbing the text of the article, which includes the text displayed for the internal wikilinks (I don't think they're getting cleaned out, are they?) How would grabbing the wikilinks themselves substantially improve the "performance" of the recommendation models, since we already have the text from the links' names as part of the inputs to the models already? |
@victle, hey :) No worries on a bit of silence :) Thing is that the URLs are being cleaned as of now. As seen at this line, the websites are being removed, but not the texts that they're the links for. I'm thinking now that this is a random step that actually doesn't need to be included in the cleaning process. We could simply remove this, and then you could then extract the internal Wikipedia links from the parsed text corpuses. Grabbing the links themselves would basically just make a new modeling approach. The assumption would shift from "I believe recommendations can be made based on which articles have similar texts" to "... which articles are linked to similar things." The second assumption is the one from the blog post, and he also got strong results, so we could implement that approach here as well :) It kind of adds another layer of depth to a combined model as well. Right now we can combine BERT and TFIDF and get something that accounts for semantic similarity (BERT) and explicit word inclusions (TFIDF), both of which as of now are working well and even better when combined. This could give us a third strong performing model that could add a degree of direct relatedness to other articles. To combine it with others the embeddings would need to be changed a bit though, as his approach embeds all target articles and all that they're linked to. Could be a situation where we could simply remove rows and columns of the embeddings matrix based on indexing though. I checked our results against the ones he has in the blogpost. The direct way that this could help is books that are historical in context. So far we've been picking fantasy novels for examples, which ultimately seem to be performing well as they would have unique words and places that lead to books by the same author. An example is his results for
Our results for a combined TFIDF-BERT approach are:
They're all classic Russian books, but his results are "better" in my opinion. We get similar results for Sorry for the wall of text 😱 Let me know what your thoughts are on the above! |
I want to try and summarize my understanding below: After reading that first paragraph, this is my understanding. I'm using the beginning text of Prince Serebrenni as an example 😄 Before cleaning the raw text, you'll get something like "Prince Serebrenni (Russian: Князь Серебряный) is a historical novel by [[https://en.wikipedia.org/wiki/Aleksey_Konstantinovich_Tolstoy]](Aleksey Konstantinovich Tolstoy)...". And through that line you referenced to in So then, if we implement this third approach/model that looks at the internal links, we could combine that with TFIDF and BERT to make an overall stronger model (hopefully!) Although, I'm not knowledgeable to know how you would combine embeddings, so that might need more explanation 😅 Either way, it does sound interesting and challenging! I think I'll begin with removing that cleaning step, and replicating the approach from the blog post to extract the links. Does that sound like a reasonable starting step? |
Hey there :) Yes, your understanding is correct. From I think that there's a better way to do this that can still maintain removing the URLs (I think that they're ultimately a lot of filler, and further will be nonsense as the punctuation's removed, so we'll have a lot of This is the simplest way I can think to go about this, as there's all kinds of cleaning steps like removing punctuation and such that follow that would also need to be accounted for. If you just put a Lemme know what you think! |
I was messing around with the Other than that, I'm a fan of the |
Very very interesting, and sorry for putting you on the wrong track. Honestly I last really referenced the parsing codes years ago when I originally wrote the LDA version of this (was a project for my master's), and didn't think that Wikipedia doesn't actually use URLs for internal links. Referencing the source of Prince Serebrenni, "Aleksey Konstantinovich Tolstoy" is Again, sorry for the false lead. We'll need to get the links in the parsing phase, which in the long run makes this easier :) The main difference on this is going to be that a third element will be added to the with open("./enwiki_books.ndjson", "r") as f:
books = [json.loads(l) for l in f]
titles = [b[0] for b in books]
texts = [b[1] for b in books]
wikilinks = [b[2] for b in books] # <- access it if you need it So basically we just get an optional element that isn't even used if we're applying the current techniques. More to the point, we don't need to screw with the cleaning process as of now. Is something that should be looked at again in the future as I think that BERT could potentially benefit from even raw texts with very little processing, but let's check this later :) I can reference this with some friends as well. I will do an update tomorrow with the changes to wikirec.data_utils that will include edits to |
I'll also do a parse and get us a copy of |
Cool! I'm glad we cleared that up, and that it's an easy fix. Let me know if there's something I can look into as well. I can keep reviewing the blogpost, as I imagine a lot of the methods and insights for the NN model will derive from that. |
Hey there :) Thanks for your offer to help on the parsing! Was literally just the line for To keep track of this, the recent steps and those left are (with my estimates of time/difficulty):
Let me know what all from the above you'd have interest in, and I'll take the rest. Also of course let me know if you need support on anything. Would be happy to help 😊 |
I'd love to talk about more about breaking down the 3rd task. And again, correct me if I'm not understanding 😅 In following the blogpost, we'd have to train a neural network (treat as a supervised task) to generate an embedding between books and the internal wikilinks. Then, we can generate a similarity matrix based on this embedding for each book. However, how do we combine the recommendations based off this NN with those of TFIDF and BERT? |
Let's definitely break it down a bit more :) Just wanted to put out everything so there's a general roadmap, and I'd of course do the data uploads and testing (not sure if you have experience/interest in unit tests). Answering your question (as well as I can right now 😄), you're right in that we'll be combining similarities into a Let me know what your thoughts are on the above! As far as breaking the task down, if you want to add something that's similar to the blogpost into model.gen_embeddings, then I could potentially work on getting |
I'm familiar with unit tests, but not well-versed I would say! Either way, what you've outlined makes sense. In terms of what I can do, I can start by building the architecture for the NN that will eventually generate the embeddings between titles and links. I'm interested in training the model myself, but we'll see if I have the computing power to do so in a reasonable manner 😅 . To keep |
A private function on the side would be totally fine for this! All sounds great, and looking forward to it :) In terms of computing power, have you used Google Colab ever? That might be a solution for this, as I don't remember the training for this being mentioned as too long in the blog post. Plus it's from 2016 when GPUs weren't as readily available as today (ML growth is nuts 😮). Big thing for that is that you do need to activate the GPUs in the notebook, as they're not on by default. As stated in the examples for this, you'd do Edit > Notebook settings > Hardware accelerator and selecting GPU. I used Colab for some university school projects, and it's built with Keras in mind. You'd have 24 hours or so of GPU time before the kernel restarts, which hopefully would be enough. If it's not, just lower the parameters down and send something that works along, and I'm happy to make my computer wheeze a bit for the full run 😄 |
@victle, what do you think about combining |
@andrewtavis I actually like the modularity of two separate functions for computing the embeddings and then the similarity matrix. Someone might just be interested in the embeddings, or would want more customization with how the similarity matrices are computed. Though this might be a rare case! But, I do see the benefit of making the recommendation process simpler. Plus, generating the similarity matrix is pretty simple after computing the embeddings, so it's like... why not? 😆 |
@victle, if you like the modularity we can keep it as is :) I was kind of on the fence for it and wanted to check, but it makes sense that someone might just want the embeddings. Plus, if we keep it as is it's less work 😄 Thanks for your input! |
This issue is for adding an embeddings neural network implementation to wikirec. This package was originally based on the linked blog post, but the original model implementation to now has not been included. That original work and the provided codes could serve as the basis to adding such a model to wikirec, which ideally would also be included in the documentation and tested. That model was based on analyzing the links between pages, which could serve as a basis for the wikirec version with modifications to wikirec.data_utils, or the model could focus on the article texts. Partial implementations are more than welcome though :)
Please first indicate your interest in working on this, as it is a feature implementation :)
Thanks for your interest in contributing!
The text was updated successfully, but these errors were encountered: