-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test out running a few dictionaries through this tool #8
Comments
I've got an instance of this running on an EC2 Micro, ready to step through a corpus of documents. |
I started ingesting the Florida and Virginia dictionaries, but on 1,009 of 155,097 (155,097 what, I don't know), it died with this error:
My guess is that there's a data error, but without knowing what the 155,097 represents, it's tough to know where to check. The file in question is 7,359 lines, 2,106,973 bytes, 2,070,718 characters, and 305,621 words. Looking through
I think it's a word count. My guess is that it's using a different method of counting words than |
I tried uncommenting |
I switched from TSV to CSV, in case there was some kind of encoding problem, but it still crashed. So I'll have to dive into the code. (Keeping in mind that I don't know Ruby.) |
OK, so it looks like |
I've observed the bug in action, although not the cause of it. |
I enabled the display of the actual words, plus the array values:
I'm not sure what's going on here, but somehow |
I dialed up the verbosity a bit, and got this:
|
I've taken the coward's way out—if the array size is less than 3, it skips that set of terms. |
This seems to be working much better. Also, this is going to take a very long time. About 4 hours just to complete the first of three steps. I may need to step up from an EC2 Micro to something beefier. |
So, while this is running, it's not really doing it right. That is, we have a term → definition, with a known relationship between the two, but we're ignoring that. This system is set up for parsing unstructured text, but our text has structure. We know that a given phrase has a given definition. Instead, we're munging together every term and every definition—eliminating the structure. The solution is to retool how this works. Instead of using a system meant to work with unstructured text, we need to use one meant to deal with structured text—something that can break down definitions and relate them to terms. Also, I imagine we'll want to pre-process dictionaries to eliminate redundancies, e.g.:
We can eliminate the first use of the defined terms, when present within quotes, and the word "means," leaving us with:
|
Aaand it ran out of memory and crashed when it was 18% finished:
So an EC2 Micro probably won't work. Also, this was going to take a lot longer than 4 hours. Maybe 8 hours? A bigger server is called for. |
Running it again, this time with a beefier instance, so it won't run out of memory. I don't think it's running any faster, though. :( |
I was wrong—this is faster. Maybe 50% faster? |
Welcome to Casey's explores NLP without knowing what he's doing land :) (hopefully there's some value for you in that work Waldo) |
Of course! Now it's Waldo Explores NLP Without Knowing What He's Doing Land. :) I'm working with @seamuskraft to create a |
:) |
It is worth remembering that Columbus was trying to get hisself to India and China... |
Just see what happens.
The text was updated successfully, but these errors were encountered: