Test out running a few dictionaries through this tool #8

waldoj · 2015-09-23T19:54:06Z

Just see what happens.

waldoj · 2015-10-09T01:48:43Z

I've got an instance of this running on an EC2 Micro, ready to step through a corpus of documents.

waldoj · 2015-10-09T02:11:11Z

I started ingesting the Florida and Virginia dictionaries, but on 1,009 of 155,097 (155,097 what, I don't know), it died with this error:

/var/lib/gems/1.9.1/gems/treat-2.1.0/lib/treat/core/dsl.rb:17:in `method_missing': undefined method `value' for nil:NilClass (NoMethodError)
    from /home/ubuntu/synonyms/Build From Corpus/lib/parse_corpus.rb:42:in `build_word_arcs'
    from /home/ubuntu/synonyms/Build From Corpus/lib/build_dict.rb:36:in `block in perform_step_one'
    from /home/ubuntu/synonyms/Build From Corpus/lib/build_dict.rb:15:in `each'
    from /home/ubuntu/synonyms/Build From Corpus/lib/build_dict.rb:15:in `perform_step_one'
    from ./build_dict.rb:33:in `<main>'

My guess is that there's a data error, but without knowing what the 155,097 represents, it's tough to know where to check. The file in question is 7,359 lines, 2,106,973 bytes, 2,070,718 characters, and 305,621 words. Looking through build_dict.rb, I see that the count is these_words.length, and these_words is created like such:

these_words = read_and_clean_file(file, stops)

I think it's a word count. My guess is that it's using a different method of counting words than wc.

waldoj · 2015-10-09T02:13:55Z

I tried uncommenting words = reject_non_utf8(words) in lib/parse_corpus.rb, in case that might help, but it didn't do any good.

waldoj · 2015-10-09T15:41:42Z

I switched from TSV to CSV, in case there was some kind of encoding problem, but it still crashed. So I'll have to dive into the code. (Keeping in mind that I don't know Ruby.)

waldoj · 2015-10-09T15:53:00Z

OK, so it looks like arc.words isn't being populated fully, so there is no .value function.

waldoj · 2015-10-09T16:35:18Z

I've observed the bug in action, although not the cause of it. build_word_arcs() is expecting a series of three-value arrays (e.g., [Word (23914420), Word (24491240), Word (24910340)]), and when it encounters one that is missing a value (e.g., [Word (25551560), Word (26401100)]), it dies because of the null value. Presumably there are two potential solutions: skip any array that is missing a value, or solve the underlying problem.

waldoj · 2015-10-09T16:37:03Z

I enabled the display of the actual words, plus the array values:

Parsing => ../State Dictionaries/Virginia.csv   [1 of 2]    [301 of 203513]
["longterm", "care", "services"]
[Word (38749800), Word (39138840), Word (39556040)]
Parsing => ../State Dictionaries/Virginia.csv   [1 of 2]    [302 of 203513]
["care", "services", "when"]
[Word (40214040), Word (41047380)]

I'm not sure what's going on here, but somehow ["care", "services", "when"] is being turned into a two-entry array, instead of a three-entry array.

waldoj · 2015-10-09T16:42:43Z

I dialed up the verbosity a bit, and got this:

Parsing => ../State Dictionaries/Virginia.csv   [1 of 2]    [301 of 203513]
["longterm", "care", "services"]
Fragment (37437360)  --- "longterm care services"  ---  {:tag_set=>:penn}   --- [] 
Parsing => ../State Dictionaries/Virginia.csv   [1 of 2]    [302 of 203513]
["care", "services", "when"]
Fragment (38526020)  --- "care services (WRB when)"  ---  {:tag_set=>:penn}   --- []

(WRB when) seems like a big red flag. I see that WRB is the part-of-speech tag to describe "where" or "when". This is the only time in the debug output that the string (WRB appears.

waldoj · 2015-10-09T18:14:13Z

I've taken the coward's way out—if the array size is less than 3, it skips that set of terms.

waldoj · 2015-10-09T18:20:56Z

This seems to be working much better. Also, this is going to take a very long time. About 4 hours just to complete the first of three steps. I may need to step up from an EC2 Micro to something beefier.

waldoj · 2015-10-09T18:30:36Z

So, while this is running, it's not really doing it right. That is, we have a term → definition, with a known relationship between the two, but we're ignoring that. This system is set up for parsing unstructured text, but our text has structure. We know that a given phrase has a given definition. Instead, we're munging together every term and every definition—eliminating the structure.

The solution is to retool how this works. Instead of using a system meant to work with unstructured text, we need to use one meant to deal with structured text—something that can break down definitions and relate them to terms. Also, I imagine we'll want to pre-process dictionaries to eliminate redundancies, e.g.:

person: "Person" means any human...

We can eliminate the first use of the defined terms, when present within quotes, and the word "means," leaving us with:

person: any human...

waldoj · 2015-10-09T19:50:27Z

Aaand it ran out of memory and crashed when it was 18% finished:

/var/lib/gems/1.9.1/gems/data_objects-0.10.16/lib/data_objects/connection.rb:79:in `initialize': SSL SYSCALL error: EOF detected (DataObjects::ConnectionError)
could not fork new process for connection: Cannot allocate memory

So an EC2 Micro probably won't work. Also, this was going to take a lot longer than 4 hours. Maybe 8 hours? A bigger server is called for.

waldoj · 2015-10-09T20:13:01Z

Running it again, this time with a beefier instance, so it won't run out of memory. I don't think it's running any faster, though. :(

waldoj · 2015-10-09T20:30:40Z

I was wrong—this is faster. Maybe 50% faster?

compleatang · 2015-10-09T20:40:16Z

Welcome to Casey's explores NLP without knowing what he's doing land :) (hopefully there's some value for you in that work Waldo)

waldoj · 2015-10-09T20:42:43Z

Of course! Now it's Waldo Explores NLP Without Knowing What He's Doing Land. :) I'm working with @seamuskraft to create a synonyms.txt file based on a ~dozen state and municipal codes, using the dictionaries that we've already auto-assembled from their many glossaries. Step 1 is to just your code up and running. Step 2 might be to toss it out and start over again, based on what we've learned in the process—who knows? :)

compleatang · 2015-10-09T20:44:14Z

:)

seamuskraft · 2015-10-09T22:17:40Z

It is worth remembering that Columbus was trying to get hisself to India and China...

waldoj self-assigned this Sep 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test out running a few dictionaries through this tool #8

Test out running a few dictionaries through this tool #8

waldoj commented Sep 23, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

compleatang commented Oct 9, 2015

waldoj commented Oct 9, 2015

compleatang commented Oct 9, 2015

seamuskraft commented Oct 9, 2015

Test out running a few dictionaries through this tool #8

Test out running a few dictionaries through this tool #8

Comments

waldoj commented Sep 23, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

waldoj commented Oct 9, 2015

compleatang commented Oct 9, 2015

waldoj commented Oct 9, 2015

compleatang commented Oct 9, 2015

seamuskraft commented Oct 9, 2015