Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization opportunities #10

Open
danieldk opened this issue Mar 14, 2019 · 1 comment
Open

Optimization opportunities #10

danieldk opened this issue Mar 14, 2019 · 1 comment

Comments

@danieldk
Copy link
Member

danieldk commented Mar 14, 2019

The vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.

Unfortunately, using aligned data/loads does not seem to have a measurable impact on running time. This is probably caused by those functions being constrained by memory bandwidth.

I just wanted to jot down two possible opportunities for reducing cache missed that might have an impact on performance.

  1. Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.

  2. The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).

@sebpuetz

@sebpuetz
Copy link
Member

The vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.

That's also what I found. Time spent on scaled additions explodes even more when using complex context schemes.

Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.

Would still be interesting to see what changes through this. Although I'm not sure what to expect. With rather short sentences and a large window, we might get cases where it's not even possible to draw any negatives without using something from the actual context window. Also, we would use tokens from the same sentence as negative examples, which could actually be related to the meaning of the focus word.

The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).

Can't add much to this, but this sounds like something to test empirically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants