GiR issue with new version of IPTM #1

bomin8319 · 2018-02-19T16:50:34Z

I have spent very long time figuring out the possible bugs, but failed.
I was able to re-derive exactly the same sampling equation as cluster LDA, so I do not see any mathematical error.
IP assignments and topic assignments from backward sampling always has larger variance than those from forward sampling, and the difference gets larger as we run more outer iterations.

To test this in the simplest setting of cluster LDA (no other variables other than cd and z), I ran 'clusterLDA.R' code in GiR2 folder and I get the same odd results. Is there anything I am totally missing?

bdesmarais · 2018-02-19T17:29:35Z

Could this be caused by label-switching in the backwards sampling? In the forward sampling, we are generating topics from topic distribution k for documents in cluster k, and there is no ambiguity regarding which cluster is topic k in the forward sampling, right? However, in inference, and thus backwards sampling, it seems like label-switching may represent an additional form of variation. Is it possible to artificially introduce label switching into the forward sampling to see if this could produce additional variance in these GiR measures?

aschein · 2018-02-19T17:38:45Z

Perhaps try calculating the statistics based on the topic-type counts (e.g., N_kv) not the assignments (e.g., z_i). The statistics (e.g., variance) of the counts N_kv should be immune from label-switching.

bomin8319 · 2018-02-19T17:56:54Z

topic-type counts (N_kv) looks fine in terms of passing GiR. However, current inference (and thus backward samples) converges to one interaction pattern and one (or two) topics across the entire corpus in the long run. For example, if the topic distribution for K=4 is (0.25, 0.25, 0.25, 0.25) in forward sampling, the inferred topic distribution in backward is (0.8, 0.2, 0, 0). I think this is still problematic if it ends up with very few topics remaining in the real data analysis.

I don't quite understand how backward sampling introduces label-switching issue "when we start the inference from the initial values set as true topic distribution".

hannawallach · 2018-02-19T18:36:54Z

It doesn't sound like label switching is the issue. Bomin, can you point me to the generative process and the inference equations?

…

On Feb 19, 2018 5:56 PM, "Bomin Kim" ***@***.***> wrote: topic-type counts (N_kv) looks fine in terms of passing GiR. However, current inference (and thus backward samples) converges to one interaction pattern and one (or two) topics across the entire corpus in the long run. For example, if the topic distribution for K=4 is (0.25, 0.25, 0.25, 0.25) in forward sampling, the inferred topic distribution in backward is (0.8, 0.2, 0, 0). I think this is still problematic if it ends up with very few topics remaining in the real data analysis. I don't quite understand how backward sampling introduces label-switching issue "when we start the inference from the initial values set as true topic distribution". — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7jUDuOdx0Xg3lWGo9tfQcxZKdQW-ks5tWbXmgaJpZM4SK117> .

bomin8319 · 2018-02-19T18:42:39Z

You can look at paper/icml2018_style/IPTM_ICML2.pdf.
Generating process is in Section 2.2, and the inference equation is Equation (15) in page 4.
Although the draft is currently written for minimal path assumption, I am currently failing for both minimal and maximum path assumptions. (Since GiR use same number of words across all documents, maximal should be fine and it should pass)

hannawallach · 2018-02-20T12:08:50Z

I will have to look tomorrow as I only have phone access today so I can't pull from the repo. Basically I'm wondering if you are doing the same integrations/approximations in the generative process as in inference. I'd try that.

…

On Feb 19, 2018 6:42 PM, "Bomin Kim" ***@***.***> wrote: You can look at paper/icml2018_style/IPTM_ICML2.pdf. Generating process is in Section 2.2, and the inference equation is Equation (15) in page 4. Although the draft is currently written for minimal path assumption, I am currently failing for both minimal and maximum path assumptions. (Since GiR use same number of words across all documents, maximal should be fine and it should pass) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7tdw6_7CkcqR1yK2YLDYOumNq6wDks5tWcCggaJpZM4SK117> .

bomin8319 · 2018-02-23T17:02:35Z

Generating from collapsed LDA equations definitely helped, but there exists remaining issue.

Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \theta_d ~ Dir(\alpha, m_{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process.
However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \theta_d ~ Dir(\alpha, m_{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy?

Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use three hierarchy instead of two---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...?

hannawallach · 2018-02-23T17:04:57Z

I don't understand your related question.

…

On Feb 23, 2018 12:02 PM, "Bomin Kim" ***@***.***> wrote: Generating from collapsed LDA equations definitely helped, but there exists remaining issue. 1. Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \theta_d ~ Dir(\alpha, m_{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process. 2. However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \theta_d ~ Dir(\alpha, m_{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy? Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use #2 instead of #1 <#1>---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7mWc-uYDjPQrPZA5UJV3psZWQ5Liks5tXu8sgaJpZM4SK117> .

hannawallach · 2018-02-23T17:08:38Z

Can you attach the PDF? I'm on my phone today and can't pull from the repo.

…

On Feb 23, 2018 12:02 PM, "Bomin Kim" ***@***.***> wrote: Generating from collapsed LDA equations definitely helped, but there exists remaining issue. 1. Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \theta_d ~ Dir(\alpha, m_{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process. 2. However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \theta_d ~ Dir(\alpha, m_{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy? Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use #2 instead of #1 <#1>---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7mWc-uYDjPQrPZA5UJV3psZWQ5Liks5tXu8sgaJpZM4SK117> .

hannawallach · 2018-02-23T17:09:06Z

Lastly, to me this sounds like there's a subtle bug somewhere....

…

On Feb 23, 2018 12:02 PM, "Bomin Kim" ***@***.***> wrote: Generating from collapsed LDA equations definitely helped, but there exists remaining issue. 1. Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \theta_d ~ Dir(\alpha, m_{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process. 2. However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \theta_d ~ Dir(\alpha, m_{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy? Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use #2 instead of #1 <#1>---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7mWc-uYDjPQrPZA5UJV3psZWQ5Liks5tXu8sgaJpZM4SK117> .

bomin8319 · 2018-02-23T17:11:39Z

IPTM_ICML2.pdf

Just to be clear, here is how I generate z's.

initialize N_dk = 0, N_kc = 0, and N_k = 0
generate topics as below
for (d in 1:D){
for (n in 1:N_d) {
z_dn ~ Equation (15)
N_{d z_dn} +=1
N_{z_dn c_d} +=1
N_{z_dn} +=1
}
}

hannawallach · 2018-02-23T17:13:45Z

Just for maximal, right? This incrementing procedure isn't right for minimal.

…

On Feb 23, 2018 12:11 PM, "Bomin Kim" ***@***.***> wrote: IPTM_ICML2.pdf <https://github.com/desmarais-lab/IPTM/files/1752470/IPTM_ICML2.pdf> Just to be clear, here is how I generate z's. - initialize N_dk = 0, N_kc = 0, and N_k = 0 for (d in 1:D){ for (n in 1:N_d) { z_dn ~ Equation (15) N_{d z_dn} +=1 N_{z_dn c_d} +=1 N_{z_dn} +=1 } } — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7meavz5wRCjs0st4CzGjsNXlVnnMks5tXvFMgaJpZM4SK117> .

bomin8319 · 2018-02-23T17:14:45Z

Yes. Just for the maximal!

hannawallach · 2018-02-23T18:45:57Z

And even maximal doesn't pass?

…

On Feb 23, 2018 12:14 PM, "Bomin Kim" ***@***.***> wrote: Yes. Just for the maximal! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7vZf8QhXm5c6pL2R3uv1cW-465M6ks5tXvIGgaJpZM4SK117> .

bomin8319 · 2018-02-23T19:00:05Z

Yes, both maximal and minimal fail.
After generating z's as above, I infer them as below:

for (iter in 1:Niter) {
for (d in 1:D) {
for (n in 1:N_d) {
N_{d z_dn} -=1
N_{z_dn c_d} -=1
N_{z_dn} -=1
z_dn ~ Equation (15) #new topic assignment
N_{d z_dn} +=1
N_{z_dn c_d} +=1
N_{z_dn} +=1
}
}
}

and compare N_k from forward and backward using GiR plots which looks like below.
(document-IP and token-word distribution plots become completely "pass" when I shut down inference for Z's).
GiRplot.pdf

hannawallach · 2018-02-23T19:03:25Z

I just don't get how this can work with the two level hierarchy but not 3. I think there must be a bug. When you do the two level hierarchy ate you using different code to when you're using the 3 level hierarchy? On Feb 23, 2018 2:00 PM, "Bomin Kim" <[email protected]> wrote: Yes, both maximal and minimal fail. After generating z's as above, I infer them as below: for (iter in 1:Niter) { for (d in 1:D) { for (n in 1:N_d) { N_{d z_dn} -=1 N_{z_dn c_d} -=1 N_{z_dn} -=1 z_dn ~ Equation (15) #new topic assignment N_{d z_dn} +=1 N_{z_dn c_d} +=1 N_{z_dn} +=1 } } } and compare N_k from forward and backward using GiR plots which looks like below. (document-IP and token-word distribution plots become completely "pass" when I shut down inference for Z's). GiRplot.pdf <https://github.com/desmarais-lab/IPTM/files/1752802/GiRplot.pdf> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7l9_-HiIHxTaOK7nsLsHYB0job1Cks5tXwq1gaJpZM4SK117> .

hannawallach · 2018-02-23T19:06:16Z

One idea: set alpha0 in your 3 level code so big that the model effectively bypasses the corpus-level counts. Does it pass?

…

On Feb 23, 2018 2:03 PM, "Hanna Wallach" ***@***.***> wrote: I just don't get how this can work with the two level hierarchy but not 3. I think there must be a bug. When you do the two level hierarchy ate you using different code to when you're using the 3 level hierarchy? On Feb 23, 2018 2:00 PM, "Bomin Kim" ***@***.***> wrote: Yes, both maximal and minimal fail. After generating z's as above, I infer them as below: for (iter in 1:Niter) { for (d in 1:D) { for (n in 1:N_d) { N_{d z_dn} -=1 N_{z_dn c_d} -=1 N_{z_dn} -=1 z_dn ~ Equation (15) #new topic assignment N_{d z_dn} +=1 N_{z_dn c_d} +=1 N_{z_dn} +=1 } } } and compare N_k from forward and backward using GiR plots which looks like below. (document-IP and token-word distribution plots become completely "pass" when I shut down inference for Z's). GiRplot.pdf <https://github.com/desmarais-lab/IPTM/files/1752802/GiRplot.pdf> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7l9_-HiIHxTaOK7nsLsHYB0job1Cks5tXwq1gaJpZM4SK117> .

bomin8319 · 2018-02-23T19:17:16Z

Yes, I use the same code for the level 2 and level 3 hierarchy.
Only difference is that I replace the fraction (N_k + alpha0/K)/(N+alpha0) by 1/K.

I just tried to set alpha0 bigger and apparently it gets closer to "pass".
With alpha0 = 1000, it passed.

hannawallach · 2018-02-23T19:18:03Z

Okay. Are you sampling the alphas? It sounds like you are not?

…

On Feb 23, 2018 2:17 PM, "Bomin Kim" ***@***.***> wrote: Yes, I use the same code for the level 2 and level 3 hierarchy. Only difference is that I replace the fraction (N_k + alpha0/K)/(N+alpha0) by 1/K. I just tried to set alpha0 bigger and apparently it gets closer to "pass". With alpha0 = 1000, it passed. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7m966RdCMXppbda1bnV3f-0bW0Fkks5tXw68gaJpZM4SK117> .

bomin8319 · 2018-02-23T19:19:29Z

No. All alpha's are treated as hyperparameters.
Should we embed sampling steps for alphas then?

hannawallach · 2018-02-23T19:20:13Z

Here's what it sounds like to me: the counts at the top level are very big. (Print them to get a sense of the magnitude.) If you have small alphas at that level, you're putting a massive amount of weight on the top-level counts and so I imagine you're getting stuck in a shitty rich get richer scenario and not getting out of it. On Feb 23, 2018 2:17 PM, "Hanna Wallach" <[email protected]> wrote: Okay. Are you sampling the alphas? It sounds like you are not?

…

On Feb 23, 2018 2:17 PM, "Bomin Kim" ***@***.***> wrote: Yes, I use the same code for the level 2 and level 3 hierarchy. Only difference is that I replace the fraction (N_k + alpha0/K)/(N+alpha0) by 1/K. I just tried to set alpha0 bigger and apparently it gets closer to "pass". With alpha0 = 1000, it passed. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7m966RdCMXppbda1bnV3f-0bW0Fkks5tXw68gaJpZM4SK117> .

hannawallach · 2018-02-23T19:21:08Z

Probably. (What values are you setting them to?)

…

On Feb 23, 2018 2:19 PM, "Bomin Kim" ***@***.***> wrote: No. All alpha's are treated as hyperparameters. Should we embed sampling steps for alphas then? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7os-74d3LQ9LIjoCjTD225r2Isrwks5tXw9CgaJpZM4SK117> .

bomin8319 · 2018-02-23T19:29:01Z

Totally agree with you on "getting stuck in a shitty rich get richer scenario".
The reason 2 level worked fine is it always lowers richer probability and increases poorer ones.
Similarly when we use huge alpha0, it gets closer to 2 level so it passed.

I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough.

hannawallach · 2018-02-23T19:29:52Z

I'd sample the alphas.

…

On Feb 23, 2018 2:29 PM, "Bomin Kim" ***@***.***> wrote: Totally agree with you on "getting stuck in a shitty rich get richer scenario". The reason 2 level worked fine is it always lowers richer probability and increases poorer ones. Similarly when we use huge alpha0, it gets closer to 2 level so it passed. I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117> .

hannawallach · 2018-02-23T19:30:44Z

I'm also still not convinced that there isn't a bug somewhere.

…

On Feb 23, 2018 2:29 PM, "Bomin Kim" ***@***.***> wrote: Totally agree with you on "getting stuck in a shitty rich get richer scenario". The reason 2 level worked fine is it always lowers richer probability and increases poorer ones. Similarly when we use huge alpha0, it gets closer to 2 level so it passed. I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117> .

bomin8319 · 2018-02-23T19:51:41Z

I seems like alpha1 (contribution of the counts at the middle level) also needed to be bigger.
Both maximal and minimal passes with (alpha, alpha1, alpha0) = (5, 50, 100), so it may need sampling of alphas to estimate how much each level should be accounted for.
Before I work on that, I will ask Bruce to double check my R code in hope of finding any bugs.

hannawallach · 2018-02-23T20:48:03Z

(I started this before your most recent email.) Replying quickly from my computer. 1. I'm not convinced that there's not a bug. 2. You can think of each alpha as a pseudocount. If the count that you're adding an alpha to is N_d -- i.e., a document length -- then this tells you something about the value of the alpha that you want. In other words, you likely want something that is not substantially larger or substantially smaller than N_d. Ditto if the count that you're adding an alpha to is N_c -- here, you want something that's roughly comparable to the number of tokens associated with a cluster. And ditto for N_. at the top level. Since N_. is the total number of tokens in the corpus, it will need to be muuuuuuuch larger than the kind of alpha values that are suitable for the N_d level (which is the level we're usually working at with LDA).

…

On Fri, Feb 23, 2018 at 2:30 PM, Hanna Wallach ***@***.***> wrote: I'm also still not convinced that there isn't a bug somewhere. On Feb 23, 2018 2:29 PM, "Bomin Kim" ***@***.***> wrote: > Totally agree with you on "getting stuck in a shitty rich get richer > scenario". > The reason 2 level worked fine is it always lowers richer probability and > increases poorer ones. > Similarly when we use huge alpha0, it gets closer to 2 level so it passed. > > I so far varied alphas from 5 to 50 (I thought this should pass no matter > what alphas are) with different combinations, but now I realize that was > not big enough. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117> > . >

-- hanna wallach http://dirichlet.net/

hannawallach · 2018-02-23T20:48:33Z

3. I'd definitely sample the alphas. It's even less easy to figure out good alpha values for the minimal path assumption.

…

On Fri, Feb 23, 2018 at 3:47 PM, Hanna Wallach ***@***.***> wrote: (I started this before your most recent email.) Replying quickly from my computer. 1. I'm not convinced that there's not a bug. 2. You can think of each alpha as a pseudocount. If the count that you're adding an alpha to is N_d -- i.e., a document length -- then this tells you something about the value of the alpha that you want. In other words, you likely want something that is not substantially larger or substantially smaller than N_d. Ditto if the count that you're adding an alpha to is N_c -- here, you want something that's roughly comparable to the number of tokens associated with a cluster. And ditto for N_. at the top level. Since N_. is the total number of tokens in the corpus, it will need to be muuuuuuuch larger than the kind of alpha values that are suitable for the N_d level (which is the level we're usually working at with LDA). On Fri, Feb 23, 2018 at 2:30 PM, Hanna Wallach ***@***.***> wrote: > I'm also still not convinced that there isn't a bug somewhere. > > On Feb 23, 2018 2:29 PM, "Bomin Kim" ***@***.***> wrote: > >> Totally agree with you on "getting stuck in a shitty rich get richer >> scenario". >> The reason 2 level worked fine is it always lowers richer probability >> and increases poorer ones. >> Similarly when we use huge alpha0, it gets closer to 2 level so it >> passed. >> >> I so far varied alphas from 5 to 50 (I thought this should pass no >> matter what alphas are) with different combinations, but now I realize that >> was not big enough. >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#1 (comment)>, >> or mute the thread >> <https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117> >> . >> > -- hanna wallach http://dirichlet.net/

-- hanna wallach http://dirichlet.net/

hannawallach · 2018-02-23T20:49:00Z

4. If there isn't a bug then the issue is mixing, caused by having shitty alpha values that mean it's taking waaaay too long to mix.

…

On Fri, Feb 23, 2018 at 3:47 PM, Hanna Wallach ***@***.***> wrote: 3. I'd definitely sample the alphas. It's even less easy to figure out good alpha values for the minimal path assumption. On Fri, Feb 23, 2018 at 3:47 PM, Hanna Wallach ***@***.***> wrote: > (I started this before your most recent email.) > > Replying quickly from my computer. > > 1. I'm not convinced that there's not a bug. > > 2. You can think of each alpha as a pseudocount. If the count that you're > adding an alpha to is N_d -- i.e., a document length -- then this tells you > something about the value of the alpha that you want. In other words, you > likely want something that is not substantially larger or substantially > smaller than N_d. Ditto if the count that you're adding an alpha to is N_c > -- here, you want something that's roughly comparable to the number of > tokens associated with a cluster. And ditto for N_. at the top level. Since > N_. is the total number of tokens in the corpus, it will need to be > muuuuuuuch larger than the kind of alpha values that are suitable for the > N_d level (which is the level we're usually working at with LDA). > > On Fri, Feb 23, 2018 at 2:30 PM, Hanna Wallach ***@***.***> > wrote: > >> I'm also still not convinced that there isn't a bug somewhere. >> >> On Feb 23, 2018 2:29 PM, "Bomin Kim" ***@***.***> wrote: >> >>> Totally agree with you on "getting stuck in a shitty rich get richer >>> scenario". >>> The reason 2 level worked fine is it always lowers richer probability >>> and increases poorer ones. >>> Similarly when we use huge alpha0, it gets closer to 2 level so it >>> passed. >>> >>> I so far varied alphas from 5 to 50 (I thought this should pass no >>> matter what alphas are) with different combinations, but now I realize that >>> was not big enough. >>> >>> — >>> You are receiving this because you commented. >>> Reply to this email directly, view it on GitHub >>> <#1 (comment)>, >>> or mute the thread >>> <https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117> >>> . >>> >> > > > -- > hanna wallach > http://dirichlet.net/ > -- hanna wallach http://dirichlet.net/

-- hanna wallach http://dirichlet.net/

bomin8319 · 2018-02-23T20:58:06Z

Thanks for suggestions! #2 definitely explains why it worked out with (alpha, alpha1, alpha0) = (5, 50, 100). I will check out the bugs first and (no matter there is a bug or not) work on sampling of the alpha since we are anyway going to use the minimal path assumption.

aschein · 2018-02-25T20:15:48Z

Have you implemented Schein testing? This test will fail if there is a software bug but *not* if a correctly implemented sampler is failing to mix. It also takes many fewer samples to detect a bug.

…

Sent from my iPhone

On Feb 23, 2018, at 3:58 PM, Bomin Kim ***@***.***> wrote: Thanks for suggestions! #2 definitely explains why it worked out with (alpha, alpha1, alpha0) = (5, 50, 100). I will check out the bugs first and (no matter there is a bug or not) work on sampling of the alpha since we are anyway going to use the minimal path assumption. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

hannawallach · 2018-02-25T20:26:19Z

+1

…

On Feb 25, 2018 3:15 PM, "Aaron Schein" ***@***.***> wrote: Have you implemented Schein testing? This test will fail if there is a software bug but *not* if a correctly implemented sampler is failing to mix. It also takes many fewer samples to detect a bug. Sent from my iPhone > On Feb 23, 2018, at 3:58 PM, Bomin Kim ***@***.***> wrote: > > Thanks for suggestions! #2 definitely explains why it worked out with (alpha, alpha1, alpha0) = (5, 50, 100). I will check out the bugs first and (no matter there is a bug or not) work on sampling of the alpha since we are anyway going to use the minimal path assumption. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub, or mute the thread. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7qBqNA6FUAGI2iZh0W7_XS0a49iSks5tYb90gaJpZM4SK117> .

bomin8319 · 2018-02-25T20:31:57Z

Yes, what I have been working on so far is actually Schein testing (to prevent from mixing issue). Schein test passes with small number of outer iterations (thus only few steps away from true values), but it fails as I increase the size of outer iterations.

hannawallach · 2018-02-25T20:33:02Z

Okay. Sounds like there is a bug somewhere. When you clamp various parts of the model, which bita pass/fail?

…

On Feb 25, 2018 3:31 PM, "Bomin Kim" ***@***.***> wrote: Yes, what I have been working on so far is actually Schein testing (to prevent from mixing issue). Schein test passes with small number of outer iterations (thus only few steps away from true values), but it fails as I increase the size of outer iterations. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1T7lThwxOBEnqH6QsOwcBsm4WXkWz0ks5tYcM9gaJpZM4SK117> .

bomin8319 · 2018-02-26T17:44:00Z

It was only topic distribution {N_k}_{k=1}^K that failed Schein test when I clamped the rest of parts of the IPTM.
So, I wrote a separate code for cluster LDA (without other IPTM variables) which iterates the generative process and inference for Z's only, assuming the cluster assignments are known. What I have illustrated so far (richer-get-richer issue) was based on this only-cluster LDA version of Schein test. Now I am trying to find a bug in this simpler version which may fix the same issue in the IPTM.

bomin8319 · 2018-03-06T17:28:24Z

Sorry for being late, but I attached the derivation of cluster LDA.
Section 2 results in the IPTM's current sampling equation of z_dn (I followed Hanna's paper ---Rethinking LDA: Why Priors Matter), but I put NOTE for one part that I am not sure about.
Section 3 is the alternative approach to directly integrate out the base measures which ended up something similar to hierarchical Dirichlet process, although I failed to finish the derivation...
I will try to work on this once more before our meeting, and we can go over this together tomorrow.
clusterLDA.pdf

GiR issue with new version of IPTM #1

GiR issue with new version of IPTM #1

Comments

bomin8319 commented Feb 19, 2018

bdesmarais commented Feb 19, 2018

aschein commented Feb 19, 2018

bomin8319 commented Feb 19, 2018

hannawallach commented Feb 19, 2018 via email

bomin8319 commented Feb 19, 2018

hannawallach commented Feb 20, 2018 via email

bomin8319 commented Feb 23, 2018 • edited Loading

hannawallach commented Feb 23, 2018 via email

hannawallach commented Feb 23, 2018 via email

hannawallach commented Feb 23, 2018 via email

bomin8319 commented Feb 23, 2018 • edited Loading

hannawallach commented Feb 23, 2018 via email

bomin8319 commented Feb 23, 2018

hannawallach commented Feb 23, 2018 via email

bomin8319 commented Feb 23, 2018

hannawallach commented Feb 23, 2018 via email

hannawallach commented Feb 23, 2018 via email

bomin8319 commented Feb 23, 2018

hannawallach commented Feb 23, 2018 via email

bomin8319 commented Feb 23, 2018

hannawallach commented Feb 23, 2018 via email

hannawallach commented Feb 23, 2018 via email

bomin8319 commented Feb 23, 2018

hannawallach commented Feb 23, 2018 via email

hannawallach commented Feb 23, 2018 via email

bomin8319 commented Feb 23, 2018

hannawallach commented Feb 23, 2018 via email

hannawallach commented Feb 23, 2018 via email

hannawallach commented Feb 23, 2018 via email

bomin8319 commented Feb 23, 2018

aschein commented Feb 25, 2018 via email

hannawallach commented Feb 25, 2018 via email

bomin8319 commented Feb 25, 2018

hannawallach commented Feb 25, 2018 via email

bomin8319 commented Feb 26, 2018

bomin8319 commented Mar 6, 2018

bomin8319 commented Feb 23, 2018 •

edited

Loading

bomin8319 commented Feb 23, 2018 •

edited

Loading