Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GiR issue with new version of IPTM #1

Open
bomin8319 opened this issue Feb 19, 2018 · 36 comments
Open

GiR issue with new version of IPTM #1

bomin8319 opened this issue Feb 19, 2018 · 36 comments

Comments

@bomin8319
Copy link
Collaborator

I have spent very long time figuring out the possible bugs, but failed.
I was able to re-derive exactly the same sampling equation as cluster LDA, so I do not see any mathematical error.
IP assignments and topic assignments from backward sampling always has larger variance than those from forward sampling, and the difference gets larger as we run more outer iterations.

To test this in the simplest setting of cluster LDA (no other variables other than cd and z), I ran 'clusterLDA.R' code in GiR2 folder and I get the same odd results. Is there anything I am totally missing?

@bdesmarais
Copy link
Collaborator

Could this be caused by label-switching in the backwards sampling? In the forward sampling, we are generating topics from topic distribution k for documents in cluster k, and there is no ambiguity regarding which cluster is topic k in the forward sampling, right? However, in inference, and thus backwards sampling, it seems like label-switching may represent an additional form of variation. Is it possible to artificially introduce label switching into the forward sampling to see if this could produce additional variance in these GiR measures?

@aschein
Copy link
Collaborator

aschein commented Feb 19, 2018

Perhaps try calculating the statistics based on the topic-type counts (e.g., N_kv) not the assignments (e.g., z_i). The statistics (e.g., variance) of the counts N_kv should be immune from label-switching.

@bomin8319
Copy link
Collaborator Author

topic-type counts (N_kv) looks fine in terms of passing GiR. However, current inference (and thus backward samples) converges to one interaction pattern and one (or two) topics across the entire corpus in the long run. For example, if the topic distribution for K=4 is (0.25, 0.25, 0.25, 0.25) in forward sampling, the inferred topic distribution in backward is (0.8, 0.2, 0, 0). I think this is still problematic if it ends up with very few topics remaining in the real data analysis.

I don't quite understand how backward sampling introduces label-switching issue "when we start the inference from the initial values set as true topic distribution".

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 19, 2018 via email

@bomin8319
Copy link
Collaborator Author

You can look at paper/icml2018_style/IPTM_ICML2.pdf.
Generating process is in Section 2.2, and the inference equation is Equation (15) in page 4.
Although the draft is currently written for minimal path assumption, I am currently failing for both minimal and maximum path assumptions. (Since GiR use same number of words across all documents, maximal should be fine and it should pass)

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 20, 2018 via email

@bomin8319
Copy link
Collaborator Author

bomin8319 commented Feb 23, 2018

Generating from collapsed LDA equations definitely helped, but there exists remaining issue.

  1. Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \theta_d ~ Dir(\alpha, m_{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process.

  2. However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \theta_d ~ Dir(\alpha, m_{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy?

Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use three hierarchy instead of two---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...?

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@bomin8319
Copy link
Collaborator Author

bomin8319 commented Feb 23, 2018

IPTM_ICML2.pdf

Just to be clear, here is how I generate z's.

  • initialize N_dk = 0, N_kc = 0, and N_k = 0
  • generate topics as below
    for (d in 1:D){
    for (n in 1:N_d) {
    z_dn ~ Equation (15)
    N_{d z_dn} +=1
    N_{z_dn c_d} +=1
    N_{z_dn} +=1
    }
    }

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@bomin8319
Copy link
Collaborator Author

Yes. Just for the maximal!

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@bomin8319
Copy link
Collaborator Author

Yes, both maximal and minimal fail.
After generating z's as above, I infer them as below:

for (iter in 1:Niter) {
for (d in 1:D) {
for (n in 1:N_d) {
N_{d z_dn} -=1
N_{z_dn c_d} -=1
N_{z_dn} -=1
z_dn ~ Equation (15) #new topic assignment
N_{d z_dn} +=1
N_{z_dn c_d} +=1
N_{z_dn} +=1
}
}
}

and compare N_k from forward and backward using GiR plots which looks like below.
(document-IP and token-word distribution plots become completely "pass" when I shut down inference for Z's).
GiRplot.pdf

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@bomin8319
Copy link
Collaborator Author

Yes, I use the same code for the level 2 and level 3 hierarchy.
Only difference is that I replace the fraction (N_k + alpha0/K)/(N+alpha0) by 1/K.

I just tried to set alpha0 bigger and apparently it gets closer to "pass".
With alpha0 = 1000, it passed.

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@bomin8319
Copy link
Collaborator Author

No. All alpha's are treated as hyperparameters.
Should we embed sampling steps for alphas then?

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@bomin8319
Copy link
Collaborator Author

Totally agree with you on "getting stuck in a shitty rich get richer scenario".
The reason 2 level worked fine is it always lowers richer probability and increases poorer ones.
Similarly when we use huge alpha0, it gets closer to 2 level so it passed.

I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough.

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@bomin8319
Copy link
Collaborator Author

I seems like alpha1 (contribution of the counts at the middle level) also needed to be bigger.
Both maximal and minimal passes with (alpha, alpha1, alpha0) = (5, 50, 100), so it may need sampling of alphas to estimate how much each level should be accounted for.
Before I work on that, I will ask Bruce to double check my R code in hope of finding any bugs.

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 23, 2018 via email

@bomin8319
Copy link
Collaborator Author

Thanks for suggestions! #2 definitely explains why it worked out with (alpha, alpha1, alpha0) = (5, 50, 100). I will check out the bugs first and (no matter there is a bug or not) work on sampling of the alpha since we are anyway going to use the minimal path assumption.

@aschein
Copy link
Collaborator

aschein commented Feb 25, 2018 via email

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 25, 2018 via email

@bomin8319
Copy link
Collaborator Author

Yes, what I have been working on so far is actually Schein testing (to prevent from mixing issue). Schein test passes with small number of outer iterations (thus only few steps away from true values), but it fails as I increase the size of outer iterations.

@hannawallach
Copy link
Collaborator

hannawallach commented Feb 25, 2018 via email

@bomin8319
Copy link
Collaborator Author

It was only topic distribution {N_k}_{k=1}^K that failed Schein test when I clamped the rest of parts of the IPTM.
So, I wrote a separate code for cluster LDA (without other IPTM variables) which iterates the generative process and inference for Z's only, assuming the cluster assignments are known. What I have illustrated so far (richer-get-richer issue) was based on this only-cluster LDA version of Schein test. Now I am trying to find a bug in this simpler version which may fix the same issue in the IPTM.

@bomin8319
Copy link
Collaborator Author

Sorry for being late, but I attached the derivation of cluster LDA.
Section 2 results in the IPTM's current sampling equation of z_dn (I followed Hanna's paper ---Rethinking LDA: Why Priors Matter), but I put NOTE for one part that I am not sure about.
Section 3 is the alternative approach to directly integrate out the base measures which ended up something similar to hierarchical Dirichlet process, although I failed to finish the derivation...
I will try to work on this once more before our meeting, and we can go over this together tomorrow.
clusterLDA.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants