Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lda v1 #311

Open
wants to merge 6 commits into
base: lda_output_fix_final
Choose a base branch
from
Open

Conversation

fmcquillan99
Copy link

User doc updates for LDA and term frequency.

At the same time, I made minor updates to user docs for PageRank, train-test split and matrix ops and combined into 1 PR.

iyerr3 and others added 6 commits February 2, 2018 16:46
This work is based on the original work by
Xiaocheng Tang <[email protected]> in madlib#75.

This PR adds two main features:

- A Minibatch solver that takes as input a batch of data
- SVM code that takes advantage of the minibatch

Closes madlib#229

Co-authored-by: Nikhil Kak <[email protected]>
Co-authored-by: Xiaocheng Tang <[email protected]>
JIRA: MADLIB-1201

Fixed the issue of output of lda_train and lda_get_word_topic_count
not matching each other. Added test case in install check.
See jira for more details and example.

Also added a install check that validates that the output of lda_train and
lda_get_word_topic_count are consistent with each other.
See jira for more details and example.
JIRA: MADLIB-1160

This commit adds a helper function, which will map each wordid with
corresponding topicid that get assigned in output table. Duplicate lines
are removed from the final result.

Also adds a workaround for GPDB4.3 svec

In GPDB4.3, we cannot call madlib.svec directly on a text
format.Instead, we have to call madlib.svec_from_string to convert the
text. This commit fix this issue so the new helper function
madlib.lda_get_word_topic_mapping can work on both gpdb5 and gpdb4.
JIRA:MADLIB-1160

This commit fixes the topicid inconsistency in madlib.lda_train
and madlib.lda_get_topic_desc, where the former one uses 0 based index
and the latter uses 1 index. Now they will all start at 0.
JIRA: MADLIB-1160

Previously, madlib.lda_get_topic_desc got top k - 1 words in the result
table. This commit fixed it to be top k.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants