Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assessing UMAP embedding quality and sweeping across n_neighbours parameter #1

Open
ohickl opened this issue Mar 18, 2021 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@ohickl
Copy link

ohickl commented Mar 18, 2021

Hi,
I am very exicted to try to assess the quality of my embeddings using EMBEDR. I am unsure though how to set the perplexity value, while doing the n_neighbours parameter sweep for UMAP. Should I set the EMBEDR perplexity to always equal n_neighbours?

Best
Oskar

@ejohnson643
Copy link
Owner

Hi Oskar!

If you want run the code using UMAP, it will ignore the perplexity parameter. I will make this clear in the documentation! Thanks for the question!

@ohickl
Copy link
Author

ohickl commented Mar 23, 2021

Thanks! I was a bit confused by this:

perplexity: float
        Similar to the perplexity parameter from van der Maaten (2008); sets 
        the scale of the affinity kernel used to measure embedding quality.  
        NOTE: In the EMBEDR algorithm, this parameter is used EVEN WHEN NOT 
        USING t-SNE!  Default is 30

in the EMBEDR class.

@ejohnson643
Copy link
Owner

Oh, of course! The perplexity parameter does double duty in that it is involved with how the embedding quality is assessed as well as in running t-SNE. That is, currently, the quality of an embedding is calculated as the similarity of two data-affinity matrices, one from the original data space and one from the embedded space. The high-dimensional affinity matrix depends on a perplexity parameter, perp_aff, which needs to be set somehow.

If you use the same value for perp_aff throughout a sweep of the UMAP n_neighbors parameter, you are examining the quality with which neighborhoods of a size set by perp_aff are embedded by UMAP as UMAP is allowed to use more or fewer neighbors to actually carry out the embedding. This is akin to fixing the resolution of your "quality ruler" and then examining the different conditions created by UMAP. I don't think there will be anything wrong with this.

Alternately, you can change perp_aff to correspond to the neighborhood size that t-SNE/UMAP is operating at. This is easy to do with t-SNE because perp_aff can be set to be the same as the canonical perplexity. However, to do this with UMAP, we need to map perp_aff to some sort of k_effective number of nearest neighbors. I am currently working on implementing this.

However, if you're concerned after you've run your sweep that you've chosen the wrong perp_aff for some reason, you don't have to re-run everything, but you will have to hack the methods a bit. What you can do is something like the following:

from embedr import EMBEDR
import matplotlib.pyplot as plt
import numpy as np
from openTSNE.affinity import PerplexityBasedNN
import utility as utl

X = np.loadtxt("./Data/mnist2500_X.txt")

old_perp = 30
new_perp = 100

n_jobs = -1
seed = 1
verbose = 5

n_data_embed = 1
n_null_embed = 2

fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(12, 5))

## Initialize and fit the data like normal
UMAP_embed = EMBEDR(perplexity=old_perp,
                    dimred_params={'n_neighbors': n_neighbors},
                    # cache_results=False,  ## Turn off file caching.
                    dimred_alg="UMAP",
                    n_jobs=n_jobs,
                    random_state=seed,
                    verbose=verbose,
                    n_data_embed=n_data_embed,
                    n_null_embed=n_null_embed,
                    project_name='changing_perplexity_test')
UMAP_embed.fit(X)

## Let's see the results!
UMAP_embed.plot(ax=ax1, show_cbar=False)

## Calculate a new affinity matrix at the new perplexity
new_aff_mat = PerplexityBasedNN(X,
                                perplexity=new_perp,
                                n_jobs=n_jobs,
                                random_state=seed,
                                verbose=verbose)

## Calculate null affinity matrices at the new perplexity
new_null_mat = {}
for nNo in range(n_null_embed):

    null_X = utl.generate_nulls(X, seed=seed + nNo).squeeze()
    nP = PerplexityBasedNN(null_X,
                           perplexity=new_perp,
                           n_jobs=n_jobs,
                           random_state=seed,
                           verbose=verbose)

    new_null_mat[nNo] = nP

## Reset the affinity matrices in the method
UMAP_embed._affmat = new_aff_mat
UMAP_embed._null_affmat = new_null_mat

## Recalculate the p-Values and quality scores.
UMAP_embed.do_cache = False  ## Need to turn off file caching to force the
                             ## method to recalculate.
UMAP_embed._calc_EES()

## Let's see the results!
UMAP_embed.plot(ax=ax2)

ax1.set_title(f"Affinity Perplexity = {old_perp}")
ax2.set_title(f"Affinity Perplexity = {new_perp}")
ax1.set_xticklabels([])
ax1.set_yticklabels([])
ax2.set_xticklabels([])
ax2.set_yticklabels([])

fig.tight_layout()

plt.show()

I'm going to leave this whole thing open as something to prioritize in the next version because this should be easier! Also, this really underscores how these parameters should be separated semantically in the code. In my reply, I invented perp_aff, but I'll actually make this a more obvious parameter in the code!

TLDR: You can probably leave perplexity fixed, but future methods will automatically update it depending on the DRA.

@ejohnson643 ejohnson643 reopened this Mar 23, 2021
@ejohnson643 ejohnson643 self-assigned this Mar 23, 2021
@ejohnson643 ejohnson643 added documentation Improvements or additions to documentation enhancement New feature or request labels Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants