Assessing UMAP embedding quality and sweeping across n_neighbours parameter #1

ohickl · 2021-03-18T10:16:45Z

Hi,
I am very exicted to try to assess the quality of my embeddings using EMBEDR. I am unsure though how to set the perplexity value, while doing the n_neighbours parameter sweep for UMAP. Should I set the EMBEDR perplexity to always equal n_neighbours?

Best
Oskar

The text was updated successfully, but these errors were encountered:

ejohnson643 · 2021-03-22T16:11:15Z

Hi Oskar!

If you want run the code using UMAP, it will ignore the perplexity parameter. I will make this clear in the documentation! Thanks for the question!

ohickl · 2021-03-23T15:40:40Z

Thanks! I was a bit confused by this:

perplexity: float
        Similar to the perplexity parameter from van der Maaten (2008); sets 
        the scale of the affinity kernel used to measure embedding quality.  
        NOTE: In the EMBEDR algorithm, this parameter is used EVEN WHEN NOT 
        USING t-SNE!  Default is 30

in the EMBEDR class.

ejohnson643 · 2021-03-23T17:54:18Z

Oh, of course! The perplexity parameter does double duty in that it is involved with how the embedding quality is assessed as well as in running t-SNE. That is, currently, the quality of an embedding is calculated as the similarity of two data-affinity matrices, one from the original data space and one from the embedded space. The high-dimensional affinity matrix depends on a perplexity parameter, perp_aff, which needs to be set somehow.

If you use the same value for perp_aff throughout a sweep of the UMAP n_neighbors parameter, you are examining the quality with which neighborhoods of a size set by perp_aff are embedded by UMAP as UMAP is allowed to use more or fewer neighbors to actually carry out the embedding. This is akin to fixing the resolution of your "quality ruler" and then examining the different conditions created by UMAP. I don't think there will be anything wrong with this.

Alternately, you can change perp_aff to correspond to the neighborhood size that t-SNE/UMAP is operating at. This is easy to do with t-SNE because perp_aff can be set to be the same as the canonical perplexity. However, to do this with UMAP, we need to map perp_aff to some sort of k_effective number of nearest neighbors. I am currently working on implementing this.

However, if you're concerned after you've run your sweep that you've chosen the wrong perp_aff for some reason, you don't have to re-run everything, but you will have to hack the methods a bit. What you can do is something like the following:

from embedr import EMBEDR
import matplotlib.pyplot as plt
import numpy as np
from openTSNE.affinity import PerplexityBasedNN
import utility as utl

X = np.loadtxt("./Data/mnist2500_X.txt")

old_perp = 30
new_perp = 100

n_jobs = -1
seed = 1
verbose = 5

n_data_embed = 1
n_null_embed = 2

fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(12, 5))

## Initialize and fit the data like normal
UMAP_embed = EMBEDR(perplexity=old_perp,
                    dimred_params={'n_neighbors': n_neighbors},
                    # cache_results=False,  ## Turn off file caching.
                    dimred_alg="UMAP",
                    n_jobs=n_jobs,
                    random_state=seed,
                    verbose=verbose,
                    n_data_embed=n_data_embed,
                    n_null_embed=n_null_embed,
                    project_name='changing_perplexity_test')
UMAP_embed.fit(X)

## Let's see the results!
UMAP_embed.plot(ax=ax1, show_cbar=False)

## Calculate a new affinity matrix at the new perplexity
new_aff_mat = PerplexityBasedNN(X,
                                perplexity=new_perp,
                                n_jobs=n_jobs,
                                random_state=seed,
                                verbose=verbose)

## Calculate null affinity matrices at the new perplexity
new_null_mat = {}
for nNo in range(n_null_embed):

    null_X = utl.generate_nulls(X, seed=seed + nNo).squeeze()
    nP = PerplexityBasedNN(null_X,
                           perplexity=new_perp,
                           n_jobs=n_jobs,
                           random_state=seed,
                           verbose=verbose)

    new_null_mat[nNo] = nP

## Reset the affinity matrices in the method
UMAP_embed._affmat = new_aff_mat
UMAP_embed._null_affmat = new_null_mat

## Recalculate the p-Values and quality scores.
UMAP_embed.do_cache = False  ## Need to turn off file caching to force the
                             ## method to recalculate.
UMAP_embed._calc_EES()

## Let's see the results!
UMAP_embed.plot(ax=ax2)

ax1.set_title(f"Affinity Perplexity = {old_perp}")
ax2.set_title(f"Affinity Perplexity = {new_perp}")
ax1.set_xticklabels([])
ax1.set_yticklabels([])
ax2.set_xticklabels([])
ax2.set_yticklabels([])

fig.tight_layout()

plt.show()

I'm going to leave this whole thing open as something to prioritize in the next version because this should be easier! Also, this really underscores how these parameters should be separated semantically in the code. In my reply, I invented perp_aff, but I'll actually make this a more obvious parameter in the code!

TLDR: You can probably leave perplexity fixed, but future methods will automatically update it depending on the DRA.

ejohnson643 closed this as completed Mar 22, 2021

ejohnson643 reopened this Mar 23, 2021

ejohnson643 self-assigned this Mar 23, 2021

ejohnson643 added documentation Improvements or additions to documentation enhancement New feature or request labels Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assessing UMAP embedding quality and sweeping across n_neighbours parameter #1

Assessing UMAP embedding quality and sweeping across n_neighbours parameter #1

ohickl commented Mar 18, 2021

ejohnson643 commented Mar 22, 2021

ohickl commented Mar 23, 2021

ejohnson643 commented Mar 23, 2021

Assessing UMAP embedding quality and sweeping across n_neighbours parameter #1

Assessing UMAP embedding quality and sweeping across n_neighbours parameter #1

Comments

ohickl commented Mar 18, 2021

ejohnson643 commented Mar 22, 2021

ohickl commented Mar 23, 2021

ejohnson643 commented Mar 23, 2021