Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nans when following Re-training Parametric UMAP with landmarks tutorial #1180

Open
EHenryPega opened this issue Jan 24, 2025 · 4 comments
Open

Comments

@EHenryPega
Copy link

EHenryPega commented Jan 24, 2025

Hey umap team!

Firstly, a big thanks for all the work on this library, it is incredibly useful! The ability to retrain a ParametricUMAP whilst preserving the mapping for embeddings that have already been processed would be incredible.

I tried this out for my own use case, using the example here on umap-learn as a reference. However, when it came to the retraining phase, the reported loss for each epoch is always nan.

I assumed this was an issue with my own setup, so I copied the example verbatim. Unfortunately I get the exact same outcome. The model does not retrain successfully.

p_embedder.fit(x2_lmk, landmark_positions=landmarks)
Epoch 1/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 21s 5ms/step - loss: nan
Epoch 2/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - loss: nan
Epoch 3/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - loss: nan
Epoch 4/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - loss: nan
Epoch 5/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - loss: nan
Epoch 6/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan
Epoch 7/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan
Epoch 8/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan
Epoch 9/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan
Epoch 10/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan

I suspect there has either been some kind of regression or there have been some updates to the library that are not reflected in the example.

Any help or suggestions would be greatly appreciated. Cheers!

@timsainb
Copy link
Collaborator

timsainb commented Feb 1, 2025

this is related to #1153 maybe @jacobgolding knows what is going on, I don't have much experience with landmarks yet

@jacobgolding
Copy link
Contributor

Hello!
I think this might be related to changes in #1156 , it looks like the documentation hasn't been updated to reflect the new helper functions for adding landmarks.
I've set aside some time in the next couple of days to make sure this is the issue, and remedy it. In the meantime, give the notebook a try instead of the code in the docco:
https://github.com/lmcinnes/umap/blob/a012b9d8751d98b94935ca21f278a54b3c3e1b7f/notebooks/MNIST_Landmarks.ipynb

@EHenryPega
Copy link
Author

Thanks for the reply. I did notice that there were some nice new helper functions in that notebook which make life a lot simpler!

Unfortunately, I still ran into the same issues when using these.

I have been able to run the notebooks successfully on a remote machine. As far as I can tell, the issue is related to my laptop using an M3 chip. I have tried many different tensor flow libraries, from vanilla to those suggested here https://github.com/ChaitanyaK77/Initializing-TensorFlow-Environment-on-M3-M3-Pro-and-M3-Max-Macbook-Pros.

Unfortunately I always end up with NaNs for loss and a broken model when fitting the model using landmarks.

@jacobgolding
Copy link
Contributor

After some testing today I mostly just confused myself. I found a couple of things:

  • When I first ran the notebook as is on the most recent version from my fork I encountered the same issue as you (on an M2 chip).
  • scikit-learn has updated their check_array function, specifically changing force_all_finite to ensure_all_finite. This is going to be a breaking change with 1.8 for UMAP as a whole, so there's work to be done to prepare for that (@lmcinnes )
  • Upgrading to scikit-learn 1.6 (the most recent version at the moment) temporarily fixed the nans on re-training, but not consistently. I can re-run the same code and get either something that works or something that doesn't.

Unfortunately I won't have much more of a chance to debug this in the near future. The next thing I would try is investigating the default landmark loss function and see what's going on there, perhaps using ops.subtract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants