Added Unsupervised Learning Methods (Clustering + Dimensionality Reduction) #240

abhinavuppala · 2024-09-18T00:36:49Z

Summary

This pull request introduces several unsupervised learning methods to the repository, including clustering methods (Gaussian Mixture Models) and Dimensionality Reduction (Principal Component Analysis, Autoencoder). These new modules are designed to work in tandem with existing features such as DataMaster.DataProcessor, in hopes of providing improvements to the previously existing CNN structure in the areas of data exploration and scalability.

New Features

UnsupervisedLearning.py
- Added PCAModel class, with methods to build model, encode, and decode input data.
- Added Autoencoder class, with methods to build & train model, encode, and decode input data. Furthermore, users can download & upload model weights files.
- Added GMM class, with methods to build & train model, calculate expected Y values, and make estimations with new input. Furthermore, users can download & upload model files.
demo.py
- Created streamlit web demo to demonstrate these 3 models
- Includes functionality to train, export & import new models, along with descriptions & dynamic visuals demonstrating the model's purpose

Example Usage

PCA

# REQUIRED CONFIG KEYS
config = {
	'n_components': int
}
# 0 < n_components < 1: makes enough PCs to keep n_components%+ variation
# n_components >= 1: makes n_components PCs (dimensions)

# uses existing autoencoder.X_train dataset when string is passed in
pca: PCAModel
X_train_encoded = pca.encode_input_data('train')

# unrelated dataset can also be passed in
X_seperate_encoded = pca.encode_input_data(X_seperate)

# decoder function can only take np array of encoded data
X_train_reconstructed = pca.decode_output_data(X_train_encoded)

Autoencoder

# REQUIRED CONFIG KEYS
# for now, input_dim is hard-coded to 104
config = {
	'output_dim': int,
	'encoder_layers': list[int],
	'decoder_layers': list[int]
}
# encoder_layers & decoder_layers list the number of
# nodes for each in-between layer. (Exclude input & output layer sizes)

# uses existing autoencoder.X_train dataset
autoencoder: Autoencoder
X_train_encoded = autoencoder.encode_input_data('train')

# unrelated dataset can also be passed in
X_seperate_encoded = autoencoder.encode_input_data(X_seperate)

# decoder function can only take np array of encoded data
X_train_reconstructed = autoencoder.decode_output_data(X_train_encoded)

GMM

# REQUIRED CONFIG KEYS
config = {
	'n_components': int,
	'covariance_type': Literal['full', 'tied', 'spherical', 'diag'],
	'n_init': int,
	'max_iter': int,
	'verbose': int
}
# These are the same purpose as parameters to sklearn.mixture.GaussianMixture

# gmm.train_model already calls calculate_expected_values
# these can be accessed in self.expected_values (avg. Y value per cluster)
gmm: GMM

# required encoded input, which could be output of PCA or Autoencoder
y_pred = gmm.predict(X_train_encoded)

Initialize & Fit Model

To initialize every model, you pass in the same 6 parameters in their constructor. Then, you call the class's build_model and train_model methods with the necessary config dict for build_model. The process is very similar for the other 2 classes.

from TelescopeML.UnsupervisedLearning import *
from TelescopeML.DataMaster import *

import pandas as pd
import numpy as np
import os

...

data_processor: DataProcessor
autoencoder = Autoencoder(
	X_train = data_processor.X_train_standardized_rowwise,
	X_val   = data_processor.X_val_standardized_rowwise,
	X_test  = data_processor.X_test_standardized_rowwise,
	y_train = data_processor.y_train_standardized_columnwise,
	y_val   = data_processor.y_val_standardized_columnwise,
	y_test  = data_processor.y_test_standardized_columnwise,
)
# Passed in when building the model,
# determines output size & between layers
config = {
	'output_dim': 20,
	'encoder_layers': [60, 35],
	'decoder_layers': [35, 60],
	# Exclude final input & output layer for each list
}
autoencoder.build_model(config)

# Train model with given parameters
model_history, model = autoencoder.train_model(epochs=100,
											   batch_size=20,
											   verbose=1)

Saving Models

Using autoencoder as an example here, but it works the same with GMM.

# model_indicator is the unique identifier for this specific autoencoder
autoencoder.save_from_indicator(model_indicator='95acc-100epoch')

...

# new_autoencoder & autoencoder must have the same architecture
# if so, loads the trained weights from old autoencoder
new_autoencoder = Autoencoder(...)
new_autoencoder.build_model(config)
new_autoencoder.load_from_indicator(model_indicator='95acc-100epoch')

Notes

I would appreciate any feedback on the new clustering & dimensionality reduction methods. In the future I could plan to expand upon these features, both by finding exploring different methods and by fine-tuning the existing methods.

…d within the demo

abhinavuppala added 6 commits September 11, 2024 21:48

Experimented with PCA to find ideal clustering algorithm

a18712d

created classes for PCA and Encoder-Decoder

50a2b32

Created PCA & Autoencoder classes along with Streamlit demo for them

996a18e

Removed outdated files for dim. reduction

3b7d53a

Added GMM to classes & demo, along with model training/upload/downloa…

e1f1adb

…d within the demo

Minor changes for consistency

f8565ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Unsupervised Learning Methods (Clustering + Dimensionality Reduction) #240

Added Unsupervised Learning Methods (Clustering + Dimensionality Reduction) #240

abhinavuppala commented Sep 18, 2024

Added Unsupervised Learning Methods (Clustering + Dimensionality Reduction) #240

Are you sure you want to change the base?

Added Unsupervised Learning Methods (Clustering + Dimensionality Reduction) #240

Conversation

abhinavuppala commented Sep 18, 2024

Summary

New Features

Example Usage

PCA

Autoencoder

GMM

Initialize & Fit Model

Saving Models

Notes