Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Unsupervised Learning Methods (Clustering + Dimensionality Reduction) #240

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

abhinavuppala
Copy link

Summary

This pull request introduces several unsupervised learning methods to the repository, including clustering methods (Gaussian Mixture Models) and Dimensionality Reduction (Principal Component Analysis, Autoencoder). These new modules are designed to work in tandem with existing features such as DataMaster.DataProcessor, in hopes of providing improvements to the previously existing CNN structure in the areas of data exploration and scalability.

New Features

  • UnsupervisedLearning.py
    • Added PCAModel class, with methods to build model, encode, and decode input data.
    • Added Autoencoder class, with methods to build & train model, encode, and decode input data. Furthermore, users can download & upload model weights files.
    • Added GMM class, with methods to build & train model, calculate expected Y values, and make estimations with new input. Furthermore, users can download & upload model files.
  • demo.py
    • Created streamlit web demo to demonstrate these 3 models
    • Includes functionality to train, export & import new models, along with descriptions & dynamic visuals demonstrating the model's purpose

Example Usage

PCA

# REQUIRED CONFIG KEYS
config = {
	'n_components': int
}
# 0 < n_components < 1: makes enough PCs to keep n_components%+ variation
# n_components >= 1: makes n_components PCs (dimensions)

# uses existing autoencoder.X_train dataset when string is passed in
pca: PCAModel
X_train_encoded = pca.encode_input_data('train')

# unrelated dataset can also be passed in
X_seperate_encoded = pca.encode_input_data(X_seperate)

# decoder function can only take np array of encoded data
X_train_reconstructed = pca.decode_output_data(X_train_encoded)

Autoencoder

# REQUIRED CONFIG KEYS
# for now, input_dim is hard-coded to 104
config = {
	'output_dim': int,
	'encoder_layers': list[int],
	'decoder_layers': list[int]
}
# encoder_layers & decoder_layers list the number of
# nodes for each in-between layer. (Exclude input & output layer sizes)

# uses existing autoencoder.X_train dataset
autoencoder: Autoencoder
X_train_encoded = autoencoder.encode_input_data('train')

# unrelated dataset can also be passed in
X_seperate_encoded = autoencoder.encode_input_data(X_seperate)

# decoder function can only take np array of encoded data
X_train_reconstructed = autoencoder.decode_output_data(X_train_encoded)

GMM

# REQUIRED CONFIG KEYS
config = {
	'n_components': int,
	'covariance_type': Literal['full', 'tied', 'spherical', 'diag'],
	'n_init': int,
	'max_iter': int,
	'verbose': int
}
# These are the same purpose as parameters to sklearn.mixture.GaussianMixture

# gmm.train_model already calls calculate_expected_values
# these can be accessed in self.expected_values (avg. Y value per cluster)
gmm: GMM

# required encoded input, which could be output of PCA or Autoencoder
y_pred = gmm.predict(X_train_encoded)

Initialize & Fit Model

To initialize every model, you pass in the same 6 parameters in their constructor. Then, you call the class's build_model and train_model methods with the necessary config dict for build_model. The process is very similar for the other 2 classes.

from TelescopeML.UnsupervisedLearning import *
from TelescopeML.DataMaster import *

import pandas as pd
import numpy as np
import os

...

data_processor: DataProcessor
autoencoder = Autoencoder(
	X_train = data_processor.X_train_standardized_rowwise,
	X_val   = data_processor.X_val_standardized_rowwise,
	X_test  = data_processor.X_test_standardized_rowwise,
	y_train = data_processor.y_train_standardized_columnwise,
	y_val   = data_processor.y_val_standardized_columnwise,
	y_test  = data_processor.y_test_standardized_columnwise,
)
# Passed in when building the model,
# determines output size & between layers
config = {
	'output_dim': 20,
	'encoder_layers': [60, 35],
	'decoder_layers': [35, 60],
	# Exclude final input & output layer for each list
}
autoencoder.build_model(config)

# Train model with given parameters
model_history, model = autoencoder.train_model(epochs=100,
											   batch_size=20,
											   verbose=1)

Saving Models

Using autoencoder as an example here, but it works the same with GMM.

# model_indicator is the unique identifier for this specific autoencoder
autoencoder.save_from_indicator(model_indicator='95acc-100epoch')

...

# new_autoencoder & autoencoder must have the same architecture
# if so, loads the trained weights from old autoencoder
new_autoencoder = Autoencoder(...)
new_autoencoder.build_model(config)
new_autoencoder.load_from_indicator(model_indicator='95acc-100epoch')

Notes

I would appreciate any feedback on the new clustering & dimensionality reduction methods. In the future I could plan to expand upon these features, both by finding exploring different methods and by fine-tuning the existing methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant