Skip to content
Dajana Müller edited this page Feb 3, 2023 · 22 revisions

To try out the mentioned dimensionality reduction methods, a pre-configured docker is available on docker hub.

Docker

  1. Install Docker for your operating system from here.
  2. Start Docker.
  3. Get the Docker image from DockerHub via docker pull bioinfbo/dimensionality_reduction:latest
  4. Run docker via nvidia-docker run -it bioinfbo/dimensionality_reduction:latest

Dimensionality Reduction Scripts

Run python setup.py to check if all scripts are working and are executed without an error. In a python shell, all functions will be available with from setup import *

Principle component analysis

PCA_tf2.py offers functions to perform PCA and load a PCA model and evaluate test data:

To perform a dimensionality reduction with PCA the following input arguments are required:

encoded data = perform_pca(arrx2_float_data,
	                   int_comp: int = 16,
	                   str_saving_path: str ="",
	                   bool_standardize: bool = True)

Your input data should be a two-dimensional numpy array (e.g. (200, spectra), np.ndarray, np.float32), the number of components int_comp should be an integer, a saving path is required as a string and your data can be standardized, such that every spectrum has a mean of 0 and a standard deviation of 1. The PCA model and the reduced data will be saved. The function returns your input data with shape (200, int_comp). Output files that will be saved are the encoded input data PCA_encoded_[int_comp]_comp.npy, the PCA model pca.pkl and the scaler if your data was standardized scaler_pca.pkl. Additionally, a PCA plot with the explained variance will be saved.

To transform other data with your calculated PCA model, use the following function:

encoded data = load_pca_model(arrx2_float_data,
			      str_data_name: str = "test",
			      int_comp: int = 16,
			      str_saving_path: str = "",
			      str_path_pca_model = None,
			      bool_standardize: bool = True,
			      str_path_scaler_model = None):

If str_path_pca_model and str_path_scaler_model are None, the function will try to load both models from str_saving_path with the default name of pca.pkl and scaler_pca.pkl.

Uniform manifold approximation and projection

UMAP_tf2.py offers functions to perform UMAP and load a UMAP model and evaluate test data:

To perform a dimensionality reduction with UMAP the following input arguments are required:

encoded_data = perform_umap(arrx2_float_data,
			    str_saving_path: str = "",
			    int_comp: int = 16, 
			    int_neigh: int = 15, 
			    float_min_dist: float = 0.1,
			    str_metric: str = "correlation", 
			    bool_standardize: bool = True):

The function perform_umap() covers the basic UMAP parameters: (https://umap-learn.readthedocs.io/en/latest/)

  • n_neigh: Number of neighbors that UMAP uses to learn a manifold structure. A smaller number refers to local structures vs a larger number concentrates on global ones. The default value is 15.
  • float_min_dist: Distance measure on how tight the points in the lower dimension should be represented. The default value is 0.1 and the parameter value ranges from 0.0 to 0.99.
  • int_comp: Parameter for the dimensionality of the reduced data. The default value is here set to 16.
  • str_metric: Computes the distance between points with the metric parameter. The default metric is here set to correlation. A wide range of metrics are supported.

Output files that will be saved are the encoded input data umap_neigh15_dist0.1_metriccorrelation_comp16_train.npy', the UMAP model umap_neigh15_dist0.1_metriccorrelation_comp16_model.pkl and the scaler if your data was standardized umap_neigh15_dist0.1_metriccorrelation_comp16_scaler.pkl.

To encode other data with your embedding model, use the following function:

encoded_data = load_umap_model(arrx2_float_data,
			       str_saving_path: str = "",
			       str_data_name: str = "test", 
			       str_path_umap_model = "umap_neigh15_dist0.1_metriccorrelation_comp2_model.pkl",
			       bool_standardize: bool = True,
			       str_path_scaler_model = "umap_neigh15_dist0.1_metriccorrelation_comp2_scaler.pkl")

Fully connected contractive autoencoder

FCCAE_tf2.py offers a function train_fccae() to train a fully connected autoencoder with a mean squared error and an additional contractive term. It returns a trained encoder and autoencoder.

encoder, autoencoder = train_fccae(arrx2_float_data_train = None, 
		       arrx2_float_data_val = None, 
		       arrx2_float_data_test = None,
		       bool_model_avail = False,
		       str_path_to_weights = "",
		       int_epochs: int = 5,
		       int_epoch_start: int = 0,
		       early_stopping_epochs: int = 200,
		       batch_size: int = 50,
		       learning_rate: float = 0.003,
		       bool_l2_normalize_data: bool = False,
		       int_norm_axis: int = 1,
		       list_hidden_layers: dict = [ {'n_nodes': 256 }, { 'n_nodes': 128 }, { 'n_nodes': 64 },\
						    { 'n_nodes': 32 }, { 'n_nodes': 16 } ],
		       str_saving_path: str ="",
		       bool_show_summary: bool = False)

Your training and validation data should be a two-dimensional numpy array (e.g. (200, spectra), np.ndarray, np.float32). Providing testing data is optional and can also be evaluated after training the model. If the training of the model stops it can be resumed with setting bool_model_avail = ´True´, providing the path to a model file str_path_to_weights = "..h5" and setting int_epoch_start to the epoch that was loaded. The output files that will be saved are both the autoencoder and the encoder model, the history plot of training, and the encoded and decoded test data if arrx2_float_data_test was given. Tensorboard files will be saved in the newly created log folder as well as the weights in the corresponding weight folder.

Stacked contractive autoencoder

SCAE_tf2.py offers a function train_scae() that trains a series of stacked contractive autoencoders with one hidden layer each and with a mean squared error and an additional contractive term. Afterwards, the encoder and decoder are connected to form a deep autoencoder. It returns a trained encoder and autoencoder.

encoder, autoencoder = train_scae(arrx2_float_data_train = None, 
		                  arrx2_float_data_val = None, 
		                  arrx2_float_data_test = None,
		                  int_epochs: int = 5,
		                  early_stopping_epochs: int = 200,
		                  batch_size: int = 50,
			          learning_rate: float = 0.003,
			          bool_l2_normalize_data: bool = False,
			          int_norm_axis: int = 1,
			          list_hidden_layers: dict = [ {'n_nodes': 256 }, { 'n_nodes': 128 }, { 'n_nodes': 64 },\
							       { 'n_nodes': 32 }, { 'n_nodes': 16 } ],
			         str_saving_path: str ="",
			         bool_show_summary: bool = False)

Your training and validation data should be a two-dimensional numpy array (e.g. (200, spectra), np.ndarray, np.float32). Providing testing data is optional and can also be evaluated after training the model. The output files that will be saved are both the autoencoder and the encoder model, the history plot of all trainings, and the encoded and decoded test data if arrx2_float_data_test was given. Tensorboard files will be saved in the newly created log folder as well as the weights in the corresponding weight folder.