machine_learning_tool module

This script can train different machine learning models performs grid search and evaluation of models

machine_learning_tool.feature_reduction(X_train, X_test, featureNames)[source]

Function to filter out 0-variance genes

Parameters
  • X_train – numpy.array of training data

  • X_test – numpy.array of testing data

  • featureNames – array with gene feature names

Returns

filtered and scaled X_train, numpy array

Returns

filtered and scaled X_test, numpy array

Returns

filtered feature_names, list

Returns

filtered gene_ids, list

Function to perform grid search and print evaluation performance

Parameters
  • model – LinearSVC, SVC, RandomForest, MultiLayerPerceptron

  • model_dict – dictionary which instantiates models

  • X_train – scaled training data, numpy array

  • y_train – list of target values

  • X_test – scaled test data, numpy array

  • y_test – list of target test values

  • gene_ids – list of gene ids

  • feature_names – list of features

  • splits – number of splits for cross-validation

  • n_jobs – number of cpu-cores to be used

  • refitting – default is True

  • outpath – for plots and tables

  • dataset_title – i.e. tissue_dataset

Returns

None

machine_learning_tool.max_features_arr(feature_names, max_f_arr)[source]

Function for obtaining max_features as float array for RandomForest

Parameters
  • feature_names – array of gene feature names

  • max_f_arr – array of max_features to be converted into float array for grid search

Returns

float array of max_features

machine_learning_tool.plot_bar(df, outpath)[source]

Plot 10 most important features

Parameters

df – pandas Dataframe of feature importances

Returns

None

machine_learning_tool.print_mean_std_scores(scoring_dict)[source]

Print cross_validation_results

Parameters

scoring_dict – from grid_search, cross_val_score, cross_validate

Returns

None

machine_learning_tool.read_dataset(path, ensembl_filter, id2name)[source]

Function to parse a dataset

Parameters
  • path – to input gene-expression table (GeneID, GeneName, Sample1, Sample2, …)

  • ensembl_filter – input list of ensembl_gene_ids

  • names – input list of gene_names

Returns

pandas.Dataframe

Returns

sample_names, list

Returns

gene_ids = features, list

Returns

gene_names = feature_names, list

machine_learning_tool.search_space(model)[source]

Function to define the search space for GridSearch or RandomGridSearch

Parameters

model – one of LinearSVC, SVC, RandomForest, MultiLayerPerceptron

Returns

search space