machine_learning_tool module¶
This script can train different machine learning models performs grid search and evaluation of models
-
machine_learning_tool.feature_reduction(X_train, X_test, featureNames)[source]¶ Function to filter out 0-variance genes
- Parameters
X_train – numpy.array of training data
X_test – numpy.array of testing data
featureNames – array with gene feature names
- Returns
filtered and scaled X_train, numpy array
- Returns
filtered and scaled X_test, numpy array
- Returns
filtered feature_names, list
- Returns
filtered gene_ids, list
-
machine_learning_tool.grid_search(model, model_dict, X_train, y_train, X_test, y_test, gene_ids, feature_names, splits, n_jobs, refitting, outpath, dataset_title)[source]¶ Function to perform grid search and print evaluation performance
- Parameters
model – LinearSVC, SVC, RandomForest, MultiLayerPerceptron
model_dict – dictionary which instantiates models
X_train – scaled training data, numpy array
y_train – list of target values
X_test – scaled test data, numpy array
y_test – list of target test values
gene_ids – list of gene ids
feature_names – list of features
splits – number of splits for cross-validation
n_jobs – number of cpu-cores to be used
refitting – default is True
outpath – for plots and tables
dataset_title – i.e. tissue_dataset
- Returns
None
-
machine_learning_tool.max_features_arr(feature_names, max_f_arr)[source]¶ Function for obtaining max_features as float array for RandomForest
- Parameters
feature_names – array of gene feature names
max_f_arr – array of max_features to be converted into float array for grid search
- Returns
float array of max_features
-
machine_learning_tool.plot_bar(df, outpath)[source]¶ Plot 10 most important features
- Parameters
df – pandas Dataframe of feature importances
- Returns
None
-
machine_learning_tool.print_mean_std_scores(scoring_dict)[source]¶ Print cross_validation_results
- Parameters
scoring_dict – from grid_search, cross_val_score, cross_validate
- Returns
None
-
machine_learning_tool.read_dataset(path, ensembl_filter, id2name)[source]¶ Function to parse a dataset
- Parameters
path – to input gene-expression table (GeneID, GeneName, Sample1, Sample2, …)
ensembl_filter – input list of ensembl_gene_ids
names – input list of gene_names
- Returns
pandas.Dataframe
- Returns
sample_names, list
- Returns
gene_ids = features, list
- Returns
gene_names = feature_names, list