Modeling with promor
This tutorial shows you how you can use promor to build predictive models using top differentially expressed proteins.
We recommend that you first go through the simple working example provided in Introduction to promor to get acquainted with promor’s functionality.
A tutorial for proteomics data without technical replicates is provided here: promor: No technical replicates
A tutorial for proteomics data with technical replicates is provided here: promor: Technical replicates
For this tutorial we will be using a previously published data set from Suvarna et al. (2021). In this study, authors used differentially expressed proteins between 33 severe and 18 non-severe COVID patients to build models to predict COVID severity. Here, we are using a subset of that data set (18 severe and 18 non-severe COVID patients).
To build the fit_df and the norm_df objects provided with the pacakge and used in this tutorial, the following analytical steps were conducted:
#Create a raw_df object with default settings covid_raw <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg3.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed3.txt", ) #Filter out proteins with fewer than 66% valid values in either group covid_filt <- filterbygroup_na(covid_raw) #Impute missing data with the minProb method covid_imp_df <- impute_na(covid_filt) #Normalize the data with the quantile method covid_norm_df <- normalize_data(covid_imp_df) #find DE proteins and make a fit_df object covid_fit_df <- find_dep(covid_norm_df)
Figure 1. A schematic diagram showing the suggested promor workflow for building predictive models with protein candidates
You can access the help pages for functions shown above and
To build predictive models with promor, you need:
A fit_df object produced by running
find_depfor differential expression analysis. For this example, we will be using
covid_fit_dfobject provided with the package.
An norm_df object which is a data frame of normalized protein intensities used as input for
find_dep. We will be using
covid_norm_dfobject provided with the package.
Let’s first create a model_df object with the example
norm_df objects provided with the
pre_processfunction uses information from the two objects to create a data frame of protein intensities in a suitable format for modeling.
- If you run this step with default options, highly correlated proteins (features) will be identified and removed from the data frame.
- Alternatively, you can set
find_highcorr = FALSEand
rem_highcorr = FALSE, but it is not recommended. You can also tweak the
corr_cutoffto change the threshold for identifying highly correlated proteins.
- Note: if no or few differentially expressed proteins were identified at the given adjusted p-value cutoff, you can choose to relax these criteria to use more proteins in modeling.
?pre_processfor more information.
# Load promor library(promor) # Create a model_df object with the top differentially expressed proteins. covid_model_df <- pre_process( fit_df = covid_fit_df, norm_df = covid_norm_df ) # Let's take a look at the first few rows of the model_df object head(covid_model_df)
At this stage, we can visualize the feature variation among classes using box plots or density plots.
- Your best proteins (features) for modeling are those that show distinct patterns of variation among conditions.
- For example, these proteins may show mostly non-overlapping density distributions.
# Box plots (default) to visualize feature variation feature_plot( model_df = covid_model_df, n_row = 4, n_col = 2 )
# Alternatively, make density plots feature_plot( model_df = covid_model_df, type = "density", n_row = 4, n_col = 2 )
- At this stage, if you decide to remove a protein or some proteins
model_dfobject, you can do so with
Next, we will split the
model_df object into training
and test data sets while maintaining class or condition
proportions in each set.
- By default, 80% of the data will be added to the training set while 20% will be added to the test set.
- You can change these proportions by setting
train_sizeto a higher or a lower value.
# Create a split_df object by splitting data into training and test data set. Don't forget to fix the random seed for reproducibility. covid_split_df <- split_data(model_df = covid_model_df, seed = 8314)
Access items in the
# You can access the items in the training data set as follows, covid_split_df$training # Similarly, access the test data set covid_split_df$test
We will be using some functions from the
to train models on our training data set.
- A list of all the available machine learning algorithms can be found here: http://topepo.github.io/caret/train-models-by-tag.html.
- Note that some algorithms may require additional arguments. See
trainControlfor more details.
- The default list of classification-based algorithms provided include:
"svmRadial"- A support vector machine algorithm
"glm"- A generalized linear model algorithm
"rf"- A random forest algorithm
"xgbLinear"- An extreme gradient boosting algorithm
"naive_bayes"- A naive bayes algorithm
For this data-set we chose four algorithms. They were chosen based on their suitability for building models using few features (eight proteins) and samples (35 samples).
# Create a model_list object by training models on the training data set using a custom list of algorithms.Don't forget to fix the random seed for reproducibility. covid_model_list <- train_models(split_df = covid_split_df, algorithm_list = c("svmLinear", "rf", "naive_bayes", "knn"), seed = 351)
We can now check how models built with each algorithm performed on the resamples of the training data set.
# Box plots (default) to visualize model performance performance_plot(model_list = covid_model_list)
# Make dot plots performance_plot( model_list = covid_model_list, type = "dot" )
*It looks like both “knn” and “naive_bayes” methods performed fairly well.
We can check which proteins out of the 8 are most important in the models.
# Make lollipop plots (default) varimp_plot( model_list = covid_model_list, text_size = 7, n_row = 2, n_col = 2 )
# Make bar plots varimp_plot( model_list = covid_model_list, type = "bar", text_size = 7, n_row = 2, n_col = 2 )
- At this stage you can again use
rem_featurefunction to remove proteins with lower importance in the models and repeat steps from Step 3.
We are now ready to test our models on never-before-seen data or our test data set.
# First we run the function with type = "prob" to get a probability list covid_prob_list <- test_models( model_list = covid_model_list, split_df = covid_split_df, type = "prob" )
Alternatively, you can output a list of predictions and output a confusion matrix for further analysis.
# We run the function with type = "raw" to get a prediction list and output a confusion matrix. You can also provide a file path to save the text file in in your preferred directory. In this example, we are saving our text file in the working directory. covid_pred_list <- test_models( model_list = covid_model_list, split_df = covid_split_df, type = "raw", save_confusionmatrix = TRUE, file_path = "." )
Finally, we can build ROC plots to assess the predictive power of our models.
roc_plotfunction requires a
- Make sure to set
type = "prob"when running
test_modelsto output a
# Make roc curves roc_plot( probability_list = covid_prob_list, split_df = covid_split_df )
Save a copy of the plot in the working directory.
# Make roc curves roc_plot( probability_list = covid_prob_list, split_df = covid_split_df, save = TRUE, file_path = "." )