A computational protocol for sample selection in biological-derived infrared spectroscopy datasets using Morais-Lima-Martin (MLM) algorithm

Infrared (IR) spectroscopy is a powerful analytical technique that can be applied to investigate a wide range of biological materials (e.g., biofluids, cells, tissues), where a specific biochemical signature is obtained representing the ‘fingerprint’ signal of the sample being analysed. This chemical information can be used as an input data for classification models in order to distinguish or predict samples groups based on computational algorithms. One fundamental step towards building such computational models is sample selection, where a fraction of the samples measured during an experiment are used for building the classifier, whereas the remaining ones are used for evaluating the model classification performance. This protocol shows how sample selection can be performed in a computational environment (MATLAB) by using a combination of Euclidian-distance calculation and random selection, named Morais-Lima-Martin (MLM) algorithm, as a previous step before building classification models in biological-derived IR datasets.


Introduction
Infrared (IR) spectroscopy is a vibrational spectroscopy technique that generates a unique chemical signature representing most of the molecules present in a material.It is much used to analyse biological materials (1), since it allows building protocols for analysing tissues, cells and biofluids in a non-destructive, fast and low-cost fashion (1,2).Computational methods are used to maximize processing time and extract relevant information.Chemometric methods are often applied to build predictive models where the complex spectral data are transformed to chemically-relevant and easyto-interpret information by means of multivariate analysis techniques.In classification applications, samples are assigned to groups based on their IR spectrochemical signature.This includes, for example, differentiation of brain tumour types (3), identification of neurodegenerative diseases (4), cervical cancer screening (5), endometrial and ovarian cancer identification (6), identification of prostate cancer tissue samples (7), differentiation of endometrial tissue regions (8), toxicology screening (9,10), and microbiologic studies involving fungi and virus identification (11)(12)(13).
However, before model construction, a fundamental step is to split the spectral dataset into at least two subsets: training and test.The training set is used for model construction and the test set for final model evaluation.Model optimization is often performed using cross-validation, where samples from the training set are used in an interactive process of model validation.Figure 1a contains a flowchart illustrating the fundamental steps for model construction.Usually, sample splitting is performed by random-selection or Euclidian-distance using the Kennard-Stone (KS) algorithm (14).This protocol provides a computational methodology for sample splitting based on a combination of the Euclidiandistance methodology of KS with a random-mutation factor to optimize sample selection, maximizing classification rates.This algorithm, named Morais-Lima-Martin (MLM), is illustrated in Figure 1b.• MLM algorithm, available for download at https://doi.org/10.6084/m9.figshare.7393517.v1; • A classed spectroscopy dataset (a sample dataset is provided together with the algorithm).

Preparing data files
MLM algorithm only works within MATLAB environment.Data should be loaded and saved in .matformat.Spectral data must be organized into matrices, where each spectrum corresponds to a row, and spectral variables are distributed among the columns.Figure 2a illustrates an example of dataset with 2 classes within MATLAB environment.
CAUTION.IR spectra must be pre-processed before sample selection.Pre-processing methodologies for IR spectral data of biological materials can be found elsewhere (1).

Algorithm installation
(1) Download and extract the "MLM.zip"file to a folder of choice; (2) start MATLAB; (3) navigate within MATLAB to the folder where the "MLM.zip"file was extracted; (4) within MATLAB, right click on the folder "MLM" and select "Add to Path > Selected Folders and Subfolders".

Selecting the dataset
To execute the example dataset, go to the folder "MLM > DATASET" within MATLAB, and double-click on the file 'DATASET.mat'.For running the algorithm with another dataset, navigate within MATLAB to the "work" folder (i.e., the folder containing the dataset of interest), and double-click on it.For more than two classes, the procedure is the same, where the sample splitting is performed for each class separately.The random-mutation factor is set as 10% (default).

Using MLM algorithm
CAUTION.The number of training and test samples for each class must be an integer value.In the case of 70% and 30% generate numbers with decimal places, they must be rounded to the closest integer value (e.g., 25.7 to 26; 14.2 to 14; 70.9 to 71; etc).

Timing
Time is dependent on the computer setup, number of spectra, and number of variables (wavenumbers) in the dataset.Time of analysis of each dataset was practically instantaneous (<1 second) using the follow computational settings: Intel® CoreTM i7 (2.80 GHz) processor with 16.0 GB of RAM memory.

Troubleshooting
If MLM algorithm does not work: verify that the MLM folder containing the MATLAB routines was added to the MATLAB path.Also, verify if the input numbers of samples (i.e., number of training samples + number of test samples) are equal to the total number of samples.
If you cannot load the sample dataset: verify that your current working directory within MATLAB is the folder containing the dataset (folder named 'DATASET').

Anticipated Results
The sample dataset used in this protocol is composed of 140 spectra representing control brain tissue samples (class 1) and 100 spectra representing cancer (glioblastoma) brain tissue samples (class 2) (Figure 3a).Further details about this dataset can be found in Gajjar et al. (15).Samples were divided into training (70%) and test (30%) sets as depicted in Figure 2b.Two classification algorithms were applied: principal component analysis linear discriminant analysis (PCA-LDA) ( 16) and partial least squares discriminant analysis (PLS-DA) (17).PCA-LDA was applied using 9 principal components (99% cumulative explained variance) with cross-validation venetian blinds (10 data splits).Similarly, PLS-DA was performed using 9 latent variables (98% cumulative explained variance) with cross-validation venetian blinds (10 data splits).Models were built using the Classification Toolbox for MATLAB (http://www.michem.unimib.it/)(18) and the PLS Toolbox version 7.9.3 (Eigenvector Research, Inc., US).Data were mean-cantered before analysis.The classification performance of these algorithms in the training and test sets are shown in Table 1.In both PCA-LDA and PLS-DA, the accuracy values of the training and test sets are similar, indicating absence of overfitting.Also, MLM algorithm provided well-balanced sensitivities and specificities, indicating that the classification methods have similar predictive performance in both classes (control and cancer).PLS-DA model achieved the best classification performance, with an accuracy of 94% in the test set.
Figure 3b shows the discriminant function (DF) graph of PCA-LDA, where some superposition between control and cancer samples are observed.On the other hand, the DF graph for PLS-DA (Figure 3c), shows a clear separation between the two group of samples, with only a few cancer samples misclassified as control.The receiver operating characteristic (ROC) curve for PLS-DA shows the great performance of this algorithm towards differentiation of control and cancer brain tissue, where an area under the curve (AUC) value of 0.971 is obtained (Figure 3d).Gliobastoma is the type of brain cancer with the poorest survival rate, particularly due to its poor prognosis, and its clinical diagnosis is much dependent on subjective and time-consuming analysis (15).New clinical methodologies for tumour detection are needed in order to overcome these limitations; and IR spectroscopy, due to its non-destructive nature, fast data acquisition and processing, and relative low-cost might aid this type of diagnosis in the future.This protocol demonstrates the usage of sample selection, by means of MLM algorithm, for building classification models with good predictive performance in IR spectral datasets of biological-derived applications.

Figures
Figures

Figure 1 A
Figure 1 A computational methodology for sample splitting based on a combination of the Euclidiandistance methodology of KS with a random-mutation factor to optimize sample selection.(a) Flowchart for IR data processing in classification applications; (b) illustration of sample selection using MLM algorithm.

Figure 2 Using
Figure 2 Using the MLM algorithm (a) Example dataset within MATLAB, containing 140 spectra for class 1 and 100 spectra for class 2; (b) commands for running MLM algorithm.