ECG Biometrics using RNN and CNN

Background: Biometric Systems (BS) are based on a pattern recognition problem where the individual traits of a person are coded and compared. The Electrocardiogram (ECG) as a biometric emerged, as it fulﬁlls the requirements of a BS. Methods: Inspired by the high performance shown by Deep Neural Networks (DNN), this work proposes two architectures to improve current results in both identiﬁcation and authentication: Temporal Convolutional Neural Network (TCNN) and Recurrent Neural Network (RNN). The last two results were submitted to a simple classiﬁer, which exploits the error of prediction of the former and the scores given by the last. Results: The robustness and applicability of these architectures were tested on Fantasia, MIT-BIH and CYBHi databases. The TCNN outperforms the RNN achieving 100%, 96% and 90% of accuracy, respectively, for identiﬁcation and 0.0%, 0.1% and 2.2% equal error rate for authentication. When comparing to previous work, both architectures reached results beyond the state-of-the-art.


Introduction
Some traditional methods of identification and authentication, such as keypossession or username-password, may prove ineffective against terrorist and criminal acts [1]. Key areas of society demand reliable solutions for border control, criminal identification, and electronic transactions. Consequently, the use of individual traits, i.e. biometrics, is trending in both private and public sectors. A report from Acuity Market Intelligence in 2015 expected the biometrics market to grow not only in the USA and Europe but also in Asia and Africa. These numbers are prospected to increase from USD 10.74 Billion in 2015 to USD 32.73 Billion by 2022, reflecting the need for further developing these systems [2,3].

Biometrics
Biometrical authentication (prove that the person is whom it claims) and identification (finding the registered person from sample) is a pattern recognition problem where physical or behavioral traits are submitted to a feature extraction module, during the enrollment stage, and compared to the biometrical database of a population, during the verification phase [1,4]. Also, the evaluation of a Biometric Systems (BS) is made in the dimensions of: universality; uniqueness; permanence; measurability; performance; acceptability, and; circumvention [5]. Given the success in fulfilling all these premises, the Electrocardiogram (ECG) morphological signature has emerged in the last decade, with the added effect in the difficulty of reproducing this signal.

ECG Biometrics
The ECG, representing the electrical activity of the heart, has a cyclic structure. Each cycle, corresponding to a heartbeat, presents a characteristic morphology composed by: (1) P wave -Contraction of the atria; (2) PQ interval -Time between the contraction of the atria and activation of the ventricles; (3) QRS complex -The combination of the Q, R and S waves that is associated to the contraction of the ventricles; (4) ST interval -Relaxation after the ventricles contraction; (5) T waverepolarization of the ventricles in preparation for the next cycle of contractions [6]. The main challenge of non-invasive ECG measurements is the intra-subject variability due to the corruption of the signal (artifacts and noise) [6,7], alterations to the physical state, affective status and drug effects [6][7][8].
There are two main types of ECG biometric systems: fiducial-based, in which heartbeat templates are sampled from the ECG signals and have their relevant points and segments computed, and; non-fiducial-based, which considers a ECG signal as a whole and computes the features on this premise. Some studies use the combination of both methods, called partially-fiducial, to achieve better results [6,9,10].

Objective
In light of the current trend in Deep Neural Networks (DNN) and as a way to improve results in both identification and authentication without extracting humancrafted features, two Neural Networks (NN) architectures are proposed (Fig. ??). The first one is a direct application of a previous work [11], where we propose a nonfiducial system that uses a Recurrent Neural Networks (RNN) capable synthesizing ECG signals. The biometric solution is related to the hypothesis that if a model that is trained with the signal from person A, the prediction error can be used as a score, which will be higher when fed with signals for person B and C.
The second system is partial-fiducial and exploits the automatic feature learning ability of a Convolutional Neural Networks (CNN). This model outputs a score based on the probability function given by the network. Both scores are submitted to the same simple classifier, the Relative Score Threshold Classifier (RSTC) which compares the scores and finds the most probable source for the given sample or batch of samples.
Both methods will be tested in three well-known public databases: Fantasia, MIT-BIH and CYBHi [12][13][14][15]. This document is organized as follows: Section 2 addresses the state-of-the-art in ECG biometrics for each dataset; Section 3 describes the proposed techniques and system architecture; results and conclusions will be discussed in Sections 4 and 5, respectively. One of the networks has N number of blocks while the other N − 2, they both end in a fully connected network and then the last vector is summed giving the output for classification.

Related Work
Recently, research in ECG biometry has been using DNN due to its popularity. Most architectures comprise CNN modules that both process, by automatically learning features, and classify. The inputs to these networks may vary, some examples are of Labati et al. (2018) [16], which uses the raw signal in a fiducial system, while Zhang et al. (2017) [17] describes a non-fiducial system that feeds the network with temporal frequency spectrograms obtained by discrete wavelet transform.
For benchmarking purposes, this section will address related work with respect to reference databases: the Fantasia and MIT-BIH datasets from Physionet, which are extensively used for biometrics research, and the CYBHi database, an off-person ECG database. This last database has been considered by Merone et al. (2017) [12] as one of the two best databases in terms of acquisition protocol and hardware for biometric studies, in comparison with fifteen databases.

Fantasia
Fantasia is a database from PhysioNet which was created under the auspices of the National Center for Research Resources of the National Institutes of Health [18]. This dataset contains 20 subjects aged between 21 and 34 years old and 20 aged between 68 and 81, during 120 min of continuous supine resting ECG recording while watching the Disney movie Fantasia [13].
This database was used by  [19], which created a model that uses a reduced version of a set of fiducial features by applying Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), information-gain ratio and rough sets (RS) achieving an accuracy of 90 ± 8% in PCA, False Rejection Rate (FRR) 5% for LDA and False Negative Rate (FNR) of 4% for RS [19]. The same team published a method that extracts and decomposes R-R intervals using Discrete Wavelet Transform (DWT) and obtained an accuracy of 95.89%, FRR 0% and FNR 5% [20]. The work of Gargulio et al. (2015) [9] analyses the influence of the HRV in the QT intervals and suggests their correction to improve performance, leading to identification rates with MultiLayer Perceptron (MLP) and Support Vector Machine (SVM) between 97% and 99% [9]. The study described above by Zhang et al. (2017) [17] has archived an accuracy of 97.2% for the Fantasia dataset.

MIT-BIH
The MIT-BIH Database has been available since 1999 in PhysioNet. The basic subset of this database, MIT-BIH Arrhythmia, contains ECG records from 47 subjects with 360Hz of sample frequency and 11-bit resolution, from Boston's Beth Israel Hospital. More subjects, with no significant arrhythmic episodes, were added to this dataset, 18 subjects in MIT-BIH Normal Sinus and 7 individuals in MIT-BIH Long-Term [14].
This database has been used not only for validating arrhythmia detectors and other cardiac dynamics research, but also in the biometrics field. The MIT-BIH Normal Sinus Rhythm dataset was used by  [21] proposes a Piecewise Linear Representation for feature extraction and classification using the similarity measures from the Dynamic Time Warping (DTW) method. The results for identification for was 100% and the minimum of the half total error (HTER) of 0.2% [24].
The accuracy in Rahbi (2013) [22] reached 99.07%, using a combination of ten features from heartbeat and sixty from coefficients of the Hermite Polynomials Expansion as input to a Hidden Markov Model (HMM) [22].
Several classification methods applied to characteristics of the QRS complex were compared by Sidek et al. (2014) [23] achieving the accuracy of 98.3% for the Bayes Network, 99.07% for Naive Bayes, 99.07% for Multilayer Perceptron and 99.07% for k-Nearest Neighbor from the same database [23].
A semi-fiducial approach was suggested by Ye & Coimbra (2012) [24] in which a Wavelet Transform was applied together with a Independent Component Analysis (ICA) analysis to detect each heartbeat and insert this information together with RR intervals and fed to a SVM classifier. For the selected 23 records with normal sinus rhythm from the Arrhythmia dataset, they reached 86.4% accuracy for subject evaluation, even though the focus was to classify heartbeat classes. Zhang (2017) got accuracy of 90.3% for the 47 individuals of MIT-BIH [17].

CYBHi
The Check Your Biosignals Here initiative (CYBHi) dataset is a system that acquired ECG without the use of skin electrodes. In the work of Silva et. al. (2013) [15] states that it was "devised a data acquisition framework and experimental setup, for large scale data collection from a large group of subjects through an easily repeatable and efficient procedure". Both sensors had hand-shaped support synchronized via syncPLUX synchronization kit, in which the electrodes were placed on the fingers while the subjects were seated for 2 min in a resting position. The used data consists of long-term sessions, where two acquisitions were made with 3 months difference, including 63 participants, 14 males, and 49 females (18-24 years old) [15]. For the purpose of this paper, the long-term sessions from the first acquisition will be named M1 while the second acquisition will be named M2.
The outcome for authentication of this study was an Equal Error rate (EER) of 9.1% using a SVM classifier against the correlation between ECG templates [15]. In another study that used this database, Lourenço

Methods
As stated before two architectures were compared and explored using the same classification method, the . The setup comprises the TCNN network and a RNN network, both are trained to create an internal representation of the waves. The difference between both architectures is while the TCNN scores are given by the output given by the last layer, the RNN approach the score is given by the prediction error.
In order to train and test models, heavily corrupted windows are removed using two different methods that assume that most of the signal is clean: (1) The first extracts the windows that are between intervals around a percentage of the median of the standard deviation value and the median of the mean value of each window, depending on the database noise factor; (2) The second uses the algorithm developed by Rodrigues et al. (2017) [7] relying on the standard deviation feature. This method was used for the CYBHi database where the first method was inefficient due to the low signal-to-noise ratio.

Recurrent Neural Network Approach
In the RNN approach, data is firstly pre-processed (Fig. 2), and later transformed by an embedded matrix -E, three sequential GRUs -G, a dense layer with linear activation and a softmax node [11].

Pre-processing
The pre-processing step starts with removing the moving average and normalized by a moving absolute maximum window and subtracted by a moving average window, followed by a convolution with a Hanning window. The benefit of this approach is to mitigate the unwanted frequencies from the signal acquisition. Due to the nature of the GRU networks, a quantization of the signal must be performed, which consists of reducing the continuous set of the amplitude values into a limited set of integer values with the dimension of S D . But in order to reduce the influence of artifacts, and extract only the significant information, the signal amplitude was clipped, according to a value of confidence. To achieve this, the edges of the amplitude histogram outside a parameterized value of confidence were removed. The value of this parameter was typically 0.5% but could change depending on the corruption of the signal. Finally, the resulting signal (x) is transformed using the following equation: where x n is the n-th sample of the input vector. The variable k ∈ {0, 1, ...S D − 1} represents the number of possible integer values x n may take. The higher the value of the signal dimension (S D ), the higher the detail is retained from the signal. The last step is the segmentation of the signal into sliding windows of size W with overlap, depending on the signal length and TWs.

Architecture
The architecture depicted in Fig. 1a, comprises one embedding matrix, commonly used for text processing. It functions as a translation mechanism between the input and the three GRU layers (G). The input is an integer value corresponding to the x n -th column of the matrix E. It starts as a matrix with random values but optimizes as training progresses, adjusting itself to the input along with the other parameters.
The GRU is a type of RNN, characterized by learning the sequential patterns where the current state depends on the previous ones while learning short-term and long-term dependencies by memory management [26,27].
In our previous work [11], this architecture could synthesize biosignals and prove by the fact that the prediction error of a model is lower when the input signal is the one that trained the model in relation to the signals with other origins. In the case of biometry, this notion is expanded to the individual source of the signal, where this network may able to learn the individual intricacies of each ECG. Following this reasoning, the produced dissimilarity score (S(p, i, w)) is given by the following equation: , where e p is the prediction error made to the predicted individual, i.e. predictor -p -and w is the time-window of the signal that belongs to the individual i. The expression max(e(:, i, w)) represents the maximum for the same window w for all predictors.

Temporal Convolutional Neural Network
The second proposed method is a two-stream TCNN [28], which uses 1-dimensional convolutional layers with dilated convolutions to learn temporal patterns. It combines predictions from two different inputs: ECG non-fiducial window segments and a sub-segment of the same window containing only one full cycle centered in R peak of the QRS complex.
This architecture is depicted in Figure 1.3 (b), where each individual network is composed of several convolutional layers with 24 kernels of size 4, followed by batch normalization and a ReLU activation function. At the end of each network, two fully connected layers, the first with 256 units while the second with the number of individuals are applied to the end of each network producing a logit vector for each network. The scale differences between the input used by each network and the progressive dimensionality reduction of the TCNN generate variability in the number of layers. For the Fantasia and MIT-BIH databases, the number of convolutional layers is 6 non-fiducial network and 4 for the individual cycle network. Due to the increase of sampling frequency of the CYBHi database, the number of layers is adjusted to 8 and 6, respectively.
These networks are trained independently using cross-entropy loss function and Adam optimizer. After trained, the logit vectors from both networks are fused with the sum rule. The score, in this case, is given by subtracting one to the normalized output vector (o n ), given by:

Relative Score Threshold Classification
The RSTC is a simple method that classifies choosing the lowest normalized similarity score (S(p, i, b)) for each batch of windows (b), consequently, S is a 3 dimension tensor with dimensions (M × I × B), where B is the size of the batch. This measurement is obtained by normalizing for each predictor the minimum of a set of windows. Therefore the minimum is the result of the following function: S p,i,b = min S p,i,B·b , S p,i,B·b+1 , . . . , S p,i,B·b+B (4) and the normalization for each b in relation to the p is made with the following rule:S , where minimum and maximum values are calculated in respect to all predictors for each batch of windows. Considering thatS(p, i, b) ∈ [0, 1] the value 0 encodes to the most probable class and when closer to 1, the lower probability. The class of the individual is given by C I i,b :

Evaluation
The identification evaluation made by the multimodal classifier (C I ) is given by the accuracy, specificity, and sensibility. The authentication evaluation is made by the EER that is obtained when FNR = False Positive Rate (FPR) for each binary classifier (C A ). The authentication mode is evaluated with this measurement, as these systems prioritize the minimization of the possibility of imposters that can access the system and the number of the rightful individuals that cannot log in. Each binary classifier is made by comparing the score for all predictors p for each individual i for all windows w with a changing threshold (ϕ). This system is finetuned for each individual and the output gives if that specific b of windows belongs to claimed individual: The Receiver Operating Characteristic (ROC) can be calculated with an array of those thresholds and the EER. Since each individual has its own fine-tuned ROC, the mean and standard deviation must be calculated and used as validation. An example of how these thresholds are obtained is depicted in Fig. 3

Results
This section will present the results for each database, ordered in increasing average intra-variability across subjects. Apart from the validation for identification and authentication modalities, the Fantasia dataset will be the benchmark to test the used methods and robustness of the proposed algorithms.

Fantasia
The first hour of the Fantasia dataset was separated after 20 minutes for crossvalidation. The first part was used for testing and all data was segmented in windows of 512 samples (approximately 2s) with an overlap of 67%. The selected windows were on average 1464 windows of 3787 per subject. The rejection algorithm had a mean and standard deviation tolerance of 0.2 and 0.05, respectively. The score distribution per subject when the test set is fed into the RNN model trained with ECG8 is depicted in Fig. 3. The score in the right is given for all batches of windows for B = 1 and the left for B = 20.
When analyzing this example, it can be observed that with the increase in the number of windows included in each batch, both mean and variability are reduced. The three thresholds presented in Fig. 3 will have a different binary classification for the predicted ECG 8 when submitted to different values of B. In the case of (a), δ 1 threshold will give a positive value for most of the ECG 23, which is false, but both δ 2 and δ 3 will include the ECG 8 batches as well, but also the number of imposters increase with the threshold value.
When the B increases to 20, the variance and mean value of ECG 8 will decrease significantly and δ 1 will now classify correctly most, or even all, the ECG 8 batches, while reducing the number of ECG 23 classified wrongly, ensuring lower values for the FNR and FPR. Both δ 2 and δ 3 produce more imposters, while not increasing the correct classification. The ROC curve is generated for each predictor and for each batch size, producing different values for EER that can be interpreted for authentication systems.
The identification rate, i.e. accuracy for the identification modality, for both algorithms are depicted in Figs. 4a. Close inspection reveals that both algorithms increase accuracy with time per batch, but the TCNN algorithm starts with higher accuracy, being outperformed by RNN after approximately 1 minute of signal per batch. While the RNN algorithm increases until reaching 100 % at approximately 112 s, the TCNN algorithm reaches 99.1% after approximately 90 s, but its curve displays a stabler behavior. The best results are presented in the confusion matrices of Fig. ?? for RNN and Fig. ?? TCNN approaches.  The results for the authentication mode is depicted in Figs. 4d and ??, which shows the evolution of the EER per time in each batch. Both RNN and TCNN achieve values very close to 0% at 80 s, but the TCNN reaches those values with a lower standard deviation faster than the RNN counterpart.

MIT-BIH
All the MIT-BIH signals were resampled to a frequency of 250 Hz to ensure that the same sampling frequency is maintained in all datasets for a rigorous comparison. The rejection rate was 42% with the parameters for tolerance for average and standard deviation of 0.9 and 0.5, respectively.
The identification test performed in the MIT-BIH database, observed in Fig. 5a, shows that the accuracy curves have similar behavior as the ones in the Fantasia dataset. For the sake of comparison with the bibliography, an analysis of the Normal Sinus Rhythm dataset was also made in detail (Fig. 5a). Even though the RNN approach does not go above 80% for all the MIT-BIH, the confusion matrix (Fig.  5b) shows the expected diagonal. Unfortunately, some models generalize the ECG signals confusing the right class, probably due to the loss of information during resampling, due to the sensibility to noise or to due to the symptomatic behavior of the arrhythmia events. The last option is supported by the fact that the Normal Sinus Rhythm subset reached 100% accuracy.
As for the TCNN even though the 96.4% is reached after more than 2 min for the full MIT-BIH database, one can observe that values close to 96% are reached after 10 s, reaffirming its robustness to noise and stability. The confusion matrix in Fig.  ?? shows that only two predictors display a faulty behavior, probably due to the presence of arrhythmia events. When comparing the individuals from the Normal Sinus Rhythm the accuracy reached 100% after approximately 10 s (Fig. 5a).
The authentication evaluation of all the MIT-BIH individuals is displayed in Figs. 5d and ??. These figures show that the TCNN approach reaches EER values close to 0% with 72 s in each batch and 1.7% for RNN. Even though the best result for TCNN is after 1 min, the values are quite low from the start. For the Normal Sinus Rhythm database (Fig. 5a) the TCNN reached 0% of EER, while the RNN reached the value of 0.6%.

CYBHi
As mentioned before the CYBHi database comprises two moments with a time distance of two months. Therefore, both algorithms are evaluated with all the combinations of both moments (M1 and M2) in terms of cross-validation, simulating the enrolment and verification environments of a biometric system. This means that each model was trained using the first element and tested with the second of the following sets: "M1 vs M1", "M2 vs M2", "M1 vs M2" and "M2 vs M1". When the training and testing moment was the same, the data was separated by 50% and the segmentation was made with an 89% of overlap to increase the number of available time-windows.
The clustering method rejected 8.93±22.2% of the total signal length on the first session and 9.97±24.2% on the second.
In Fig. 6a shows that the decrease of accuracy reflects the signal corruption that this dataset faces. Since this dataset displays a high variability within subjects due  Figure 5: Results for the MIT-BIH dataset. 5a is the identification results with the increase of batch sizes. 5b and 5c are the confusion matrix for the best results for RNN and TCNN approaches, respectively, and respective metrics of accuracy (acc) specificity (sp) and sensitivity (st). 5d and 5e are the evolution of the EER values over time for both approaches.
to the low Signal-to-Noise Ratio (SNR), the robustness of the method is imperative for good results. The TCNN algorithm archives results above 90% for "M1 vs M1" and 100 % for "M2 vs M2", even if the amount of information was significantly lower than the previous datasets. The decrease of crossing both moments as significant as expected, revealing accuracies above 75 % for both "M1 vs M2" and "M2 vs M1". The low accuracy presented by the RNN approach reflects the sensitivity to noise. Fig. 6b shows that the EER archive 4.3% and 6.1% for "M1 vs M1" and "M2 vs M2", respectively, using the RNN models, while TCNN displays results close to 0%, probability due the temporal proximity of the training and testing sets. As for the evaluation across different moments, the EER for the RNN approach is 18.3% and 18.8% for "M1 vs M2" and "M2 vs M1", respectively and the 2.2% and 4.1% for the TCNN in the same order. These results, even though they are based in a higher quantity of information, they outperform current state-of-the-art for off-person measurements. Table 1 summarizes the results per database for the consulted bibliography. This table shows that the TCNN approach outperforms most of the other studies in both identification and authentication paradigms. The robustness of this architecture is mainly due to the combination of bringing together both fiducial and non-fiducial machine-learned characteristics of the signal through the fusion layer.

Comparative Study
Unfortunately, the computational time increases significantly with the number of samples per window and the number of individuals and with it some limitations on providing a biometric system in a real scenario: • when a new person is added to the model, all the training must be repeated; • this approach is viable for a suitable number of individuals, but to use in a scale of millions, the training and testing process will be costly or even impossible with state of the art computational power and memory • Another disadvantage is the need to extract the peak information, which sometimes is not feasible, especially when the slow waves (such as T or U) reach higher values than the R wave. Even though the RNN achieves state of the art results for the Fantasia and Rythm of the MIT-BIH, the sensibility to changes in the signal is higher than the aforementioned architecture. The most prominent possibilities for this sensibility is: • The noise may have a high impact in the quantization process, especially due to random artifacts that shift the maximum and minimum values of the signal; • The loss of information due to the preprocessing stage caused by the decimate method, which could be important in establishing the difference between individuals. • The need for higher quantities of data for training, which could contain more angles of the ECG mechanism. Another limitation for this system is the required computational power, due to its recursive nature, it is impossible to parallelize between sequential samples, leading to the training of 512 windows with 512 samples an average of 35.5 hours with an Nvidia GTX 1080 Ti. In terms of testing, the score calculation for each window is just on the scale of milliseconds.    [19] 90.0% -- [20] 95.9% -- [9] Fantasia 99.0% --RNN 100% 0.02% TCNN 99.1% 0.02% [29] 98.1% -- [30] 99.6% -- [29] 94.5% -- [22] MIT-BIH Sinus 99.0% -- [23] 99 The main advantages of the RNN architecture are: • It is unnecessary to train all the models again when adding a new person to the biometric system as the individual traits of one individual would be preserved in a single model that is independent of the others; • The non-fiducial approach removes the need to extract the R peak from the signal; • This approach achieves excellent results on signals with good SNR, therefore this system is viable for good quality acquisitions. • The possibility of generating the signal using the same method as in   [11] to observe the differences with the individual, in order to track models that did not learn correctly.

Conclusions
This study has also shown the usefulness of DNN in enhancing current biometric systems. These technologies allied with the massive collection and storage of biological data will provide powerful and robust systems, increasing the level of security, reducing the concerns of counterfeiting actions. Even though the RNN and TCNN systems provided good results, their limitations could be surpassed if both architectures were combined. Other state-of-the-art downsampling methods could also be applied in order to mitigate the loss of information while increasing the training speed. The noise contamination is also a major concern, by employing other filtering methods the system could have a better grasp of the intricate characteristics, only present in a clean signal.
The next steps to implement in a real-life scenario would be the introduction of transfer learning in the training process, the acquisition of more time windows per person, the introduction of more individuals, and the increase of the number of moments though time. The study of the impact of the noise and the scalability of these systems would give an added value.
One factor that could give a higher generalization and detection capabilities would be the inclusion of different electrode configurations for the clinical-grade ECG so that the DNN model learns the different angles of the signal gaining robustness over the different placements. The analysis of different emotional and mental states could also be very important, as the ECG signal can change under different conditions.