Detecting Anomalies in the MIT-BIH Arrhythmia Dataset

by Veronica Scerra

Biological time-series anomaly detection using ECG beat morphology and timing with autoencoder models.

For my second stab at an anomaly detection project, I wanted to try something that felt a little more intuitive than the NSL-KDD dataset, so I opted for some biological data. This project aimed to explore anomaly detection techniques in biological time-series data. The focus here was to assess whether beat-level morphology or temporal beat sequences could provide more effective signals for detecting abnormal cardiac events. Asking more clearly: is it the beat itself that is abnormal, or is it the timing of the beat that makes it abnormal?

Dataset and Preprocessing

I started by selecting a subset of the MIT-BIH dataset, working with records 100-104. Each record contains multi-channel ECG data annotated with heartbeat classifications (labeled data). I processed the data by applying a bandpass filter (0.5 - 40 Hz) and a 60 Hz notch filter to clean the signal, then normalized (z-score) to scale and extracted individual beats based on R-peak annotations.

Each beat was stored as a fixed-length window centered around the R-peak. To preserve temporal order for later sequence modeling, I stored sample indices and R-R intervals (the time between consecutive R-peaks).

Exploratory Analysis

Label Frequency Distribution. The MIT- BIH dataset is labeled - not as “normal” and “abnormal”, but rather in a very informative spectrum from “normal” to localized abnormalities (“Left bundle branch block, Right bundle branch block), by anatomical subregion (“Atrial premature beat”, “Premature ventricular beat”, “Ventricular escape beat”, “Fusion of ventricular and normal beat”), and even extra-biological origin (“Paced beat”). So my first task was deciding what counted as “abnormal”. Plotting the distribution of beat labels revealed that the dataset was largely comprised of “normal” beats, with “paced” beats taking second place, and a handful of other non-normal beats trailing far behind. This makes sense, as we’d expect hearts to mostly beat normally. This is also a classical anomaly detection scenario - sifting through a mountain of normal instances to find the rare abnormal occurrences.

PCA Visualization of Beat Morphologies. In an attempt to understand the morphological differences between the various types of beats present in the dataset, I performed a Principal Component Analysis (PCA). I had hoped that the various categories of beats in the data might be visually separable, indicating clear structural differences. When taken all together, I did find varying densities of beats, but not a clear clustering along labeled lines. The normal vs. abnormal beats formed broad, but not perfect clusters, suggesting some morphological separability.

Exploration of RR-Intervals by Label Type. To try and build on the foothold that PCA gave in terms of morphological separability, I overlaid RR-intervals as a color-code, hoping that the small differences in density could be enhanced with added temporal information, but alas, no such luck. The inter-beat intervals were in no way correlated with beat morphological differences. Some part of this arose, no doubt, from the preponderance of “paced” beats in the sample. Paced beats are designed to resemble normal beats with stable rhythm timing. The clinical context for this is that paced beat are only “abnormal” in that they’re artificial. Otherwise, they’re intentionally regular and usually indistinguishable from normal sinus beats.

Given the above results, going forward into modeling, I decided to exclude the beats that were labeled as paced - at best they mimicked normal beat morphology and timing, but at worst they presented unique challenges in interpretation. From this point on, I excluded paced beats, and for labelling purposes, I relabeled all but “normal” beats as “abnormal”

Modeling Approaches

Dense Autoencoder

I started by training a feedforward autoencoder on beat morphology using only normal beats. The model learned to reconstruct these normal beats with high fidelity. I used reconstruction error (MSE) as the anomaly score, and tested various threshold levels to find one that optimized F1 score, using the labeled abnormal beats. After optimizing threshold, the dense autoencoder produced clear separation in reconstruction error distributions between normal and abnormal beats, making it highly effective at identifying anomalies.

ROC AUC	PR AUC	F1 Score
.995	.992	.962

LSTM Autoencoder

To investigate the efficacy of temporal sequencing on anomaly detection, I constructed sequences of five consecutive beats per record, and trained a sequence-to-sequence Long Short-Term Memory (LSTM) autoencoder model. Each beat was flattened to a single vector before being stacked into sequences. The LSTM was trained on sequences of normal beats, and used the same reconstruction error logic to flag anomalies. Despite performing well on reconstructing sequences of normal beats, the LSTM model failed to sufficiently distinguish abnormal ones. This confirmed the early insight that the timing of beats (RR intervals or sequential structure) does not carry as much anomaly signal as the beat shape itself. Essentially, anomalous beats are occurring at the normal times, but the execution of the beat is what makes it different.

ROC AUC	PR AUC	F1 Score
.604	.142	.208

Key Insights

This analysis project demonstrated that morphology is a more effective indicator of cardiac anomalies than beat sequence timing for this dataset. The dense autoencoder significantly outperformed the LSTM-based approach. This suggests that anomaly detection in ECG data should prioritize shape-based modeling as a first approach.

Methodological Decisions and Reasoning

Beat-centered modeling first. PCA and label distributions pointed toward structural, not temporal, differences
Dense AE before LSTM AE. I chose the simpler model first to baseline performance before building and modeling beat timing structures.
Sequence modeling as follow-up. I explored LSTM as a natural progression, but it was ultimately less effective than morphology-based encoding
Use of F1 for threshold tuning Beacuse of class imbalance, F1 was a more informative metric than accuracy. For a look at an anomaly detection analysis with more equal group size, see my NSL-KDD dataset analysis.
Dashboarding. Building a dashboard notebook to summarize and evaluate the modeling results enabled standardized evaluation and made it easier to visualize model behavior.

This project validated autoencoder-based anomaly detection for biomedical signals and highlighted the importance of selecting modeling paradigms based on where the anomaly signal is strongest. With well-defined beat shapes and clear signal cleaning, dense models can detect subtle anomalies without requiring complex sequential context.

View Source Code