HybridEdge-COVID: Fair Compression Benchmarking, Calibration Aware Uncertainty Quantification, and Radiologist-Validated Explainability for Trustworthy Edge-Deployed COVID-19 Chest X-Ray Screening
Main Article Content
Abstract
Background: COVID-19 diagnostic capacity is still very limited in LMICs. Systematic optimism bias is introduced in prior studies of edge-AI sensors by compressing proposed models more aggressively than baseline models, a methodological quirk that is not clearly addressed in the COVID-19 chest X-ray (CXR) literature. Calibrated uncertainty, clinical relevance of uncertainty, and explainability of AI predictions to clinicians – in the form of saliency maps – are also critical features of trustworthy medical AI that have been lacking in previous COVID-19 CXR edge-AI benchmarking studies. Methods: The key methodological advancement is the application of a three-step Edge-Aware Optimisation Pipeline (dynamic-range quantisation, INT8 quantisation aware training and structured L1-norm channel pruning) to all seven architectures to remove any systematic benchmarking bias. As a secondary contribution we propose a lightweight hybrid CNN (1.91M parameters) combining SqueezeNet Fire, MobileNetV2 inverted residual bottlenecks, and Squeeze-and-Excitation channel wise attention, as an example to illustrate the evaluation framework that can evaluate any hybrid CNN. The statistical validation consists of testing for equivalence (using TOST with margin Δ = ±1.0 pp), pairwise AUC comparison (DeLong), and Bonferroni-corrected McNemar's tests. The trustworthiness is evaluated through ECE/Brier Score/MCE calibration analysis, Monte Carlo Dropout uncertainty quantification (50 passes) and risk coverage deferral analysis. Performance is tested using 5-fold stratified cross validation (COVID-Xray-5k; n = 5,000; 95% bootstrap CIs) and external testing on COVIDx CXR-3 (n = 13,870; 16,352 unique patients). Grad CAM++ explainability maps are double-blinded validated by two board-certified radiologists with Cohen's kappa inter-rater agreement. Results: Under uniform compression benchmarking, HybridEdge-COVID achieves 97.84 ± 0.31% CV accuracy (95% CI: 97.21–98.47%), AUC 0.981 ± 0.009, and MCC 0.957 ± 0.013. ResNet18 (98.12%) and ResNet50 (98.23%) have higher AUC point estimate accuracy, which is explicitly stated; DeLong analysis shows no significant difference in AUC from ResNet18 or EfficientNet-Lite0 after Bonferroni correction; and TOST confirms at least the equivalence in AUC, with a range of ±1.0 pp for four out of six comparisons. The results of calibration analysis show that the ECE is 0.022, which proves that the multi-stage compression does not affect the reliability of probability. The Monte Carlo Dropout uncertainty estimates increase the misclassified cases by 4.25×, which results in a 10%-referral deferral workflow with retained accuracy of around 98.5%. External validation yields 91.30% accuracy (95% CI: 90.73–91.87%) and AUC 0.943. On Raspberry Pi 4 (< USD 55): 8.93 s/100 images, 47.2 MB peak RAM, 4.8 MB model — Pareto-optimal among all 7 evaluated architectures. Dual-radiologist Grad-CAM++ validation: κ = 0.71 (95% CI: 0.61–0.81; substantial agreement), 76.9% clinical feature consistency. Conclusions: This study presents a fair compression benchmarking framework, calibration-aware uncertainty quantification and rigorous statistical validation and preliminary radiologist-validated explainability for trustworthy edge-deployed COVID-19 CXR screening. In addition to the proposed HybridEdge-COVID architecture, the most important contribution is the reproducible evaluation methodology that will allow scientifically fair comparisons of the performance of edge medical AI systems. Limitations: binary classification only, no prospective clinical validation, preliminary XAI by two radiologists, transformer baselines not evaluated by compression pipeline, Grad-CAM++ not on-device. Before considering any deployment to clinical use, there needs to be multi-centre prospective validation.