A Dual-Branch Vision Transformer with Cross-Attention Spectral Fusion for Early Stress Detection in Tropical Sugarcane and Cotton
Main Article Content
Abstract
Early detection of biotic and abiotic stress in tropical sugarcane and cotton is essential for protecting crop yields that sustain hundreds of millions of smallholder farmers globally [1]. While convolutional neural networks have demonstrated strong performance on single-modality leaf images, they lack the architectural capacity to simultaneously model fine-grained local texture features (lesion morphology, chlorosis patterns) and global structural context (canopy-level discoloration gradients) critical for distinguishing visually similar early-stage stress categories. This paper introduces, a novel Dual-Branch Vision Transformer architecture that processes visual image features and low-cost smartphone-derived spectral index features through two parallel Swin Transformer branches, fused via a dedicated Cross-Attention Spectral Fusion (CASF) module. DualViT-Crop further incorporates a Squeeze-and-Excitation channel recalibration block and a multi-scale Feature Pyramid Neck to capture stress signatures at both cellular and canopy levels. Experiments on a unified benchmark combining PlantVillage [2], Mendeley Sugarcane Leaf Disease [3], and Kaggle Cotton Disease [4] datasets demonstrate that DualViT-Crop achieves a mean F1-score of 95.6%, top 1 accuracy of 96.2%, and a Grad-CAM Localization Fidelity (GLF) score of 0.74, outperforming seven baseline methods including ResNet-50, EfficientNet-B4, and standard ViT-B/16 by an average of 5.8 percentage points in F1.