Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images
Abstract
Remote sensing image change captioning (RSICC) aims at generating human-like language to describe the semantic changes between bitemporal remote sensing image (RSI) pairs. It provides valuable insight into environmental dynamics and land management. Unlike conventional change captioning (CC) tasks, RSICC involves not only retrieving relevant information across different modalities and generating fluent captions but also mitigating the impact of pixel-level differences on the terrain change localization. Pixel-level discrepancies over a long time span decrease caption accuracy. To address these problems, we propose a probabilistic diffusion-based model that leverages its remarkable generative capability to produce flexible captions. In the training phase, we construct a condition denoiser to efficiently map the real caption distribution to a standard Gaussian distribution. This denoiser incorporates cross-mode fusion (CMF) and stacking self-attention (SSA) modules to enhance cross-modal alignment and reduce pixel interference, thereby improving caption accuracy. In the training phase, the condition denoiser provides a new strategy for mean value estimation and helps to generate captions step by step. Extensive experiments on the LEVIR-CC dataset and DUBAI-CC dataset demonstrate the effectiveness of our Diffusion-RSCC and each of its individual components. The quantitative results showcase superior performance over existing methods across both traditional and newly introduced metrics. The code is available at
.
Type
Publication
IEEE Transactions on Geoscience and Remote Sensing