Toward Efficient and Accurate Remote Sensing Image-Text Retrieval with a Coarse-to-Fine Approach
Abstract
Existing remote sensing (RS) image-text retrieval methods generally fall into two categories: dual-stream approaches and single-stream approaches. Dual-stream models are efficient but often lack sufficient interaction between visual and textual modalities, while single-stream models offer high accuracy but suffer from prolonged inference time. To pursue a trade-off between efficiency and accuracy, we propose a novel coarse-to-fine image-text retrieval (CFITR) framework that integrates both dual-stream and single-stream architectures into a two-stage retrieval process. Our method begins with a dual-stream hashing module (DSHM) to perform coarse retrieval by leveraging precomputed hash codes for efficiency. In the subsequent fine retrieval stage, a single-stream module (SSM) refines these results using a joint transformer to improve accuracy through enhanced cross-modal interactions. We introduce a local feature enhancement module (LFEM) based on convolutions to capture detailed local features and a post-processing similarity reranking (PPSR) algorithm that optimizes retrieval results without additional training. Extensive experiments on the RSICD and RSITMD datasets demonstrate that our CFITR framework significantly improves retrieval accuracy and supports real-time performance. Our code is publicly available at
.
Type
Publication
IEEE Geoscience and Remote Sensing Letters