Toward Efficient and Accurate Remote Sensing Image-Text Retrieval with a Coarse-to-Fine Approach

Nov 8, 2024·

Wenqian Zhou

Hanlin Wu (吴瀚霖)

Pei Deng

· 0 min read

Abstract

Existing remote sensing (RS) image-text retrieval methods generally fall into two categories: dual-stream approaches and single-stream approaches. Dual-stream models are efficient but often lack sufficient interaction between visual and textual modalities, while single-stream models offer high accuracy but suffer from prolonged inference time. To pursue a trade-off between efficiency and accuracy, we propose a novel coarse-to-fine image-text retrieval (CFITR) framework that integrates both dual-stream and single-stream architectures into a two-stage retrieval process. Our method begins with a dual-stream hashing module (DSHM) to perform coarse retrieval by leveraging precomputed hash codes for efficiency. In the subsequent fine retrieval stage, a single-stream module (SSM) refines these results using a joint transformer to improve accuracy through enhanced cross-modal interactions. We introduce a local feature enhancement module (LFEM) based on convolutions to capture detailed local features and a post-processing similarity reranking (PPSR) algorithm that optimizes retrieval results without additional training. Extensive experiments on the RSICD and RSITMD datasets demonstrate that our CFITR framework significantly improves retrieval accuracy and supports real-time performance. Our code is publicly available at .

Type

Journal Article

Publication

IEEE Geoscience and Remote Sensing Letters