Towards Small Object Editing: A Benchmark Dataset and A Training-Free Approach

[Paper]      [Code]    

Anonymous Authors
Anonymous Institutions

SOEBench is a standardized benchmark for quantitatively evaluating text-based small object editing(SOE) collected from MSCOCO and OpenImage and contains totally 4k images. We also introduce a training-free cross-attention guidance methods to deal with this problem.

Abstract

A plethora of text-guided image editing methods has recently been developed by leveraging the impressive capabilities of large-scale diffusion-based generative models especially Stable Diffusion. Despite the success of diffusion models in producing high-quality images, their application to small object generation has been limited due to difficulties in aligning cross-modal attention maps between text and these objects. Our approach offers a training-free method that significantly mitigates this alignment issue with local and global attention guidance , enhancing the model’s ability to accurately render small objects in accordance with textual descriptions. We detail the methodology in our approach, emphasizing its divergence from traditional generation techniques and highlighting its advantages. What’s more important is that we also provide SOEBench (Small Object Editing), a standardized benchmark for quantitatively evaluating text-based small object generation collected from MSCOCO and OpenImage. Preliminary results demonstrate the effectiveness of our method, showing marked improvements in the fidelity and accuracy of small object generation compared to existing models. This advancement not only contributes to the field of AI and computer vision but also opens up new possibilities for applications in various industries where precise image generation is critical.

Method

Building upon the constructed SOEBench, we further provide a strong baseline method for small object editing. As discussed above, the quality of the cross-attention map is crucial in small object editing. To this end, we propose a new joint attention guidance method to enhance the accuracy of the cross-attention map alignment from both local and global perspectives. In particular,we first develop a local attention guidance strategy to enhance the foreground cross-attention map alignment and then introduce a global attention guidance strategy to enhance the background cross-attention map alignment. Our proposed baseline method is training-free but highly effective in addressing the SOE problem.



Method Comparison

User Preference

BibTex


@misc{Anonymous2024Anonymous, title={Towards Small Object Editing: A Benchmark Dataset and A Training-Free Approach}, author={Anonymous Authors}, year={2024}, eprint={XXXX.XXXX}, archivePrefix={arXiv}, primaryClass={cs.CV} }