arXiv daily

Computer Vision and Pattern Recognition (cs.CV)

Wed, 24 May 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Tue, 13 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.Remote Sensing Image Change Detection Towards Continuous Bitemporal Resolution Differences

Authors:Hao Chen, Haotian Zhang, Keyan Chen, Chenyao Zhou, Song Chen, Zhengxia Zhou, Zhenwei Shi

Abstract: Most contemporary supervised Remote Sensing (RS) image Change Detection (CD) approaches are customized for equal-resolution bitemporal images. Real-world applications raise the need for cross-resolution change detection, aka, CD based on bitemporal images with different spatial resolutions. Current cross-resolution methods that are trained with samples of a fixed resolution difference (resolution ratio between the high-resolution (HR) image and the low-resolution (LR) one) may fit a certain ratio but lack adaptation to other resolution differences. Toward continuous cross-resolution CD, we propose scale-invariant learning to enforce the model consistently predicting HR results given synthesized samples of varying bitemporal resolution differences. Concretely, we synthesize blurred versions of the HR image by random downsampled reconstructions to reduce the gap between HR and LR images. We introduce coordinate-based representations to decode per-pixel predictions by feeding the coordinate query and corresponding multi-level embedding features into an MLP that implicitly learns the shape of land cover changes, therefore benefiting recognizing blurred objects in the LR image. Moreover, considering that spatial resolution mainly affects the local textures, we apply local-window self-attention to align bitemporal features during the early stages of the encoder. Extensive experiments on two synthesized and one real-world different-resolution CD datasets verify the effectiveness of the proposed method. Our method significantly outperforms several vanilla CD methods and two cross-resolution CD methods on the three datasets both in in-distribution and out-of-distribution settings. The empirical results suggest that our method could yield relatively consistent HR change predictions regardless of varying resolution difference ratios. Our code will be public.

2.BinaryViT: Towards Efficient and Accurate Binary Vision Transformers

Authors:Junrui Xiao, Zhikai Li, Lianwei Yang, Qingyi Gu

Abstract: Vision Transformers (ViTs) have emerged as the fundamental architecture for most computer vision fields, but the considerable memory and computation costs hinders their application on resource-limited devices. As one of the most powerful compression methods, binarization reduces the computation of the neural network by quantizing the weights and activation values as $\pm$1. Although existing binarization methods have demonstrated excellent performance on Convolutional Neural Networks (CNNs), the full binarization of ViTs is still under-studied and suffering a significant performance drop. In this paper, we first argue empirically that the severe performance degradation is mainly caused by the weight oscillation in the binarization training and the information distortion in the activation of ViTs. Based on these analyses, we propose $\textbf{BinaryViT}$, an accurate full binarization scheme for ViTs, which pushes the quantization of ViTs to the limit. Specifically, we propose a novel gradient regularization scheme (GRS) for driving a bimodal distribution of the weights to reduce oscillation in binarization training. Moreover, we design an activation shift module (ASM) to adaptively tune the activation distribution to reduce the information distortion caused by binarization. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improve the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of 16.2$\times$ and 17.7$\times$ in model size and OPs compared to the full-precision DeiT-S. The codes and models will be released on github.

3.AutoDepthNet: High Frame Rate Depth Map Reconstruction using Commodity Depth and RGB Cameras

Authors:Peyman Gholami, Robert Xiao

Abstract: Depth cameras have found applications in diverse fields, such as computer vision, artificial intelligence, and video gaming. However, the high latency and low frame rate of existing commodity depth cameras impose limitations on their applications. We propose a fast and accurate depth map reconstruction technique to reduce latency and increase the frame rate in depth cameras. Our approach uses only a commodity depth camera and color camera in a hybrid camera setup; our prototype is implemented using a Kinect Azure depth camera at 30 fps and a high-speed RGB iPhone 11 Pro camera captured at 240 fps. The proposed network, AutoDepthNet, is an encoder-decoder model that captures frames from the high-speed RGB camera and combines them with previous depth frames to reconstruct a stream of high frame rate depth maps. On GPU, with a 480 x 270 output resolution, our system achieves an inference time of 8 ms, enabling real-time use at up to 200 fps with parallel processing. AutoDepthNet can estimate depth values with an average RMS error of 0.076, a 44.5% improvement compared to an optical flow-based comparison method. Our method can also improve depth map quality by estimating depth values for missing and invalidated pixels. The proposed method can be easily applied to existing depth cameras and facilitates the use of depth cameras in applications that require high-speed depth estimation. We also showcase the effectiveness of the framework in upsampling different sparse datasets e.g. video object segmentation. As a demonstration of our method, we integrated our framework into existing body tracking systems and demonstrated the robustness of the proposed method in such applications.

4.ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

Authors:Dongxu Yue, Qin Guo, Munan Ning, Jiaxi Cui, Yuesheng Zhu, Li Yuan

Abstract: Editing real facial images is a crucial task in computer vision with significant demand in various real-world applications. While GAN-based methods have showed potential in manipulating images especially when combined with CLIP, these methods are limited in their ability to reconstruct real images due to challenging GAN inversion capability. Despite the successful image reconstruction achieved by diffusion-based methods, there are still challenges in effectively manipulating fine-gained facial attributes with textual instructions.To address these issues and facilitate convenient manipulation of real facial images, we propose a novel approach that conduct text-driven image editing in the semantic latent space of diffusion model. By aligning the temporal feature of the diffusion model with the semantic condition at generative process, we introduce a stable manipulation strategy, which perform precise zero-shot manipulation effectively. Furthermore, we develop an interactive system named ChatFace, which combines the zero-shot reasoning ability of large language models to perform efficient manipulations in diffusion semantic latent space. This system enables users to perform complex multi-attribute manipulations through dialogue, opening up new possibilities for interactive image editing. Extensive experiments confirmed that our approach outperforms previous methods and enables precise editing of real facial images, making it a promising candidate for real-world applications. Project page:

5.SUVR: A Search-based Approach to Unsupervised Visual Representation Learning

Authors:Yi-Zhan Xu, Chih-Yao Chen, Cheng-Te Li

Abstract: Unsupervised learning has grown in popularity because of the difficulty of collecting annotated data and the development of modern frameworks that allow us to learn from unlabeled data. Existing studies, however, either disregard variations at different levels of similarity or only consider negative samples from one batch. We argue that image pairs should have varying degrees of similarity, and the negative samples should be allowed to be drawn from the entire dataset. In this work, we propose Search-based Unsupervised Visual Representation Learning (SUVR) to learn better image representations in an unsupervised manner. We first construct a graph from the image dataset by the similarity between images, and adopt the concept of graph traversal to explore positive samples. In the meantime, we make sure that negative samples can be drawn from the full dataset. Quantitative experiments on five benchmark image classification datasets demonstrate that SUVR can significantly outperform strong competing methods on unsupervised embedding learning. Qualitative experiments also show that SUVR can produce better representations in which similar images are clustered closer together than unrelated images in the latent space.

6.MRN: Multiplexed Routing Network for Incremental Multilingual Text Recognition

Authors:Tianlun Zheng, Zhineng Chen, BingChen Huang, Wei Zhang, Yu-Gang Jiang

Abstract: Traditional Multilingual Text Recognition (MLTR) usually targets a fixed set of languages and thus struggles to handle newly added languages or adapt to ever-changing class distributions. In this paper, we introduce the Incremental Multilingual Text Recognition (IMLTR) task in the incremental learning setting, where new language data comes in batches. Compared to generic incremental learning, IMLTR is even more challenging as it suffers from rehearsal-imbalance (uneven distribution of sample characters in the rehearsal set). To address this issue, we propose a Multiplexed Routing Network (MRN), where a series of recognizers is trained for each language. Subsequently, a language predictor is adopted to weigh the recognizers for voting. Since the recognizers are derived from the original model, MRN effectively reduces the reliance on older data and is better suited for rehearsal-imbalance. We extensively evaluate MRN on MLT17 and MLT19 datasets, outperforming existing state-of-the-art methods by a large margin, i.e., accuracy improvement ranging from 10.3% to 27.4% under different settings.

7.Dual Path Transformer with Partition Attention

Authors:Zhengkai Jiang, Liang Liu, Jiangning Zhang, Yabiao Wang, Mingang Chen, Chengjie Wang

Abstract: This paper introduces a novel attention mechanism, called dual attention, which is both efficient and effective. The dual attention mechanism consists of two parallel components: local attention generated by Convolutional Neural Networks (CNNs) and long-range attention generated by Vision Transformers (ViTs). To address the high computational complexity and memory footprint of vanilla Multi-Head Self-Attention (MHSA), we introduce a novel Multi-Head Partition-wise Attention (MHPA) mechanism. The partition-wise attention approach models both intra-partition and inter-partition attention simultaneously. Building on the dual attention block and partition-wise attention mechanism, we present a hierarchical vision backbone called DualFormer. We evaluate the effectiveness of our model on several computer vision tasks, including image classification on ImageNet, object detection on COCO, and semantic segmentation on Cityscapes. Specifically, the proposed DualFormer-XS achieves 81.5\% top-1 accuracy on ImageNet, outperforming the recent state-of-the-art MPViT-XS by 0.6\% top-1 accuracy with much higher throughput.

8.Generative Modeling through the Semi-dual Formulation of Unbalanced Optimal Transport

Authors:Jaemoo Choi, Jaewoong Choi, Myungjoo Kang

Abstract: Optimal Transport (OT) problem investigates a transport map that bridges two distributions while minimizing a given cost function. In this regard, OT between tractable prior distribution and data has been utilized for generative modeling tasks. However, OT-based methods are susceptible to outliers and face optimization challenges during training. In this paper, we propose a novel generative model based on the semi-dual formulation of Unbalanced Optimal Transport (UOT). Unlike OT, UOT relaxes the hard constraint on distribution matching. This approach provides better robustness against outliers, stability during training, and faster convergence. We validate these properties empirically through experiments. Moreover, we study the theoretical upper-bound of divergence between distributions in UOT. Our model outperforms existing OT-based generative models, achieving FID scores of 2.97 on CIFAR-10 and 5.80 on CelebA-HQ-256.

9.Text Conditional Alt-Text Generation for Twitter Images

Authors:Nikita Srivatsan, Sofia Samaniego, Omar Florez, Taylor Berg-Kirkpatrick

Abstract: In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. This task is more than just a special case of image captioning, as alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative -- e.g. the tweet may name an uncommon object in the image that the model has not previously seen. We address this with a CLIP prefix model that extracts an embedding of the image and passes it to a mapping network that outputs a short sequence in word embedding space, or a ``prefix'', to which we also concatenate the text from the tweet itself. This lets the model condition on both visual and textual information from the post. The combined multimodal prefix is then fed as a prompt to a pretrained language model which autoregressively completes the sequence to generate the alt-text. While prior work has used similar methods for captioning, ours is the first to our knowledge that incorporates textual information from the associated social media post into the prefix as well, and we further demonstrate through ablations that utility of these two information sources stacks. We put forward a new dataset scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation, and show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work.

10.Polarimetric Imaging for Perception

Authors:Michael Baltaxe, Tomer Pe'er, Dan Levi

Abstract: Autonomous driving and advanced driver-assistance systems rely on a set of sensors and algorithms to perform the appropriate actions and provide alerts as a function of the driving scene. Typically, the sensors include color cameras, radar, lidar and ultrasonic sensors. Strikingly however, although light polarization is a fundamental property of light, it is seldom harnessed for perception tasks. In this work we analyze the potential for improvement in perception tasks when using an RGB-polarimetric camera, as compared to an RGB camera. We examine monocular depth estimation and free space detection during the middle of the day, when polarization is independent of subject heading, and show that a quantifiable improvement can be achieved for both of them using state-of-the-art deep neural networks, with a minimum of architectural changes. We also present a new dataset composed of RGB-polarimetric images, lidar scans, GNSS / IMU readings and free space segmentations that further supports developing perception algorithms that take advantage of light polarization.

11.Exploring Diverse In-Context Configurations for Image Captioning

Authors:Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen

Abstract: After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, \ie, randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case.

12.Semi-Supervised and Long-Tailed Object Detection with CascadeMatch

Authors:Yuhang Zang, Kaiyang Zhou, Chen Huang, Chen Change Loy

Abstract: This paper focuses on long-tailed object detection in the semi-supervised learning setting, which poses realistic challenges, but has rarely been studied in the literature. We propose a novel pseudo-labeling-based detector called CascadeMatch. Our detector features a cascade network architecture, which has multi-stage detection heads with progressive confidence thresholds. To avoid manually tuning the thresholds, we design a new adaptive pseudo-label mining mechanism to automatically identify suitable values from data. To mitigate confirmation bias, where a model is negatively reinforced by incorrect pseudo-labels produced by itself, each detection head is trained by the ensemble pseudo-labels of all detection heads. Experiments on two long-tailed datasets, i.e., LVIS and COCO-LT, demonstrate that CascadeMatch surpasses existing state-of-the-art semi-supervised approaches -- across a wide range of detection architectures -- in handling long-tailed object detection. For instance, CascadeMatch outperforms Unbiased Teacher by 1.9 AP Fix on LVIS when using a ResNet50-based Cascade R-CNN structure, and by 1.7 AP Fix when using Sparse R-CNN with a Transformer encoder. We also show that CascadeMatch can even handle the challenging sparsely annotated object detection problem.

13.On Correlated Knowledge Distillation for Monitoring Human Pose with Radios

Authors:Shiva Raj Pokhrel, Jonathan Kua, Deol Satish, Phil Williams, Arkady Zaslavsky, Seng W. Loke, Jinho Choi

Abstract: In this work, we propose and develop a simple experimental testbed to study the feasibility of a novel idea by coupling radio frequency (RF) sensing technology with Correlated Knowledge Distillation (CKD) theory towards designing lightweight, near real-time and precise human pose monitoring systems. The proposed CKD framework transfers and fuses pose knowledge from a robust "Teacher" model to a parameterized "Student" model, which can be a promising technique for obtaining accurate yet lightweight pose estimates. To assure its efficacy, we implemented CKD for distilling logits in our integrated Software Defined Radio (SDR)-based experimental setup and investigated the RF-visual signal correlation. Our CKD-RF sensing technique is characterized by two modes -- a camera-fed Teacher Class Network (e.g., images, videos) with an SDR-fed Student Class Network (e.g., RF signals). Specifically, our CKD model trains a dual multi-branch teacher and student network by distilling and fusing knowledge bases. The resulting CKD models are then subsequently used to identify the multimodal correlation and teach the student branch in reverse. Instead of simply aggregating their learnings, CKD training comprised multiple parallel transformations with the two domains, i.e., visual images and RF signals. Once trained, our CKD model can efficiently preserve privacy and utilize the multimodal correlated logits from the two different neural networks for estimating poses without using visual signals/video frames (by using only the RF signals).

14.OD-NeRF: Efficient Training of On-the-Fly Dynamic Neural Radiance Fields

Authors:Zhiwen Yan, Chen Li, Gim Hee Lee

Abstract: Dynamic neural radiance fields (dynamic NeRFs) have demonstrated impressive results in novel view synthesis on 3D dynamic scenes. However, they often require complete video sequences for training followed by novel view synthesis, which is similar to playing back the recording of a dynamic 3D scene. In contrast, we propose OD-NeRF to efficiently train and render dynamic NeRFs on-the-fly which instead is capable of streaming the dynamic scene. When training on-the-fly, the training frames become available sequentially and the model is trained and rendered frame-by-frame. The key challenge of efficient on-the-fly training is how to utilize the radiance field estimated from the previous frames effectively. To tackle this challenge, we propose: 1) a NeRF model conditioned on the multi-view projected colors to implicitly track correspondence between the current and previous frames, and 2) a transition and update algorithm that leverages the occupancy grid from the last frame to sample efficiently at the current frame. Our algorithm can achieve an interactive speed of 6FPS training and rendering on synthetic dynamic scenes on-the-fly, and a significant speed-up compared to the state-of-the-art on real-world dynamic scenes.

15.NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Authors:Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang

Abstract: We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at

16.Predicting Token Impact Towards Efficient Vision Transformer

Authors:Hong Wang, Su Yang, Xiaoke Huang, Weishan Zhang

Abstract: Token filtering to reduce irrelevant tokens prior to self-attention is a straightforward way to enable efficient vision Transformer. This is the first work to view token filtering from a feature selection perspective, where we weigh the importance of a token according to how much it can change the loss once masked. If the loss changes greatly after masking a token of interest, it means that such a token has a significant impact on the final decision and is thus relevant. Otherwise, the token is less important for the final decision, so it can be filtered out. After applying the token filtering module generalized from the whole training data, the token number fed to the self-attention module can be obviously reduced in the inference phase, leading to much fewer computations in all the subsequent self-attention layers. The token filter can be realized using a very simple network, where we utilize multi-layer perceptron. Except for the uniqueness of performing token filtering only once from the very beginning prior to self-attention, the other core feature making our method different from the other token filters lies in the predictability of token impact from a feature selection point of view. The experiments show that the proposed method provides an efficient way to approach a light weighted model after optimized with a backbone by means of fine tune, which is easy to be deployed in comparison with the existing methods based on training from scratch.

17.Introducing Competition to Boost the Transferability of Targeted Adversarial Examples through Clean Feature Mixup

Authors:Junyoung Byun, Myung-Joon Kwon, Seungju Cho, Yoonji Kim, Changick Kim

Abstract: Deep neural networks are widely known to be susceptible to adversarial examples, which can cause incorrect predictions through subtle input modifications. These adversarial examples tend to be transferable between models, but targeted attacks still have lower attack success rates due to significant variations in decision boundaries. To enhance the transferability of targeted adversarial examples, we propose introducing competition into the optimization process. Our idea is to craft adversarial perturbations in the presence of two new types of competitor noises: adversarial perturbations towards different target classes and friendly perturbations towards the correct class. With these competitors, even if an adversarial example deceives a network to extract specific features leading to the target class, this disturbance can be suppressed by other competitors. Therefore, within this competition, adversarial examples should take different attack strategies by leveraging more diverse features to overwhelm their interference, leading to improving their transferability to different models. Considering the computational complexity, we efficiently simulate various interference from these two types of competitors in feature space by randomly mixing up stored clean features in the model inference and named this method Clean Feature Mixup (CFM). Our extensive experimental results on the ImageNet-Compatible and CIFAR-10 datasets show that the proposed method outperforms the existing baselines with a clear margin. Our code is available at

18.DuDGAN: Improving Class-Conditional GANs via Dual-Diffusion

Authors:Taesun Yeom, Minhyeok Lee

Abstract: Class-conditional image generation using generative adversarial networks (GANs) has been investigated through various techniques; however, it continues to face challenges such as mode collapse, training instability, and low-quality output in cases of datasets with high intra-class variation. Furthermore, most GANs often converge in larger iterations, resulting in poor iteration efficacy in training procedures. While Diffusion-GAN has shown potential in generating realistic samples, it has a critical limitation in generating class-conditional samples. To overcome these limitations, we propose a novel approach for class-conditional image generation using GANs called DuDGAN, which incorporates a dual diffusion-based noise injection process. Our method consists of three unique networks: a discriminator, a generator, and a classifier. During the training process, Gaussian-mixture noises are injected into the two noise-aware networks, the discriminator and the classifier, in distinct ways. This noisy data helps to prevent overfitting by gradually introducing more challenging tasks, leading to improved model performance. As a result, our method outperforms state-of-the-art conditional GAN models for image generation in terms of performance. We evaluated our method using the AFHQ, Food-101, and CIFAR-10 datasets and observed superior results across metrics such as FID, KID, Precision, and Recall score compared with comparison models, highlighting the effectiveness of our approach.

19.Optimization-Based Improvement of Face Image Quality Assessment Techniques

Authors:Žiga Babnik, Naser Damer, Vitomir Štruc

Abstract: Contemporary face recognition (FR) models achieve near-ideal recognition performance in constrained settings, yet do not fully translate the performance to unconstrained (realworld) scenarios. To help improve the performance and stability of FR systems in such unconstrained settings, face image quality assessment (FIQA) techniques try to infer sample-quality information from the input face images that can aid with the recognition process. While existing FIQA techniques are able to efficiently capture the differences between high and low quality images, they typically cannot fully distinguish between images of similar quality, leading to lower performance in many scenarios. To address this issue, we present in this paper a supervised quality-label optimization approach, aimed at improving the performance of existing FIQA techniques. The developed optimization procedure infuses additional information (computed with a selected FR model) into the initial quality scores generated with a given FIQA technique to produce better estimates of the "actual" image quality. We evaluate the proposed approach in comprehensive experiments with six state-of-the-art FIQA approaches (CR-FIQA, FaceQAN, SER-FIQ, PCNet, MagFace, SDD-FIQA) on five commonly used benchmarks (LFW, CFPFP, CPLFW, CALFW, XQLFW) using three targeted FR models (ArcFace, ElasticFace, CurricularFace) with highly encouraging results.

20.Multiresolution Feature Guidance Based Transformer for Anomaly Detection

Authors:Shuting Yan, Pingping Chen, Honghui Chen, Huan Mao, Feng Chen, Zhijian Lin

Abstract: Anomaly detection is represented as an unsupervised learning to identify deviated images from normal images. In general, there are two main challenges of anomaly detection tasks, i.e., the class imbalance and the unexpectedness of anomalies. In this paper, we propose a multiresolution feature guidance method based on Transformer named GTrans for unsupervised anomaly detection and localization. In GTrans, an Anomaly Guided Network (AGN) pre-trained on ImageNet is developed to provide surrogate labels for features and tokens. Under the tacit knowledge guidance of the AGN, the anomaly detection network named Trans utilizes Transformer to effectively establish a relationship between features with multiresolution, enhancing the ability of the Trans in fitting the normal data manifold. Due to the strong generalization ability of AGN, GTrans locates anomalies by comparing the differences in spatial distance and direction of multi-scale features extracted from the AGN and the Trans. Our experiments demonstrate that the proposed GTrans achieves state-of-the-art performance in both detection and localization on the MVTec AD dataset. GTrans achieves image-level and pixel-level anomaly detection AUROC scores of 99.0% and 97.9% on the MVTec AD dataset, respectively.

21.Towards View-invariant and Accurate Loop Detection Based on Scene Graph

Authors:Chuhao Liu, Shaojie Shen

Abstract: Loop detection plays a key role in visual Simultaneous Localization and Mapping (SLAM) by correcting the accumulated pose drift. In indoor scenarios, the richly distributed semantic landmarks are view-point invariant and hold strong descriptive power in loop detection. The current semantic-aided loop detection embeds the topology between semantic instances to search a loop. However, current semantic-aided loop detection methods face challenges in dealing with ambiguous semantic instances and drastic viewpoint differences, which are not fully addressed in the literature. This paper introduces a novel loop detection method based on an incrementally created scene graph, targeting the visual SLAM at indoor scenes. It jointly considers the macro-view topology, micro-view topology, and occupancy of semantic instances to find correct correspondences. Experiments using handheld RGB-D sequence show our method is able to accurately detect loops in drastically changed viewpoints. It maintains a high precision in observing objects with similar topology and appearance. Our method also demonstrates that it is robust in changed indoor scenes.

22.HARD: Hard Augmentations for Robust Distillation

Authors:Arne F. Nix, Max F. Burg, Fabian H. Sinz

Abstract: Knowledge distillation (KD) is a simple and successful method to transfer knowledge from a teacher to a student model solely based on functional activity. However, current KD has a few shortcomings: it has recently been shown that this method is unsuitable to transfer simple inductive biases like shift equivariance, struggles to transfer out of domain generalization, and optimization time is magnitudes longer compared to default non-KD model training. To improve these aspects of KD, we propose Hard Augmentations for Robust Distillation (HARD), a generally applicable data augmentation framework, that generates synthetic data points for which the teacher and the student disagree. We show in a simple toy example that our augmentation framework solves the problem of transferring simple equivariances with KD. We then apply our framework in real-world tasks for a variety of augmentation models, ranging from simple spatial transformations to unconstrained image manipulations with a pretrained variational autoencoder. We find that our learned augmentations significantly improve KD performance on in-domain and out-of-domain evaluation. Moreover, our method outperforms even state-of-the-art data augmentations and since the augmented training inputs can be visualized, they offer a qualitative insight into the properties that are transferred from the teacher to the student. Thus HARD represents a generally applicable, dynamically optimized data augmentation technique tailored to improve the generalization and convergence speed of models trained with KD.

23.GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data

Authors:Zhitong Xiong, Sining Chen, Yi Wang, Lichao Mou, Xiao Xiang Zhu

Abstract: Geometric information in the normalized digital surface models (nDSM) is highly correlated with the semantic class of the land cover. Exploiting two modalities (RGB and nDSM (height)) jointly has great potential to improve the segmentation performance. However, it is still an under-explored field in remote sensing due to the following challenges. First, the scales of existing datasets are relatively small and the diversity of existing datasets is limited, which restricts the ability of validation. Second, there is a lack of unified benchmarks for performance assessment, which leads to difficulties in comparing the effectiveness of different models. Last, sophisticated multi-modal semantic segmentation methods have not been deeply explored for remote sensing data. To cope with these challenges, in this paper, we introduce a new remote-sensing benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data. Towards a fair and comprehensive analysis of existing methods, the proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data. Furthermore, we propose a novel and effective Transformer-based intermediary multi-modal fusion (TIMF) module to improve the semantic segmentation performance through adaptive token-level multi-modal fusion.The designed benchmark can foster future research on developing new methods for multi-modal learning on remote sensing data. Extensive analyses of those methods are conducted and valuable insights are provided through the experimental results. Code for the benchmark and baselines can be accessed at \url{}.

24.Incremental Dense Reconstruction from Monocular Video with Guided Sparse Feature Volume Fusion

Authors:Xingxing Zuo, Nan Yang, Nathaniel Merrill, Binbin Xu, Stefan Leutenegger

Abstract: Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memory consumption of high-resolution feature volumes. This letter proposes a real-time feature volume-based dense reconstruction method that predicts TSDF (Truncated Signed Distance Function) values from a novel sparsified deep feature volume, which is able to achieve higher resolutions than previous feature volume-based methods, and is favorable in large-scale outdoor scenarios where the majority of voxels are empty. An uncertainty-aware multi-view stereo (MVS) network is leveraged to infer initial voxel locations of the physical surface in a sparse feature volume. Then for refining the recovered 3D geometry, deep features are attentively aggregated from multiview images at potential surface locations, and temporally fused. Besides achieving higher resolutions than before, our method is shown to produce more complete reconstructions with finer detail in many cases. Extensive evaluations on both public and self-collected datasets demonstrate a very competitive real-time reconstruction result for our method compared to state-of-the-art reconstruction methods in both indoor and outdoor settings.

25.Dual-Side Feature Fusion 3D Pose Transfer

Authors:Jue Liu, Feipeng Da

Abstract: 3D pose transfer solves the problem of additional input and correspondence of traditional deformation transfer, only the source and target meshes need to be input, and the pose of the source mesh can be transferred to the target mesh. Some lightweight methods proposed in recent years consume less memory but cause spikes and distortions for some unseen poses, while others are costly in training due to the inclusion of large matrix multiplication and adversarial networks. In addition, the meshes with different numbers of vertices also increase the difficulty of pose transfer. In this work, we propose a Dual-Side Feature Fusion Pose Transfer Network to improve the pose transfer accuracy of the lightweight method. Our method takes the pose features as one of the side inputs to the decoding network and fuses them into the target mesh layer by layer at multiple scales. Our proposed Feature Fusion Adaptive Instance Normalization has the characteristic of having two side input channels that fuse pose features and identity features as denormalization parameters, thus enhancing the pose transfer capability of the network. Extensive experimental results show that our proposed method has stronger pose transfer capability than state-of-the-art methods while maintaining a lightweight network structure, and can converge faster.

26.DC-Net: Divide-and-Conquer for Salient Object Detection

Authors:Jiayi Zhu, Xuebin Qin, Abdulmotaleb Elsaddik

Abstract: In this paper, we introduce Divide-and-Conquer into the salient object detection (SOD) task to enable the model to learn prior knowledge that is for predicting the saliency map. We design a novel network, Divide-and-Conquer Network (DC-Net) which uses two encoders to solve different subtasks that are conducive to predicting the final saliency map, here is to predict the edge maps with width 4 and location maps of salient objects and then aggregate the feature maps with different semantic information into the decoder to predict the final saliency map. The decoder of DC-Net consists of our newly designed two-level Residual nested-ASPP (ResASPP$^{2}$) modules, which have the ability to capture a large number of different scale features with a small number of convolution operations and have the advantages of maintaining high resolution all the time and being able to obtain a large and compact effective receptive field (ERF). Based on the advantage of Divide-and-Conquer's parallel computing, we use Parallel Acceleration to speed up DC-Net, allowing it to achieve competitive performance on six LR-SOD and five HR-SOD datasets under high efficiency (60 FPS and 55 FPS). Codes and results are available:

27.ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

Authors:Christoph Auer, Ahmed Nassar, Maksym Lysak, Michele Dolfi, Nikolaos Livathinos, Peter Staar

Abstract: Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions to document layout understanding. In this report, we present the results of our \textit{ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents}, which posed the challenge to accurately segment the page layout in a broad range of document styles and domains, including corporate reports, technical literature and patents. To raise the bar over previous competitions, we engineered a hard competition dataset and proposed the recent DocLayNet dataset for training. We recorded 45 team registrations and received official submissions from 21 teams. In the presented solutions, we recognize interesting combinations of recent computer vision models, data augmentation strategies and ensemble methods to achieve remarkable accuracy in the task we posed. A clear trend towards adoption of vision-transformer based methods is evident. The results demonstrate substantial progress towards achieving robust and highly generalizing methods for document layout understanding.

28.MMNet: Multi-Mask Network for Referring Image Segmentation

Authors:Yichen Yan, Xingjian He, Wenxuan Wan, Jing Liu

Abstract: Referring image segmentation aims to segment an object referred to by natural language expression from an image. However, this task is challenging due to the distinct data properties between text and image, and the randomness introduced by diverse objects and unrestricted language expression. Most of previous work focus on improving cross-modal feature fusion while not fully addressing the inherent uncertainty caused by diverse objects and unrestricted language. To tackle these problems, we propose an end-to-end Multi-Mask Network for referring image segmentation(MMNet). we first combine picture and language and then employ an attention mechanism to generate multiple queries that represent different aspects of the language expression. We then utilize these queries to produce a series of corresponding segmentation masks, assigning a score to each mask that reflects its importance. The final result is obtained through the weighted sum of all masks, which greatly reduces the randomness of the language expression. Our proposed framework demonstrates superior performance compared to state-of-the-art approaches on the two most commonly used datasets, RefCOCO, RefCOCO+ and G-Ref, without the need for any post-processing. This further validates the efficacy of our proposed framework.

29.Sampling-based Uncertainty Estimation for an Instance Segmentation Network

Authors:Florian Heidecker, Ahmad El-Khateeb, Bernhard Sick

Abstract: The examination of uncertainty in the predictions of machine learning (ML) models is receiving increasing attention. One uncertainty modeling technique used for this purpose is Monte-Carlo (MC)-Dropout, where repeated predictions are generated for a single input. Therefore, clustering is required to describe the resulting uncertainty, but only through efficient clustering is it possible to describe the uncertainty from the model attached to each object. This article uses Bayesian Gaussian Mixture (BGM) to solve this problem. In addition, we investigate different values for the dropout rate and other techniques, such as focal loss and calibration, which we integrate into the Mask-RCNN model to obtain the most accurate uncertainty approximation of each instance and showcase it graphically.

30.Scale Matters: Attribution Meets the Wavelet Domain to Explain Model Sensitivity to Image Corruptions

Authors:Gabriel Kasmi, Laurent Dubus, Yves-Marie Saint Drenan, Philippe Blanc

Abstract: Neural networks have shown remarkable performance in computer vision, but their deployment in real-world scenarios is challenging due to their sensitivity to image corruptions. Existing attribution methods are uninformative for explaining the sensitivity to image corruptions, while the literature on robustness only provides model-based explanations. However, the ability to scrutinize models' behavior under image corruptions is crucial to increase the user's trust. Towards this end, we introduce the Wavelet sCale Attribution Method (WCAM), a generalization of attribution from the pixel domain to the space-scale domain. Attribution in the space-scale domain reveals where and on what scales the model focuses. We show that the WCAM explains models' failures under image corruptions, identifies sufficient information for prediction, and explains how zoom-in increases accuracy.

31.IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

Authors:Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

Abstract: The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at

32.Measuring Faithful and Plausible Visual Grounding in VQA

Authors:Daniel Reich, Felix Putze, Tanja Schultz

Abstract: Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.

33.Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Authors:Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji

Abstract: Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at

34.Transferring Visual Attributes from Natural Language to Verified Image Generation

Authors:Rodrigo Valerio, Joao Bordalo, Michal Yarom, Yonattan Bitton, Idan Szpektor, Joao Magalhaes

Abstract: Text to image generation methods (T2I) are widely popular in generating art and other creative artifacts. While visual hallucinations can be a positive factor in scenarios where creativity is appreciated, such artifacts are poorly suited for cases where the generated image needs to be grounded in complex natural language without explicit visual elements. In this paper, we propose to strengthen the consistency property of T2I methods in the presence of natural complex language, which often breaks the limits of T2I methods by including non-visual information, and textual elements that require knowledge for accurate generation. To address these phenomena, we propose a Natural Language to Verified Image generation approach (NL2VI) that converts a natural prompt into a visual prompt, which is more suitable for image generation. A T2I model then generates an image for the visual prompt, which is then verified with VQA algorithms. Experimentally, aligning natural prompts with image generation can improve the consistency of the generated images by up to 11% over the state of the art. Moreover, improvements can generalize to challenging domains like cooking and DIY tasks, where the correctness of the generated image is crucial to illustrate actions.

35.Jointly Optimizing Image Compression with Low-light Image Enhancement

Authors:Shilv Cai, Xu Zou, Liqun Chen, Luxin Yan, Sheng Zhong

Abstract: Learning-based image compression methods have made great progress. Most of them are designed for generic natural images. In fact, low-light images frequently occur due to unavoidable environmental influences or technical limitations, such as insufficient lighting or limited exposure time. %When general-purpose image compression algorithms compress low-light images, useful detail information is lost, resulting in a dramatic decrease in image enhancement. Once low-light images are compressed by existing general image compression approaches, useful information(e.g., texture details) would be lost resulting in a dramatic performance decrease in low-light image enhancement. To simultaneously achieve a higher compression rate and better enhancement performance for low-light images, we propose a novel image compression framework with joint optimization of low-light image enhancement. We design an end-to-end trainable two-branch architecture with lower computational cost, which includes the main enhancement branch and the signal-to-noise ratio~(SNR) aware branch. Experimental results show that our proposed joint optimization framework achieves a significant improvement over existing ``Compress before Enhance" or ``Enhance before Compress" sequential solutions for low-light images. Source codes are included in the supplementary material.

36.PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology

Authors:Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Zhongyi Shui, Xiaoxuan Yu, Yizhi Zhao, Honglin Li, Yunlong Zhang, Ruojia Zhao, Xinheng Lyu, Lin Yang

Abstract: As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, with significant applications in natural image interpretation. However, the field of pathology has largely remained untapped in this regard, despite the growing need for accurate, timely, and personalized diagnostics. To bridge the gap in pathology MLLMs, we present the PathAsst in this study, which is a generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. To develop PathAsst, we collect over 142K high-quality pathology image-text pairs from a variety of reliable sources, including PubMed, comprehensive pathology textbooks, reputable pathology websites, and private data annotated by pathologists. Leveraging the advanced capabilities of ChatGPT/GPT-4, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data, specifically tailored for the invocation of the pathology-specific models, allowing the PathAsst to effectively interact with these models based on the input image and user intent, consequently enhancing the model's diagnostic capabilities. Subsequently, our PathAsst is trained based on Vicuna-13B language model in coordination with the CLIP vision encoder. The results of PathAsst show the potential of harnessing the AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We are committed to open-sourcing our meticulously curated dataset, as well as a comprehensive toolkit designed to aid researchers in the extensive collection and preprocessing of their own datasets. Resources can be obtained at

37.Learning INR for Event-guided Rolling Shutter Frame Correction, Deblur, and Interpolation

Authors:Yunfan Lu, Guoqiang Liang, Lin Wang

Abstract: Images captured by rolling shutter (RS) cameras under fast camera motion often contain obvious image distortions and blur, which can be modeled as a row-wise combination of a sequence of global shutter (GS) frames within the exposure time naturally, recovering high-frame-rate GS sharp frames from an RS blur image needs to simultaneously consider RS correction, deblur, and frame interpolation Taking this task is nontrivial, and to our knowledge, no feasible solutions exist by far. A naive way is to decompose the complete process into separate tasks and simply cascade existing methods; however, this results in cumulative errors and noticeable artifacts. Event cameras enjoy many advantages, e.g., high temporal resolution, making them potential for our problem. To this end, we make the first attempt to recover high-frame-rate sharp GS frames from an RS blur image and paired event data. Our key idea is to learn an implicit neural representation (INR) to directly map the position and time coordinates to RGB values to address the interlocking degradations in the image restoration process. Specifically, we introduce spatial-temporal implicit encoding (STE) to convert an RS blur image and events into a spatial-temporal representation (STR). To query a specific sharp frame (GS or RS), we embed the exposure time into STR and decode the embedded features to recover a sharp frame. Moreover, we propose an RS blur image-guided integral loss to better train the network. Our method is relatively lightweight as it contains only 0.379M parameters and demonstrates high efficiency as the STE is called only once for any number of interpolation frames. Extensive experiments show that our method significantly outperforms prior methods addressing only one or two of the tasks.

38.Audio-Visual Dataset and Method for Anomaly Detection in Traffic Videos

Authors:Błażej Leporowski, Arian Bakhtiarnia, Nicole Bonnici, Adrian Muscat, Luca Zanella, Yiming Wang, Alexandros Iosifidis

Abstract: We introduce the first audio-visual dataset for traffic anomaly detection taken from real-world scenes, called MAVAD, with a diverse range of weather and illumination conditions. In addition, we propose a novel method named AVACA that combines visual and audio features extracted from video sequences by means of cross-attention to detect anomalies. We demonstrate that the addition of audio improves the performance of AVACA by up to 5.2%. We also evaluate the impact of image anonymization, showing only a minor decrease in performance averaging at 1.7%.

39.Unpaired Image-to-Image Translation via Neural Schrödinger Bridge

Authors:Beomsu Kim, Gihyun Kwon, Kwanyoung Kim, Jong Chul Ye

Abstract: Diffusion models are a powerful class of generative models which simulate stochastic differential equations (SDEs) to generate data from noise. Although diffusion models have achieved remarkable progress in recent years, they have limitations in the unpaired image-to-image translation tasks due to the Gaussian prior assumption. Schr\"odinger Bridge (SB), which learns an SDE to translate between two arbitrary distributions, have risen as an attractive solution to this problem. However, none of SB models so far have been successful at unpaired translation between high-resolution images. In this work, we propose the Unpaired Neural Schr\"odinger Bridge (UNSB), which combines SB with adversarial training and regularization to learn a SB between unpaired data. We demonstrate that UNSB is scalable, and that it successfully solves various unpaired image-to-image translation tasks. Code: \url{}

40.Modeling Complex Object Changes in Satellite Image Time-Series: Approach based on CSP and Spatiotemporal Graph

Authors:Zouhayra Ayadi, Wadii Boulila, Imed Riadh Farah

Abstract: This paper proposes a method for automatically monitoring and analyzing the evolution of complex geographic objects. The objects are modeled as a spatiotemporal graph, which separates filiation relations, spatial relations, and spatiotemporal relations, and is analyzed by detecting frequent sub-graphs using constraint satisfaction problems (CSP). The process is divided into four steps: first, the identification of complex objects in each satellite image; second, the construction of a spatiotemporal graph to model the spatiotemporal changes of the complex objects; third, the creation of sub-graphs to be detected in the base spatiotemporal graph; and fourth, the analysis of the spatiotemporal graph by detecting the sub-graphs and solving a constraint network to determine relevant sub-graphs. The final step is further broken down into two sub-steps: (i) the modeling of the constraint network with defined variables and constraints, and (ii) the solving of the constraint network to find relevant sub-graphs in the spatiotemporal graph. Experiments were conducted using real-world satellite images representing several cities in Saudi Arabia, and the results demonstrate the effectiveness of the proposed approach.

41.InpaintNeRF360: Text-Guided 3D Inpainting on Unbounded Neural Radiance Fields

Authors:Dongqing Wang, Tong Zhang, Alaa Abboud, Sabine Süsstrunk

Abstract: Neural Radiance Fields (NeRF) can generate highly realistic novel views. However, editing 3D scenes represented by NeRF across 360-degree views, particularly removing objects while preserving geometric and photometric consistency, remains a challenging problem due to NeRF's implicit scene representation. In this paper, we propose InpaintNeRF360, a unified framework that utilizes natural language instructions as guidance for inpainting NeRF-based 3D scenes.Our approach employs a promptable segmentation model by generating multi-modal prompts from the encoded text for multiview segmentation. We apply depth-space warping to enforce viewing consistency in the segmentations, and further refine the inpainted NeRF model using perceptual priors to ensure visual plausibility. InpaintNeRF360 is capable of simultaneously removing multiple objects or modifying object appearance based on text instructions while synthesizing 3D viewing-consistent and photo-realistic inpainting. Through extensive experiments on both unbounded and frontal-facing scenes trained through NeRF, we demonstrate the effectiveness of our approach and showcase its potential to enhance the editability of implicit radiance fields.

42.Computer Vision for Construction Progress Monitoring: A Real-Time Object Detection Approach

Authors:Jiesheng Yang, Andreas Wilde, Karsten Menzel, Md Zubair Sheikh, Boris Kuznetsov

Abstract: Construction progress monitoring (CPM) is essential for effective project management, ensuring on-time and on-budget delivery. Traditional CPM methods often rely on manual inspection and reporting, which are time-consuming and prone to errors. This paper proposes a novel approach for automated CPM using state-of-the-art object detection algorithms. The proposed method leverages e.g. YOLOv8's real-time capabilities and high accuracy to identify and track construction elements within site images and videos. A dataset was created, consisting of various building elements and annotated with relevant objects for training and validation. The performance of the proposed approach was evaluated using standard metrics, such as precision, recall, and F1-score, demonstrating significant improvement over existing methods. The integration of Computer Vision into CPM provides stakeholders with reliable, efficient, and cost-effective means to monitor project progress, facilitating timely decision-making and ultimately contributing to the successful completion of construction projects.

43.Thinking Twice: Clinical-Inspired Thyroid Ultrasound Lesion Detection Based on Feature Feedback

Authors:Lingtao Wang, Jianrui Ding, Fenghe Tang, Chunping Ning

Abstract: Accurate detection of thyroid lesions is a critical aspect of computer-aided diagnosis. However, most existing detection methods perform only one feature extraction process and then fuse multi-scale features, which can be affected by noise and blurred features in ultrasound images. In this study, we propose a novel detection network based on a feature feedback mechanism inspired by clinical diagnosis. The mechanism involves first roughly observing the overall picture and then focusing on the details of interest. It comprises two parts: a feedback feature selection module and a feature feedback pyramid. The feedback feature selection module efficiently selects the features extracted in the first phase in both space and channel dimensions to generate high semantic prior knowledge, which is similar to coarse observation. The feature feedback pyramid then uses this high semantic prior knowledge to enhance feature extraction in the second phase and adaptively fuses the two features, similar to fine observation. Additionally, since radiologists often focus on the shape and size of lesions for diagnosis, we propose an adaptive detection head strategy to aggregate multi-scale features. Our proposed method achieves an AP of 70.3% and AP50 of 99.0% on the thyroid ultrasound dataset and meets the real-time requirement. The code is available at

44.Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

Authors:Jinjin Gu, Xianzheng Ma, Xiangtao Kong, Yu Qiao, Chao Dong

Abstract: Deep deraining networks, while successful in laboratory benchmarks, consistently encounter substantial generalization issues when deployed in real-world applications. A prevailing perspective in deep learning encourages the use of highly complex training data, with the expectation that a richer image content knowledge will facilitate overcoming the generalization problem. However, through comprehensive and systematic experimentation, we discovered that this strategy does not enhance the generalization capability of these networks. On the contrary, it exacerbates the tendency of networks to overfit to specific degradations. Our experiments reveal that better generalization in a deraining network can be achieved by simplifying the complexity of the training data. This is due to the networks are slacking off during training, that is, learning the least complex elements in the image content and degradation to minimize training loss. When the complexity of the background image is less than that of the rain streaks, the network will prioritize the reconstruction of the background, thereby avoiding overfitting to the rain patterns and resulting in improved generalization performance. Our research not only offers a valuable perspective and methodology for better understanding the generalization problem in low-level vision tasks, but also displays promising practical potential.

45.Reliability Scores from Saliency Map Clusters for Improved Image-based Harvest-Readiness Prediction in Cauliflower

Authors:Jana Kierdorf, Ribana Roscher

Abstract: Cauliflower is a hand-harvested crop that must fulfill high-quality standards in sales making the timing of harvest important. However, accurately determining harvest-readiness can be challenging due to the cauliflower head being covered by its canopy. While deep learning enables automated harvest-readiness estimation, errors can occur due to field-variability and limited training data. In this paper, we analyze the reliability of a harvest-readiness classifier with interpretable machine learning. By identifying clusters of saliency maps, we derive reliability scores for each classification result using knowledge about the domain and the image properties. For unseen data, the reliability can be used to (i) inform farmers to improve their decision-making and (ii) increase the model prediction accuracy. Using RGB images of single cauliflower plants at different developmental stages from the GrowliFlower dataset, we investigate various saliency mapping approaches and find that they result in different quality of reliability scores. With the most suitable interpretation tool, we adjust the classification result and achieve a 15.72% improvement of the overall accuracy to 88.14% and a 15.44% improvement of the average class accuracy to 88.52% for the GrowliFlower dataset.

46.Clinically Labeled Contrastive Learning for OCT Biomarker Classification

Authors:Kiran Kokilepersaud, Stephanie Trejo Corona, Mohit Prabhushankar, Ghassan AlRegib, Charles Wykoff

Abstract: This paper presents a novel positive and negative set selection strategy for contrastive learning of medical images based on labels that can be extracted from clinical data. In the medical field, there exists a variety of labels for data that serve different purposes at different stages of a diagnostic and treatment process. Clinical labels and biomarker labels are two examples. In general, clinical labels are easier to obtain in larger quantities because they are regularly collected during routine clinical care, while biomarker labels require expert analysis and interpretation to obtain. Within the field of ophthalmology, previous work has shown that clinical values exhibit correlations with biomarker structures that manifest within optical coherence tomography (OCT) scans. We exploit this relationship by using the clinical data as pseudo-labels for our data without biomarker labels in order to choose positive and negative instances for training a backbone network with a supervised contrastive loss. In this way, a backbone network learns a representation space that aligns with the clinical data distribution available. Afterwards, we fine-tune the network trained in this manner with the smaller amount of biomarker labeled data with a cross-entropy loss in order to classify these key indicators of disease directly from OCT scans. We also expand on this concept by proposing a method that uses a linear combination of clinical contrastive losses. We benchmark our methods against state of the art self-supervised methods in a novel setting with biomarkers of varying granularity. We show performance improvements by as much as 5\% in total biomarker detection AUROC.

47.Deceptive-NeRF: Enhancing NeRF Reconstruction using Pseudo-Observations from Diffusion Models

Authors:Xinhang Liu, Shiu-hong Kao, Jiaben Chen, Yu-Wing Tai, Chi-Keung Tang

Abstract: This paper introduces Deceptive-NeRF, a new method for enhancing the quality of reconstructed NeRF models using synthetically generated pseudo-observations, capable of handling sparse input and removing floater artifacts. Our proposed method involves three key steps: 1) reconstruct a coarse NeRF model from sparse inputs; 2) generate pseudo-observations based on the coarse model; 3) refine the NeRF model using pseudo-observations to produce a high-quality reconstruction. To generate photo-realistic pseudo-observations that faithfully preserve the identity of the reconstructed scene while remaining consistent with the sparse inputs, we develop a rectification latent diffusion model that generates images conditional on a coarse RGB image and depth map, which are derived from the coarse NeRF and latent text embedding from input images. Extensive experiments show that our method is effective and can generate perceptually high-quality NeRF even with very sparse inputs.

48.DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

Authors:Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn

Abstract: The recent progress in diffusion-based text-to-image generation models has significantly expanded generative capabilities via conditioning the text descriptions. However, since relying solely on text prompts is still restrictive for fine-grained customization, we aim to extend the boundaries of conditional generation to incorporate diverse types of modalities, e.g., sketch, box, and style embedding, simultaneously. We thus design a multimodal text-to-image diffusion model, coined as DiffBlender, that achieves the aforementioned goal in a single model by training only a few small hypernetworks. DiffBlender facilitates a convenient scaling of input modalities, without altering the parameters of an existing large-scale generative model to retain its well-established knowledge. Furthermore, our study sets new standards for multimodal generation by conducting quantitative and qualitative comparisons with existing approaches. By diversifying the channels of conditioning modalities, DiffBlender faithfully reflects the provided information or, in its absence, creates imaginative generation.

49.Promoting Generalization in Cross-Dataset Remote Photoplethysmography

Authors:Nathan Vance, Jeremy Speth, Benjamin Sporrer, Patrick Flynn

Abstract: Remote Photoplethysmography (rPPG), or the remote monitoring of a subject's heart rate using a camera, has seen a shift from handcrafted techniques to deep learning models. While current solutions offer substantial performance gains, we show that these models tend to learn a bias to pulse wave features inherent to the training dataset. We develop augmentations to mitigate this learned bias by expanding both the range and variability of heart rates that the model sees while training, resulting in improved model convergence when training and cross-dataset generalization at test time. Through a 3-way cross dataset analysis we demonstrate a reduction in mean absolute error from over 13 beats per minute to below 3 beats per minute. We compare our method with other recent rPPG systems, finding similar performance under a variety of evaluation parameters.

50.GTNet: Graph Transformer Network for 3D Point Cloud Classification and Semantic Segmentation

Authors:Wei Zhou, Qian Wang, Weiwei Jin, Xinzhe Shi, Dekui Wang, Xingxing Hao, Yongxiang Yu

Abstract: Recently, graph-based and Transformer-based deep learning networks have demonstrated excellent performances on various point cloud tasks. Most of the existing graph methods are based on static graph, which take a fixed input to establish graph relations. Moreover, many graph methods apply maximization and averaging to aggregate neighboring features, so that only a single neighboring point affects the feature of centroid or different neighboring points have the same influence on the centroid's feature, which ignoring the correlation and difference between points. Most Transformer-based methods extract point cloud features based on global attention and lack the feature learning on local neighbors. To solve the problems of these two types of models, we propose a new feature extraction block named Graph Transformer and construct a 3D point point cloud learning network called GTNet to learn features of point clouds on local and global patterns. Graph Transformer integrates the advantages of graph-based and Transformer-based methods, and consists of Local Transformer and Global Transformer modules. Local Transformer uses a dynamic graph to calculate all neighboring point weights by intra-domain cross-attention with dynamically updated graph relations, so that every neighboring point could affect the features of centroid with different weights; Global Transformer enlarges the receptive field of Local Transformer by a global self-attention. In addition, to avoid the disappearance of the gradient caused by the increasing depth of network, we conduct residual connection for centroid features in GTNet; we also adopt the features of centroid and neighbors to generate the local geometric descriptors in Local Transformer to strengthen the local information learning capability of the model. Finally, we use GTNet for shape classification, part segmentation and semantic segmentation tasks in this paper.

51.L-CAD: Language-based Colorization with Any-level Descriptions

Authors:Zheng Chang, Shuchen Weng, Peixuan Zhang, Yu Li, Si Li, Boxin Shi

Abstract: Language-based colorization produces plausible and visually pleasing colors under the guidance of user-friendly natural language descriptions. Previous methods implicitly assume that users provide comprehensive color descriptions for most of the objects in the image, which leads to suboptimal performance. In this paper, we propose a unified model to perform language-based colorization with any-level descriptions. We leverage the pretrained cross-modality generative model for its robust language understanding and rich color priors to handle the inherent ambiguity of any-level descriptions. We further design modules to align with input conditions to preserve local spatial structures and prevent the ghosting effect. With the proposed novel sampling strategy, our model achieves instance-aware colorization in diverse and complex scenarios. Extensive experimental results demonstrate our advantages of effectively handling any-level descriptions and outperforming both language-based and automatic colorization methods. The code and pretrained models are available at:

52.DynStatF: An Efficient Feature Fusion Strategy for LiDAR 3D Object Detection

Authors:Yao Rong, Xiangyu Wei, Tianwei Lin, Yueyu Wang, Enkelejda Kasneci

Abstract: Augmenting LiDAR input with multiple previous frames provides richer semantic information and thus boosts performance in 3D object detection, However, crowded point clouds in multi-frames can hurt the precise position information due to the motion blur and inaccurate point projection. In this work, we propose a novel feature fusion strategy, DynStaF (Dynamic-Static Fusion), which enhances the rich semantic information provided by the multi-frame (dynamic branch) with the accurate location information from the current single-frame (static branch). To effectively extract and aggregate complimentary features, DynStaF contains two modules, Neighborhood Cross Attention (NCA) and Dynamic-Static Interaction (DSI), operating through a dual pathway architecture. NCA takes the features in the static branch as queries and the features in the dynamic branch as keys (values). When computing the attention, we address the sparsity of point clouds and take only neighborhood positions into consideration. NCA fuses two features at different feature map scales, followed by DSI providing the comprehensive interaction. To analyze our proposed strategy DynStaF, we conduct extensive experiments on the nuScenes dataset. On the test set, DynStaF increases the performance of PointPillars in NDS by a large margin from 57.7% to 61.6%. When combined with CenterPoint, our framework achieves 61.0% mAP and 67.7% NDS, leading to state-of-the-art performance without bells and whistles.

53.Real time dense anomaly detection by learning on synthetic negative data

Authors:Anja Delić, Matej Grcić, Siniša Šegvić

Abstract: Most approaches to dense anomaly detection rely on generative modeling or on discriminative methods that train with negative data. We consider a recent hybrid method that optimizes the same shared representation according to cross-entropy of the discriminative predictions, and negative log likelihood of the predicted energy-based density. We extend that work with a jointly trained generative flow that samples synthetic negatives at the border of the inlier distribution. The proposed extension provides potential to learn the hybrid method without real negative data. Our experiments analyze the impact of training with synthetic negative data and validate contribution of the energy-based density during training and evaluation.

54.Robust Classification via a Single Diffusion Model

Authors:Huanran Chen, Yinpeng Dong, Zhengyi Wang, Xiao Yang, Chengqi Duan, Hang Su, Jun Zhu

Abstract: Recently, diffusion models have been successfully applied to improving adversarial robustness of image classifiers by purifying the adversarial noises or generating realistic data for adversarial training. However, the diffusion-based purification can be evaded by stronger adaptive attacks while adversarial training does not perform well under unseen threats, exhibiting inevitable limitations of these methods. To better harness the expressive power of diffusion models, in this paper we propose Robust Diffusion Classifier (RDC), a generative classifier that is constructed from a pre-trained diffusion model to be adversarially robust. Our method first maximizes the data likelihood of a given input and then predicts the class probabilities of the optimized input using the conditional likelihood of the diffusion model through Bayes' theorem. Since our method does not require training on particular adversarial attacks, we demonstrate that it is more generalizable to defend against multiple unseen threats. In particular, RDC achieves $73.24\%$ robust accuracy against $\ell_\infty$ norm-bounded perturbations with $\epsilon_\infty=8/255$ on CIFAR-10, surpassing the previous state-of-the-art adversarial training models by $+2.34\%$. The findings highlight the potential of generative classifiers by employing diffusion models for adversarial robustness compared with the commonly studied discriminative classifiers.

55.Delving Deeper into Data Scaling in Masked Image Modeling

Authors:Cheng-Ze Lu, Xiaojie Jin, Qibin Hou, Jun Hao Liew, Ming-Ming Cheng, Jiashi Feng

Abstract: Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we take a step further and propose to investigate this problem in a more practical setting. Specifically, we utilize the web-collected Coyo-700M dataset. We randomly sample varying numbers of training images from the Coyo dataset and construct a series of sub-datasets, containing 0.5M, 1M, 5M, 10M, and 100M images, for pre-training. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models. The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical. We hope these observations could provide valuable insights for future research on MIM.

56.Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation

Authors:Tong Xu, Micol Spitale, Hao Tang, Lu Liu, Hatice Gunes, Siyang Song

Abstract: Generating facial reactions in a human-human dyadic interaction is complex and highly dependent on the context since more than one facial reactions can be appropriate for the speaker's behaviour. This has challenged existing machine learning (ML) methods, whose training strategies enforce models to reproduce a specific (not multiple) facial reaction from each input speaker behaviour. This paper proposes the first multiple appropriate facial reaction generation framework that re-formulates the one-to-many mapping facial reaction generation problem as a one-to-one mapping problem. This means that we approach this problem by considering the generation of a distribution of the listener's appropriate facial reactions instead of multiple different appropriate facial reactions, i.e., 'many' appropriate facial reaction labels are summarised as 'one' distribution label during training. Our model consists of a perceptual processor, a cognitive processor, and a motor processor. The motor processor is implemented with a novel Reversible Multi-dimensional Edge Graph Neural Network (REGNN). This allows us to obtain a distribution of appropriate real facial reactions during the training process, enabling the cognitive processor to be trained to predict the appropriate facial reaction distribution. At the inference stage, the REGNN decodes an appropriate facial reaction by using this distribution as input. Experimental results demonstrate that our approach outperforms existing models in generating more appropriate, realistic, and synchronized facial reactions. The improved performance is largely attributed to the proposed appropriate facial reaction distribution learning strategy and the use of a REGNN. The code is available at

57.ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers

Authors:Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang

Abstract: Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.

58.High Speed Human Action Recognition using a Photonic Reservoir Computer

Authors:Enrico Picco, Piotr Antonik, Serge Massar

Abstract: The recognition of human actions in videos is one of the most active research fields in computer vision. The canonical approach consists in a more or less complex preprocessing stages of the raw video data, followed by a relatively simple classification algorithm. Here we address recognition of human actions using the reservoir computing algorithm, which allows us to focus on the classifier stage. We introduce a new training method for the reservoir computer, based on "Timesteps Of Interest", which combines in a simple way short and long time scales. We study the performance of this algorithm using both numerical simulations and a photonic implementation based on a single non-linear node and a delay line on the well known KTH dataset. We solve the task with high accuracy and speed, to the point of allowing for processing multiple video streams in real time. The present work is thus an important step towards developing efficient dedicated hardware for video processing.

59.MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

Authors:Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich, Björn Deiseroth, Constantin Eichenberg, Andrew Dai, Robert Baldock, Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick Schramowski, Kristian Kersting, Samuel Weinbach

Abstract: The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.

60.Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation

Authors:Chang Liu, Henghui Ding, Yulun Zhang, Xudong Jiang

Abstract: We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention ($\mathrm{M^3Att}$) and Multi-Modal Mutual Decoder ($\mathrm{M^3Dec}$) that better fuse information from the two input modalities. Based on {$\mathrm{M^3Dec}$}, we further propose Iterative Multi-modal Interaction ($\mathrm{IMI}$) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction ($\mathrm{LFR}$) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.

61.Training on Thin Air: Improve Image Classification with Generated Data

Authors:Yongchao Zhou, Hshmat Sahak, Jimmy Ba

Abstract: Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Moreover, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets. Additionally, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning.

62.Visual Programming for Text-to-Image Generation and Evaluation

Authors:Jaemin Cho, Abhay Zala, Mohit Bansal

Abstract: As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope our work encourages future progress on interpretable/explainable generation and evaluation for T2I models. Website:

63.A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Authors:Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, Ming-Hsuan Yang

Abstract: Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.

64.Mitigating Biased Activation in Weakly-supervised Object Localization via Counterfactual Learning

Authors:Feifei Shao, Yawei Luo, Lei Chen, Ping Liu, Yi Yang, Jun Xiao

Abstract: In this paper, we focus on an under-explored issue of biased activation in prior weakly-supervised object localization methods based on Class Activation Mapping (CAM). We analyze the cause of this problem from a causal view and attribute it to the co-occurring background confounders. Following this insight, we propose a novel Counterfactual Co-occurring Learning (CCL) paradigm to synthesize the counterfactual representations via coupling constant foreground and unrealized backgrounds in order to cut off their co-occurring relationship. Specifically, we design a new network structure called Counterfactual-CAM, which embeds the counterfactual representation perturbation mechanism into the vanilla CAM-based model. This mechanism is responsible for decoupling foreground as well as background and synthesizing the counterfactual representations. By training the detection model with these synthesized representations, we compel the model to focus on the constant foreground content while minimizing the influence of distracting co-occurring background. To our best knowledge, it is the first attempt in this direction. Extensive experiments on several benchmarks demonstrate that Counterfactual-CAM successfully mitigates the biased activation problem, achieving improved object localization accuracy.

65.Boundary Attention Mapping (BAM): Fine-grained saliency maps for segmentation of Burn Injuries

Authors:Mahla Abdolahnejad, Justin Lee, Hannah Chan, Alex Morzycki, Olivier Ethier, Anthea Mo, Peter X. Liu, Joshua N. Wong, Colin Hong, Rakesh Joshi

Abstract: Burn injuries can result from mechanisms such as thermal, chemical, and electrical insults. A prompt and accurate assessment of burns is essential for deciding definitive clinical treatments. Currently, the primary approach for burn assessments, via visual and tactile observations, is approximately 60%-80% accurate. The gold standard is biopsy and a close second would be non-invasive methods like Laser Doppler Imaging (LDI) assessments, which have up to 97% accuracy in predicting burn severity and the required healing time. In this paper, we introduce a machine learning pipeline for assessing burn severities and segmenting the regions of skin that are affected by burn. Segmenting 2D colour images of burns allows for the injured versus non-injured skin to be delineated, clearly marking the extent and boundaries of the localized burn/region-of-interest, even during remote monitoring of a burn patient. We trained a convolutional neural network (CNN) to classify four severities of burns. We built a saliency mapping method, Boundary Attention Mapping (BAM), that utilises this trained CNN for the purpose of accurately localizing and segmenting the burn regions from skin burn images. We demonstrated the effectiveness of our proposed pipeline through extensive experiments and evaluations using two datasets; 1) A larger skin burn image dataset consisting of 1684 skin burn images of four burn severities, 2) An LDI dataset that consists of a total of 184 skin burn images with their associated LDI scans. The CNN trained using the first dataset achieved an average F1-Score of 78% and micro/macro- average ROC of 85% in classifying the four burn severities. Moreover, a comparison between the BAM results and LDI results for measuring injury boundary showed that the segmentations generated by our method achieved 91.60% accuracy, 78.17% sensitivity, and 93.37% specificity.

66.SAMScore: A Semantic Structural Similarity Metric for Image Translation Evaluation

Authors:Yunxiang Li, Meixu Chen, Wenxuan Yang, Kai Wang, Jun Ma, Alan C. Bovik, You Zhang

Abstract: Image translation has wide applications, such as style transfer and modality conversion, usually aiming to generate images having both high degrees of realism and faithfulness. These problems remain difficult, especially when it is important to preserve semantic structures. Traditional image-level similarity metrics are of limited use, since the semantics of an image are high-level, and not strongly governed by pixel-wise faithfulness to an original image. Towards filling this gap, we introduce SAMScore, a generic semantic structural similarity metric for evaluating the faithfulness of image translation models. SAMScore is based on the recent high-performance Segment Anything Model (SAM), which can perform semantic similarity comparisons with standout accuracy. We applied SAMScore on 19 image translation tasks, and found that it is able to outperform all other competitive metrics on all of the tasks. We envision that SAMScore will prove to be a valuable tool that will help to drive the vibrant field of image translation, by allowing for more precise evaluations of new and evolving translation models. The code is available at

67.What can generic neural networks learn from a child's visual experience?

Authors:A. Emin Orhan, Brenden M. Lake

Abstract: Young children develop sophisticated internal models of the world based on their egocentric visual experience. How much of this is driven by innate constraints and how much is driven by their experience? To investigate these questions, we train state-of-the-art neural networks on a realistic proxy of a child's visual experience without any explicit supervision or domain-specific inductive biases. Specifically, we train both embedding models and generative models on 200 hours of headcam video from a single child collected over two years. We train a total of 72 different models, exploring a range of model architectures and self-supervised learning algorithms, and comprehensively evaluate their performance in downstream tasks. The best embedding models perform at 70% of a highly performant ImageNet-trained model on average. They also learn broad semantic categories without any labeled examples and learn to localize semantic categories in an image without any location supervision. However, these models are less object-centric and more background-sensitive than comparable ImageNet-trained models. Generative models trained with the same data successfully extrapolate simple properties of partially masked objects, such as their texture, color, orientation, and rough outline, but struggle with finer object details. We replicate our experiments with two other children and find very similar results. Broadly useful high-level visual representations are thus robustly learnable from a representative sample of a child's visual experience without strong inductive biases.

68.A Neural Space-Time Representation for Text-to-Image Personalization

Authors:Yuval Alaluf, Elad Richardson, Gal Metzer, Daniel Cohen-Or

Abstract: A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method's ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.

69.LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Authors:Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang

Abstract: Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual generative models. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language to enhance the visual planning skills of LLMs. LayoutGPT can generate plausible layouts in multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also shows superior performance in converting challenging language concepts like numerical and spatial relations to layout arrangements for faithful text-to-image generation. When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness. Lastly, LayoutGPT achieves comparable performance to supervised methods in 3D indoor scene synthesis, demonstrating its effectiveness and potential in multiple visual domains.

70.Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape

Authors:Rundi Wu, Ruoshi Liu, Carl Vondrick, Changxi Zheng

Abstract: Synthesizing novel 3D models that resemble the input example has long been pursued by researchers and artists in computer graphics. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our model can generate 3D shapes of various types with better quality than prior methods.

71.RoMa: Revisiting Robust Losses for Dense Feature Matching

Authors:Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, Michael Felsberg

Abstract: Dense feature matching is an important computer vision task that involves estimating all correspondences between two images of a 3D scene. In this paper, we revisit robust losses for matching from a Markov chain perspective, yielding theoretical insights and large gains in performance. We begin by constructing a unifying formulation of matching as a Markov chain, based on which we identify two key stages which we argue should be decoupled for matching. The first is the coarse stage, where the estimated result needs to be globally consistent. The second is the refinement stage, where the model needs precise localization capabilities. Inspired by the insight that these stages concern distinct issues, we propose a coarse matcher following the regression-by-classification paradigm that provides excellent globally consistent, albeit not exactly localized, matches. This is followed by a local feature refinement stage using well-motivated robust regression losses, yielding extremely precise matches. Our proposed approach, which we call RoMa, achieves significant improvements compared to the state-of-the-art. Code is available at

72.Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets

Authors:Brandon Smith, Miguel Farinha, Siobhan Mackenzie Hall, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain

Abstract: Vision-language models are growing in popularity and public visibility to generate, edit, and caption images at scale; but their outputs can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. Although debiasing methods have been proposed, we argue that these measurements of model bias lack validity due to dataset bias. We demonstrate there are spurious correlations in COCO Captions, the most commonly used dataset for evaluating bias, between background context and the gender of people in-situ. This is problematic because commonly-used bias metrics (such as Bias@K) rely on per-gender base rates. To address this issue, we propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets, where only the gender of the subject is edited and the background is fixed. However, existing image editing methods have limitations and sometimes produce low-quality images; so, we introduce a method to automatically filter the generated images based on their similarity to real images. Using our balanced synthetic contrast sets, we benchmark bias in multiple CLIP-based models, demonstrating how metrics are skewed by imbalance in the original COCO images. Our results indicate that the proposed approach improves the validity of the evaluation, ultimately contributing to more realistic understanding of bias in vision-language models.