arXiv daily

Computer Vision and Pattern Recognition (cs.CV)

Tue, 20 Jun 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Tue, 13 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Wed, 24 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.Spatiotemporal Pyramidal CNN with Depth-Wise Separable Convolution for Eye Blinking Detection in the Wild

Authors:Lan Anh Thi Nguy, Bach Nguyen Gia, Thanh Tu Thi Nguyen, Kamioka Eiji, Tan Xuan Phan

Abstract: Eye blinking detection in the wild plays an essential role in deception detection, driving fatigue detection, etc. Despite the fact that numerous attempts have already been made, the majority of them have encountered difficulties, such as the derived eye images having different resolutions as the distance between the face and the camera changes; or the requirement of a lightweight detection model to obtain a short inference time in order to perform in real-time. In this research, two problems are addressed: how the eye blinking detection model can learn efficiently from different resolutions of eye pictures in diverse conditions; and how to reduce the size of the detection model for faster inference time. We propose to utilize upsampling and downsampling the input eye images to the same resolution as one potential solution for the first problem, then find out which interpolation method can result in the highest performance of the detection model. For the second problem, although a recent spatiotemporal convolutional neural network used for eye blinking detection has a strong capacity to extract both spatial and temporal characteristics, it remains having a high number of network parameters, leading to high inference time. Therefore, using Depth-wise Separable Convolution rather than conventional convolution layers inside each branch is considered in this paper as a feasible solution.

2.Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

Authors:Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, Manolis Savva

Abstract: We contribute the Habitat Synthetic Scene Dataset, a dataset of 211 high-quality 3D scenes, and use it to test navigation agent generalization to realistic 3D environments. Our dataset represents real interiors and contains a diverse set of 18,656 models of real-world objects. We investigate the impact of synthetic 3D scene dataset scale and realism on the task of training embodied agents to find and navigate to objects (ObjectGoal navigation). By comparing to synthetic 3D scene datasets from prior work, we find that scale helps in generalization, but the benefits quickly saturate, making visual fidelity and correlation to real-world scenes more important. Our experiments show that agents trained on our smaller-scale dataset can match or outperform agents trained on much larger datasets. Surprisingly, we observe that agents trained on just 122 scenes from our dataset outperform agents trained on 10,000 scenes from the ProcTHOR-10K dataset in terms of zero-shot generalization in real-world scanned environments.

3.RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Authors:Zilun Zhang, Tiancheng Zhao, Yulong Guo, Jianwei Yin

Abstract: Pre-trained Vision-Language Foundation Models utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain Foundation Model (DFM), bridging the gap between the General Foundation Model (GFM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DFM. Experimental results show that our proposed dataset are highly effective for various tasks, improving upon the baseline by $8 \% \sim 16 \%$ in zero-shot classification tasks, and obtaining good results in both Vision-Language Retrieval and Semantic Localization tasks. Finally, we show successful results of training the RS Stable Diffusion model using the RS5M, uncovering more use cases of the dataset.

4.Progressive Neural Representation for Sequential Video Compilation

Authors:Haeyong Kang, DaHyun Kim, Jaehong Yoon, Sung Ju Hwang, Chang D Yoo

Abstract: Neural Implicit Representations (NIR) have gained significant attention recently due to their ability to represent complex and high-dimensional data. Unlike explicit representations, which require storing and manipulating individual data points, implicit representations capture information through a learned mapping function without explicitly representing the data points themselves. They often prune or quantize neural networks after training to accelerate encoding/decoding speed, yet we find that conventional methods fail to transfer learned representations to new videos. This work studies the continuous expansion of implicit video representations as videos arrive sequentially over time, where the model can only access the videos from the current session. We propose a novel neural video representation, Progressive Neural Representation (PNR), that finds an adaptive substructure from the supernet for a given video based on Lottery Ticket Hypothesis. At each training session, our PNR transfers the learned knowledge of the previously obtained subnetworks to learn the representation of the current video while keeping the past subnetwork weights intact. Therefore it can almost perfectly preserve the decoding ability (i.e., catastrophic forgetting) of the NIR on previous videos. We demonstrate the effectiveness of our proposed PNR on the neural sequential video representation compilation on the novel UVG8/17 video sequence benchmarks.

5.Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation for Video Snapshot Compressive Imaging

Authors:Siming Zheng, Xin Yuan

Abstract: We consider the problem of video snapshot compressive imaging (SCI), where sequential high-speed frames are modulated by different masks and captured by a single measurement. The underlying principle of reconstructing multi-frame images from only one single measurement is to solve an ill-posed problem. By combining optimization algorithms and neural networks, deep unfolding networks (DUNs) score tremendous achievements in solving inverse problems. In this paper, our proposed model is under the DUN framework and we propose a 3D Convolution-Transformer Mixture (CTM) module with a 3D efficient and scalable attention model plugged in, which helps fully learn the correlation between temporal and spatial dimensions by virtue of Transformer. To our best knowledge, this is the first time that Transformer is employed to video SCI reconstruction. Besides, to further investigate the high-frequency information during the reconstruction process which are neglected in previous studies, we introduce variance estimation characterizing the uncertainty on a pixel-by-pixel basis. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) (with a 1.2dB gain in PSNR over previous SOTA algorithm) results. We will release the code.

6.Meerkat Behaviour Recognition Dataset

Authors:Mitchell Rogers, Gaël Gendron, David Arturo Soriano Valdez, Mihailo Azhar, Yang Chen, Shahrokh Heidari, Caleb Perelini, Padriac O'Leary, Kobe Knowles, Izak Tait, Simon Eyre, Michael Witbrock, Patrice Delmas

Abstract: Recording animal behaviour is an important step in evaluating the well-being of animals and further understanding the natural world. Current methods for documenting animal behaviour within a zoo setting, such as scan sampling, require excessive human effort, are unfit for around-the-clock monitoring, and may produce human-biased results. Several animal datasets already exist that focus predominantly on wildlife interactions, with some extending to action or behaviour recognition. However, there is limited data in a zoo setting or data focusing on the group behaviours of social animals. We introduce a large meerkat (Suricata Suricatta) behaviour recognition video dataset with diverse annotated behaviours, including group social interactions, tracking of individuals within the camera view, skewed class distribution, and varying illumination conditions. This dataset includes videos from two positions within the meerkat enclosure at the Wellington Zoo (Wellington, New Zealand), with 848,400 annotated frames across 20 videos and 15 unannotated videos.

7.Depth and DOF Cues Make A Better Defocus Blur Detector

Authors:Yuxin Jin, Ming Qian, Jincheng Xiong, Nan Xue, Gui-Song Xia

Abstract: Defocus blur detection (DBD) separates in-focus and out-of-focus regions in an image. Previous approaches mistakenly mistook homogeneous areas in focus for defocus blur regions, likely due to not considering the internal factors that cause defocus blur. Inspired by the law of depth, depth of field (DOF), and defocus, we propose an approach called D-DFFNet, which incorporates depth and DOF cues in an implicit manner. This allows the model to understand the defocus phenomenon in a more natural way. Our method proposes a depth feature distillation strategy to obtain depth knowledge from a pre-trained monocular depth estimation model and uses a DOF-edge loss to understand the relationship between DOF and depth. Our approach outperforms state-of-the-art methods on public benchmarks and a newly collected large benchmark dataset, EBD. Source codes and EBD dataset are available at: https:github.com/yuxinjin-whu/D-DFFNet.

8.Augmenting Sub-model to Improve Main Model

Authors:Byeongho Heo, Taekyung Kim, Sangdoo Yun, Dongyoon Han

Abstract: Image classification has improved with the development of training techniques. However, these techniques often require careful parameter tuning to balance the strength of regularization, limiting their potential benefits. In this paper, we propose a novel way to use regularization called Augmenting Sub-model (AugSub). AugSub consists of two models: the main model and the sub-model. While the main model employs conventional training recipes, the sub-model leverages the benefit of additional regularization. AugSub achieves this by mitigating adverse effects through a relaxed loss function similar to self-distillation loss. We demonstrate the effectiveness of AugSub with three drop techniques: dropout, drop-path, and random masking. Our analysis shows that all AugSub improves performance, with the training loss converging even faster than regular training. Among the three, AugMask is identified as the most practical method due to its performance and cost efficiency. We further validate AugMask across diverse training recipes, including DeiT-III, ResNet, MAE fine-tuning, and Swin Transformer. The results show that AugMask consistently provides significant performance gain. AugSub provides a practical and effective solution for introducing additional regularization under various training recipes. Code is available at \url{https://github.com/naver-ai/augsub}.

9.KiUT: Knowledge-injected U-Transformer for Radiology Report Generation

Authors:Zhongzhen Huang, Xiaofan Zhang, Shaoting Zhang

Abstract: Radiology report generation aims to automatically generate a clinically accurate and coherent paragraph from the X-ray image, which could relieve radiologists from the heavy burden of report writing. Although various image caption methods have shown remarkable performance in the natural image field, generating accurate reports for medical images requires knowledge of multiple modalities, including vision, language, and medical terminology. We propose a Knowledge-injected U-Transformer (KiUT) to learn multi-level visual representation and adaptively distill the information with contextual and clinical knowledge for word prediction. In detail, a U-connection schema between the encoder and decoder is designed to model interactions between different modalities. And a symptom graph and an injected knowledge distiller are developed to assist the report generation. Experimentally, we outperform state-of-the-art methods on two widely used benchmark datasets: IU-Xray and MIMIC-CXR. Further experimental results prove the advantages of our architecture and the complementary benefits of the injected knowledge.

10.Masked Diffusion Models are Fast Learners

Authors:Jiachen Lei, Peng Cheng, Zhongjie Ba, Kui Ren

Abstract: Diffusion models have emerged as the de-facto technique for image generation, yet they entail significant computational overhead, hindering the technique's broader application in the research community. We propose a prior-based denoising training framework, the first to incorporate the pre-train and fine-tune paradigm into the diffusion model training process, which substantially improves training efficiency and shows potential in facilitating various downstream tasks. Our approach centers on masking a high proportion (e.g., up to 90%) of the input image and employing masked score matching to denoise the visible areas, thereby guiding the diffusion model to learn more salient features from training data as prior knowledge. By utilizing this masked learning process in a pre-training stage, we efficiently train the ViT-based diffusion model on CelebA-HQ 256x256 in the pixel space, achieving a 4x acceleration and enhancing the quality of generated images compared to DDPM. Moreover, our masked pre-training technique is universally applicable to various diffusion models that directly generate images in the pixel space and facilitates learning pre-trained models with excellent generalizability: a diffusion model pre-trained on VGGFace2 attains a 46% quality improvement through fine-tuning with merely 10% local data. Our code is available at https://github.com/jiachenlei/maskdm.

11.RoMe: Towards Large Scale Road Surface Reconstruction via Mesh Representation

Authors:Ruohong Mei, Wei Sui, Jiaxin Zhang, Qian Zhang, Tao Peng, Cong Yang

Abstract: Large-scale road surface reconstruction is becoming important to autonomous driving systems, as it provides valuable training and testing data effectively. In this paper, we introduce a simple yet efficient method, RoMe, for large-scale Road surface reconstruction via Mesh representations. To simplify the problem, RoMe decomposes a 3D road surface into a triangle-mesh and a multilayer perception network to model the road elevation implicitly. To retain fine surface details, each mesh vertex has two extra attributes, namely color and semantics. To improve the efficiency of RoMe in large-scale environments, a novel waypoint sampling method is introduced. As such, RoMe can properly preserve road surface details, with only linear computational complexity to road areas. In addition, to improve the accuracy of RoMe, extrinsics optimization is proposed to mitigate inaccurate extrinsic calibrations. Experimental results on popular public datasets also demonstrate the high efficiency and accuracy of RoMe.

12.CrossKD: Cross-Head Knowledge Distillation for Dense Object Detection

Authors:Jiabao Wang, Yuming Chen, Zhaohui Zheng, Xiang Li, Ming-Ming Cheng, Qibin Hou

Abstract: Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation, which is generally observed to be better than prediction mimicking. In this paper, we show that the inconsistency of the optimization objectives between the ground-truth signals and distillation targets is the key reason for the inefficiency of prediction mimicking. To alleviate this issue, we present a simple yet effective distillation scheme, termed CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head. The resulting cross-head predictions are then forced to mimic the teacher's predictions. Such a distillation manner relieves the student's head from receiving contradictory supervision signals from the ground-truth annotations and the teacher's predictions, greatly improving the student's detection performance. On MS COCO, with only prediction mimicking losses applied, our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods for object detection. Code is available at https://github.com/jbwang1997/CrossKD.

13.HabiCrowd: A High Performance Simulator for Crowd-Aware Visual Navigation

Authors:An Dinh Vuong, Toan Tien Nguyen, Minh Nhat VU, Baoru Huang, Dzung Nguyen, Huynh Thi Thanh Binh, Thieu Vo, Anh Nguyen

Abstract: Visual navigation, a foundational aspect of Embodied AI (E-AI), has been significantly studied in the past few years. While many 3D simulators have been introduced to support visual navigation tasks, scarcely works have been directed towards combining human dynamics, creating the gap between simulation and real-world applications. Furthermore, current 3D simulators incorporating human dynamics have several limitations, particularly in terms of computational efficiency, which is a promise of E-AI simulators. To overcome these shortcomings, we introduce HabiCrowd, the first standard benchmark for crowd-aware visual navigation that integrates a crowd dynamics model with diverse human settings into photorealistic environments. Empirical evaluations demonstrate that our proposed human dynamics model achieves state-of-the-art performance in collision avoidance, while exhibiting superior computational efficiency compared to its counterparts. We leverage HabiCrowd to conduct several comprehensive studies on crowd-aware visual navigation tasks and human-robot interactions. The source code and data can be found at https://habicrowd.github.io/.

14.Multi-task Collaborative Pre-training and Individual-adaptive-tokens Fine-tuning: A Unified Framework for Brain Representation Learning

Authors:Ning Jiang, Gongshu Wang, Tianyi Yan

Abstract: Structural magnetic resonance imaging (sMRI) provides accurate estimates of the brain's structural organization and learning invariant brain representations from sMRI is an enduring issue in neuroscience. Previous deep representation learning models ignore the fact that the brain, as the core of human cognitive activity, is distinct from other organs whose primary attribute is anatomy. Therefore, capturing the semantic structure that dominates interindividual cognitive variability is key to accurately representing the brain. Given that this high-level semantic information is subtle, distributed, and interdependently latent in the brain structure, sMRI-based models need to capture fine-grained details and understand how they relate to the overall global structure. However, existing models are optimized by simple objectives, making features collapse into homogeneity and worsening simultaneous representation of fine-grained information and holistic semantics, causing a lack of biological plausibility and interpretation of cognition. Here, we propose MCIAT, a unified framework that combines Multi-task Collaborative pre-training and Individual-Adaptive-Tokens fine-tuning. Specifically, we first synthesize restorative learning, age prediction auxiliary learning and adversarial learning as a joint proxy task for deep semantic representation learning. Then, a mutual-attention-based token selection method is proposed to highlight discriminative features. The proposed MCIAT achieves state-of-the-art diagnosis performance on the ADHD-200 dataset compared with several sMRI-based approaches and shows superior generalization on the MCIC and OASIS datasets. Moreover, we studied 12 behavioral tasks and found significant associations between cognitive functions and MCIAT-established representations, which verifies the interpretability of our proposed framework.

15.MultiEarth 2023 Deforestation Challenge -- Team FOREVER

Authors:Seunghan Park, Dongoo Lee, Yeonju Choi, SungTae Moon

Abstract: It is important problem to accurately estimate deforestation of satellite imagery since this approach can analyse extensive area without direct human access. However, it is not simple problem because of difficulty in observing the clear ground surface due to extensive cloud cover during long rainy season. In this paper, we present a multi-view learning strategy to predict deforestation status in the Amazon rainforest area with latest deep neural network models. Multi-modal dataset consists of three types of different satellites imagery, Sentinel-1, Sentinel-2 and Landsat 8 is utilized to train and predict deforestation status. MMsegmentation framework is selected to apply comprehensive data augmentation and diverse networks. The proposed method effectively and accurately predicts the deforestation status of new queries.

16.MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Authors:Yongzhu Miao, Shasha Li, Jintao Tang, Ting Wang

Abstract: Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.

17.Stable and Consistent Prediction of 3D Characteristic Orientation via Invariant Residual Learning

Authors:Seungwook Kim, Chunghyun Park, Yoonwoo Jeong, Jaesik Park, Minsu Cho

Abstract: Learning to predict reliable characteristic orientations of 3D point clouds is an important yet challenging problem, as different point clouds of the same class may have largely varying appearances. In this work, we introduce a novel method to decouple the shape geometry and semantics of the input point cloud to achieve both stability and consistency. The proposed method integrates shape-geometry-based SO(3)-equivariant learning and shape-semantics-based SO(3)-invariant residual learning, where a final characteristic orientation is obtained by calibrating an SO(3)-equivariant orientation hypothesis using an SO(3)-invariant residual rotation. In experiments, the proposed method not only demonstrates superior stability and consistency but also exhibits state-of-the-art performances when applied to point cloud part segmentation, given randomly rotated inputs.

18.Exploring the Effectiveness of Dataset Synthesis: An application of Apple Detection in Orchards

Authors:Alexander van Meekeren, Maya Aghaei, Klaas Dijkstra

Abstract: Deep object detection models have achieved notable successes in recent years, but one major obstacle remains: the requirement for a large amount of training data. Obtaining such data is a tedious process and is mainly time consuming, leading to the exploration of new research avenues like synthetic data generation techniques. In this study, we explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection and compare it to a baseline model trained on real-world data. After creating a dataset of realistic apple trees with prompt engineering and utilizing a previously trained Stable Diffusion model, the custom dataset was annotated and evaluated by training a YOLOv5m object detection model to predict apples in a real-world apple detection dataset. YOLOv5m was chosen for its rapid inference time and minimal hardware demands. Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images when evaluated on a set of real-world images. However, these findings remain highly promising, as the average precision difference is only 0.09 and 0.06, respectively. Qualitative results indicate that the model can accurately predict the location of apples, except in cases of heavy shading. These findings illustrate the potential of synthetic data generation techniques as a viable alternative to the collection of extensive training data for object detection models.

19.Multi-Scale Occ: 4th Place Solution for CVPR 2023 3D Occupancy Prediction Challenge

Authors:Yangyang Ding, Luying Huang, Jiachen Zhong

Abstract: In this report, we present the 4th place solution for CVPR 2023 3D occupancy prediction challenge. We propose a simple method called Multi-Scale Occ for occupancy prediction based on lift-splat-shoot framework, which introduces multi-scale image features for generating better multi-scale 3D voxel features with temporal fusion of multiple past frames. Post-processing including model ensemble, test-time augmentation, and class-wise thresh are adopted to further boost the final performance. As shown on the leaderboard, our proposed occupancy prediction method ranks the 4th place with 49.36 mIoU.

20.UM-CAM: Uncertainty-weighted Multi-resolution Class Activation Maps for Weakly-supervised Fetal Brain Segmentation

Authors:Jia Fu, Tao Lu, Shaoting Zhang, Guotai Wang

Abstract: Accurate segmentation of the fetal brain from Magnetic Resonance Image (MRI) is important for prenatal assessment of fetal development. Although deep learning has shown the potential to achieve this task, it requires a large fine annotated dataset that is difficult to collect. To address this issue, weakly-supervised segmentation methods with image-level labels have gained attention, which are commonly based on class activation maps from a classification network trained with image tags. However, most of these methods suffer from incomplete activation regions, due to the low-resolution localization without detailed boundary cues. To this end, we propose a novel weakly-supervised method with image-level labels based on semantic features and context information exploration. We first propose an Uncertainty-weighted Multi-resolution Class Activation Map (UM-CAM) to generate high-quality pixel-level supervision. Then, we design a Geodesic distance-based Seed Expansion (GSE) method to provide context information for rectifying the ambiguous boundaries of UM-CAM. Extensive experiments on a fetal brain dataset show that our UM-CAM can provide more accurate activation regions with fewer false positive regions than existing CAM variants, and our proposed method outperforms state-of-the-art weakly-supervised methods with image-level labels.

21.EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model

Authors:Lianying Yin, Yijun Wang, Tianyu He, Jinming Liu, Wei Zhao, Bohan Li, Xin Jin, Jianxin Lin

Abstract: Although previous co-speech gesture generation methods are able to synthesize motions in line with speech content, it is still not enough to handle diverse and complicated motion distribution. The key challenges are: 1) the one-to-many nature between the speech content and gestures; 2) the correlation modeling between the body joints. In this paper, we present a novel framework (EMoG) to tackle the above challenges with denoising diffusion models: 1) To alleviate the one-to-many problem, we incorporate emotion clues to guide the generation process, making the generation much easier; 2) To model joint correlation, we propose to decompose the difficult gesture generation into two sub-problems: joint correlation modeling and temporal dynamics modeling. Then, the two sub-problems are explicitly tackled with our proposed Joint Correlation-aware transFormer (JCFormer). Through extensive evaluations, we demonstrate that our proposed method surpasses previous state-of-the-art approaches, offering substantial superiority in gesture synthesis.

22.Pushing the Limits of 3D Shape Generation at Scale

Authors:Wang Yu, Xuelin Qian, Jingyang Huo, Tiejun Huang, Bo Zhao, Yanwei Fu

Abstract: We present a significant breakthrough in 3D shape generation by scaling it to unprecedented dimensions. Through the adaptation of the Auto-Regressive model and the utilization of large language models, we have developed a remarkable model with an astounding 3.6 billion trainable parameters, establishing it as the largest 3D shape generation model to date, named Argus-3D. Our approach addresses the limitations of existing methods by enhancing the quality and diversity of generated 3D shapes. To tackle the challenges of high-resolution 3D shape generation, our model incorporates tri-plane features as latent representations, effectively reducing computational complexity. Additionally, we introduce a discrete codebook for efficient quantization of these representations. Leveraging the power of transformers, we enable multi-modal conditional generation, facilitating the production of diverse and visually impressive 3D shapes. To train our expansive model, we leverage an ensemble of publicly-available 3D datasets, consisting of a comprehensive collection of approximately 900,000 objects from renowned repositories such as ModelNet40, ShapeNet, Pix3D, 3D-Future, and Objaverse. This diverse dataset empowers our model to learn from a wide range of object variations, bolstering its ability to generate high-quality and diverse 3D shapes. Extensive experimentation demonstrate the remarkable efficacy of our approach in significantly improving the visual quality of generated 3D shapes. By pushing the boundaries of 3D generation, introducing novel methods for latent representation learning, and harnessing the power of transformers for multi-modal conditional generation, our contributions pave the way for substantial advancements in the field. Our work unlocks new possibilities for applications in gaming, virtual reality, product design, and other domains that demand high-quality and diverse 3D objects.

23.TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting

Authors:Liang Liao, Taorong Liu, Delin Chen, Jing Xiao, Zheng Wang, Chia-Wen Lin, Shin'ichi Satoh

Abstract: Image inpainting for completing complicated semantic environments and diverse hole patterns of corrupted images is challenging even for state-of-the-art learning-based inpainting methods trained on large-scale data. A reference image capturing the same scene of a corrupted image offers informative guidance for completing the corrupted image as it shares similar texture and structure priors to that of the holes of the corrupted image. In this work, we propose a transformer-based encoder-decoder network, named TransRef, for reference-guided image inpainting. Specifically, the guidance is conducted progressively through a reference embedding procedure, in which the referencing features are subsequently aligned and fused with the features of the corrupted image. For precise utilization of the reference features for guidance, a reference-patch alignment (Ref-PA) module is proposed to align the patch features of the reference and corrupted images and harmonize their style differences, while a reference-patch transformer (Ref-PT) module is proposed to refine the embedded reference feature. Moreover, to facilitate the research of reference-guided image restoration tasks, we construct a publicly accessible benchmark dataset containing 50K pairs of input and reference images. Both quantitative and qualitative evaluations demonstrate the efficacy of the reference information and the proposed method over the state-of-the-art methods in completing complex holes. Code and dataset can be accessed at https://github.com/Cameltr/TransRef.

24.3D Keypoint Estimation Using Implicit Representation Learning

Authors:Xiangyu Zhu, Dong Du, Haibin Huang, Chongyang Ma, Xiaoguang Han

Abstract: In this paper, we tackle the challenging problem of 3D keypoint estimation of general objects using a novel implicit representation. Previous works have demonstrated promising results for keypoint prediction through direct coordinate regression or heatmap-based inference. However, these methods are commonly studied for specific subjects, such as human bodies and faces, which possess fixed keypoint structures. They also suffer in several practical scenarios where explicit or complete geometry is not given, including images and partial point clouds. Inspired by the recent success of advanced implicit representation in reconstruction tasks, we explore the idea of using an implicit field to represent keypoints. Specifically, our key idea is employing spheres to represent 3D keypoints, thereby enabling the learnability of the corresponding signed distance field. Explicit keypoints can be extracted subsequently by our algorithm based on the Hough transform. Quantitative and qualitative evaluations also show the superiority of our representation in terms of prediction accuracy.

25.Audio-Driven 3D Facial Animation from In-the-Wild Videos

Authors:Liying Lu, Tianke Zhang, Yunfei Liu, Xuangeng Chu, Yu Li

Abstract: Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head. Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs. Consequently, their generalization capability remains limited. In this paper, we propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model. The abundance of easily accessible 2D talking-head videos equips our model with a robust generalization capability. By combining these videos with existing 3D face reconstruction methods, our model excels in generating consistent and high-fidelity lip synchronization. Additionally, our model proficiently captures the speaking styles of different individuals, allowing it to generate 3D talking-heads with distinct personal styles. Extensive qualitative and quantitative experimental results demonstrate the superiority of our method.

26.Bullying10K: A Neuromorphic Dataset towards Privacy-Preserving Bullying Recognition

Authors:Yiting Dong, Yang Li, Dongcheng Zhao, Guobin Shen, Yi Zeng

Abstract: The prevalence of violence in daily life poses significant threats to individuals' physical and mental well-being. Using surveillance cameras in public spaces has proven effective in proactively deterring and preventing such incidents. However, concerns regarding privacy invasion have emerged due to their widespread deployment. To address the problem, we leverage Dynamic Vision Sensors (DVS) cameras to detect violent incidents and preserve privacy since it captures pixel brightness variations instead of static imagery. We introduce the Bullying10K dataset, encompassing various actions, complex movements, and occlusions from real-life scenarios. It provides three benchmarks for evaluating different tasks: action recognition, temporal action localization, and pose estimation. With 10,000 event segments, totaling 12 billion events and 255 GB of data, Bullying10K contributes significantly by balancing violence detection and personal privacy persevering. And it also poses a challenge to the neuromorphic dataset. It will serve as a valuable resource for training and developing privacy-protecting video systems. The Bullying10K opens new possibilities for innovative approaches in these domains.

27.NeRF synthesis with shading guidance

Authors:Chenbin Li, Yu Xin, Gaoyi Liu, Xiang Zeng, Ligang Liu

Abstract: The emerging Neural Radiance Field (NeRF) shows great potential in representing 3D scenes, which can render photo-realistic images from novel view with only sparse views given. However, utilizing NeRF to reconstruct real-world scenes requires images from different viewpoints, which limits its practical application. This problem can be even more pronounced for large scenes. In this paper, we introduce a new task called NeRF synthesis that utilizes the structural content of a NeRF patch exemplar to construct a new radiance field of large size. We propose a two-phase method for synthesizing new scenes that are continuous in geometry and appearance. We also propose a boundary constraint method to synthesize scenes of arbitrary size without artifacts. Specifically, we control the lighting effects of synthesized scenes using shading guidance instead of decoupling the scene. We have demonstrated that our method can generate high-quality results with consistent geometry and appearance, even for scenes with complex lighting. We can also synthesize new scenes on curved surface with arbitrary lighting effects, which enhances the practicality of our proposed NeRF synthesis approach.

28.Computing a human-like reaction time metric from stable recurrent vision models

Authors:Lore Goetschalckx, Lakshmi Narasimhan Govindarajan, Alekh Karkada Ashok, Aarit Ahuja, David L. Sheinberg, Thomas Serre

Abstract: The meteoric rise in the adoption of deep neural networks as computational models of vision has inspired efforts to "align" these models with humans. One dimension of interest for alignment includes behavioral choices, but moving beyond characterizing choice patterns to capturing temporal aspects of visual decision-making has been challenging. Here, we sketch a general-purpose methodology to construct computational accounts of reaction times from a stimulus-computable, task-optimized model. Specifically, we introduce a novel metric leveraging insights from subjective logic theory summarizing evidence accumulation in recurrent vision models. We demonstrate that our metric aligns with patterns of human reaction times for stimulus manipulations across four disparate visual decision-making tasks spanning perceptual grouping, mental simulation, and scene categorization. This work paves the way for exploring the temporal alignment of model and human visual strategies in the context of various other cognitive tasks toward generating testable hypotheses for neuroscience.

29.Deep Double Self-Expressive Subspace Clustering

Authors:Ling Zhao, Yunpeng Ma, Shanxiong Chen, Jun Zhou

Abstract: Deep subspace clustering based on auto-encoder has received wide attention. However, most subspace clustering based on auto-encoder does not utilize the structural information in the self-expressive coefficient matrix, which limits the clustering performance. In this paper, we propose a double self-expressive subspace clustering algorithm. The key idea of our solution is to view the self-expressive coefficient as a feature representation of the example to get another coefficient matrix. Then, we use the two coefficient matrices to construct the affinity matrix for spectral clustering. We find that it can reduce the subspace-preserving representation error and improve connectivity. To further enhance the clustering performance, we proposed a self-supervised module based on contrastive learning, which can further improve the performance of the trained network. Experiments on several benchmark datasets demonstrate that the proposed algorithm can achieve better clustering than state-of-the-art methods.

30.Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

Authors:Simone Bianco, Luigi Celona, Marco Donzella, Paolo Napoletano

Abstract: State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training. This dataset contains annotations provided by human annotators, who typically produce captions averaging around ten tokens. However, this constraint presents a challenge in effectively capturing complex scenes and conveying detailed information. Furthermore, captioning models tend to exhibit bias towards the ``average'' caption, which captures only the more general aspects. What would happen if we were able to automatically generate longer captions, thereby making them more detailed? Would these captions, evaluated by humans, be more or less representative of the image content compared to the original MS-COCO captions? In this paper, we present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused, resulting in richer captions. Our proposed method leverages existing models from the literature, eliminating the need for additional training. Instead, it utilizes an image-text based metric to rank the captions generated by SoTA models for a given image. Subsequently, the top two captions are fused using a Large Language Model (LLM). Experimental results demonstrate the effectiveness of our approach, as the captions generated by our model exhibit higher consistency with human judgment when evaluated on the MS-COCO test set. By combining the strengths of various SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich, informative nature of human-generated descriptions. This advance opens up new possibilities for generating captions that are more suitable for the training of both vision-language and captioning models.

31.BEVScope: Enhancing Self-Supervised Depth Estimation Leveraging Bird's-Eye-View in Dynamic Scenarios

Authors:Yucheng Mao, Ruowen Zhao, Tianbao Zhang, Hang Zhao

Abstract: Depth estimation is a cornerstone of perception in autonomous driving and robotic systems. The considerable cost and relatively sparse data acquisition of LiDAR systems have led to the exploration of cost-effective alternatives, notably, self-supervised depth estimation. Nevertheless, current self-supervised depth estimation methods grapple with several limitations: (1) the failure to adequately leverage informative multi-camera views. (2) the limited capacity to handle dynamic objects effectively. To address these challenges, we present BEVScope, an innovative approach to self-supervised depth estimation that harnesses Bird's-Eye-View (BEV) features. Concurrently, we propose an adaptive loss function, specifically designed to mitigate the complexities associated with moving objects. Empirical evaluations conducted on the Nuscenes dataset validate our approach, demonstrating competitive performance. Code will be released at https://github.com/myc634/BEVScope.

32.Annotation Cost Efficient Active Learning for Content Based Image Retrieval

Authors:Julia Henkel, Genc Hoxha, Gencer Sumbul, Lars Möllenbrok, Begüm Demir

Abstract: Deep metric learning (DML) based methods have been found very effective for content-based image retrieval (CBIR) in remote sensing (RS). For accurately learning the model parameters of deep neural networks, most of the DML methods require a high number of annotated training images, which can be costly to gather. To address this problem, in this paper we present an annotation cost efficient active learning (AL) method (denoted as ANNEAL). The proposed method aims to iteratively enrich the training set by annotating the most informative image pairs as similar or dissimilar, %answering a simple yes/no question, while accurately modelling a deep metric space. This is achieved by two consecutive steps. In the first step the pairwise image similarity is modelled based on the available training set. Then, in the second step the most uncertain and diverse (i.e., informative) image pairs are selected to be annotated. Unlike the existing AL methods for CBIR, at each AL iteration of ANNEAL a human expert is asked to annotate the most informative image pairs as similar/dissimilar. This significantly reduces the annotation cost compared to annotating images with land-use/land cover class labels. Experimental results show the effectiveness of our method. The code of ANNEAL is publicly available at https://git.tu-berlin.de/rsim/ANNEAL.

33.Collision Avoidance Detour for Multi-Agent Trajectory Forecasting

Authors:Hsu-kuang Chiu, Stephen F. Smith

Abstract: We present our approach, Collision Avoidance Detour (CAD), which won the 3rd place award in the 2023 Waymo Open Dataset Challenge - Sim Agents, held at the 2023 CVPR Workshop on Autonomous Driving. To satisfy the motion prediction factorization requirement, we partition all the valid objects into three mutually exclusive sets: Autonomous Driving Vehicle (ADV), World-tracks-to-predict, and World-others. We use different motion models to forecast their future trajectories independently. Furthermore, we also apply collision avoidance detour resampling, additive Gaussian noise, and velocity-based heading estimation to improve the realism of our simulation result.

34.SkyGPT: Probabilistic Short-term Solar Forecasting Using Synthetic Sky Videos from Physics-constrained VideoGPT

Authors:Yuhao Nie, Eric Zelikman, Andea Scott, Quentin Paletta, Adam Brandt

Abstract: In recent years, deep learning-based solar forecasting using all-sky images has emerged as a promising approach for alleviating uncertainty in PV power generation. However, the stochastic nature of cloud movement remains a major challenge for accurate and reliable solar forecasting. With the recent advances in generative artificial intelligence, the synthesis of visually plausible yet diversified sky videos has potential for aiding in forecasts. In this study, we introduce \emph{SkyGPT}, a physics-informed stochastic video prediction model that is able to generate multiple possible future images of the sky with diverse cloud motion patterns, by using past sky image sequences as input. Extensive experiments and comparison with benchmark video prediction models demonstrate the effectiveness of the proposed model in capturing cloud dynamics and generating future sky images with high realism and diversity. Furthermore, we feed the generated future sky images from the video prediction models for 15-minute-ahead probabilistic solar forecasting for a 30-kW roof-top PV system, and compare it with an end-to-end deep learning baseline model SUNSET and a smart persistence model. Better PV output prediction reliability and sharpness is observed by using the predicted sky images generated with SkyGPT compared with other benchmark models, achieving a continuous ranked probability score (CRPS) of 2.81 (13\% better than SUNSET and 23\% better than smart persistence) and a Winkler score of 26.70 for the test set. Although an arbitrary number of futures can be generated from a historical sky image sequence, the results suggest that 10 future scenarios is a good choice that balances probabilistic solar forecasting performance and computational cost.

35.GenPlot: Increasing the Scale and Diversity of Chart Derendering Data

Authors:Brendan Artley

Abstract: Vertical bars, horizontal bars, dot, scatter, and line plots provide a diverse set of visualizations to represent data. To understand these plots, one must be able to recognize textual components, locate data points in a plot, and process diverse visual contexts to extract information. In recent works such as Pix2Struct, Matcha, and Deplot, OCR-free chart-to-text translation has achieved state-of-the-art results on visual language tasks. These results outline the importance of chart-derendering as a pre-training objective, yet existing datasets provide a fixed set of training examples. In this paper, we propose GenPlot; a plot generator that can generate billions of additional plots for chart-derendering using synthetic data.

36.Data-Driven but Privacy-Conscious: Pedestrian Dataset De-identification via Full-Body Person Synthesis

Authors:Maxim Maximov, Tim Meinhardt, Ismail Elezi, Zoe Papakipos, Caner Hazirbas, Cristian Canton, Laura Leal-Taixé

Abstract: The advent of data-driven technology solutions is accompanied by an increasing concern with data privacy. This is of particular importance for human-centered image recognition tasks, such as pedestrian detection, re-identification, and tracking. To highlight the importance of privacy issues and motivate future research, we motivate and introduce the Pedestrian Dataset De-Identification (PDI) task. PDI evaluates the degree of de-identification and downstream task training performance for a given de-identification method. As a first baseline, we propose IncogniMOT, a two-stage full-body de-identification pipeline based on image synthesis via generative adversarial networks. The first stage replaces target pedestrians with synthetic identities. To improve downstream task performance, we then apply stage two, which blends and adapts the synthetic image parts into the data. To demonstrate the effectiveness of IncogniMOT, we generate a fully de-identified version of the MOT17 pedestrian tracking dataset and analyze its application as training data for pedestrian re-identification, detection, and tracking models. Furthermore, we show how our data is able to narrow the synthetic-to-real performance gap in a privacy-conscious manner.

37.Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

Authors:Ayush Tewari, Tianwei Yin, George Cazenavette, Semon Rezchikov, Joshua B. Tenenbaum, Frédo Durand, William T. Freeman, Vincent Sitzmann

Abstract: Denoising diffusion models are a powerful type of generative models used to capture complex distributions of real-world signals. However, their applicability is limited to scenarios where training samples are readily available, which is not always the case in real-world applications. For example, in inverse graphics, the goal is to generate samples from a distribution of 3D scenes that align with a given image, but ground-truth 3D scenes are unavailable and only 2D images are accessible. To address this limitation, we propose a novel class of denoising diffusion probabilistic models that learn to sample from distributions of signals that are never directly observed. Instead, these signals are measured indirectly through a known differentiable forward model, which produces partial observations of the unknown signal. Our approach involves integrating the forward model directly into the denoising process. This integration effectively connects the generative modeling of observations with the generative modeling of the underlying signals, allowing for end-to-end training of a conditional generative model over signals. During inference, our approach enables sampling from the distribution of underlying signals that are consistent with a given partial observation. We demonstrate the effectiveness of our method on three challenging computer vision tasks. For instance, in the context of inverse graphics, our model enables direct sampling from the distribution of 3D scenes that align with a single 2D input image.

38.How can objects help action recognition?

Authors:Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

Abstract: Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30%, 40%, and 60% of the input tokens on SomethingElse, Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets.

39.Dense Video Object Captioning from Disjoint Supervision

Authors:Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

Abstract: We propose a new task and model for dense video object captioning -- detecting, tracking, and captioning trajectories of all objects in a video. This task unifies spatial and temporal understanding of the video, and requires fine-grained language description. Our model for dense video object captioning is trained end-to-end and consists of different modules for spatial localization, tracking, and captioning. As such, we can train our model with a mixture of disjoint tasks, and leverage diverse, large-scale datasets which supervise different parts of our model. This results in noteworthy zero-shot performance. Moreover, by finetuning a model from this initialization, we can further improve our performance, surpassing strong image-based baselines by a significant margin. Although we are not aware of other work performing this task, we are able to repurpose existing video grounding datasets for our task, namely VidSTG and VLN. We show our task is more general than grounding, and models trained on our task can directly be applied to grounding by finding the bounding box with the maximum likelihood of generating the query sentence. Our model outperforms dedicated, state-of-the-art models for spatial grounding on both VidSTG and VLN.

40.Learning Profitable NFT Image Diffusions via Multiple Visual-Policy Guided Reinforcement Learning

Authors:Huiguo He, Tianfu Wang, Huan Yang, Jianlong Fu, Nicholas Jing Yuan, Jian Yin, Hongyang Chao, Qi Zhang

Abstract: We study the task of generating profitable Non-Fungible Token (NFT) images from user-input texts. Recent advances in diffusion models have shown great potential for image generation. However, existing works can fall short in generating visually-pleasing and highly-profitable NFT images, mainly due to the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT image, and 2) effective optimization metrics for generating high-quality NFT images. To solve these challenges, we propose a Diffusion-based generation framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for NFT images. The proposed framework consists of a large language model (LLM), a diffusion-based image generator, and a series of visual rewards by design. First, the LLM enhances a basic human input (such as "panda") by generating more comprehensive NFT-style prompts that include specific visual attributes, such as "panda with Ninja style and green background." Second, the diffusion-based image generator is fine-tuned using a large-scale NFT dataset to capture fine-grained image styles and accessory compositions of popular NFT elements. Third, we further propose to utilize multiple visual-policies as optimization goals, including visual rarity levels, visual aesthetic scores, and CLIP-based text-image relevances. This design ensures that our proposed Diffusion-MVP is capable of minting NFT images with high visual quality and market value. To facilitate this research, we have collected the largest publicly available NFT image dataset to date, consisting of 1.5 million high-quality images with corresponding texts and market values. Extensive experiments including objective evaluations and user studies demonstrate that our framework can generate NFT images showing more visually engaging elements and higher market value, compared with SOTA approaches.

41.Self-supervised Multi-task Learning Framework for Safety and Health-Oriented Connected Driving Environment Perception using Onboard Camera

Authors:Shaocheng Jia, Wei Yao

Abstract: Cutting-edge connected vehicle (CV) technologies have drawn much attention in recent years. The real-time traffic data captured by a CV can be shared with other CVs and data centers so as to open new possibilities for solving diverse transportation problems. However, imagery captured by onboard cameras in a connected environment, are not sufficiently investigated, especially for safety and health-oriented visual perception. In this paper, a bidirectional process of image synthesis and decomposition (BPISD) approach is proposed, and thus a novel self-supervised multi-task learning framework, to simultaneously estimate depth map, atmospheric visibility, airlight, and PM2.5 mass concentration, in which depth map and visibility are considered highly associated with traffic safety, while airlight and PM2.5 mass concentration are directly correlated with human health. Both the training and testing phases of the proposed system solely require a single image as input. Due to the innovative training pipeline, the depth estimation network can manage various levels of visibility conditions and overcome inherent problems in current image-synthesis-based depth estimation, thereby generating high-quality depth maps even in low-visibility situations and further benefiting accurate estimations of visibility, airlight, and PM2.5 mass concentration. Extensive experiments on the synthesized data from the KITTI and real-world data collected in Beijing demonstrate that the proposed method can (1) achieve performance competitive in depth estimation as compared with state-of-the-art methods when taking clear images as input; (2) predict vivid depth map for images contaminated by various levels of haze; and (3) accurately estimate visibility, airlight, and PM2.5 mass concentrations. Beneficial applications can be developed based on the presented work to improve traffic safety, air quality, and public health.

42.Using super-resolution for enhancing visual perception and segmentation performance in veterinary cytology

Authors:Jakub Caputa, Maciej Wielgosz, Daria Łukasik, Paweł Russek, Jakub Grzeszczyk, Michał Karwatowski, Szymon Mazurek, Rafał Frączek, Anna Śmiech, Ernest Jamro, Sebastian Koryciak, Agnieszka Dąbrowska-Boruch, Marcin Pietroń, Kazimierz Wiatr

Abstract: The primary objective of this research was to enhance the quality of semantic segmentation in cytology images by incorporating super-resolution (SR) architectures. An additional contribution was the development of a novel dataset aimed at improving imaging quality in the presence of inaccurate focus. Our experimental results demonstrate that the integration of SR techniques into the segmentation pipeline can lead to a significant improvement of up to 25% in the mean average precision (mAP) segmentation metric. These findings suggest that leveraging SR architectures holds great promise for advancing the state of the art in cytology image analysis.

43.Multiverse Transformer: 1st Place Solution for Waymo Open Sim Agents Challenge 2023

Authors:Yu Wang, Tiebiao Zhao, Fan Yi

Abstract: This technical report presents our 1st place solution for the Waymo Open Sim Agents Challenge (WOSAC) 2023. Our proposed MultiVerse Transformer for Agent simulation (MVTA) effectively leverages transformer-based motion prediction approaches, and is tailored for closed-loop simulation of agents. In order to produce simulations with a high degree of realism, we design novel training and sampling methods, and implement a receding horizon prediction mechanism. In addition, we introduce a variable-length history aggregation method to mitigate the compounding error that can arise during closed-loop autoregressive execution. On the WOSAC, our MVTA and its enhanced version MVTE reach a realism meta-metric of 0.5091 and 0.5168, respectively, outperforming all the other methods on the leaderboard.

44.LNL+K: Learning with Noisy Labels and Noise Source Distribution Knowledge

Authors:Siqi Wang, Bryan A. Plummer

Abstract: Learning with noisy labels (LNL) is challenging as the model tends to memorize noisy labels, which can lead to overfitting. Many LNL methods detect clean samples by maximizing the similarity between samples in each category, which does not make any assumptions about likely noise sources. However, we often have some knowledge about the potential source(s) of noisy labels. For example, an image mislabeled as a cheetah is more likely a leopard than a hippopotamus due to their visual similarity. Thus, we introduce a new task called Learning with Noisy Labels and noise source distribution Knowledge (LNL+K), which assumes we have some knowledge about likely source(s) of label noise that we can take advantage of. By making this presumption, methods are better equipped to distinguish hard negatives between categories from label noise. In addition, this enables us to explore datasets where the noise may represent the majority of samples, a setting that breaks a critical premise of most methods developed for the LNL task. We explore several baseline LNL+K approaches that integrate noise source knowledge into state-of-the-art LNL methods across three diverse datasets and three types of noise, where we report a 5-15% boost in performance compared with the unadapted methods. Critically, we find that LNL methods do not generalize well in every setting, highlighting the importance of directly exploring our LNL+K task.

45.NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement

Authors:Marcos V. Conde, Javier Vazquez-Corral, Michael S. Brown, Radu Timofte

Abstract: 3D lookup tables (3D LUTs) are a key component for image enhancement. Modern image signal processors (ISPs) have dedicated support for these as part of the camera rendering pipeline. Cameras typically provide multiple options for picture styles, where each style is usually obtained by applying a unique handcrafted 3D LUT. Current approaches for learning and applying 3D LUTs are notably fast, yet not so memory-efficient, as storing multiple 3D LUTs is required. For this reason and other implementation limitations, their use on mobile devices is less popular. In this work, we propose a Neural Implicit LUT (NILUT), an implicitly defined continuous 3D color transformation parameterized by a neural network. We show that NILUTs are capable of accurately emulating real 3D LUTs. Moreover, a NILUT can be extended to incorporate multiple styles into a single network with the ability to blend styles implicitly. Our novel approach is memory-efficient, controllable and can complement previous methods, including learned ISPs. Code, models and dataset available at: https://github.com/mv-lab/nilut

46.LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching

Authors:Duy M. H. Nguyen, Hoang Nguyen, Nghiem T. Diep, Tan N. Pham, Tri Cao, Binh T. Nguyen, Paul Swoboda, Nhat Ho, Shadi Albarqouni, Pengtao Xie, Daniel Sonntag, Mathias Niepert

Abstract: Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained deep networks on ImageNet and vision-language foundation models trained on web-scale data are prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and medical images. To bridge this gap, we introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets, covering a large number of organs and modalities such as CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art self-supervised algorithms on this dataset and propose a novel self-supervised contrastive learning algorithm using a graph-matching formulation. The proposed approach makes three contributions: (i) it integrates prior pair-wise image similarity metrics based on local and global information; (ii) it captures the structural constraints of feature embeddings through a loss function constructed via a combinatorial graph-matching objective; and (iii) it can be trained efficiently end-to-end using modern gradient-estimation techniques for black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream medical tasks ranging from segmentation and classification to object detection, and both for the in and out-of-distribution settings. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models. For challenging tasks such as Brain Tumor Classification or Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models trained on 1 billion masks by 6-7% while using only a ResNet-50.