arXiv daily

Computer Vision and Pattern Recognition (cs.CV)

Thu, 04 May 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Tue, 13 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Wed, 24 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.LayoutDM: Transformer-based Diffusion Model for Layout Generation

Authors:Shang Chai, Liansheng Zhuang, Fengying Yan

Abstract: Automatic layout generation that can synthesize high-quality layouts is an important tool for graphic design in many applications. Though existing methods based on generative models such as Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) have progressed, they still leave much room for improving the quality and diversity of the results. Inspired by the recent success of diffusion models in generating high-quality images, this paper explores their potential for conditional layout generation and proposes Transformer-based Layout Diffusion Model (LayoutDM) by instantiating the conditional denoising diffusion probabilistic model (DDPM) with a purely transformer-based architecture. Instead of using convolutional neural networks, a transformer-based conditional Layout Denoiser is proposed to learn the reverse diffusion process to generate samples from noised layout data. Benefitting from both transformer and DDPM, our LayoutDM is of desired properties such as high-quality generation, strong sample diversity, faithful distribution coverage, and stationary training in comparison to GANs and VAEs. Quantitative and qualitative experimental results show that our method outperforms state-of-the-art generative models in terms of quality and diversity.

2.Unsupervised Domain Adaptation for Neuron Membrane Segmentation based on Structural Features

Authors:Yuxiang An, Dongnan Liu, Weidong Cai

Abstract: AI-enhanced segmentation of neuronal boundaries in electron microscopy (EM) images is crucial for automatic and accurate neuroinformatics studies. To enhance the limited generalization ability of typical deep learning frameworks for medical image analysis, unsupervised domain adaptation (UDA) methods have been applied. In this work, we propose to improve the performance of UDA methods on cross-domain neuron membrane segmentation in EM images. First, we designed a feature weight module considering the structural features during adaptation. Second, we introduced a structural feature-based super-resolution approach to alleviating the domain gap by adjusting the cross-domain image resolutions. Third, we proposed an orthogonal decomposition module to facilitate the extraction of domain-invariant features. Extensive experiments on two domain adaptive membrane segmentation applications have indicated the effectiveness of our method.

3.High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Authors:Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, Yong Liu

Abstract: Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.

4.Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation

Authors:Renshen Wang, Yasuhisa Fujii, Alessandro Bissacco

Abstract: Text reading order is a crucial aspect in the output of an OCR engine, with a large impact on downstream tasks. Its difficulty lies in the large variation of domain specific layout structures, and is further exacerbated by real-world image degradations such as perspective distortions. We propose a lightweight, scalable and generalizable approach to identify text reading order with a multi-modal, multi-task graph convolutional network (GCN) running on a sparse layout based graph. Predictions from the model provide hints of bidimensional relations among text lines and layout region structures, upon which a post-processing cluster-and-sort algorithm generates an ordered sequence of all the text lines. The model is language-agnostic and runs effectively across multi-language datasets that contain various types of images taken in uncontrolled conditions, and it is small enough to be deployed on virtually any platform including mobile devices.

5.Prompt-ICM: A Unified Framework towards Image Coding for Machines with Task-driven Prompts

Authors:Ruoyu Feng, Jinming Liu, Xin Jin, Xiaohan Pan, Heming Sun, Zhibo Chen

Abstract: Image coding for machines (ICM) aims to compress images to support downstream AI analysis instead of human perception. For ICM, developing a unified codec to reduce information redundancy while empowering the compressed features to support various vision tasks is very important, which inevitably faces two core challenges: 1) How should the compression strategy be adjusted based on the downstream tasks? 2) How to well adapt the compressed features to different downstream tasks? Inspired by recent advances in transferring large-scale pre-trained models to downstream tasks via prompting, in this work, we explore a new ICM framework, termed Prompt-ICM. To address both challenges by carefully learning task-driven prompts to coordinate well the compression process and downstream analysis. Specifically, our method is composed of two core designs: a) compression prompts, which are implemented as importance maps predicted by an information selector, and used to achieve different content-weighted bit allocations during compression according to different downstream tasks; b) task-adaptive prompts, which are instantiated as a few learnable parameters specifically for tuning compressed features for the specific intelligent task. Extensive experiments demonstrate that with a single feature codec and a few extra parameters, our proposed framework could efficiently support different kinds of intelligent tasks with much higher coding efficiency.

6.How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning

Authors:Vittorio Pippi, Silvia Cascianelli, Christopher Kermorvant, Rita Cucchiara

Abstract: Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets. Nonetheless, those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting. This issue is very relevant for valuable but small collections of documents preserved in historical archives, for which obtaining sufficient annotated training data is costly or, in some cases, unfeasible. To overcome this challenge, a possible solution is to pretrain HTR models on large datasets and then fine-tune them on small single-author collections. In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model. Through extensive experimental analysis, also considering the amount of fine-tuning lines, we give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines.

7.Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Authors:Chao Xu, Shaoting Zhu, Junwei Zhu, Tianxin Huang, Jiangning Zhang, Ying Tai, Yong Liu

Abstract: Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio. However, existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature rearrange paradigm coupled with unstable GAN frameworks. In this work, we first represent the emotion in the text prompt, which could inherit rich semantics from the CLIP, allowing flexible and generalized emotion control. We further reorganize these tasks as the target-oriented texture transfer and adopt the Diffusion Models. More specifically, given a textured face as the source and the rendered face projected from the desired 3DMM coefficients as the target, our proposed Texture-Geometry-aware Diffusion Model decomposes the complex transfer problem into multi-conditional denoising process, where a Texture Attention-based module accurately models the correspondences between appearance and geometry cues contained in source and target conditions, and incorporate extra implicit information for high-fidelity talking face generation. Additionally, TGDM can be gracefully tailored for face swapping. We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes. Extensive experiments demonstrate the superiority of our method.

8.Boundary-aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval

Authors:Tan Pan, Furong Xu, Xudong Yang, Sifeng He, Chen Jiang, Qingpei Guo, Feng Qian Xiaobo Zhang, Yuan Cheng, Lei Yang, Wei Chu

Abstract: Image retrieval plays an important role in the Internet world. Usually, the core parts of mainstream visual retrieval systems include an online service of the embedding model and a large-scale vector database. For traditional model upgrades, the old model will not be replaced by the new one until the embeddings of all the images in the database are re-computed by the new model, which takes days or weeks for a large amount of data. Recently, backward-compatible training (BCT) enables the new model to be immediately deployed online by making the new embeddings directly comparable to the old ones. For BCT, improving the compatibility of two models with less negative impact on retrieval performance is the key challenge. In this paper, we introduce AdvBCT, an Adversarial Backward-Compatible Training method with an elastic boundary constraint that takes both compatibility and discrimination into consideration. We first employ adversarial learning to minimize the distribution disparity between embeddings of the new model and the old model. Meanwhile, we add an elastic boundary constraint during training to improve compatibility and discrimination efficiently. Extensive experiments on GLDv2, Revisited Oxford (ROxford), and Revisited Paris (RParis) demonstrate that our method outperforms other BCT methods on both compatibility and discrimination. The implementation of AdvBCT will be publicly available at https://github.com/Ashespt/AdvBCT.

9.Semantic-aware Generation of Multi-view Portrait Drawings

Authors:Biao Ma, Fei Gao, Chang Jiang, Nannan Wang, Gang Xu

Abstract: Neural radiance fields (NeRF) based methods have shown amazing performance in synthesizing 3D-consistent photographic images, but fail to generate multi-view portrait drawings. The key is that the basic assumption of these methods -- a surface point is consistent when rendered from different views -- doesn't hold for drawings. In a portrait drawing, the appearance of a facial point may changes when viewed from different angles. Besides, portrait drawings usually present little 3D information and suffer from insufficient training data. To combat this challenge, in this paper, we propose a Semantic-Aware GEnerator (SAGE) for synthesizing multi-view portrait drawings. Our motivation is that facial semantic labels are view-consistent and correlate with drawing techniques. We therefore propose to collaboratively synthesize multi-view semantic maps and the corresponding portrait drawings. To facilitate training, we design a semantic-aware domain translator, which generates portrait drawings based on features of photographic faces. In addition, use data augmentation via synthesis to mitigate collapsed results. We apply SAGE to synthesize multi-view portrait drawings in diverse artistic styles. Experimental results show that SAGE achieves significantly superior or highly competitive performance, compared to existing 3D-aware image synthesis methods. The codes are available at https://github.com/AiArt-HDU/SAGE.

10.Edge-aware Consistent Stereo Video Depth Estimation

Authors:Elena Kosheleva, Sunil Jaiswal, Faranak Shamsafar, Noshaba Cheema, Klaus Illgner-Fehns, Philipp Slusallek

Abstract: Video depth estimation is crucial in various applications, such as scene reconstruction and augmented reality. In contrast to the naive method of estimating depths from images, a more sophisticated approach uses temporal information, thereby eliminating flickering and geometrical inconsistencies. We propose a consistent method for dense video depth estimation; however, unlike the existing monocular methods, ours relates to stereo videos. This technique overcomes the limitations arising from the monocular input. As a benefit of using stereo inputs, a left-right consistency loss is introduced to improve the performance. Besides, we use SLAM-based camera pose estimation in the process. To address the problem of depth blurriness during test-time training (TTT), we present an edge-preserving loss function that improves the visibility of fine details while preserving geometrical consistency. We show that our edge-aware stereo video model can accurately estimate the dense depth maps.

11.Point2Tree(P2T) -- framework for parameter tuning of semantic and instance segmentation used with mobile laser scanning data in coniferous forest

Authors:Maciej Wielgosz, Stefano Puliti, Phil Wilkes, Rasmus Astrup

Abstract: This article introduces Point2Tree, a novel framework that incorporates a three-stage process involving semantic segmentation, instance segmentation, optimization analysis of hyperparemeters importance. It introduces a comprehensive and modular approach to processing laser points clouds in Forestry. We tested it on two independent datasets. The first area was located in an actively managed boreal coniferous dominated forest in V{\aa}ler, Norway, 16 circular plots of 400 square meters were selected to cover a range of forest conditions in terms of species composition and stand density. We trained a model based on Pointnet++ architecture which achieves 0.92 F1-score in semantic segmentation. As a second step in our pipeline we used graph-based approach for instance segmentation which reached F1-score approx. 0.6. The optimization allowed to further boost the performance of the pipeline by approx. 4 \% points.

12.LatentAugment: Dynamically Optimized Latent Probabilities of Data Augmentation

Authors:Koichi Kuriyama

Abstract: Although data augmentation is a powerful technique for improving the performance of image classification tasks, it is difficult to identify the best augmentation policy. The optimal augmentation policy, which is the latent variable, cannot be directly observed. To address this problem, this study proposes $\textit{LatentAugment}$, which estimates the latent probability of optimal augmentation. The proposed method is appealing in that it can dynamically optimize the augmentation strategies for each input and model parameter in learning iterations. Theoretical analysis shows that LatentAugment is a general model that includes other augmentation methods as special cases, and it is simple and computationally efficient in comparison with existing augmentation methods. Experimental results show that the proposed LatentAugment has higher test accuracy than previous augmentation methods on the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.

13.Modelling Spatio-Temporal Interactions for Compositional Action Recognition

Authors:Ramanathan Rajendiran, Debaditya Roy, Basura Fernando

Abstract: Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. Humans can abstract away the action from the appearance of the objects and their context which is referred to as compositionality of actions. Compositional action recognition deals with imparting human-like compositional generalization abilities to action-recognition models. In this regard, extracting the interactions between humans and objects forms the basis of compositional understanding. These interactions are not affected by the appearance biases of the objects or the context. But the context provides additional cues about the interactions between things and stuff. Hence we need to infuse context into the human-object interactions for compositional action recognition. To this end, we first design a spatial-temporal interaction encoder that captures the human-object (things) interactions. The encoder learns the spatio-temporal interaction tokens disentangled from the background context. The interaction tokens are then infused with contextual information from the video tokens to model the interactions between things and stuff. The final context-infused spatio-temporal interaction tokens are used for compositional action recognition. We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset where we obtain a new state-of-the-art result of 83.8% top-1 accuracy outperforming recent important object-centric methods by a significant margin. Our approach of explicit human-object-stuff interaction modeling is effective even for standard action recognition datasets such as Something-Something-V2 and Epic-Kitchens-100 where we obtain comparable or better performance than state-of-the-art.

14.Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Authors:Teng Wang, Jinrui Zhang, Junjie Fei, Yixiao Ge, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao, Ying Shan, Feng Zheng

Abstract: Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.

15.Semi-supervised Domain Adaptation via Prototype-based Multi-level Learning

Authors:Xinyang Huang, Chuang Zhu, Wenkai Chen

Abstract: In semi-supervised domain adaptation (SSDA), a few labeled target samples of each class help the model to transfer knowledge representation from the fully labeled source domain to the target domain. Many existing methods ignore the benefits of making full use of the labeled target samples from multi-level. To make better use of this additional data, we propose a novel Prototype-based Multi-level Learning (ProML) framework to better tap the potential of labeled target samples. To achieve intra-domain adaptation, we first introduce a pseudo-label aggregation based on the intra-domain optimal transport to help the model align the feature distribution of unlabeled target samples and the prototype. At the inter-domain level, we propose a cross-domain alignment loss to help the model use the target prototype for cross-domain knowledge transfer. We further propose a dual consistency based on prototype similarity and linear classifier to promote discriminative learning of compact target feature representation at the batch level. Extensive experiments on three datasets, including DomainNet, VisDA2017, and Office-Home demonstrate that our proposed method achieves state-of-the-art performance in SSDA.

16.In-situ Anomaly Detection in Additive Manufacturing with Graph Neural Networks

Authors:Sebastian Larsen, Paul A. Hooper

Abstract: Transforming a design into a high-quality product is a challenge in metal additive manufacturing due to rare events which can cause defects to form. Detecting these events in-situ could, however, reduce inspection costs, enable corrective action, and is the first step towards a future of tailored material properties. In this study a model is trained on laser input information to predict nominal laser melting conditions. An anomaly score is then calculated by taking the difference between the predictions and new observations. The model is evaluated on a dataset with known defects achieving an F1 score of 0.821. This study shows that anomaly detection methods are an important tool in developing robust defect detection methods.

17.Avatar Knowledge Distillation: Self-ensemble Teacher Paradigm with Uncertainty

Authors:Yuan Zhang, Weihua Chen, Yichen Lu, Tao Huang, Xiuyu Sun, Jian Cao

Abstract: Knowledge distillation is an effective paradigm for boosting the performance of pocket-size model, especially when multiple teacher models are available, the student would break the upper limit again. However, it is not economical to train diverse teacher models for the disposable distillation. In this paper, we introduce a new concept dubbed Avatars for distillation, which are the inference ensemble models derived from the teacher. Concretely, (1) For each iteration of distillation training, various Avatars are generated by a perturbation transformation. We validate that Avatars own higher upper limit of working capacity and teaching ability, aiding the student model in learning diverse and receptive knowledge perspectives from the teacher model. (2) During the distillation, we propose an uncertainty-aware factor from the variance of statistical differences between the vanilla teacher and Avatars, to adjust Avatars' contribution on knowledge transfer adaptively. Avatar Knowledge Distillation AKD is fundamentally different from existing methods and refines with the innovative view of unequal training. Comprehensive experiments demonstrate the effectiveness of our Avatars mechanism, which polishes up the state-of-the-art distillation methods for dense prediction without more extra computational cost. The AKD brings at most 0.7 AP gains on COCO 2017 for Object Detection and 1.83 mIoU gains on Cityscapes for Semantic Segmentation, respectively.

18.Weakly-supervised Micro- and Macro-expression Spotting Based on Multi-level Consistency

Authors:Wang-Wang Yu, Kai-Fu Yang, Hong-Mei Yan, Yong-Jie Li

Abstract: Most micro- and macro-expression spotting methods in untrimmed videos suffer from the burden of video-wise collection and frame-wise annotation. Weakly-supervised expression spotting (WES) based on video-level labels can potentially mitigate the complexity of frame-level annotation while achieving fine-grained frame-level spotting. However, we argue that existing weakly-supervised methods are based on multiple instance learning (MIL) involving inter-modality, inter-sample, and inter-task gaps. The inter-sample gap is primarily from the sample distribution and duration. Therefore, we propose a novel and simple WES framework, MC-WES, using multi-consistency collaborative mechanisms that include modal-level saliency, video-level distribution, label-level duration and segment-level feature consistency strategies to implement fine frame-level spotting with only video-level labels to alleviate the above gaps and merge prior knowledge. The modal-level saliency consistency strategy focuses on capturing key correlations between raw images and optical flow. The video-level distribution consistency strategy utilizes the difference of sparsity in temporal distribution. The label-level duration consistency strategy exploits the difference in the duration of facial muscles. The segment-level feature consistency strategy emphasizes that features under the same labels maintain similarity. Experimental results on two challenging datasets -- CAS(ME)$^2$ and SAMM-LV -- demonstrate that MC-WES is comparable to state-of-the-art fully-supervised methods.

19.Incremental 3D Semantic Scene Graph Prediction from RGB Sequences

Authors:Shun-Cheng Wu, Keisuke Tateno, Nassir Navab, Federico Tombari

Abstract: D semantic scene graphs are a powerful holistic representation as they describe the individual objects and depict the relation between them. They are compact high-level graphs that enable many tasks requiring scene reasoning. In real-world settings, existing 3D estimation methods produce robust predictions that mostly rely on dense inputs. In this work, we propose a real-time framework that incrementally builds a consistent 3D semantic scene graph of a scene given an RGB image sequence. Our method consists of a novel incremental entity estimation pipeline and a scene graph prediction network. The proposed pipeline simultaneously reconstructs a sparse point map and fuses entity estimation from the input images. The proposed network estimates 3D semantic scene graphs with iterative message passing using multi-view and geometric features extracted from the scene entities. Extensive experiments on the 3RScan dataset show the effectiveness of the proposed method in this challenging task, outperforming state-of-the-art approaches.

20.Age-Invariant Face Embedding using the Wasserstein Distance

Authors:Eran Dahan, Yosi Keller

Abstract: In this work, we study face verification in datasets where images of the same individuals exhibit significant age differences. This poses a major challenge for current face recognition and verification techniques. To address this issue, we propose a novel approach that utilizes multitask learning and a Wasserstein distance discriminator to disentangle age and identity embeddings of facial images. Our approach employs multitask learning with a Wasserstein distance discriminator that minimizes the mutual information between the age and identity embeddings by minimizing the Jensen-Shannon divergence. This improves the encoding of age and identity information in face images and enhances the performance of face verification in age-variant datasets. We evaluate the effectiveness of our approach using multiple age-variant face datasets and demonstrate its superiority over state-of-the-art methods in terms of face verification accuracy.

21.Radiance Field Gradient Scaling for Unbiased Near-Camera Training

Authors:Julien Philip, Valentin Deschaintre

Abstract: NeRF acquisition typically requires careful choice of near planes for the different cameras or suffers from background collapse, creating floating artifacts on the edges of the captured scene. The key insight of this work is that background collapse is caused by a higher density of samples in regions near cameras. As a result of this sampling bias, near-camera volumes receive significantly more gradients, leading to incorrect density buildup. We propose a gradient scaling approach to counter-balance this bias, removing the need for near planes, while preventing background collapse. Our method can be implemented in a few lines, does not induce any significant overhead, and is compatible with most NeRF implementations.

22.Multi-Modality Deep Network for JPEG Artifacts Reduction

Authors:Xuhao Jiang, Weimin Tan, Qing Lin, Chenxi Ma, Bo Yan, Liquan Shen

Abstract: In recent years, many convolutional neural network-based models are designed for JPEG artifacts reduction, and have achieved notable progress. However, few methods are suitable for extreme low-bitrate image compression artifacts reduction. The main challenge is that the highly compressed image loses too much information, resulting in reconstructing high-quality image difficultly. To address this issue, we propose a multimodal fusion learning method for text-guided JPEG artifacts reduction, in which the corresponding text description not only provides the potential prior information of the highly compressed image, but also serves as supplementary information to assist in image deblocking. We fuse image features and text semantic features from the global and local perspectives respectively, and design a contrastive loss built upon contrastive learning to produce visually pleasing results. Extensive experiments, including a user study, prove that our method can obtain better deblocking results compared to the state-of-the-art methods.

23.Towards End-to-End Semi-Supervised Table Detection with Deformable Transformer

Authors:Tahira Shehzadi, Khurram Azeem Hashmi, Didier Stricker, Marcus Liwicki, Muhammad Zeshan Afzal

Abstract: Table detection is the task of classifying and localizing table objects within document images. With the recent development in deep learning methods, we observe remarkable success in table detection. However, a significant amount of labeled data is required to train these models effectively. Many semi-supervised approaches are introduced to mitigate the need for a substantial amount of label data. These approaches use CNN-based detectors that rely on anchor proposals and post-processing stages such as NMS. To tackle these limitations, this paper presents a novel end-to-end semi-supervised table detection method that employs the deformable transformer for detecting table objects. We evaluate our semi-supervised method on PubLayNet, DocBank, ICADR-19 and TableBank datasets, and it achieves superior performance compared to previous methods. It outperforms the fully supervised method (Deformable transformer) by +3.4 points on 10\% labels of TableBank-both dataset and the previous CNN-based semi-supervised approach (Soft Teacher) by +1.8 points on 10\% labels of PubLayNet dataset. We hope this work opens new possibilities towards semi-supervised and unsupervised table detection methods.

24.MTLSegFormer: Multi-task Learning with Transformers for Semantic Segmentation in Precision Agriculture

Authors:Diogo Nunes Goncalves, Jose Marcato Junior, Pedro Zamboni, Hemerson Pistori, Jonathan Li, Keiller Nogueira, Wesley Nunes Goncalves

Abstract: Multi-task learning has proven to be effective in improving the performance of correlated tasks. Most of the existing methods use a backbone to extract initial features with independent branches for each task, and the exchange of information between the branches usually occurs through the concatenation or sum of the feature maps of the branches. However, this type of information exchange does not directly consider the local characteristics of the image nor the level of importance or correlation between the tasks. In this paper, we propose a semantic segmentation method, MTLSegFormer, which combines multi-task learning and attention mechanisms. After the backbone feature extraction, two feature maps are learned for each task. The first map is proposed to learn features related to its task, while the second map is obtained by applying learned visual attention to locally re-weigh the feature maps of the other tasks. In this way, weights are assigned to local regions of the image of other tasks that have greater importance for the specific task. Finally, the two maps are combined and used to solve a task. We tested the performance in two challenging problems with correlated tasks and observed a significant improvement in accuracy, mainly in tasks with high dependence on the others.

25.MEDIC: A Multimodal Empathy Dataset in Counseling

Authors:Zhou'an_Zhu, Xin Li, Jicai Pan, Yufei Xiao, Yanan Chang, Feiyi Zheng, Shangfei Wang

Abstract: Although empathic interaction between counselor and client is fundamental to success in the psychotherapeutic process, there are currently few datasets to aid a computational approach to empathy understanding. In this paper, we construct a multimodal empathy dataset collected from face-to-face psychological counseling sessions. The dataset consists of 771 video clips. We also propose three labels (i.e., expression of experience, emotional reaction, and cognitive reaction) to describe the degree of empathy between counselors and their clients. Expression of experience describes whether the client has expressed experiences that can trigger empathy, and emotional and cognitive reactions indicate the counselor's empathic reactions. As an elementary assessment of the usability of the constructed multimodal empathy dataset, an interrater reliability analysis of annotators' subjective evaluations for video clips is conducted using the intraclass correlation coefficient and Fleiss' Kappa. Results prove that our data annotation is reliable. Furthermore, we conduct empathy prediction using three typical methods, including the tensor fusion network, the sentimental words aware fusion network, and a simple concatenation model. The experimental results show that empathy can be well predicted on our dataset. Our dataset is available for research purposes.

26.Neuromorphic Sensing for Yawn Detection in Driver Drowsiness

Authors:Paul Kielty, Mehdi Sefidgar Dilmaghani, Cian Ryan, Joe Lemley, Peter Corcoran

Abstract: Driver monitoring systems (DMS) are a key component of vehicular safety and essential for the transition from semiautonomous to fully autonomous driving. A key task for DMS is to ascertain the cognitive state of a driver and to determine their level of tiredness. Neuromorphic vision systems, based on event camera technology, provide advanced sensing of facial characteristics, in particular the behavior of a driver's eyes. This research explores the potential to extend neuromorphic sensing techniques to analyze the entire facial region, detecting yawning behaviors that give a complimentary indicator of tiredness. A neuromorphic dataset is constructed from 952 video clips (481 yawns, 471 not-yawns) captured with an RGB color camera, with 37 subjects. A total of 95200 neuromorphic image frames are generated from this video data using a video-to-event converter. From these data 21 subjects were selected to provide a training dataset, 8 subjects were used for validation data, and the remaining 8 subjects were reserved for an "unseen" test dataset. An additional 12300 frames were generated from event simulations of a public dataset to test against other methods. A CNN with self-attention and a recurrent head was designed, trained, and tested with these data. Respective precision and recall scores of 95.9 percent and 94.7 percent were achieved on our test set, and 89.9 percent and 91 percent on the simulated public test set, demonstrating the feasibility to add yawn detection as a sensing component of a neuromorphic DMS.

27.APR: Online Distant Point Cloud Registration Through Aggregated Point Cloud Reconstruction

Authors:Quan Liu, Yunsong Zhou, Hongzi Zhu, Shan Chang, Minyi Guo

Abstract: For many driving safety applications, it is of great importance to accurately register LiDAR point clouds generated on distant moving vehicles. However, such point clouds have extremely different point density and sensor perspective on the same object, making registration on such point clouds very hard. In this paper, we propose a novel feature extraction framework, called APR, for online distant point cloud registration. Specifically, APR leverages an autoencoder design, where the autoencoder reconstructs a denser aggregated point cloud with several frames instead of the original single input point cloud. Our design forces the encoder to extract features with rich local geometry information based on one single input point cloud. Such features are then used for online distant point cloud registration. We conduct extensive experiments against state-of-the-art (SOTA) feature extractors on KITTI and nuScenes datasets. Results show that APR outperforms all other extractors by a large margin, increasing average registration recall of SOTA extractors by 7.1% on LoKITTI and 4.6% on LoNuScenes.

28.Additive Class Distinction Maps using Branched-GANs

Authors:Elnatan Kadar, Jonathan Brokman, Guy Gilboa

Abstract: We present a new model, training procedure and architecture to create precise maps of distinction between two classes of images. The objective is to comprehend, in pixel-wise resolution, the unique characteristics of a class. These maps can facilitate self-supervised segmentation and objectdetection in addition to new capabilities in explainable AI (XAI). Our proposed architecture is based on image decomposition, where the output is the sum of multiple generative networks (branched-GANs). The distinction between classes is isolated in a dedicated branch. This approach allows clear, precise and interpretable visualization of the unique characteristics of each class. We show how our generic method can be used in several modalities for various tasks, such as MRI brain tumor extraction, isolating cars in aerial photography and obtaining feminine and masculine face features. This is a preliminary report of our initial findings and results.

29.Aligning Bird-Eye View Representation of Point Cloud Sequences using Scene Flow

Authors:Minh-Quan Dao, Vincent Frémont, Elwan Héry

Abstract: Low-resolution point clouds are challenging for object detection methods due to their sparsity. Densifying the present point cloud by concatenating it with its predecessors is a popular solution to this challenge. Such concatenation is possible thanks to the removal of ego vehicle motion using its odometry. This method is called Ego Motion Compensation (EMC). Thanks to the added points, EMC significantly improves the performance of single-frame detectors. However, it suffers from the shadow effect that manifests in dynamic objects' points scattering along their trajectories. This effect results in a misalignment between feature maps and objects' locations, thus limiting performance improvement to stationary and slow-moving objects only. Scene flow allows aligning point clouds in 3D space, thus naturally resolving the misalignment in feature spaces. By observing that scene flow computation shares several components with 3D object detection pipelines, we develop a plug-in module that enables single-frame detectors to compute scene flow to rectify their Bird-Eye View representation. Experiments on the NuScenes dataset show that our module leads to a significant increase (up to 16%) in the Average Precision of large vehicles, which interestingly demonstrates the most severe shadow effect. The code is published at https://github.com/quan-dao/pc-corrector.

30.UPDExplainer: an Interpretable Transformer-based Framework for Urban Physical Disorder Detection Using Street View Imagery

Authors:Chuanbo Hu, Shan Jia, Fan Zhang, Changjiang Xiao, Mindi Ruan, Jacob Thrasher, Xin Li

Abstract: Urban Physical Disorder (UPD), such as old or abandoned buildings, broken sidewalks, litter, and graffiti, has a negative impact on residents' quality of life. They can also increase crime rates, cause social disorder, and pose a public health risk. Currently, there is a lack of efficient and reliable methods for detecting and understanding UPD. To bridge this gap, we propose UPDExplainer, an interpretable transformer-based framework for UPD detection. We first develop a UPD detection model based on the Swin Transformer architecture, which leverages readily accessible street view images to learn discriminative representations. In order to provide clear and comprehensible evidence and analysis, we subsequently introduce a UPD factor identification and ranking module that combines visual explanation maps with semantic segmentation maps. This novel integrated approach enables us to identify the exact objects within street view images that are responsible for physical disorders and gain insights into the underlying causes. Experimental results on the re-annotated Place Pulse 2.0 dataset demonstrate promising detection performance of the proposed method, with an accuracy of 79.9%. For a comprehensive evaluation of the method's ranking performance, we report the mean Average Precision (mAP), R-Precision (RPrec), and Normalized Discounted Cumulative Gain (NDCG), with success rates of 75.51%, 80.61%, and 82.58%, respectively. We also present a case study of detecting and ranking physical disorders in the southern region of downtown Los Angeles, California, to demonstrate the practicality and effectiveness of our framework.

31.Forward-Forward Contrastive Learning

Authors:Md. Atik Ahamed, Jin Chen, Abdullah-Al-Zubaer Imran

Abstract: Medical image classification is one of the most important tasks for computer-aided diagnosis. Deep learning models, particularly convolutional neural networks, have been successfully used for disease classification from medical images, facilitated by automated feature learning. However, the diverse imaging modalities and clinical pathology make it challenging to construct generalized and robust classifications. Towards improving the model performance, we propose a novel pretraining approach, namely Forward Forward Contrastive Learning (FFCL), which leverages the Forward-Forward Algorithm in a contrastive learning framework--both locally and globally. Our experimental results on the chest X-ray dataset indicate that the proposed FFCL achieves superior performance (3.69% accuracy over ImageNet pretrained ResNet-18) over existing pretraining models in the pneumonia classification task. Moreover, extensive ablation experiments support the particular local and global contrastive pretraining design in FFCL.

32.Image Captioners Sometimes Tell More Than Images They See

Authors:Honori Udo, Takafumi Koshinaka

Abstract: Image captioning, a.k.a. "image-to-text," which generates descriptive text from given images, has been rapidly developing throughout the era of deep learning. To what extent is the information in the original image preserved in the descriptive text generated by an image captioner? To answer that question, we have performed experiments involving the classification of images from descriptive text alone, without referring to the images at all, and compared results with those from standard image-based classifiers. We have evaluate several image captioning models with respect to a disaster image classification task, CrisisNLP, and show that descriptive text classifiers can sometimes achieve higher accuracy than standard image-based classifiers. Further, we show that fusing an image-based classifier with a descriptive text classifier can provide improvement in accuracy.

33.FUSegNet: A Deep Convolutional Neural Network for Foot Ulcer Segmentation

Authors:Mrinal Kanti Dhar, Taiyu Zhang, Yash Patel, Zeyun Yu

Abstract: This paper presents FUSegNet, a new model for foot ulcer segmentation in diabetes patients, which uses the pre-trained EfficientNet-b7 as a backbone to address the issue of limited training samples. A modified spatial and channel squeeze-and-excitation (scSE) module called parallel scSE or P-scSE is proposed that combines additive and max-out scSE. A new arrangement is introduced for the module by fusing it in the middle of each decoder stage. As the top decoder stage carries a limited number of feature maps, max-out scSE is bypassed there to form a shorted P-scSE. A set of augmentations, comprising geometric, morphological, and intensity-based augmentations, is applied before feeding the data into the network. The proposed model is first evaluated on a publicly available chronic wound dataset where it achieves a data-based dice score of 92.70%, which is the highest score among the reported approaches. The model outperforms other scSE-based UNet models in terms of Pratt's figure of merits (PFOM) scores in most categories, which evaluates the accuracy of edge localization. The model is then tested in the MICCAI 2021 FUSeg challenge, where a variation of FUSegNet called x-FUSegNet is submitted. The x-FUSegNet model, which takes the average of outputs obtained by FUSegNet using 5-fold cross-validation, achieves a dice score of 89.23%, placing it at the top of the FUSeg Challenge leaderboard. The source code for the model is available on https://github.com/mrinal054/FUSegNet.

34.Adversarially-Guided Portrait Matting

Authors:Sergej Chicherin, Karen Efremyan

Abstract: We present a method for generating alpha mattes using a limited data source. We pretrain a novel transformerbased model (StyleMatte) on portrait datasets. We utilize this model to provide image-mask pairs for the StyleGAN3- based network (StyleMatteGAN). This network is trained unsupervisedly and generates previously unseen imagemask training pairs that are fed back to StyleMatte. We demonstrate that the performance of the matte pulling network improves during this cycle and obtains top results on the used datasets. Furthermore, StyleMatteGAN provides high-resolution, privacy-preserving portraits with alpha mattes, making it suitable for various image composition tasks. Our code is available at https://github.com/chroneus/stylematte

35.OSDaR23: Open Sensor Data for Rail 2023

Authors:Rustam Tagiew, Martin Köppel, Karsten Schwalbe, Patrick Denzler, Philipp Neumaier, Tobias Klockau, Martin Boekhoff, Pavel Klasek, Roman Tilly

Abstract: For driverless train operation on mainline railways, several tasks need to be implemented by technical systems. One of the most challenging tasks is to monitor the train's driveway and its surroundings for potential obstacles due to long braking distances. Machine learning algorithms can be used to analyze data from vision sensors such as infrared (IR) and visual (RGB) cameras, lidars, and radars to detect objects. Such algorithms require large amounts of annotated data from objects in the rail environment that may pose potential obstacles, as well as rail-specific objects such as tracks or catenary poles, as training data. However, only very few datasets are publicly available and these available datasets typically involve only a limited number of sensors. Datasets and trained models from other domains, such as automotive, are useful but insufficient for object detection in the railway context. Therefore, this publication presents OSDaR23, a multi-sensor dataset of 21 sequences captured in Hamburg, Germany, in September 2021. The sensor setup consisted of multiple calibrated and synchronized IR/RGB cameras, lidars, a radar, and position and acceleration sensors front-mounted on a railway vehicle. In addition to raw data, the dataset contains 204091 polyline, polygonal, rectangle and cuboid annotations for 20 different object classes. This dataset can also be used for tasks going beyond collision prediction, which are listed in this paper.

36.Evaluating Post-hoc Interpretability with Intrinsic Interpretability

Authors:José Pereira Amorim, Pedro Henriques Abreu, João Santos, Henning Müller

Abstract: Despite Convolutional Neural Networks having reached human-level performance in some medical tasks, their clinical use has been hindered by their lack of interpretability. Two major interpretability strategies have been proposed to tackle this problem: post-hoc methods and intrinsic methods. Although there are several post-hoc methods to interpret DL models, there is significant variation between the explanations provided by each method, and it a difficult to validate them due to the lack of ground-truth. To address this challenge, we adapted the intrinsical interpretable ProtoPNet for the context of histopathology imaging and compared the attribution maps produced by it and the saliency maps made by post-hoc methods. To evaluate the similarity between saliency map methods and attribution maps we adapted 10 saliency metrics from the saliency model literature, and used the breast cancer metastases detection dataset PatchCamelyon with 327,680 patches of histopathological images of sentinel lymph node sections to validate the proposed approach. Overall, SmoothGrad and Occlusion were found to have a statistically bigger overlap with ProtoPNet while Deconvolution and Lime have been found to have the least.

37.NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads

Authors:Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, Matthias Nießner

Abstract: We focus on reconstructing high-fidelity radiance fields of human heads, capturing their animations over time, and synthesizing re-renderings from novel viewpoints at arbitrary time steps. To this end, we propose a new multi-view capture setup composed of 16 calibrated machine vision cameras that record time-synchronized images at 7.1 MP resolution and 73 frames per second. With our setup, we collect a new dataset of over 4700 high-resolution, high-framerate sequences of more than 220 human heads, from which we introduce a new human head reconstruction benchmark. The recorded sequences cover a wide range of facial dynamics, including head motions, natural expressions, emotions, and spoken language. In order to reconstruct high-fidelity human heads, we propose Dynamic Neural Radiance Fields using Hash Ensembles (NeRSemble). We represent scene dynamics by combining a deformation field and an ensemble of 3D multi-resolution hash encodings. The deformation field allows for precise modeling of simple scene movements, while the ensemble of hash encodings helps to represent complex dynamics. As a result, we obtain radiance field representations of human heads that capture motion over time and facilitate re-rendering of arbitrary novel viewpoints. In a series of experiments, we explore the design choices of our method and demonstrate that our approach outperforms state-of-the-art dynamic radiance field approaches by a significant margin.

38.Contrastive Mean Teacher for Domain Adaptive Object Detectors

Authors:Shengcao Cao, Dhiraj Joshi, Liang-Yan Gui, Yu-Xiong Wang

Abstract: Object detectors often suffer from the domain gap between training (source domain) and real-world applications (target domain). Mean-teacher self-training is a powerful paradigm in unsupervised domain adaptation for object detection, but it struggles with low-quality pseudo-labels. In this work, we identify the intriguing alignment and synergy between mean-teacher self-training and contrastive learning. Motivated by this, we propose Contrastive Mean Teacher (CMT) -- a unified, general-purpose framework with the two paradigms naturally integrated to maximize beneficial learning signals. Instead of using pseudo-labels solely for final predictions, our strategy extracts object-level features using pseudo-labels and optimizes them via contrastive learning, without requiring labels in the target domain. When combined with recent mean-teacher self-training methods, CMT leads to new state-of-the-art target-domain performance: 51.9% mAP on Foggy Cityscapes, outperforming the previously best by 2.1% mAP. Notably, CMT can stabilize performance and provide more significant gains as pseudo-label noise increases.

39.Learning Hand-Held Object Reconstruction from In-The-Wild Videos

Authors:Aditya Prakash, Matthew Chang, Matthew Jin, Saurabh Gupta

Abstract: Prior works for reconstructing hand-held objects from a single image rely on direct 3D shape supervision which is challenging to gather in real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of in-the-wild raw video data showing hand-object interactions. In this paper, we automatically extract 3D supervision (via multiview 2D supervision) from such raw video data to scale up the learning of models for hand-held object reconstruction. This requires tackling two key challenges: unknown camera pose and occlusion. For the former, we use hand pose (predicted from existing techniques, e.g. FrankMocap) as a proxy for object pose. For the latter, we learn data-driven 3D shape priors using synthetic objects from the ObMan dataset. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments on the MOW and HO3D datasets show the effectiveness of these supervisory signals at predicting the 3D shape for real-world hand-held objects without any direct real-world 3D supervision.

40.TUVF: Learning Generalizable Texture UV Radiance Fields

Authors:An-Chieh Cheng, Xueting Li, Sifei Liu, Xiaolong Wang

Abstract: Textures are a vital aspect of creating visually appealing and realistic 3D models. In this paper, we study the problem of generating high-fidelity texture given shapes of 3D assets, which has been relatively less explored compared with generic 3D shape modeling. Our goal is to facilitate a controllable texture generation process, such that one texture code can correspond to a particular appearance style independent of any input shapes from a category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable UV sphere space rather than directly on the 3D shape. This allows the texture to be disentangled from the underlying shape and transferable to other shapes that share the same UV space, i.e., from the same category. We integrate the UV sphere space with the radiance field, which provides a more efficient and accurate representation of textures than traditional texture maps. We perform our experiments on real-world object datasets where we achieve not only realistic synthesis but also substantial improvements over state-of-the-arts on texture controlling and editing. Project Page: https://www.anjiecheng.me/TUVF

41.Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization

Authors:Connor Z. Lin, Koki Nagano, Jan Kautz, Eric R. Chan, Umar Iqbal, Leonidas Guibas, Gordon Wetzstein, Sameh Khamis

Abstract: There is a growing demand for the accessible creation of high-quality 3D avatars that are animatable and customizable. Although 3D morphable models provide intuitive control for editing and animation, and robustness for single-view face reconstruction, they cannot easily capture geometric and appearance details. Methods based on neural implicit representations, such as signed distance functions (SDF) or neural radiance fields, approach photo-realism, but are difficult to animate and do not generalize well to unseen data. To tackle this problem, we propose a novel method for constructing implicit 3D morphable face models that are both generalizable and intuitive for editing. Trained from a collection of high-quality 3D scans, our face model is parameterized by geometry, expression, and texture latent codes with a learned SDF and explicit UV texture parameterization. Once trained, we can reconstruct an avatar from a single in-the-wild image by leveraging the learned prior to project the image into the latent space of our model. Our implicit morphable face models can be used to render an avatar from novel views, animate facial expressions by modifying expression codes, and edit textures by directly painting on the learned UV-texture maps. We demonstrate quantitatively and qualitatively that our method improves upon photo-realism, geometry, and expression accuracy compared to state-of-the-art methods.

42.OctFormer: Octree-based Transformers for 3D Point Clouds

Authors:Peng-Shuai Wang

Abstract: OctFormer can not only serve as a general and effective backbone for 3D point cloud segmentation and object detection but also have linear complexity and is scalable for large-scale point clouds. The key challenge in applying transformers to point clouds is reducing the quadratic, thus overwhelming, computation complexity of attentions. To combat this issue, several works divide point clouds into non-overlapping windows and constrain attentions in each local window. However, the point number in each window varies greatly, impeding the efficient execution on GPU. Observing that attentions are robust to the shapes of local windows, we propose a novel octree attention, which leverages sorted shuffled keys of octrees to partition point clouds into local windows containing a fixed number of points while permitting shapes of windows to change freely. And we also introduce dilated octree attention to expand the receptive field further. Our octree attention can be implemented in 10 lines of code with open-sourced libraries and runs 17 times faster than other point cloud attentions when the point number exceeds 200k. Built upon the octree attention, OctFormer can be easily scaled up and achieves state-of-the-art performances on a series of 3D segmentation and detection benchmarks, surpassing previous sparse-voxel-based CNNs and point cloud transformers in terms of both efficiency and effectiveness. Notably, on the challenging ScanNet200 dataset, OctFormer outperforms sparse-voxel-based CNNs by 7.3 in mIoU. Our code and trained models are available at https://wang-ps.github.io/octformer.

43.Personalize Segment Anything Model with One Shot

Authors:Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Hao Dong, Peng Gao, Hongsheng Li

Abstract: Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose a training-free Personalization approach for SAM, termed as PerSAM. Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior, and segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement. In this way, we effectively adapt SAM for private use without any training. To further alleviate the mask ambiguity, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the entire SAM, we introduce two learnable weights for multi-scale masks, only training 2 parameters within 10 seconds for improved performance. To demonstrate our efficacy, we construct a new segmentation dataset, PerSeg, for personalized evaluation, and test our methods on video object segmentation with competitive performance. Besides, our approach can also enhance DreamBooth to personalize Stable Diffusion for text-to-image generation, which discards the background disturbance for better target appearance learning. Code is released at https://github.com/ZrrSkywalker/Personalize-SAM

44.NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds

Authors:Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang

Abstract: This paper proposes NeuralEditor that enables neural radiance fields (NeRFs) natively editable for general shape editing tasks. Despite their impressive results on novel-view synthesis, it remains a fundamental challenge for NeRFs to edit the shape of the scene. Our key insight is to exploit the explicit point cloud representation as the underlying structure to construct NeRFs, inspired by the intuitive interpretation of NeRF rendering as a process that projects or "plots" the associated 3D point cloud to a 2D image plane. To this end, NeuralEditor introduces a novel rendering scheme based on deterministic integration within K-D tree-guided density-adaptive voxels, which produces both high-quality rendering results and precise point clouds through optimization. NeuralEditor then performs shape editing via mapping associated points between point clouds. Extensive evaluation shows that NeuralEditor achieves state-of-the-art performance in both shape deformation and scene morphing tasks. Notably, NeuralEditor supports both zero-shot inference and further fine-tuning over the edited scene. Our code, benchmark, and demo video are available at https://immortalco.github.io/NeuralEditor.

45.Controllable Visual-Tactile Synthesis

Authors:Ruihan Gao, Wenzhen Yuan, Jun-Yan Zhu

Abstract: Deep generative models have various content creation applications such as graphic design, e-commerce, and virtual Try-on. However, current works mainly focus on synthesizing realistic visual outputs, often ignoring other sensory modalities, such as touch, which limits physical interaction with users. In this work, we leverage deep generative models to create a multi-sensory experience where users can touch and see the synthesized object when sliding their fingers on a haptic surface. The main challenges lie in the significant scale discrepancy between vision and touch sensing and the lack of explicit mapping from touch sensing data to a haptic rendering device. To bridge this gap, we collect high-resolution tactile data with a GelSight sensor and create a new visuotactile clothing dataset. We then develop a conditional generative model that synthesizes both visual and tactile outputs from a single sketch. We evaluate our method regarding image quality and tactile rendering accuracy. Finally, we introduce a pipeline to render high-quality visual and tactile outputs on an electroadhesion-based haptic device for an immersive experience, allowing for challenging materials and editable sketch inputs.

46.Tracking through Containers and Occluders in the Wild

Authors:Basile Van Hoorick, Pavel Tokmakov, Simon Stent, Jie Li, Carl Vondrick

Abstract: Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce $\textbf{TCOW}$, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.

47.ZipIt! Merging Models from Different Tasks without Training

Authors:George Stoica, Daniel Bolya, Jakob Bjorner, Taylor Hearn, Judy Hoffman

Abstract: Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, we tackle the extremely difficult problem of combining completely distinct models with different initializations, each solving a separate task, into one multi-task model without any additional training. Prior work in model merging permutes one model to the space of the other then adds them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to additionally allow for merging features within each model by defining a general "zip" operation. Second, we add support for partially zipping the models up until a specified layer, naturally creating a multi-head model. We find that these two changes combined account for a staggering 20-60% improvement over prior work, making the merging of models trained on disjoint tasks feasible.