arXiv daily

Computer Vision and Pattern Recognition (cs.CV)

Tue, 13 Jun 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Wed, 24 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.DenseLight: Efficient Control for Large-scale Traffic Signals with Dense Feedback

Authors:Junfan Lin, Yuying Zhu, Lingbo Liu, Yang Liu, Guanbin Li, Liang Lin

Abstract: Traffic Signal Control (TSC) aims to reduce the average travel time of vehicles in a road network, which in turn enhances fuel utilization efficiency, air quality, and road safety, benefiting society as a whole. Due to the complexity of long-horizon control and coordination, most prior TSC methods leverage deep reinforcement learning (RL) to search for a control policy and have witnessed great success. However, TSC still faces two significant challenges. 1) The travel time of a vehicle is delayed feedback on the effectiveness of TSC policy at each traffic intersection since it is obtained after the vehicle has left the road network. Although several heuristic reward functions have been proposed as substitutes for travel time, they are usually biased and not leading the policy to improve in the correct direction. 2) The traffic condition of each intersection is influenced by the non-local intersections since vehicles traverse multiple intersections over time. Therefore, the TSC agent is required to leverage both the local observation and the non-local traffic conditions to predict the long-horizontal traffic conditions of each intersection comprehensively. To address these challenges, we propose DenseLight, a novel RL-based TSC method that employs an unbiased reward function to provide dense feedback on policy effectiveness and a non-local enhanced TSC agent to better predict future traffic conditions for more precise traffic control. Extensive experiments and ablation studies demonstrate that DenseLight can consistently outperform advanced baselines on various road networks with diverse traffic flows. The code is available at https://github.com/junfanlin/DenseLight.

2.Marking anything: application of point cloud in extracting video target features

Authors:Xiangchun Xu

Abstract: Extracting retrievable features from video is of great significance for structured video database construction, video copyright protection and fake video rumor refutation. Inspired by point cloud data processing, this paper proposes a method for marking anything (MA) in the video, which can extract the contour features of any target in the video and convert it into a feature vector with a length of 256 that can be retrieved. The algorithm uses YOLO-v8 algorithm, multi-object tracking algorithm and PointNet++ to extract contour of the video detection target to form spatial point cloud data. Then extract the point cloud feature vector and use it as the retrievable feature of the video detection target. In order to verify the effectiveness and robustness of contour feature, some datasets are crawled from Dou Yin and Kinetics-700 dataset as experimental data. For Dou Yin's homogenized videos, the proposed contour features achieve retrieval accuracy higher than 97% in Top1 return mode. For videos from Kinetics 700, the contour feature also showed good robustness for partial clip mode video tracing.

3.Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

Authors:Yuheng Yang, Haipeng Chen, Zhenguang Liu, Yingda Lyu, Beibei Zhang, Shuang Wu, Zhibo Wang, Kui Ren

Abstract: Action recognition has long been a fundamental and intriguing problem in artificial intelligence. The task is challenging due to the high dimensionality nature of an action, as well as the subtle motion details to be considered. Current state-of-the-art approaches typically learn from articulated motion sequences in the straightforward 3D Euclidean space. However, the vanilla Euclidean space is not efficient for modeling important motion characteristics such as the joint-wise angular acceleration, which reveals the driving force behind the motion. Moreover, current methods typically attend to each channel equally and lack theoretical constrains on extracting task-relevant features from the input. In this paper, we seek to tackle these challenges from three aspects: (1) We propose to incorporate an acceleration representation, explicitly modeling the higher-order variations in motion. (2) We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention, where different representations (i.e., streams) supplement each other towards a more precise action recognition while attention capitalizes on those important channels. (3) We explore feature-level supervision for maximizing the extraction of task-relevant information and formulate this into a mutual information loss. Empirically, our approach sets the new state-of-the-art performance on three benchmark datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA. Our code is anonymously released at https://github.com/ActionR-Group/Stream-GCN, hoping to inspire the community.

4.Parametric Implicit Face Representation for Audio-Driven Facial Reenactment

Authors:Ricong Huang, Peiwen Lai, Yipeng Qin, Guanbin Li

Abstract: Audio-driven facial reenactment is a crucial technique that has a range of applications in film-making, virtual avatars and video conferences. Existing works either employ explicit intermediate face representations (e.g., 2D facial landmarks or 3D face models) or implicit ones (e.g., Neural Radiance Fields), thus suffering from the trade-offs between interpretability and expressive power, hence between controllability and quality of the results. In this work, we break these trade-offs with our novel parametric implicit face representation and propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads. Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models, thereby taking the best of both explicit and implicit methods. In addition, we propose several new techniques to improve the three components of our framework, including i) incorporating contextual information into the audio-to-expression parameters encoding; ii) using conditional image synthesis to parameterize the implicit representation and implementing it with an innovative tri-plane structure for efficient learning; iii) formulating facial reenactment as a conditional image inpainting problem and proposing a novel data augmentation technique to improve model generalizability. Extensive experiments demonstrate that our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.

5.Binary Radiance Fields

Authors:Seungjoo Shin, Jaesik Park

Abstract: In this paper, we propose binary radiance fields (BiRF), a storage-efficient radiance field representation employing binary feature encoding that encodes local features using binary encoding parameters in a format of either $+1$ or $-1$. This binarization strategy lets us represent the feature grid with highly compact feature encoding and a dramatic reduction in storage size. Furthermore, our 2D-3D hybrid feature grid design enhances the compactness of feature encoding as the 3D grid includes main components while 2D grids capture details. In our experiments, binary radiance field representation successfully outperforms the reconstruction performance of state-of-the-art (SOTA) efficient radiance field models with lower storage allocation. In particular, our model achieves impressive results in static scene reconstruction, with a PSNR of 31.53 dB for Synthetic-NeRF scenes, 34.26 dB for Synthetic-NSVF scenes, 28.02 dB for Tanks and Temples scenes while only utilizing 0.7 MB, 0.8 MB, and 0.8 MB of storage space, respectively. We hope the proposed binary radiance field representation will make radiance fields more accessible without a storage bottleneck.

6.I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models

Authors:Raz Lapid, Moshe Sipper

Abstract: Modern image-to-text systems typically adopt the encoder-decoder framework, which comprises two main components: an image encoder, responsible for extracting image features, and a transformer-based decoder, used for generating captions. Taking inspiration from the analysis of neural networks' robustness against adversarial perturbations, we propose a novel gray-box algorithm for creating adversarial examples in image-to-text models. Unlike image classification tasks that have a finite set of class labels, finding visually similar adversarial examples in an image-to-text task poses greater challenges because the captioning system allows for a virtually infinite space of possible captions. In this paper, we present a gray-box adversarial attack on image-to-text, both untargeted and targeted. We formulate the process of discovering adversarial perturbations as an optimization problem that uses only the image-encoder component, meaning the proposed attack is language-model agnostic. Through experiments conducted on the ViT-GPT2 model, which is the most-used image-to-text model in Hugging Face, and the Flickr30k dataset, we demonstrate that our proposed attack successfully generates visually similar adversarial examples, both with untargeted and targeted captions. Notably, our attack operates in a gray-box manner, requiring no knowledge about the decoder module. We also show that our attacks fool the popular open-source platform Hugging Face.

7.Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model

Authors:Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa

Abstract: Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{https://sites.google.com/view/phd-demo-page}.

8.Learning to Estimate 6DoF Pose from Limited Data: A Few-Shot, Generalizable Approach using RGB Images

Authors:Panwang Pan, Zhiwen Fan, Brandon Y. Feng, Peihao Wang, Chenxin Li, Zhangyang Wang

Abstract: The accurate estimation of six degrees-of-freedom (6DoF) object poses is essential for many applications in robotics and augmented reality. However, existing methods for 6DoF pose estimation often depend on CAD templates or dense support views, restricting their usefulness in realworld situations. In this study, we present a new cascade framework named Cas6D for few-shot 6DoF pose estimation that is generalizable and uses only RGB images. To address the false positives of target object detection in the extreme few-shot setting, our framework utilizes a selfsupervised pre-trained ViT to learn robust feature representations. Then, we initialize the nearest top-K pose candidates based on similarity score and refine the initial poses using feature pyramids to formulate and update the cascade warped feature volume, which encodes context at increasingly finer scales. By discretizing the pose search range using multiple pose bins and progressively narrowing the pose search range in each stage using predictions from the previous stage, Cas6D can overcome the large gap between pose candidates and ground truth poses, which is a common failure mode in sparse-view scenarios. Experimental results on the LINEMOD and GenMOP datasets demonstrate that Cas6D outperforms state-of-the-art methods by 9.2% and 3.8% accuracy (Proj-5) under the 32-shot setting compared to OnePose++ and Gen6D.

9.Rethinking Adversarial Training with A Simple Baseline

Authors:Hong Liu, Shin'ichi Satoh

Abstract: We report competitive results on RobustBench for CIFAR and SVHN using a simple yet effective baseline approach. Our approach involves a training protocol that integrates rescaled square loss, cyclic learning rates, and erasing-based data augmentation. The outcomes we have achieved are comparable to those of the model trained with state-of-the-art techniques, which is currently the predominant choice for adversarial training. Our baseline, referred to as SimpleAT, yields three novel empirical insights. (i) By switching to square loss, the accuracy is comparable to that obtained by using both de-facto training protocol plus data augmentation. (ii) One cyclic learning rate is a good scheduler, which can effectively reduce the risk of robust overfitting. (iii) Employing rescaled square loss during model training can yield a favorable balance between adversarial and natural accuracy. In general, our experimental results show that SimpleAT effectively mitigates robust overfitting and consistently achieves the best performance at the end of training. For example, on CIFAR-10 with ResNet-18, SimpleAT achieves approximately 52% adversarial accuracy against the current strong AutoAttack. Furthermore, SimpleAT exhibits robust performance on various image corruptions, including those commonly found in CIFAR-10-C dataset. Finally, we assess the effectiveness of these insights through two techniques: bias-variance analysis and logit penalty methods. Our findings demonstrate that all of these simple techniques are capable of reducing the variance of model predictions, which is regarded as the primary contributor to robust overfitting. In addition, our analysis also uncovers connections with various advanced state-of-the-art methods.

10.UOD: Universal One-shot Detection of Anatomical Landmarks

Authors:Heqin Zhu, Quan Quan, Qingsong Yao, Zaiyi Liu, S. kevin Zhou

Abstract: One-shot medical landmark detection gains much attention and achieves great success for its label-efficient training process. However, existing one-shot learning methods are highly specialized in a single domain and suffer domain preference heavily in the situation of multi-domain unlabeled data. Moreover, one-shot learning is not robust that it faces performance drop when annotating a sub-optimal image. To tackle these issues, we resort to developing a domain-adaptive one-shot landmark detection framework for handling multi-domain medical images, named Universal One-shot Detection (UOD). UOD consists of two stages and two corresponding universal models which are designed as combinations of domain-specific modules and domain-shared modules. In the first stage, a domain-adaptive convolution model is self-supervised learned to generate pseudo landmark labels. In the second stage, we design a domain-adaptive transformer to eliminate domain preference and build the global context for multi-domain data. Even though only one annotated sample from each domain is available for training, the domain-shared modules help UOD aggregate all one-shot samples to detect more robust and accurate landmarks. We investigated both qualitatively and quantitatively the proposed UOD on three widely-used public X-ray datasets in different anatomical domains (i.e., head, hand, chest) and obtained state-of-the-art performances in each domain.

11.NeuS-PIR: Learning Relightable Neural Surface using Pre-Integrated Rendering

Authors:Shi Mao, Chenming Wu, Zhelun Shen, Liangjun Zhang

Abstract: Recent advances in neural implicit fields enables rapidly reconstructing 3D geometry from multi-view images. Beyond that, recovering physical properties such as material and illumination is essential for enabling more applications. This paper presents a new method that effectively learns relightable neural surface using pre-intergrated rendering, which simultaneously learns geometry, material and illumination within the neural implicit field. The key insight of our work is that these properties are closely related to each other, and optimizing them in a collaborative manner would lead to consistent improvements. Specifically, we propose NeuS-PIR, a method that factorizes the radiance field into a spatially varying material field and a differentiable environment cubemap, and jointly learns it with geometry represented by neural surface. Our experiments demonstrate that the proposed method outperforms the state-of-the-art method in both synthetic and real datasets.

12.Enhanced Multimodal Representation Learning with Cross-modal KD

Authors:Mengxi Chen, Linyu Xing, Yu Wang, Ya Zhang

Abstract: This paper explores the tasks of leveraging auxiliary modalities which are only available at training to enhance multimodal representation learning through cross-modal Knowledge Distillation (KD). The widely adopted mutual information maximization-based objective leads to a short-cut solution of the weak teacher, i.e., achieving the maximum mutual information by simply making the teacher model as weak as the student model. To prevent such a weak solution, we introduce an additional objective term, i.e., the mutual information between the teacher and the auxiliary modality model. Besides, to narrow down the information gap between the student and teacher, we further propose to minimize the conditional entropy of the teacher given the student. Novel training schemes based on contrastive learning and adversarial learning are designed to optimize the mutual information and the conditional entropy, respectively. Experimental results on three popular multimodal benchmark datasets have shown that the proposed method outperforms a range of state-of-the-art approaches for video recognition, video retrieval and emotion classification.

13.Sea Ice Segmentation From SAR Data by Convolutional Transformer Networks

Authors:Nicolae-Catalin Ristea, Andrei Anghel, Mihai Datcu

Abstract: Sea ice is a crucial component of the Earth's climate system and is highly sensitive to changes in temperature and atmospheric conditions. Accurate and timely measurement of sea ice parameters is important for understanding and predicting the impacts of climate change. Nevertheless, the amount of satellite data acquired over ice areas is huge, making the subjective measurements ineffective. Therefore, automated algorithms must be used in order to fully exploit the continuous data feeds coming from satellites. In this paper, we present a novel approach for sea ice segmentation based on SAR satellite imagery using hybrid convolutional transformer (ConvTr) networks. We show that our approach outperforms classical convolutional networks, while being considerably more efficient than pure transformer models. ConvTr obtained a mean intersection over union (mIoU) of 63.68% on the AI4Arctic data set, assuming an inference time of 120ms for a 400 x 400 squared km product.

14.Localization of Just Noticeable Difference for Image Compression

Authors:Guangan Chen, Hanhe Lin, Oliver Wiedemann, Dietmar Saupe

Abstract: The just noticeable difference (JND) is the minimal difference between stimuli that can be detected by a person. The picture-wise just noticeable difference (PJND) for a given reference image and a compression algorithm represents the minimal level of compression that causes noticeable differences in the reconstruction. These differences can only be observed in some specific regions within the image, dubbed as JND-critical regions. Identifying these regions can improve the development of image compression algorithms. Due to the fact that visual perception varies among individuals, determining the PJND values and JND-critical regions for a target population of consumers requires subjective assessment experiments involving a sufficiently large number of observers. In this paper, we propose a novel framework for conducting such experiments using crowdsourcing. By applying this framework, we created a novel PJND dataset, KonJND++, consisting of 300 source images, compressed versions thereof under JPEG or BPG compression, and an average of 43 ratings of PJND and 129 self-reported locations of JND-critical regions for each source image. Our experiments demonstrate the effectiveness and reliability of our proposed framework, which is easy to be adapted for collecting a large-scale dataset. The source code and dataset are available at https://github.com/angchen-dev/LocJND.

15.Lookaround Optimizer: $k$ steps around, 1 step average

Authors:Jiangtao Zhang, Shunyu Liu, Jie Song, Tongtian Zhu, Zhengqi Xu, Mingli Song

Abstract: Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner (i.e., the weights are averaged after the entire training process is finished), which significantly degrades the diversity between networks and thus impairs the effectiveness in ensembling. In this paper, inspired by weight average, we propose Lookaround, a straightforward yet effective SGD-based optimizer leading to flatter minima with better generalization. Specifically, Lookaround iterates two steps during the whole training period: the around step and the average step. In each iteration, 1) the around step starts from a common point and trains multiple networks simultaneously, each on transformed data by a different data augmentation, and 2) the average step averages these trained networks to get the averaged network, which serves as the starting point for the next iteration. The around step improves the functionality diversity while the average step guarantees the weight locality of these networks during the whole training, which is essential for WA to work. We theoretically explain the superiority of Lookaround by convergence analysis, and make extensive experiments to evaluate Lookaround on popular benchmarks including CIFAR and ImageNet with both CNNs and ViTs, demonstrating clear superiority over state-of-the-arts. Our code is available at https://github.com/Ardcy/Lookaround.

16.E2E-LOAD: End-to-End Long-form Online Action Detection

Authors:Shuqiang Cao, Weixin Luo, Bairui Wang, Wei Zhang, Lin Ma

Abstract: Recently, there has been a growing trend toward feature-based approaches for Online Action Detection (OAD). However, these approaches have limitations due to their fixed backbone design, which ignores the potential capability of a trainable backbone. In this paper, we propose the first end-to-end OAD model, termed E2E-LOAD, designed to address the major challenge of OAD, namely, long-term understanding and efficient online reasoning. Specifically, our proposed approach adopts an initial spatial model that is shared by all frames and maintains a long sequence cache for inference at a low computational cost. We also advocate an asymmetric spatial-temporal model for long-form and short-form modeling effectively. Furthermore, we propose a novel and efficient inference mechanism that accelerates heavy spatial-temporal exploration. Extensive ablation studies and experiments demonstrate the effectiveness and efficiency of our proposed method. Notably, we achieve 17.3 (+12.6) FPS for end-to-end OAD with 72.4%~(+1.2%), 90.3%~(+0.7%), and 48.1%~(+26.0%) mAP on THMOUS14, TVSeries, and HDD, respectively, which is 3x faster than previous approaches. The source code will be made publicly available.

17.Robustness of SAM: Segment Anything Under Corruptions and Beyond

Authors:Yu Qiao, Chaoning Zhang, Taegoo Kang, Donghun Kim, Shehbaz Tariq, Chenshuang Zhang, Choong Seon Hong

Abstract: Segment anything model (SAM), as the name suggests, is claimed to be capable of cutting out any object. SAM is a vision foundation model which demonstrates impressive zero-shot transfer performance with the guidance of a prompt. However, there is currently a lack of comprehensive evaluation of its robustness performance under various types of corruptions. Prior works show that SAM is biased towards texture (style) rather than shape, motivated by which we start by investigating SAM's robustness against style transfer, which is synthetic corruption. With the effect of corruptions interpreted as a style change, we further evaluate its robustness on 15 common corruptions with 5 severity levels for each real-world corruption. Beyond the corruptions, we further evaluate the SAM robustness on local occlusion and adversarial perturbations. Overall, this work provides a comprehensive empirical study on the robustness of the SAM under corruptions and beyond.

18.Dynamically Masked Discriminator for Generative Adversarial Networks

Authors:Wentian Zhang, Haozhe Liu, Bing Li, Jinheng Xie, Yawen Huang, Yuexiang Li, Yefeng Zheng, Bernard Ghanem

Abstract: Training Generative Adversarial Networks (GANs) remains a challenging problem. The discriminator trains the generator by learning the distribution of real/generated data. However, the distribution of generated data changes throughout the training process, which is difficult for the discriminator to learn. In this paper, we propose a novel method for GANs from the viewpoint of online continual learning. We observe that the discriminator model, trained on historically generated data, often slows down its adaptation to the changes in the new arrival generated data, which accordingly decreases the quality of generated results. By treating the generated data in training as a stream, we propose to detect whether the discriminator slows down the learning of new knowledge in generated data. Therefore, we can explicitly enforce the discriminator to learn new knowledge fast. Particularly, we propose a new discriminator, which automatically detects its retardation and then dynamically masks its features, such that the discriminator can adaptively learn the temporally-vary distribution of generated data. Experimental results show our method outperforms the state-of-the-art approaches.

19.Effects of Data Enrichment with Image Transformations on the Performance of Deep Networks

Authors:Hakan Temiz

Abstract: Images cannot always be expected to come in a certain standard format and orientation. Deep networks need to be trained to take into account unexpected variations in orientation or format. For this purpose, training data should be enriched to include different conditions. In this study, the effects of data enrichment on the performance of deep networks in the super resolution problem were investigated experimentally. A total of six basic image transformations were used for the enrichment procedures. In the experiments, two deep network models were trained with variants of the ILSVRC2012 dataset enriched by these six image transformation processes. Considering a single image transformation, it has been observed that the data enriched with 180 degree rotation provides the best results. The most unsuccessful result was obtained when the models were trained on the enriched data generated by the flip upside down process. Models scored highest when trained with a mix of all transformations.

20.Automatic and Accurate Classification of Hotel Bathrooms from Images with Deep Learning

Authors:Hakan Temiz

Abstract: Hotel bathrooms are one of the most important places in terms of customer satisfaction, and where the most complaints are reported. To share their experiences, guests rate hotels, comment, and share images of their positive or negative ratings. An important part of the room images shared by guests is related to bathrooms. Guests tend to prove their satisfaction or dissatisfaction with the bathrooms with images in their comments. These Positive or negative comments and visuals potentially affect the prospective guests. In this study, two different versions of a deep learning algorithm were designed to classify hotel bathrooms as satisfactory (good) or unsatisfactory (bad, when any defects such as dirtiness, deficiencies, malfunctions were present) by analyzing images. The best-performer between the two models was determined as a result of a series of extensive experimental studies. The models were trained for each of 144 combinations of 5 hyper-parameter sets with a data set containing more than 11 thousand bathroom images, specially created for this study. The "HotelBath" data set was shared also with the community with this study. Four different image sizes were taken into consideration: 128, 256, 512 and 1024 pixels in both directions. The classification performances of the models were measured with several metrics. Both algorithms showed very attractive performances even with many combinations of hyper-parameters. They can classify bathroom images with very high accuracy. Suh that the top algorithm achieved an accuracy of 92.4% and an AUC (area under the curve) score of 0.967. In addition, other metrics also proved the success...

21.Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis

Authors:Yihan Ma, Zhengyu Zhao, Xinlei He, Zheng Li, Michael Backes, Yang Zhang

Abstract: Large text-to-image models have shown remarkable performance in synthesizing high-quality images. In particular, the subject-driven model makes it possible to personalize the image synthesis for a specific subject, e.g., a human face or an artistic style, by fine-tuning the generic text-to-image model with a few images from that subject. Nevertheless, misuse of subject-driven image synthesis may violate the authority of subject owners. For example, malicious users may use subject-driven synthesis to mimic specific artistic styles or to create fake facial images without authorization. To protect subject owners against such misuse, recent attempts have commonly relied on adversarial examples to indiscriminately disrupt subject-driven image synthesis. However, this essentially prevents any benign use of subject-driven synthesis based on protected images. In this paper, we take a different angle and aim at protection without sacrificing the utility of protected images for general synthesis purposes. Specifically, we propose GenWatermark, a novel watermark system based on jointly learning a watermark generator and a detector. In particular, to help the watermark survive the subject-driven synthesis, we incorporate the synthesis process in learning GenWatermark by fine-tuning the detector with synthesized images for a specific subject. This operation is shown to largely improve the watermark detection accuracy and also ensure the uniqueness of the watermark for each individual subject. Extensive experiments validate the effectiveness of GenWatermark, especially in practical scenarios with unknown models and text prompts (74% Acc.), as well as partial data watermarking (80% Acc. for 1/4 watermarking). We also demonstrate the robustness of GenWatermark to two potential countermeasures that substantially degrade the synthesis quality.

22.Area is all you need: repeatable elements make stronger adversarial attacks

Authors:Dillon Niederhut

Abstract: Over the last decade, deep neural networks have achieved state of the art in computer vision tasks. These models, however, are susceptible to unusual inputs, known as adversarial examples, that cause them to misclassify or otherwise fail to detect objects. Here, we provide evidence that the increasing success of adversarial attacks is primarily due to increasing their size. We then demonstrate a method for generating the largest possible adversarial patch by building a adversarial pattern out of repeatable elements. This approach achieves a new state of the art in evading detection by YOLOv2 and YOLOv3. Finally, we present an experiment that fails to replicate the prior success of several attacks published in this field, and end with some comments on testing and reproducibility.

23.Compositionally Equivariant Representation Learning

Authors:Xiao Liu, Pedro Sanchez, Spyridon Thermos, Alison Q. O'Neil, Sotirios A. Tsaftaris

Abstract: Deep learning models often need sufficient supervision (i.e. labelled data) in order to be trained effectively. By contrast, humans can swiftly learn to identify important anatomy in medical images like MRI and CT scans, with minimal guidance. This recognition capability easily generalises to new images from different medical facilities and to new tasks in different settings. This rapid and generalisable learning ability is largely due to the compositional structure of image patterns in the human brain, which are not well represented in current medical models. In this paper, we study the utilisation of compositionality in learning more interpretable and generalisable representations for medical image segmentation. Overall, we propose that the underlying generative factors that are used to generate the medical images satisfy compositional equivariance property, where each factor is compositional (e.g. corresponds to the structures in human anatomy) and also equivariant to the task. Hence, a good representation that approximates well the ground truth factor has to be compositionally equivariant. By modelling the compositional representations with learnable von-Mises-Fisher (vMF) kernels, we explore how different design and learning biases can be used to enforce the representations to be more compositionally equivariant under un-, weakly-, and semi-supervised settings. Extensive results show that our methods achieve the best performance over several strong baselines on the task of semi-supervised domain-generalised medical image segmentation. Code will be made publicly available upon acceptance at https://github.com/vios-s.

24.Low-Resource White-Box Semantic Segmentation of Supporting Towers on 3D Point Clouds via Signature Shape Identification

Authors:Diogo Lavado, Cláudia Soares, Alessandra Micheletti, Giovanni Bocchi, Alex Coronati, Manuel Silva, Patrizio Frosini

Abstract: Research in 3D semantic segmentation has been increasing performance metrics, like the IoU, by scaling model complexity and computational resources, leaving behind researchers and practitioners that (1) cannot access the necessary resources and (2) do need transparency on the model decision mechanisms. In this paper, we propose SCENE-Net, a low-resource white-box model for 3D point cloud semantic segmentation. SCENE-Net identifies signature shapes on the point cloud via group equivariant non-expansive operators (GENEOs), providing intrinsic geometric interpretability. Our training time on a laptop is 85~min, and our inference time is 20~ms. SCENE-Net has 11 trainable geometrical parameters and requires fewer data than black-box models. SCENE--Net offers robustness to noisy labeling and data imbalance and has comparable IoU to state-of-the-art methods. With this paper, we release a 40~000 Km labeled dataset of rural terrain point clouds and our code implementation.

25.Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

Authors:Ming Y. Lu, Bowen Chen, Andrew Zhang, Drew F. K. Williamson, Richard J. Chen, Tong Ding, Long Phi Le, Yung-Sung Chuang, Faisal Mahmood

Abstract: Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models on gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pre-train our text encoder. By effectively leveraging strong pre-trained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2% across three different real-world cancer subtyping tasks. Our code is available at: https://github.com/mahmoodlab/MI-Zero.

26.LMD: Light-weight Prediction Quality Estimation for Object Detection in Lidar Point Clouds

Authors:Tobias Riedlinger, Marius Schubert, Sarina Penquitt, Jan-Marcel Kezmann, Pascal Colling, Karsten Kahl, Lutz Roese-Koerner, Michael Arnold, Urs Zimmermann, Matthias Rottmann

Abstract: Object detection on Lidar point cloud data is a promising technology for autonomous driving and robotics which has seen a significant rise in performance and accuracy during recent years. Particularly uncertainty estimation is a crucial component for down-stream tasks and deep neural networks remain error-prone even for predictions with high confidence. Previously proposed methods for quantifying prediction uncertainty tend to alter the training scheme of the detector or rely on prediction sampling which results in vastly increased inference time. In order to address these two issues, we propose LidarMetaDetect (LMD), a light-weight post-processing scheme for prediction quality estimation. Our method can easily be added to any pre-trained Lidar object detector without altering anything about the base model and is purely based on post-processing, therefore, only leading to a negligible computational overhead. Our experiments show a significant increase of statistical reliability in separating true from false predictions. We propose and evaluate an additional application of our method leading to the detection of annotation errors. Explicit samples and a conservative count of annotation error proposals indicates the viability of our method for large-scale datasets like KITTI and nuScenes. On the widely-used nuScenes test dataset, 43 out of the top 100 proposals of our method indicate, in fact, erroneous annotations.

27.PSSTRNet: Progressive Segmentation-guided Scene Text Removal Network

Authors:Guangtao Lyu, Anna Zhu

Abstract: Scene text removal (STR) is a challenging task due to the complex text fonts, colors, sizes, and background textures in scene images. However, most previous methods learn both text location and background inpainting implicitly within a single network, which weakens the text localization mechanism and makes a lossy background. To tackle these problems, we propose a simple Progressive Segmentation-guided Scene Text Removal Network(PSSTRNet) to remove the text in the image iteratively. It contains two decoder branches, a text segmentation branch, and a text removal branch, with a shared encoder. The text segmentation branch generates text mask maps as the guidance for the regional removal branch. In each iteration, the original image, previous text removal result, and text mask are input to the network to extract the rest part of the text segments and cleaner text removal result. To get a more accurate text mask map, an update module is developed to merge the mask map in the current and previous stages. The final text removal result is obtained by adaptive fusion of results from all previous stages. A sufficient number of experiments and ablation studies conducted on the real and synthetic public datasets demonstrate our proposed method achieves state-of-the-art performance. The source code of our work is available at: \href{https://github.com/GuangtaoLyu/PSSTRNet}{https://github.com/GuangtaoLyu/PSSTRNet.}

28.Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity

Authors:Mu Zhou, Lucas Stoffl, Mackenzie Mathis, Alexander Mathis

Abstract: Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlapping individuals, and bottom-up methods often falsely infer connections to distant body parts. Thus, we propose a novel pipeline called bottom-up conditioned top-down pose estimation (BUCTD) that combines the strengths of bottom-up and top-down methods. Specifically, we propose to use a bottom-up model as the detector, which in addition to an estimated bounding box provides a pose proposal that is fed as condition to an attention-based top-down model. We demonstrate the performance and efficiency of our approach on animal and human pose estimation benchmarks. On CrowdPose and OCHuman, we outperform previous state-of-the-art models by a significant margin. We achieve 78.5 AP on CrowdPose and 47.2 AP on OCHuman, an improvement of 8.6% and 4.9% over the prior art, respectively. Furthermore, we show that our method has excellent performance on non-crowded datasets such as COCO, and strongly improves the performance on multi-animal benchmarks involving mice, fish and monkeys.

29.Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data

Authors:Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi

Abstract: We present Viewset Diffusion: a framework for training image-conditioned 3D generative models from 2D data. Image-conditioned 3D generative models allow us to address the inherent ambiguity in single-view 3D reconstruction. Given one image of an object, there is often more than one possible 3D volume that matches the input image, because a single image never captures all sides of an object. Deterministic models are inherently limited to producing one possible reconstruction and therefore make mistakes in ambiguous settings. Modelling distributions of 3D shapes is challenging because 3D ground truth data is often not available. We propose to solve the issue of data availability by training a diffusion model which jointly denoises a multi-view image set.We constrain the output of Viewset Diffusion models to a single 3D volume per image set, guaranteeing consistent geometry. Training is done through reconstruction losses on renderings, allowing training with only three images per object. Our design of architecture and training scheme allows our model to perform 3D generation and generative, ambiguity-aware single-view reconstruction in a feed-forward manner. Project page: szymanowiczs.github.io/viewset-diffusion.

30.VISION Datasets: A Benchmark for Vision-based InduStrial InspectiON

Authors:Haoping Bai, Shancong Mou, Tatiana Likhomanenko, Ramazan Gokberk Cinbis, Oncel Tuzel, Ping Huang, Jiulong Shan, Jianjun Shi, Meng Cao

Abstract: Despite progress in vision-based inspection algorithms, real-world industrial challenges -- specifically in data availability, quality, and complex production requirements -- often remain under-addressed. We introduce the VISION Datasets, a diverse collection of 14 industrial inspection datasets, uniquely poised to meet these challenges. Unlike previous datasets, VISION brings versatility to defect detection, offering annotation masks across all splits and catering to various detection methodologies. Our datasets also feature instance-segmentation annotation, enabling precise defect identification. With a total of 18k images encompassing 44 defect types, VISION strives to mirror a wide range of real-world production scenarios. By supporting two ongoing challenge competitions on the VISION Datasets, we hope to foster further advancements in vision-based industrial inspection.

31.Image Captioners Are Scalable Vision Learners Too

Authors:Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer

Abstract: Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.

32.Continuous Cost Aggregation for Dual-Pixel Disparity Extraction

Authors:Sagi Monin, Sagi Katz, Georgios Evangelidis

Abstract: Recent works have shown that depth information can be obtained from Dual-Pixel (DP) sensors. A DP arrangement provides two views in a single shot, thus resembling a stereo image pair with a tiny baseline. However, the different point spread function (PSF) per view, as well as the small disparity range, makes the use of typical stereo matching algorithms problematic. To address the above shortcomings, we propose a Continuous Cost Aggregation (CCA) scheme within a semi-global matching framework that is able to provide accurate continuous disparities from DP images. The proposed algorithm fits parabolas to matching costs and aggregates parabola coefficients along image paths. The aggregation step is performed subject to a quadratic constraint that not only enforces the disparity smoothness but also maintains the quadratic form of the total costs. This gives rise to an inherently efficient disparity propagation scheme with a pixel-wise minimization in closed-form. Furthermore, the continuous form allows for a robust multi-scale aggregation that better compensates for the varying PSF. Experiments on DP data from both DSLR and phone cameras show that the proposed scheme attains state-of-the-art performance in DP disparity estimation.

33.MOFI: Learning Image Representations from Noisy Entity Annotated Images

Authors:Wentao Wu, Aleksei Timofeev, Chen Chen, Bowen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jon Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang

Abstract: We present MOFI, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: ($i$) pre-training data, and ($ii$) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. The approach is simple, does not require costly human annotation, and can be readily scaled up to billions of image-text pairs mined from the web. Through this method, we have created Image-to-Entities (I2E), a new large-scale dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes, including supervised pre-training, contrastive pre-training, and multi-task learning. For constrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations.

34.Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Authors:Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy

Abstract: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

35.Hidden Biases of End-to-End Driving Models

Authors:Bernhard Jaeger, Kashyap Chitta, Andreas Geiger

Abstract: End-to-end driving systems have recently made rapid progress, in particular on CARLA. Independent of their major contribution, they introduce changes to minor system components. Consequently, the source of improvements is unclear. We identify two biases that recur in nearly all state-of-the-art methods and are critical for the observed progress on CARLA: (1) lateral recovery via a strong inductive bias towards target point following, and (2) longitudinal averaging of multimodal waypoint predictions for slowing down. We investigate the drawbacks of these biases and identify principled alternatives. By incorporating our insights, we develop TF++, a simple end-to-end method that ranks first on the Longest6 and LAV benchmarks, gaining 14 driving score over the best prior work on Longest6.

36.GeneCIS: A Benchmark for General Conditional Image Similarity

Authors:Sagar Vaze, Nicolas Carion, Ishan Misra

Abstract: We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.

37.Neural Scene Chronology

Authors:Haotong Lin, Qianqian Wang, Ruojin Cai, Sida Peng, Hadar Averbuch-Elor, Xiaowei Zhou, Noah Snavely

Abstract: In this work, we aim to reconstruct a time-varying 3D model, capable of rendering photo-realistic renderings with independent control of viewpoint, illumination, and time, from Internet photos of large-scale landmarks. The core challenges are twofold. First, different types of temporal changes, such as illumination and changes to the underlying scene itself (such as replacing one graffiti artwork with another) are entangled together in the imagery. Second, scene-level temporal changes are often discrete and sporadic over time, rather than continuous. To tackle these problems, we propose a new scene representation equipped with a novel temporal step function encoding method that can model discrete scene-level content changes as piece-wise constant functions over time. Specifically, we represent the scene as a space-time radiance field with a per-image illumination embedding, where temporally-varying scene changes are encoded using a set of learned step functions. To facilitate our task of chronology reconstruction from Internet imagery, we also collect a new dataset of four scenes that exhibit various changes over time. We demonstrate that our method exhibits state-of-the-art view synthesis results on this dataset, while achieving independent control of viewpoint, time, and illumination.

38.XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

Authors:Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Fahad Shahbaz Khan

Abstract: The latest breakthroughs in large vision-language models, such as Bard and GPT-4, have showcased extraordinary abilities in performing a wide range of tasks. Such models are trained on massive datasets comprising billions of public image-text pairs with diverse tasks. However, their performance on task-specific domains, such as radiology, is still under-investigated and potentially limited due to a lack of sophistication in understanding biomedical images. On the other hand, conversational medical models have exhibited remarkable success but have mainly focused on text-based analysis. In this paper, we introduce XrayGPT, a novel conversational medical vision-language model that can analyze and answer open-ended questions about chest radiographs. Specifically, we align both medical visual encoder (MedClip) with a fine-tuned large language model (Vicuna), using a simple linear transformation. This alignment enables our model to possess exceptional visual conversation abilities, grounded in a deep understanding of radiographs and medical domain knowledge. To enhance the performance of LLMs in the medical context, we generate ~217k interactive and high-quality summaries from free-text radiology reports. These summaries serve to enhance the performance of LLMs through the fine-tuning process. Our approach opens up new avenues the research for advancing the automated analysis of chest radiographs. Our open-source demos, models, and instruction sets are available at: https://github.com/mbzuai-oryx/XrayGPT.