Computer Vision and Pattern Recognition (cs.CV)
Tue, 30 May 2023
1.Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient Scene Graph Generation
Authors:Neau Maëlic, Paulo Santos, Anne-Gwenn Bosser, Cédric Buche
Abstract: Learning to compose visual relationships from raw images in the form of scene graphs is a highly challenging task due to contextual dependencies, but it is essential in computer vision applications that depend on scene understanding. However, no current approaches in Scene Graph Generation (SGG) aim at providing useful graphs for downstream tasks. Instead, the main focus has primarily been on the task of unbiasing the data distribution for predicting more fine-grained relations. That being said, all fine-grained relations are not equally relevant and at least a part of them are of no use for real-world applications. In this work, we introduce the task of Efficient SGG that prioritizes the generation of relevant relations, facilitating the use of Scene Graphs in downstream tasks such as Image Generation. To support further approaches in this task, we present a new dataset, VG150-curated, based on the annotations of the popular Visual Genome dataset. We show through a set of experiments that this dataset contains more high-quality and diverse annotations than the one usually adopted by approaches in SGG. Finally, we show the efficiency of this dataset in the task of Image Generation from Scene Graphs. Our approach can be easily replicated to improve the quality of other Scene Graph Generation datasets.
2.SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-guided Video Editing
Authors:Nazmul Karim, Umar Khalid, Mohsen Joneidi, Chen Chen, Nazanin Rahnavard
Abstract: Text-to-Image (T2I) diffusion models have achieved remarkable success in synthesizing high-quality images conditioned on text prompts. Recent methods have tried to replicate the success by either training text-to-video (T2V) models on a very large number of text-video pairs or adapting T2I models on text-video pairs independently. Although the latter is computationally less expensive, it still takes a significant amount of time for per-video adaption. To address this issue, we propose SAVE, a novel spectral-shift-aware adaptation framework, in which we fine-tune the spectral shift of the parameter space instead of the parameters themselves. Specifically, we take the spectral decomposition of the pre-trained T2I weights and only control the change in the corresponding singular values, i.e. spectral shift, while freezing the corresponding singular vectors. To avoid drastic drift from the original T2I weights, we introduce a spectral shift regularizer that confines the spectral shift to be more restricted for large singular values and more relaxed for small singular values. Since we are only dealing with spectral shifts, the proposed method reduces the adaptation time significantly (approx. 10 times) and has fewer resource constrains for training. Such attributes posit SAVE to be more suitable for real-world applications, e.g. editing undesirable content during video streaming. We validate the effectiveness of SAVE with an extensive experimental evaluation under different settings, e.g. style transfer, object replacement, privacy preservation, etc.
3.LayerDiffusion: Layered Controlled Image Editing with Diffusion Models
Authors:Pengzhi Li, QInxuan Huang, Yikang Ding, Zhiheng Li
Abstract: Text-guided image editing has recently experienced rapid development. However, simultaneously performing multiple editing actions on a single image, such as background replacement and specific subject attribute changes, while maintaining consistency between the subject and the background remains challenging. In this paper, we propose LayerDiffusion, a semantic-based layered controlled image editing method. Our method enables non-rigid editing and attribute modification of specific subjects while preserving their unique characteristics and seamlessly integrating them into new backgrounds. We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy combined with layered diffusion training. During the diffusion process, an iterative guidance strategy is used to generate a final image that aligns with the textual description. Experimental results demonstrate the effectiveness of our method in generating highly coherent images that closely align with the given textual description. The edited images maintain a high similarity to the features of the input image and surpass the performance of current leading image editing methods. LayerDiffusion opens up new possibilities for controllable image editing.
4.Improving Deep Representation Learning via Auxiliary Learnable Target Coding
Authors:Kangjun Liu, Ke Chen, Yaowei Wang, Kui Jia
Abstract: Deep representation learning is a subfield of machine learning that focuses on learning meaningful and useful representations of data through deep neural networks. However, existing methods for semantic classification typically employ pre-defined target codes such as the one-hot and the Hadamard codes, which can either fail or be less flexible to model inter-class correlation. In light of this, this paper introduces a novel learnable target coding as an auxiliary regularization of deep representation learning, which can not only incorporate latent dependency across classes but also impose geometric properties of target codes into representation space. Specifically, a margin-based triplet loss and a correlation consistency loss on the proposed target codes are designed to encourage more discriminative representations owing to enlarging between-class margins in representation space and favoring equal semantic correlation of learnable target codes respectively. Experimental results on several popular visual classification and retrieval benchmarks can demonstrate the effectiveness of our method on improving representation learning, especially for imbalanced data.
5.ShuffleMix: Improving Representations via Channel-Wise Shuffle of Interpolated Hidden States
Authors:Kangjun Liu, Ke Chen, Lihua Guo, Yaowei Wang, Kui Jia
Abstract: Mixup style data augmentation algorithms have been widely adopted in various tasks as implicit network regularization on representation learning to improve model generalization, which can be achieved by a linear interpolation of labeled samples in input or feature space as well as target space. Inspired by good robustness of alternative dropout strategies against over-fitting on limited patterns of training samples, this paper introduces a novel concept of ShuffleMix -- Shuffle of Mixed hidden features, which can be interpreted as a kind of dropout operation in feature space. Specifically, our ShuffleMix method favors a simple linear shuffle of randomly selected feature channels for feature mixup in-between training samples to leverage semantic interpolated supervision signals, which can be extended to a generalized shuffle operation via additionally combining linear interpolations of intra-channel features. Compared to its direct competitor of feature augmentation -- the Manifold Mixup, the proposed ShuffleMix can gain superior generalization, owing to imposing more flexible and smooth constraints on generating samples and achieving regularization effects of channel-wise feature dropout. Experimental results on several public benchmarking datasets of single-label and multi-label visual classification tasks can confirm the effectiveness of our method on consistently improving representations over the state-of-the-art mixup augmentation.
6.HQDec: Self-Supervised Monocular Depth Estimation Based on a High-Quality Decoder
Authors:Fei Wang, Jun Cheng
Abstract: Decoders play significant roles in recovering scene depths. However, the decoders used in previous works ignore the propagation of multilevel lossless fine-grained information, cannot adaptively capture local and global information in parallel, and cannot perform sufficient global statistical analyses on the final output disparities. In addition, the process of mapping from a low-resolution feature space to a high-resolution feature space is a one-to-many problem that may have multiple solutions. Therefore, the quality of the recovered depth map is low. To this end, we propose a high-quality decoder (HQDec), with which multilevel near-lossless fine-grained information, obtained by the proposed adaptive axial-normalized position-embedded channel attention sampling module (AdaAxialNPCAS), can be adaptively incorporated into a low-resolution feature map with high-level semantics utilizing the proposed adaptive information exchange scheme. In the HQDec, we leverage the proposed adaptive refinement module (AdaRM) to model the local and global dependencies between pixels in parallel and utilize the proposed disparity attention module to model the distribution characteristics of disparity values from a global perspective. To recover fine-grained high-resolution features with maximal accuracy, we adaptively fuse the high-frequency information obtained by constraining the upsampled solution space utilizing the local and global dependencies between pixels into the high-resolution feature map generated from the nonlearning method. Extensive experiments demonstrate that each proposed component improves the quality of the depth estimation results over the baseline results, and the developed approach achieves state-of-the-art results on the KITTI and DDAD datasets. The code and models will be publicly available at \href{https://github.com/fwucas/HQDec}{HQDec}.
7.Wide & deep learning for spatial & intensity adaptive image restoration
Authors:Yadong Wang, Xiangzhi Bai
Abstract: Most existing deep learning-based image restoration methods usually aim to remove degradation with uniform spatial distribution and constant intensity, making insufficient use of degradation prior knowledge. Here we bootstrap the deep neural networks to suppress complex image degradation whose intensity is spatially variable, through utilizing prior knowledge from degraded images. Specifically, we propose an ingenious and efficient multi-frame image restoration network (DparNet) with wide & deep architecture, which integrates degraded images and prior knowledge of degradation to reconstruct images with ideal clarity and stability. The degradation prior is directly learned from degraded images in form of key degradation parameter matrix, with no requirement of any off-site knowledge. The wide & deep architecture in DparNet enables the learned parameters to directly modulate the final restoring results, boosting spatial & intensity adaptive image restoration. We demonstrate the proposed method on two representative image restoration applications: image denoising and suppression of atmospheric turbulence effects in images. Two large datasets, containing 109,536 and 49,744 images respectively, were constructed to support our experiments. The experimental results show that our DparNet significantly outperform SoTA methods in restoration performance and network efficiency. More importantly, by utilizing the learned degradation parameters via wide & deep learning, we can improve the PSNR of image restoration by 0.6~1.1 dB with less than 2% increasing in model parameter numbers and computational complexity. Our work suggests that degraded images may hide key information of the degradation process, which can be utilized to boost spatial & intensity adaptive image restoration.
8.High-Performance Inference Graph Convolutional Networks for Skeleton-Based Action Recognition
Authors:Ziao Li, Junyi Wang, Guhong Nie
Abstract: Recently, significant achievements have been made in skeleton-based human action recognition with the emergence of graph convolutional networks (GCNs). However, the state-of-the-art (SOTA) models used for this task focus on constructing more complex higher-order connections between joint nodes to describe skeleton information, which leads to complex inference processes and high computational costs, resulting in reduced model's practicality. To address the slow inference speed caused by overly complex model structures, we introduce re-parameterization and over-parameterization techniques to GCNs, and propose two novel high-performance inference graph convolutional networks, namely HPI-GCN-RP and HPI-GCN-OP. HPI-GCN-RP uses re-parameterization technique to GCNs to achieve a higher inference speed with competitive model performance. HPI-GCN-OP further utilizes over-parameterization technique to bring significant performance improvement with inference speed slightly decreased. Experimental results on the two skeleton-based action recognition datasets demonstrate the effectiveness of our approach. Our HPI-GCN-OP achieves an accuracy of 93% on the cross-subject split of the NTU-RGB+D 60 dataset, and 90.1% on the cross-subject benchmark of the NTU-RGB+D 120 dataset and is 4.5 times faster than HD-GCN at the same accuracy.
9.Can We Evaluate Domain Adaptation Models Without Target-Domain Labels? A Metric for Unsupervised Evaluation of Domain Adaptation
Authors:Jianfei Yang, Hanjie Qian, Yuecong Xu, Lihua Xie
Abstract: Unsupervised domain adaptation (UDA) involves adapting a model trained on a label-rich source domain to an unlabeled target domain. However, in real-world scenarios, the absence of target-domain labels makes it challenging to evaluate the performance of deep models after UDA. Additionally, prevailing UDA methods typically rely on adversarial training and self-training, which could lead to model degeneration and negative transfer, further exacerbating the evaluation problem. In this paper, we propose a novel metric called the \textit{Transfer Score} to address these issues. The transfer score enables the unsupervised evaluation of domain adaptation models by assessing the spatial uniformity of the classifier via model parameters, as well as the transferability and discriminability of the feature space. Based on unsupervised evaluation using our metric, we achieve three goals: (1) selecting the most suitable UDA method from a range of available options, (2) optimizing hyperparameters of UDA models to prevent model degeneration, and (3) identifying the epoch at which the adapted model performs optimally. Our work bridges the gap between UDA research and practical UDA evaluation, enabling a realistic assessment of UDA model performance. We validate the effectiveness of our metric through extensive empirical studies conducted on various public datasets. The results demonstrate the utility of the transfer score in evaluating UDA models and its potential to enhance the overall efficacy of UDA techniques.
10.Align, Perturb and Decouple: Toward Better Leverage of Difference Information for RSI Change Detection
Authors:Supeng Wang, Yuxi Li, Ming Xie, Mingmin Chi, Yabiao Wang, Chengjie Wang, Wenbing Zhu
Abstract: Change detection is a widely adopted technique in remote sense imagery (RSI) analysis in the discovery of long-term geomorphic evolution. To highlight the areas of semantic changes, previous effort mostly pays attention to learning representative feature descriptors of a single image, while the difference information is either modeled with simple difference operations or implicitly embedded via feature interactions. Nevertheless, such difference modeling can be noisy since it suffers from non-semantic changes and lacks explicit guidance from image content or context. In this paper, we revisit the importance of feature difference for change detection in RSI, and propose a series of operations to fully exploit the difference information: Alignment, Perturbation and Decoupling (APD). Firstly, alignment leverages contextual similarity to compensate for the non-semantic difference in feature space. Next, a difference module trained with semantic-wise perturbation is adopted to learn more generalized change estimators, which reversely bootstraps feature extraction and prediction. Finally, a decoupled dual-decoder structure is designed to predict semantic changes in both content-aware and content-agnostic manners. Extensive experiments are conducted on benchmarks of LEVIR-CD, WHU-CD and DSIFN-CD, demonstrating our proposed operations bring significant improvement and achieve competitive results under similar comparative conditions. Code is available at https://github.com/wangsp1999/CD-Research/tree/main/openAPD
11.LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
Authors:Yi Tu, Ya Guo, Huan Chen, Jinyang Tang
Abstract: Visually-rich Document Understanding (VrDU) has attracted much research attention over the past years. Pre-trained models on a large number of document images with transformer-based backbones have led to significant performance gains in this field. The major challenge is how to fusion the different modalities (text, layout, and image) of the documents in a unified model with different pre-training tasks. This paper focuses on improving text-layout interactions and proposes a novel multi-modal pre-training model, LayoutMask. LayoutMask uses local 1D position, instead of global 1D position, as layout input and has two pre-training objectives: (1) Masked Language Modeling: predicting masked tokens with two novel masking strategies; (2) Masked Position Modeling: predicting masked 2D positions to improve layout representation learning. LayoutMask can enhance the interactions between text and layout modalities in a unified model and produce adaptive and robust multi-modal representations for downstream tasks. Experimental results show that our proposed method can achieve state-of-the-art results on a wide variety of VrDU problems, including form understanding, receipt understanding, and document image classification.
12.Towards Accurate Data-free Quantization for Diffusion Models
Authors:Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
Abstract: In this paper, we propose an accurate data-free post-training quantization framework of diffusion models (ADP-DM) for efficient image generation. Conventional data-free quantization methods learn shared quantization functions for tensor discretization regardless of the generation timesteps, while the activation distribution differs significantly across various timesteps. The calibration images are acquired in random timesteps which fail to provide sufficient information for generalizable quantization function learning. Both issues cause sizable quantization errors with obvious image generation performance degradation. On the contrary, we design group-wise quantization functions for activation discretization in different timesteps and sample the optimal timestep for informative calibration image generation, so that our quantized diffusion model can reduce the discretization errors with negligible computational overhead. Specifically, we partition the timesteps according to the importance weights of quantization functions in different groups, which are optimized by differentiable search algorithms. We also select the optimal timestep for calibration image generation by structural risk minimizing principle in order to enhance the generalization ability in the deployment of quantized diffusion model. Extensive experimental results show that our method outperforms the state-of-the-art post-training quantization of diffusion model by a sizable margin with similar computational cost.
13.Diffusion-Stego: Training-free Diffusion Generative Steganography via Message Projection
Authors:Daegyu Kim, Chaehun Shin, Jooyoung Choi, Dahuin Jung, Sungroh Yoon
Abstract: Generative steganography is the process of hiding secret messages in generated images instead of cover images. Existing studies on generative steganography use GAN or Flow models to obtain high hiding message capacity and anti-detection ability over cover images. However, they create relatively unrealistic stego images because of the inherent limitations of generative models. We propose Diffusion-Stego, a generative steganography approach based on diffusion models which outperform other generative models in image generation. Diffusion-Stego projects secret messages into latent noise of diffusion models and generates stego images with an iterative denoising process. Since the naive hiding of secret messages into noise boosts visual degradation and decreases extracted message accuracy, we introduce message projection, which hides messages into noise space while addressing these issues. We suggest three options for message projection to adjust the trade-off between extracted message accuracy, anti-detection ability, and image quality. Diffusion-Stego is a training-free approach, so we can apply it to pre-trained diffusion models which generate high-quality images, or even large-scale text-to-image models, such as Stable diffusion. Diffusion-Stego achieved a high capacity of messages (3.0 bpp of binary messages with 98% accuracy, and 6.0 bpp with 90% accuracy) as well as high quality (with a FID score of 2.77 for 1.0 bpp on the FFHQ 64$\times$64 dataset) that makes it challenging to distinguish from real images in the PNG format.
14.Real-World Image Variation by Aligning Diffusion Inversion Chain
Authors:Yuechen Zhang, Jinbo Xing, Eric Lo, Jiaya Jia
Abstract: Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods with respect to semantic-condition similarity and perceptual quality. Furthermore, this generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and example-based image inpainting.
15.Hybrid Representation Learning via Epistemic Graph
Authors:Jin Yuan, Yang Zhang, Yangzhou Du, Zhongchao Shi, Xin Geng, Jianping Fan, Yong Rui
Abstract: In recent years, deep models have achieved remarkable success in many vision tasks. Unfortunately, their performance largely depends on intensive training samples. In contrast, human beings typically perform hybrid learning, e.g., spontaneously integrating structured knowledge for cross-domain recognition or on a much smaller amount of data samples for few-shot learning. Thus it is very attractive to extend hybrid learning for the computer vision tasks by seamlessly integrating structured knowledge with data samples to achieve more effective representation learning. However, such a hybrid learning approach remains a great challenge due to the huge gap between the structured knowledge and the deep features (learned from data samples) on both dimensions and knowledge granularity. In this paper, a novel Epistemic Graph Layer (EGLayer) is developed to enable hybrid learning, such that the information can be exchanged more effectively between the deep features and a structured knowledge graph. Our EGLayer is composed of three major parts: (a) a local graph module to establish a local prototypical graph through the learned deep features, i.e., aligning the deep features with the structured knowledge graph at the same granularity; (b) a query aggregation model to aggregate useful information from the local graphs, and using such representations to compute their similarity with global node embeddings for final prediction; and (c) a novel correlation loss function to constrain the linear consistency between the local and global adjacency matrices.
16.Decomposed Human Motion Prior for Video Pose Estimation via Adversarial Training
Authors:Wenshuo Chen, Xiang Zhou, Zhengdi Yu, Zhaoyu Zheng, Weixi Gu, Kai Zhang
Abstract: Estimating human pose from video is a task that receives considerable attention due to its applicability in numerous 3D fields. The complexity of prior knowledge of human body movements poses a challenge to neural network models in the task of regressing keypoints. In this paper, we address this problem by incorporating motion prior in an adversarial way. Different from previous methods, we propose to decompose holistic motion prior to joint motion prior, making it easier for neural networks to learn from prior knowledge thereby boosting the performance on the task. We also utilize a novel regularization loss to balance accuracy and smoothness introduced by motion prior. Our method achieves 9\% lower PA-MPJPE and 29\% lower acceleration error than previous methods tested on 3DPW. The estimator proves its robustness by achieving impressive performance on in-the-wild dataset.