Computer Vision and Pattern Recognition (cs.CV)
Fri, 28 Jul 2023
1.DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable Kendall's Rank Correlation
Authors:Kaipeng Zheng, Huishuai Zhang, Weiran Huang
Abstract: Few-shot learning aims to adapt models trained on the base dataset to novel tasks where the categories are not seen by the model before. This often leads to a relatively uniform distribution of feature values across channels on novel classes, posing challenges in determining channel importance for novel tasks. Standard few-shot learning methods employ geometric similarity metrics such as cosine similarity and negative Euclidean distance to gauge the semantic relatedness between two features. However, features with high geometric similarities may carry distinct semantics, especially in the context of few-shot learning. In this paper, we demonstrate that the importance ranking of feature channels is a more reliable indicator for few-shot learning than geometric similarity metrics. We observe that replacing the geometric similarity metric with Kendall's rank correlation only during inference is able to improve the performance of few-shot learning across a wide range of datasets with different domains. Furthermore, we propose a carefully designed differentiable loss for meta-training to address the non-differentiability issue of Kendall's rank correlation. Extensive experiments demonstrate that the proposed rank-correlation-based approach substantially enhances few-shot learning performance.
2.DocDeshadower: Frequency-aware Transformer for Document Shadow Removal
Authors:Shenghong Luo, Ruifeng Xu, Xuhang Chen, Zinuo Li, Chi-Man Pun, Shuqiang Wang
Abstract: The presence of shadows significantly impacts the visual quality of scanned documents. However, the existing traditional techniques and deep learning methods used for shadow removal have several limitations. These methods either rely heavily on heuristics, resulting in suboptimal performance, or require large datasets to learn shadow-related features. In this study, we propose the DocDeshadower, a multi-frequency Transformer-based model built on Laplacian Pyramid. DocDeshadower is designed to remove shadows at different frequencies in a coarse-to-fine manner. To achieve this, we decompose the shadow image into different frequency bands using Laplacian Pyramid. In addition, we introduce two novel components to this model: the Attention-Aggregation Network and the Gated Multi-scale Fusion Transformer. The Attention-Aggregation Network is designed to remove shadows in the low-frequency part of the image, whereas the Gated Multi-scale Fusion Transformer refines the entire image at a global scale with its large perceptive field. Our extensive experiments demonstrate that DocDeshadower outperforms the current state-of-the-art methods in both qualitative and quantitative terms.
3.TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts
Authors:Hanrong Ye, Dan Xu
Abstract: Learning discriminative task-specific features simultaneously for multiple distinct tasks is a fundamental problem in multi-task learning. Recent state-of-the-art models consider directly decoding task-specific features from one shared task-generic feature (e.g., feature from a backbone layer), and utilize carefully designed decoders to produce multi-task features. However, as the input feature is fully shared and each task decoder also shares decoding parameters for different input samples, it leads to a static feature decoding process, producing less discriminative task-specific representations. To tackle this limitation, we propose TaskExpert, a novel multi-task mixture-of-experts model that enables learning multiple representative task-generic feature spaces and decoding task-specific features in a dynamic manner. Specifically, TaskExpert introduces a set of expert networks to decompose the backbone feature into several representative task-generic features. Then, the task-specific features are decoded by using dynamic task-specific gating networks operating on the decomposed task-generic features. Furthermore, to establish long-range modeling of the task-specific representations from different layers of TaskExpert, we design a multi-task feature memory that updates at each layer and acts as an additional feature expert for dynamic task-specific feature decoding. Extensive experiments demonstrate that our TaskExpert clearly outperforms previous best-performing methods on all 9 metrics of two competitive multi-task learning benchmarks for visual scene understanding (i.e., PASCAL-Context and NYUD-v2). Codes and models will be made publicly available at https://github.com/prismformore/Multi-Task-Transformer
4.Staging E-Commerce Products for Online Advertising using Retrieval Assisted Image Generation
Authors:Yueh-Ning Ku, Mikhail Kuznetsov, Shaunak Mishra, Paloma de Juan
Abstract: Online ads showing e-commerce products typically rely on the product images in a catalog sent to the advertising platform by an e-commerce platform. In the broader ads industry such ads are called dynamic product ads (DPA). It is common for DPA catalogs to be in the scale of millions (corresponding to the scale of products which can be bought from the e-commerce platform). However, not all product images in the catalog may be appealing when directly re-purposed as an ad image, and this may lead to lower click-through rates (CTRs). In particular, products just placed against a solid background may not be as enticing and realistic as a product staged in a natural environment. To address such shortcomings of DPA images at scale, we propose a generative adversarial network (GAN) based approach to generate staged backgrounds for un-staged product images. Generating the entire staged background is a challenging task susceptible to hallucinations. To get around this, we introduce a simpler approach called copy-paste staging using retrieval assisted GANs. In copy paste staging, we first retrieve (from the catalog) staged products similar to the un-staged input product, and then copy-paste the background of the retrieved product in the input image. A GAN based in-painting model is used to fill the holes left after this copy-paste operation. We show the efficacy of our copy-paste staging method via offline metrics, and human evaluation. In addition, we show how our staging approach can enable animations of moving products leading to a video ad from a product image.
5.Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF
Authors:Haotian Bai, Yiqi Lin, Yize Chen, Lin Wang
Abstract: The explicit neural radiance field (NeRF) has gained considerable interest for its efficient training and fast inference capabilities, making it a promising direction such as virtual reality and gaming. In particular, PlenOctree (POT)[1], an explicit hierarchical multi-scale octree representation, has emerged as a structural and influential framework. However, POT's fixed structure for direct optimization is sub-optimal as the scene complexity evolves continuously with updates to cached color and density, necessitating refining the sampling distribution to capture signal complexity accordingly. To address this issue, we propose the dynamic PlenOctree DOT, which adaptively refines the sample distribution to adjust to changing scene complexity. Specifically, DOT proposes a concise yet novel hierarchical feature fusion strategy during the iterative rendering process. Firstly, it identifies the regions of interest through training signals to ensure adaptive and efficient refinement. Next, rather than directly filtering out valueless nodes, DOT introduces the sampling and pruning operations for octrees to aggregate features, enabling rapid parameter learning. Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over $55.15$/$68.84\%$ parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $\&$ Temples, respectively. Project homepage:https://vlislab22.github.io/DOT. [1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
6.Supervised Homography Learning with Realistic Dataset Generation
Authors:Hai Jiang, Haipeng Li, Songchen Han, Haoqiang Fan, Bing Zeng, Shuaicheng Liu
Abstract: In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data and yield a supervised homography network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content consistency module and a quality assessment module. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method achieves state-of-the-art performance and existing supervised methods can be also improved based on the generated dataset. Code and dataset are available at https://github.com/megvii-research/RealSH.
7.Prompt Guided Transformer for Multi-Task Dense Prediction
Authors:Yuxiang Lu, Shalayiding Sirejiding, Yue Ding, Chunlin Wang, Hongtao Lu
Abstract: Task-conditional architecture offers advantage in parameter efficiency but falls short in performance compared to state-of-the-art multi-decoder methods. How to trade off performance and model parameters is an important and difficult problem. In this paper, we introduce a simple and lightweight task-conditional model called Prompt Guided Transformer (PGT) to optimize this challenge. Our approach designs a Prompt-conditioned Transformer block, which incorporates task-specific prompts in the self-attention mechanism to achieve global dependency modeling and parameter-efficient feature adaptation across multiple tasks. This block is integrated into both the shared encoder and decoder, enhancing the capture of intra- and inter-task features. Moreover, we design a lightweight decoder to further reduce parameter usage, which accounts for only 2.7% of the total model parameters. Extensive experiments on two multi-task dense prediction benchmarks, PASCAL-Context and NYUD-v2, demonstrate that our approach achieves state-of-the-art results among task-conditional methods while using fewer parameters, and maintains a significant balance between performance and parameter size.
8.AffineGlue: Joint Matching and Robust Estimation
Authors:Daniel Barath, Dmytro Mishkin, Luca Cavalli, Paul-Edouard Sarlin, Petr Hruby, Marc Pollefeys
Abstract: We propose AffineGlue, a method for joint two-view feature matching and robust estimation that reduces the combinatorial complexity of the problem by employing single-point minimal solvers. AffineGlue selects potential matches from one-to-many correspondences to estimate minimal models. Guided matching is then used to find matches consistent with the model, suffering less from the ambiguities of one-to-one matches. Moreover, we derive a new minimal solver for homography estimation, requiring only a single affine correspondence (AC) and a gravity prior. Furthermore, we train a neural network to reject ACs that are unlikely to lead to a good model. AffineGlue is superior to the SOTA on real-world datasets, even when assuming that the gravity direction points downwards. On PhotoTourism, the AUC@10{\deg} score is improved by 6.6 points compared to the SOTA. On ScanNet, AffineGlue makes SuperPoint and SuperGlue achieve similar accuracy as the detector-free LoFTR.
9.Uncertainty-aware Unsupervised Multi-Object Tracking
Authors:Kai Liu, Sheng Jin, Zhihang Fu, Ze Chen, Rongxin Jiang, Jieping Ye
Abstract: Without manually annotated identities, unsupervised multi-object trackers are inferior to learning reliable feature embeddings. It causes the similarity-based inter-frame association stage also be error-prone, where an uncertainty problem arises. The frame-by-frame accumulated uncertainty prevents trackers from learning the consistent feature embedding against time variation. To avoid this uncertainty problem, recent self-supervised techniques are adopted, whereas they failed to capture temporal relations. The interframe uncertainty still exists. In fact, this paper argues that though the uncertainty problem is inevitable, it is possible to leverage the uncertainty itself to improve the learned consistency in turn. Specifically, an uncertainty-based metric is developed to verify and rectify the risky associations. The resulting accurate pseudo-tracklets boost learning the feature consistency. And accurate tracklets can incorporate temporal information into spatial transformation. This paper proposes a tracklet-guided augmentation strategy to simulate tracklets' motion, which adopts a hierarchical uncertainty-based sampling mechanism for hard sample mining. The ultimate unsupervised MOT framework, namely U2MOT, is proven effective on MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance among the published supervised and unsupervised trackers.
10.Deep Learning Pipeline for Automated Visual Moth Monitoring: Insect Localization and Species Classification
Authors:Dimitri Korsch, Paul Bodesheim, Joachim Denzler
Abstract: Biodiversity monitoring is crucial for tracking and counteracting adverse trends in population fluctuations. However, automatic recognition systems are rarely applied so far, and experts evaluate the generated data masses manually. Especially the support of deep learning methods for visual monitoring is not yet established in biodiversity research, compared to other areas like advertising or entertainment. In this paper, we present a deep learning pipeline for analyzing images captured by a moth scanner, an automated visual monitoring system of moth species developed within the AMMOD project. We first localize individuals with a moth detector and afterward determine the species of detected insects with a classifier. Our detector achieves up to 99.01% mean average precision and our classifier distinguishes 200 moth species with an accuracy of 93.13% on image cutouts depicting single insects. Combining both in our pipeline improves the accuracy for species identification in images of the moth scanner from 79.62% to 88.05%.
11.Implicit neural representation for change detection
Authors:Peter Naylor, Diego Di Carlo, Arianna Traviglia, Makoto Yamada, Marco Fiorucci
Abstract: Detecting changes that occurred in a pair of 3D airborne LiDAR point clouds, acquired at two different times over the same geographical area, is a challenging task because of unmatching spatial supports and acquisition system noise. Most recent attempts to detect changes on point clouds are based on supervised methods, which require large labelled data unavailable in real-world applications. To address these issues, we propose an unsupervised approach that comprises two components: Neural Field (NF) for continuous shape reconstruction and a Gaussian Mixture Model for categorising changes. NF offer a grid-agnostic representation to encode bi-temporal point clouds with unmatched spatial support that can be regularised to increase high-frequency details and reduce noise. The reconstructions at each timestamp are compared at arbitrary spatial scales, leading to a significant increase in detection capabilities. We apply our method to a benchmark dataset of simulated LiDAR point clouds for urban sprawling. The dataset offers different challenging scenarios with different resolutions, input modalities and noise levels, allowing a multi-scenario comparison of our method with the current state-of-the-art. We boast the previous methods on this dataset by a 10% margin in intersection over union metric. In addition, we apply our methods to a real-world scenario to identify illegal excavation (looting) of archaeological sites and confirm that they match findings from field experts.
12.Automated Visual Monitoring of Nocturnal Insects with Light-based Camera Traps
Authors:Dimitri Korsch, Paul Bodesheim, Gunnar Brehm, Joachim Denzler
Abstract: Automatic camera-assisted monitoring of insects for abundance estimations is crucial to understand and counteract ongoing insect decline. In this paper, we present two datasets of nocturnal insects, especially moths as a subset of Lepidoptera, photographed in Central Europe. One of the datasets, the EU-Moths dataset, was captured manually by citizen scientists and contains species annotations for 200 different species and bounding box annotations for those. We used this dataset to develop and evaluate a two-stage pipeline for insect detection and moth species classification in previous work. We further introduce a prototype for an automated visual monitoring system. This prototype produced the second dataset consisting of more than 27,000 images captured on 95 nights. For evaluation and bootstrapping purposes, we annotated a subset of the images with bounding boxes enframing nocturnal insects. Finally, we present first detection and classification baselines for these datasets and encourage other scientists to use this publicly available data.
13.Cross-Modal Concept Learning and Inference for Vision-Language Models
Authors:Yi Zhang, Ce Zhang, Yushun Tang, Zhihai He
Abstract: Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.
14.Local and Global Information in Obstacle Detection on Railway Tracks
Authors:Matthias Brucker, Andrei Cramariuc, Cornelius von Einem, Roland Siegwart, Cesar Cadena
Abstract: Reliable obstacle detection on railways could help prevent collisions that result in injuries and potentially damage or derail the train. Unfortunately, generic object detectors do not have enough classes to account for all possible scenarios, and datasets featuring objects on railways are challenging to obtain. We propose utilizing a shallow network to learn railway segmentation from normal railway images. The limited receptive field of the network prevents overconfident predictions and allows the network to focus on the locally very distinct and repetitive patterns of the railway environment. Additionally, we explore the controlled inclusion of global information by learning to hallucinate obstacle-free images. We evaluate our method on a custom dataset featuring railway images with artificially augmented obstacles. Our proposed method outperforms other learning-based baseline methods.
15.Non-invasive Diabetes Detection using Gabor Filter: A Comparative Analysis of Different Cameras
Authors:Christina A. Garcia, Patricia Angela R. Abu, Rosula SJ. Reyes
Abstract: This paper compares and explores the performance of both mobile device camera and laptop camera as convenient tool for capturing images for non-invasive detection of Diabetes Mellitus (DM) using facial block texture features. Participants within age bracket 20 to 79 years old were chosen for the dataset. 12mp and 7mp mobile cameras, and a laptop camera were used to take the photo under normal lighting condition. Extracted facial blocks were classified using k-Nearest Neighbors (k-NN) and Support Vector Machine (SVM). 100 images were captured, preprocessed, filtered using Gabor, and iterated. Performance of the system was measured in terms of accuracy, specificity, and sensitivity. Best performance of 96.7% accuracy, 100% sensitivity, and 93% specificity were achieved from 12mp back camera using SVM with 100 images.
16.Improving Image Quality of Sparse-view Lung Cancer CT Images with a Convolutional Neural Network
Authors:Annika Ries, Tina Dorosti, Johannes Thalhammer, Daniel Sasse, Andreas Sauter, Felix Meurer, Ashley Benne, Franz Pfeiffer, Daniela Pfeiffer
Abstract: Purpose: To improve the image quality of sparse-view computed tomography (CT) images with a U-Net for lung cancer detection and to determine the best trade-off between number of views, image quality, and diagnostic confidence. Methods: CT images from 41 subjects (34 with lung cancer, seven healthy) were retrospectively selected (01.2016-12.2018) and forward projected onto 2048-view sinograms. Six corresponding sparse-view CT data subsets at varying levels of undersampling were reconstructed from sinograms using filtered backprojection with 16, 32, 64, 128, 256, and 512 views, respectively. A dual-frame U-Net was trained and evaluated for each subsampling level on 8,658 images from 22 diseased subjects. A representative image per scan was selected from 19 subjects (12 diseased, seven healthy) for a single-blinded reader study. The selected slices, for all levels of subsampling, with and without post-processing by the U-Net model, were presented to three readers. Image quality and diagnostic confidence were ranked using pre-defined scales. Subjective nodule segmentation was evaluated utilizing sensitivity (Se) and Dice Similarity Coefficient (DSC) with 95% confidence intervals (CI). Results: The 64-projection sparse-view images resulted in Se = 0.89 and DSC = 0.81 [0.75,0.86] while their counterparts, post-processed with the U-Net, had improved metrics (Se = 0.94, DSC = 0.85 [0.82,0.87]). Fewer views lead to insufficient quality for diagnostic purposes. For increased views, no substantial discrepancies were noted between the sparse-view and post-processed images. Conclusion: Projection views can be reduced from 2048 to 64 while maintaining image quality and the confidence of the radiologists on a satisfactory level.
17.Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation
Authors:Jaime Corsetti, Davide Boscaini, Fabio Poiesi
Abstract: Recent works on 6D object pose estimation focus on learning keypoint correspondences between images and object models, and then determine the object pose through RANSAC-based algorithms or by directly regressing the pose with end-to-end optimisations. We argue that learning point-level discriminative features is overlooked in the literature. To this end, we revisit Fully Convolutional Geometric Features (FCGF) and tailor it for object 6D pose estimation to achieve state-of-the-art performance. FCGF employs sparse convolutions and learns point-level features using a fully-convolutional network by optimising a hardest contrastive loss. We can outperform recent competitors on popular benchmarks by adopting key modifications to the loss and to the input data representations, by carefully tuning the training strategies, and by employing data augmentations suitable for the underlying problem. We carry out a thorough ablation to study the contribution of each modification.
18.YOLOv8 for Defect Inspection of Hexagonal Directed Self-Assembly Patterns: A Data-Centric Approach
Authors:Enrique Dehaerne, Bappaditya Dey, Hossein Esfandiar, Lander Verstraete, Hyo Seon Suh, Sandip Halder, Stefan De Gendt
Abstract: Shrinking pattern dimensions leads to an increased variety of defect types in semiconductor devices. This has spurred innovation in patterning approaches such as Directed self-assembly (DSA) for which no traditional, automatic defect inspection software exists. Machine Learning-based SEM image analysis has become an increasingly popular research topic for defect inspection with supervised ML models often showing the best performance. However, little research has been done on obtaining a dataset with high-quality labels for these supervised models. In this work, we propose a method for obtaining coherent and complete labels for a dataset of hexagonal contact hole DSA patterns while requiring minimal quality control effort from a DSA expert. We show that YOLOv8, a state-of-the-art neural network, achieves defect detection precisions of more than 0.9 mAP on our final dataset which best reflects DSA expert defect labeling expectations. We discuss the strengths and limitations of our proposed labeling approach and suggest directions for future work in data-centric ML-based defect inspection.
19.Few-shot Image Classification based on Gradual Machine Learning
Authors:Na Chen, Xianming Kuang, Feiyu Liu, Kehao Wang, Qun Chen
Abstract: Few-shot image classification aims to accurately classify unlabeled images using only a few labeled samples. The state-of-the-art solutions are built by deep learning, which focuses on designing increasingly complex deep backbones. Unfortunately, the task remains very challenging due to the difficulty of transferring the knowledge learned in training classes to new ones. In this paper, we propose a novel approach based on the non-i.i.d paradigm of gradual machine learning (GML). It begins with only a few labeled observations, and then gradually labels target images in the increasing order of hardness by iterative factor inference in a factor graph. Specifically, our proposed solution extracts indicative feature representations by deep backbones, and then constructs both unary and binary factors based on the extracted features to facilitate gradual learning. The unary factors are constructed based on class center distance in an embedding space, while the binary factors are constructed based on k-nearest neighborhood. We have empirically validated the performance of the proposed approach on benchmark datasets by a comparative study. Our extensive experiments demonstrate that the proposed approach can improve the SOTA performance by 1-5% in terms of accuracy. More notably, it is more robust than the existing deep models in that its performance can consistently improve as the size of query set increases while the performance of deep models remains essentially flat or even becomes worse.
20.Panoptic Scene Graph Generation with Semantics-prototype Learning
Authors:Li Li, Wei Ji, Yiming Wu, Mengze Li, You Qin, Lina Wei, Roger Zimmermann
Abstract: Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes. However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for same object pairs. Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models. To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to measure the invariance of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potential biased data. Finally, with the unbiased predicate-prototype representation embedding space, biased annotations are easily identified. Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets.
21.Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding
Authors:Jiachen Kang, Wenjing Jia, Xiangjian He, Kin Man Lam
Abstract: Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as "specialized images". This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with a novel pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under LINEAR fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already approached the results obtained with FULL model fine-tuning (92.66%), demonstrating its effective and robust representation capability.
22.OAFuser: Towards Omni-Aperture Fusion for Light Field Semantic Segmentation of Road Scenes
Authors:Fei Teng, Jiaming Zhang, Kunyu Peng, Kailun Yang, Yaonan Wang, Rainer Stiefelhagen
Abstract: Light field cameras can provide rich angular and spatial information to enhance image semantic segmentation for scene understanding in the field of autonomous driving. However, the extensive angular information of light field cameras contains a large amount of redundant data, which is overwhelming for the limited hardware resource of intelligent vehicles. Besides, inappropriate compression leads to information corruption and data loss. To excavate representative information, we propose an Omni-Aperture Fusion model (OAFuser), which leverages dense context from the central view and discovers the angular information from sub-aperture images to generate a semantically-consistent result. To avoid feature loss during network propagation and simultaneously streamline the redundant information from the light field camera, we present a simple yet very effective Sub-Aperture Fusion Module (SAFM) to embed sub-aperture images into angular features without any additional memory cost. Furthermore, to address the mismatched spatial information across viewpoints, we present Center Angular Rectification Module (CARM) realized feature resorting and prevent feature occlusion caused by asymmetric information. Our proposed OAFuser achieves state-of-the-art performance on the UrbanLF-Real and -Syn datasets and sets a new record of 84.93% in mIoU on the UrbanLF-Real Extended dataset, with a gain of +4.53%. The source code of OAFuser will be made publicly available at https://github.com/FeiBryantkit/OAFuser.
23.Integrated Digital Reconstruction of Welded Components: Supporting Improved Fatigue Life Prediction
Authors:Anders Faarbæk Mikkelstrup, Morten Kristiansen
Abstract: In the design of offshore jacket foundations, fatigue life is crucial. Post-weld treatment has been proposed to enhance the fatigue performance of welded joints, where particularly high-frequency mechanical impact (HFMI) treatment has been shown to improve fatigue performance significantly. Automated HFMI treatment has improved quality assurance and can lead to cost-effective design when combined with accurate fatigue life prediction. However, the finite element method (FEM), commonly used for predicting fatigue life in complex or multi-axial joints, relies on a basic CAD depiction of the weld, failing to consider the actual weld geometry and defects. Including the actual weld geometry in the FE model improves fatigue life prediction and possible crack location prediction but requires a digital reconstruction of the weld. Current digital reconstruction methods are time-consuming or require specialised scanning equipment and potential component relocation. The proposed framework instead uses an industrial manipulator combined with a line scanner to integrate digital reconstruction as part of the automated HFMI treatment setup. This approach applies standard image processing, simple filtering techniques, and non-linear optimisation for aligning and merging overlapping scans. A screened Poisson surface reconstruction finalises the 3D model to create a meshed surface. The outcome is a generic, cost-effective, flexible, and rapid method that enables generic digital reconstruction of welded parts, aiding in component design, overall quality assurance, and documentation of the HFMI treatment.
24.CLIP Brings Better Features to Visual Aesthetics Learners
Authors:Liwu Xu, Jinjin Xu, Yuzhe Yang, Yijie Huang, Yanchun Xie, Yaqian Li
Abstract: The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase \textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge \textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}}. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.
25.Scaling Data Generation in Vision-and-Language Navigation
Authors:Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao
Abstract: Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.
26.Multi-layer Aggregation as a key to feature-based OOD detection
Authors:Benjamin Lambert, Florence Forbes, Senan Doyle, Michel Dojat
Abstract: Deep Learning models are easily disturbed by variations in the input images that were not observed during the training stage, resulting in unpredictable predictions. Detecting such Out-of-Distribution (OOD) images is particularly crucial in the context of medical image analysis, where the range of possible abnormalities is extremely wide. Recently, a new category of methods has emerged, based on the analysis of the intermediate features of a trained model. These methods can be divided into 2 groups: single-layer methods that consider the feature map obtained at a fixed, carefully chosen layer, and multi-layer methods that consider the ensemble of the feature maps generated by the model. While promising, a proper comparison of these algorithms is still lacking. In this work, we compared various feature-based OOD detection methods on a large spectra of OOD (20 types), representing approximately 7800 3D MRIs. Our experiments shed the light on two phenomenons. First, multi-layer methods consistently outperform single-layer approaches, which tend to have inconsistent behaviour depending on the type of anomaly. Second, the OOD detection performance highly depends on the architecture of the underlying neural network.
27.TrackAgent: 6D Object Tracking via Reinforcement Learning
Authors:Konstantin Röhrl, Dominik Bauer, Timothy Patten, Markus Vincze
Abstract: Tracking an object's 6D pose, while either the object itself or the observing camera is moving, is important for many robotics and augmented reality applications. While exploiting temporal priors eases this problem, object-specific knowledge is required to recover when tracking is lost. Under the tight time constraints of the tracking task, RGB(D)-based methods are often conceptionally complex or rely on heuristic motion models. In comparison, we propose to simplify object tracking to a reinforced point cloud (depth only) alignment task. This allows us to train a streamlined approach from scratch with limited amounts of sparse 3D point clouds, compared to the large datasets of diverse RGBD sequences required in previous works. We incorporate temporal frame-to-frame registration with object-based recovery by frame-to-model refinement using a reinforcement learning (RL) agent that jointly solves for both objectives. We also show that the RL agent's uncertainty and a rendering-based mask propagation are effective reinitialization triggers.
28.PatchMixer: Rethinking network design to boost generalization for 3D point cloud understanding
Authors:Davide Boscaini, Fabio Poiesi
Abstract: The recent trend in deep learning methods for 3D point cloud understanding is to propose increasingly sophisticated architectures either to better capture 3D geometries or by introducing possibly undesired inductive biases. Moreover, prior works introducing novel architectures compared their performance on the same domain, devoting less attention to their generalization to other domains. We argue that the ability of a model to transfer the learnt knowledge to different domains is an important feature that should be evaluated to exhaustively assess the quality of a deep network architecture. In this work we propose PatchMixer, a simple yet effective architecture that extends the ideas behind the recent MLP-Mixer paper to 3D point clouds. The novelties of our approach are the processing of local patches instead of the whole shape to promote robustness to partial point clouds, and the aggregation of patch-wise features using an MLP as a simpler alternative to the graph convolutions or the attention mechanisms that are used in prior works. We evaluated our method on the shape classification and part segmentation tasks, achieving superior generalization performance compared to a selection of the most relevant deep architectures.
29.SimDETR: Simplifying self-supervised pretraining for DETR
Authors:Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos
Abstract: DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector's performance. However, existing methods have their own limitations, like keeping the detector's backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR.
30.MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
Authors:Ruopeng Gao, Limin Wang
Abstract: As a video task, Multi-Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9\% and 13.0\% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at \href{https://github.com/MCG-NJU/MeMOTR}{https://github.com/MCG-NJU/MeMOTR}.
31.Semi-Supervised Object Detection in the Open World
Authors:Garvita Allabadi, Ana Lucic, Peter Pao-Huang, Yu-Xiong Wang, Vikram Adve
Abstract: Existing approaches for semi-supervised object detection assume a fixed set of classes present in training and unlabeled datasets, i.e., in-distribution (ID) data. The performance of these techniques significantly degrades when these techniques are deployed in the open-world, due to the fact that the unlabeled and test data may contain objects that were not seen during training, i.e., out-of-distribution (OOD) data. The two key questions that we explore in this paper are: can we detect these OOD samples and if so, can we learn from them? With these considerations in mind, we propose the Open World Semi-supervised Detection framework (OWSSD) that effectively detects OOD data along with a semi-supervised learning pipeline that learns from both ID and OOD data. We introduce an ensemble based OOD detector consisting of lightweight auto-encoder networks trained only on ID data. Through extensive evalulation, we demonstrate that our method performs competitively against state-of-the-art OOD detection algorithms and also significantly boosts the semi-supervised learning performance in open-world scenarios.