Computer Vision and Pattern Recognition (cs.CV)
Fri, 01 Sep 2023
1.Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care
Authors:Tin Lai
Abstract: Recent advancements in artificial intelligence (AI) have facilitated its widespread adoption in primary medical services, addressing the demand-supply imbalance in healthcare. Vision Transformers (ViT) have emerged as state-of-the-art computer vision models, benefiting from self-attention modules. However, compared to traditional machine-learning approaches, deep-learning models are complex and are often treated as a "black box" that can cause uncertainty regarding how they operate. Explainable Artificial Intelligence (XAI) refers to methods that explain and interpret machine learning models' inner workings and how they come to decisions, which is especially important in the medical domain to guide the healthcare decision-making process. This review summarises recent ViT advancements and interpretative approaches to understanding the decision-making process of ViT, enabling transparency in medical diagnosis applications.
2.SparseSat-NeRF: Dense Depth Supervised Neural Radiance Fields for Sparse Satellite Images
Authors:Lulin Zhang, Ewelina Rupnik
Abstract: Digital surface model generation using traditional multi-view stereo matching (MVS) performs poorly over non-Lambertian surfaces, with asynchronous acquisitions, or at discontinuities. Neural radiance fields (NeRF) offer a new paradigm for reconstructing surface geometries using continuous volumetric representation. NeRF is self-supervised, does not require ground truth geometry for training, and provides an elegant way to include in its representation physical parameters about the scene, thus potentially remedying the challenging scenarios where MVS fails. However, NeRF and its variants require many views to produce convincing scene's geometries which in earth observation satellite imaging is rare. In this paper we present SparseSat-NeRF (SpS-NeRF) - an extension of Sat-NeRF adapted to sparse satellite views. SpS-NeRF employs dense depth supervision guided by crosscorrelation similarity metric provided by traditional semi-global MVS matching. We demonstrate the effectiveness of our approach on stereo and tri-stereo Pleiades 1B/WorldView-3 images, and compare against NeRF and Sat-NeRF. The code is available at https://github.com/LulinZhang/SpS-NeRF
3.Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution
Authors:Charles Laroche, Andrés Almansa, Eva Coupete
Abstract: Using diffusion models to solve inverse problems is a growing field of research. Current methods assume the degradation to be known and provide impressive results in terms of restoration quality and diversity. In this work, we leverage the efficiency of those models to jointly estimate the restored image and unknown parameters of the degradation model. In particular, we designed an algorithm based on the well-known Expectation-Minimization (EM) estimation method and diffusion models. Our method alternates between approximating the expected log-likelihood of the inverse problem using samples drawn from a diffusion model and a maximization step to estimate unknown model parameters. For the maximization step, we also introduce a novel blur kernel regularization based on a Plug \& Play denoiser. Diffusion models are long to run, thus we provide a fast version of our algorithm. Extensive experiments on blind image deblurring demonstrate the effectiveness of our method when compared to other state-of-the-art approaches.
4.Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
Authors:Minghao Zhu, Xiao Lin, Ronghao Dang, Chengju Liu, Qijun Chen
Abstract: As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.
5.Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture
Authors:Shaohua Pan, Qi Ma, Xinyu Yi, Weifeng Hu, Xiong Wang, Xingkang Zhou, Jijunnan Li, Feng Xu
Abstract: Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. %The two divided parts can help each other for better mocap results under different conditions. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation. Our codes are available for research at https://shaohua-pan.github.io/robustcap-page/.
6.ARFA: An Asymmetric Receptive Field Autoencoder Model for Spatiotemporal Prediction
Authors:Wenxuan Zhang, Xuechao Zou, Li Wu, Jianqiang Huang, Xiaoying Wang
Abstract: Spatiotemporal prediction aims to generate future sequences by paradigms learned from historical contexts. It holds significant importance in numerous domains, including traffic flow prediction and weather forecasting. However, existing methods face challenges in handling spatiotemporal correlations, as they commonly adopt encoder and decoder architectures with identical receptive fields, which adversely affects prediction accuracy. This paper proposes an Asymmetric Receptive Field Autoencoder (ARFA) model to address this issue. Specifically, we design corresponding sizes of receptive field modules tailored to the distinct functionalities of the encoder and decoder. In the encoder, we introduce a large kernel module for global spatiotemporal feature extraction. In the decoder, we develop a small kernel module for local spatiotemporal information reconstruction. To address the scarcity of meteorological prediction data, we constructed the RainBench, a large-scale radar echo dataset specific to the unique precipitation characteristics of inland regions in China for precipitation prediction. Experimental results demonstrate that ARFA achieves consistent state-of-the-art performance on two mainstream spatiotemporal prediction datasets and our RainBench dataset, affirming the effectiveness of our approach. This work not only explores a novel method from the perspective of receptive fields but also provides data support for precipitation prediction, thereby advancing future research in spatiotemporal prediction.
7.Human trajectory prediction using LSTM with Attention mechanism
Authors:Amin Manafi Soltan Ahmadi, Samaneh Hoseini Semnani
Abstract: In this paper, we propose a human trajectory prediction model that combines a Long Short-Term Memory (LSTM) network with an attention mechanism. To do that, we use attention scores to determine which parts of the input data the model should focus on when making predictions. Attention scores are calculated for each input feature, with a higher score indicating the greater significance of that feature in predicting the output. Initially, these scores are determined for the target human position, velocity, and their neighboring individual's positions and velocities. By using attention scores, our model can prioritize the most relevant information in the input data and make more accurate predictions. We extract attention scores from our attention mechanism and integrate them into the trajectory prediction module to predict human future trajectories. To achieve this, we introduce a new neural layer that processes attention scores after extracting them and concatenates them with positional information. We evaluate our approach on the publicly available ETH and UCY datasets and measure its performance using the final displacement error (FDE) and average displacement error (ADE) metrics. We show that our modified algorithm performs better than the Social LSTM in predicting the future trajectory of pedestrians in crowded spaces. Specifically, our model achieves an improvement of 6.2% in ADE and 6.3% in FDE compared to the Social LSTM results in the literature.
8.Robust Point Cloud Processing through Positional Embedding
Authors:Jianqiao Zheng, Xueqian Li, Sameera Ramasinghe, Simon Lucey
Abstract: End-to-end trained per-point embeddings are an essential ingredient of any state-of-the-art 3D point cloud processing such as detection or alignment. Methods like PointNet, or the more recent point cloud transformer -- and its variants -- all employ learned per-point embeddings. Despite impressive performance, such approaches are sensitive to out-of-distribution (OOD) noise and outliers. In this paper, we explore the role of an analytical per-point embedding based on the criterion of bandwidth. The concept of bandwidth enables us to draw connections with an alternate per-point embedding -- positional embedding, particularly random Fourier features. We present compelling robust results across downstream tasks such as point cloud classification and registration with several categories of OOD noise.
9.MuraNet: Multi-task Floor Plan Recognition with Relation Attention
Authors:Lingxiao Huang, Jung-Hsuan Wu, Chiching Wei, Wilson Li
Abstract: The recognition of information in floor plan data requires the use of detection and segmentation models. However, relying on several single-task models can result in ineffective utilization of relevant information when there are multiple tasks present simultaneously. To address this challenge, we introduce MuraNet, an attention-based multi-task model for segmentation and detection tasks in floor plan data. In MuraNet, we adopt a unified encoder called MURA as the backbone with two separated branches: an enhanced segmentation decoder branch and a decoupled detection head branch based on YOLOX, for segmentation and detection tasks respectively. The architecture of MuraNet is designed to leverage the fact that walls, doors, and windows usually constitute the primary structure of a floor plan's architecture. By jointly training the model on both detection and segmentation tasks, we believe MuraNet can effectively extract and utilize relevant features for both tasks. Our experiments on the CubiCasa5k public dataset show that MuraNet improves convergence speed during training compared to single-task models like U-Net and YOLOv3. Moreover, we observe improvements in the average AP and IoU in detection and segmentation tasks, respectively.Our ablation experiments demonstrate that the attention-based unified backbone of MuraNet achieves better feature extraction in floor plan recognition tasks, and the use of decoupled multi-head branches for different tasks further improves model performance. We believe that our proposed MuraNet model can address the disadvantages of single-task models and improve the accuracy and efficiency of floor plan data recognition.
10.Dense Voxel 3D Reconstruction Using a Monocular Event Camera
Authors:Haodong Chen, Vera Chung, Li Tan, Xiaoming Chen
Abstract: Event cameras are sensors inspired by biological systems that specialize in capturing changes in brightness. These emerging cameras offer many advantages over conventional frame-based cameras, including high dynamic range, high frame rates, and extremely low power consumption. Due to these advantages, event cameras have increasingly been adapted in various fields, such as frame interpolation, semantic segmentation, odometry, and SLAM. However, their application in 3D reconstruction for VR applications is underexplored. Previous methods in this field mainly focused on 3D reconstruction through depth map estimation. Methods that produce dense 3D reconstruction generally require multiple cameras, while methods that utilize a single event camera can only produce a semi-dense result. Other single-camera methods that can produce dense 3D reconstruction rely on creating a pipeline that either incorporates the aforementioned methods or other existing Structure from Motion (SfM) or Multi-view Stereo (MVS) methods. In this paper, we propose a novel approach for solving dense 3D reconstruction using only a single event camera. To the best of our knowledge, our work is the first attempt in this regard. Our preliminary results demonstrate that the proposed method can produce visually distinguishable dense 3D reconstructions directly without requiring pipelines like those used by existing methods. Additionally, we have created a synthetic dataset with $39,739$ object scans using an event camera simulator. This dataset will help accelerate other relevant research in this field.
11.VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
Authors:Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang
Abstract: In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation.
12.Fine-grained Recognition with Learnable Semantic Data Augmentation
Authors:Yifan Pu, Yizeng Han, Yulin Wang, Junlan Feng, Chao Deng, Gao Huang
Abstract: Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The source code will be released.
13.Selective Scene Text Removal
Authors:Hayato Mitani, Akisato Kimura, Seiichi Uchida
Abstract: Scene text removal (STR) is the image transformation task to remove text regions in scene images. The conventional STR methods remove all scene text. This means that the existing methods cannot select text to be removed. In this paper, we propose a novel task setting named selective scene text removal (SSTR) that removes only target words specified by the user. Although SSTR is a more complex task than STR, the proposed multi-module structure enables efficient training for SSTR. Experimental results show that the proposed method can remove target words as expected.
14.Improving the matching of deformable objects by learning to detect keypoints
Authors:Felipe Cadar, Welerson, Vaishnavi Kanagasabapathi, Guilherme Potje, Renato Martins, Erickson R. Nascimento
Abstract: We propose a novel learned keypoint detection method to increase the number of correct matches for the task of non-rigid image correspondence. By leveraging true correspondences acquired by matching annotated image pairs with a specified descriptor extractor, we train an end-to-end convolutional neural network (CNN) to find keypoint locations that are more appropriate to the considered descriptor. For that, we apply geometric and photometric warpings to images to generate a supervisory signal, allowing the optimization of the detector. Experiments demonstrate that our method enhances the Mean Matching Accuracy of numerous descriptors when used in conjunction with our detection method, while outperforming the state-of-the-art keypoint detectors on real images of non-rigid objects by 20 p.p. We also apply our method on the complex real-world task of object retrieval where our detector performs on par with the finest keypoint detectors currently available for this task. The source code and trained models are publicly available at https://github.com/verlab/LearningToDetect_PRL_2023
15.Unsupervised bias discovery in medical image segmentation
Authors:Nicolás Gaggion, Rodrigo Echeveste, Lucas Mansilla, Diego H. Milone, Enzo Ferrante
Abstract: It has recently been shown that deep learning models for anatomical segmentation in medical images can exhibit biases against certain sub-populations defined in terms of protected attributes like sex or ethnicity. In this context, auditing fairness of deep segmentation models becomes crucial. However, such audit process generally requires access to ground-truth segmentation masks for the target population, which may not always be available, especially when going from development to deployment. Here we propose a new method to anticipate model biases in biomedical image segmentation in the absence of ground-truth annotations. Our unsupervised bias discovery method leverages the reverse classification accuracy framework to estimate segmentation quality. Through numerical experiments in synthetic and realistic scenarios we show how our method is able to successfully anticipate fairness issues in the absence of ground-truth labels, constituting a novel and valuable tool in this field.
16.dacl10k: Benchmark for Semantic Bridge Damage Segmentation
Authors:Johannes Flotzinger, Philipp J. Rösch, Thomas Braml
Abstract: Reliably identifying reinforced concrete defects (RCDs)plays a crucial role in assessing the structural integrity, traffic safety, and long-term durability of concrete bridges, which represent the most common bridge type worldwide. Nevertheless, available datasets for the recognition of RCDs are small in terms of size and class variety, which questions their usability in real-world scenarios and their role as a benchmark. Our contribution to this problem is "dacl10k", an exceptionally diverse RCD dataset for multi-label semantic segmentation comprising 9,920 images deriving from real-world bridge inspections. dacl10k distinguishes 12 damage classes as well as 6 bridge components that play a key role in the building assessment and recommending actions, such as restoration works, traffic load limitations or bridge closures. In addition, we examine baseline models for dacl10k which are subsequently evaluated. The best model achieves a mean intersection-over-union of 0.42 on the test set. dacl10k, along with our baselines, will be openly accessible to researchers and practitioners, representing the currently biggest dataset regarding number of images and class diversity for semantic segmentation in the bridge inspection domain.
17.A Theoretical and Practical Framework for Evaluating Uncertainty Calibration in Object Detection
Authors:Pedro Conde, Rui L. Lopes, Cristiano Premebida
Abstract: The proliferation of Deep Neural Networks has resulted in machine learning systems becoming increasingly more present in various real-world applications. Consequently, there is a growing demand for highly reliable models in these domains, making the problem of uncertainty calibration pivotal, when considering the future of deep learning. This is especially true when considering object detection systems, that are commonly present in safety-critical application such as autonomous driving and robotics. For this reason, this work presents a novel theoretical and practical framework to evaluate object detection systems in the context of uncertainty calibration. The robustness of the proposed uncertainty calibration metrics is shown through a series of representative experiments. Code for the proposed uncertainty calibration metrics at: https://github.com/pedrormconde/Uncertainty_Calibration_Object_Detection.
18.An Improved Encoder-Decoder Framework for Food EnergyEstimation
Authors:Jack Ma, Jiangpeng He, Fengqing Zhu
Abstract: Dietary assessment is essential to maintaining a healthy lifestyle. Automatic image-based dietary assessment is a growing field of research due to the increasing prevalence of image capturing devices (e.g. mobile phones). In this work, we estimate food energy from a single monocular image, a difficult task due to the limited hard-to-extract amount of energy information present in an image. To do so, we employ an improved encoder-decoder framework for energy estimation; the encoder transforms the image into a representation embedded with food energy information in an easier-to-extract format, which the decoder then extracts the energy information from. To implement our method, we compile a high-quality food image dataset verified by registered dietitians containing eating scene images, food-item segmentation masks, and ground truth calorie values. Our method improves upon previous caloric estimation methods by over 10\% and 30 kCal in terms of MAPE and MAE respectively.
19.Asymmetric double-winged multi-view clustering network for exploring Diverse and Consistent Information
Authors:Qun Zheng, Xihong Yang, Siwei Wang, Xinru An, Qi Liu
Abstract: In unsupervised scenarios, deep contrastive multi-view clustering (DCMVC) is becoming a hot research spot, which aims to mine the potential relationships between different views. Most existing DCMVC algorithms focus on exploring the consistency information for the deep semantic features, while ignoring the diverse information on shallow features. To fill this gap, we propose a novel multi-view clustering network termed CodingNet to explore the diverse and consistent information simultaneously in this paper. Specifically, instead of utilizing the conventional auto-encoder, we design an asymmetric structure network to extract shallow and deep features separately. Then, by aligning the similarity matrix on the shallow feature to the zero matrix, we ensure the diversity for the shallow features, thus offering a better description of multi-view data. Moreover, we propose a dual contrastive mechanism that maintains consistency for deep features at both view-feature and pseudo-label levels. Our framework's efficacy is validated through extensive experiments on six widely used benchmark datasets, outperforming most state-of-the-art multi-view clustering algorithms.
20.A Machine Vision Method for Correction of Eccentric Error: Based on Adaptive Enhancement Algorithm
Authors:Fanyi Wang, Pin Cao, Yihui Zhang, Haotian Hu, Yongying Yang
Abstract: In the procedure of surface defects detection for large-aperture aspherical optical elements, it is of vital significance to adjust the optical axis of the element to be coaxial with the mechanical spin axis accurately. Therefore, a machine vision method for eccentric error correction is proposed in this paper. Focusing on the severe defocus blur of reference crosshair image caused by the imaging characteristic of the aspherical optical element, which may lead to the failure of correction, an Adaptive Enhancement Algorithm (AEA) is proposed to strengthen the crosshair image. AEA is consisted of existed Guided Filter Dark Channel Dehazing Algorithm (GFA) and proposed lightweight Multi-scale Densely Connected Network (MDC-Net). The enhancement effect of GFA is excellent but time-consuming, and the enhancement effect of MDC-Net is slightly inferior but strongly real-time. As AEA will be executed dozens of times during each correction procedure, its real-time performance is very important. Therefore, by setting the empirical threshold of definition evaluation function SMD2, GFA and MDC-Net are respectively applied to highly and slightly blurred crosshair images so as to ensure the enhancement effect while saving as much time as possible. AEA has certain robustness in time-consuming performance, which takes an average time of 0.2721s and 0.0963s to execute GFA and MDC-Net separately on ten 200pixels 200pixels Region of Interest (ROI) images with different degrees of blur. And the eccentricity error can be reduced to within 10um by our method.
21.SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation
Authors:Youhong Wang, Yunji Liang, Hao Xu, Shaohui Jiao, Hongkai Yu
Abstract: Recently, self-supervised monocular depth estimation has gained popularity with numerous applications in autonomous driving and robotics. However, existing solutions primarily seek to estimate depth from immediate visual features, and struggle to recover fine-grained scene details with limited generalization. In this paper, we introduce SQLdepth, a novel approach that can effectively learn fine-grained scene structures from motion. In SQLdepth, we propose a novel Self Query Layer (SQL) to build a self-cost volume and infer depth from it, rather than inferring depth from feature maps. The self-cost volume implicitly captures the intrinsic geometry of the scene within a single frame. Each individual slice of the volume signifies the relative distances between points and objects within a latent space. Ultimately, this volume is compressed to the depth map via a novel decoding approach. Experimental results on KITTI and Cityscapes show that our method attains remarkable state-of-the-art performance (AbsRel = $0.082$ on KITTI, $0.052$ on KITTI with improved ground-truth and $0.106$ on Cityscapes), achieves $9.9\%$, $5.5\%$ and $4.5\%$ error reduction from the previous best. In addition, our approach showcases reduced training complexity, computational efficiency, improved generalization, and the ability to recover fine-grained scene details. Moreover, the self-supervised pre-trained and metric fine-tuned SQLdepth can surpass existing supervised methods by significant margins (AbsRel = $0.043$, $14\%$ error reduction). self-matching-oriented relative distance querying in SQL improves the robustness and zero-shot generalization capability of SQLdepth. Code and the pre-trained weights will be publicly available. Code is available at \href{https://github.com/hisfog/SQLdepth-Impl}{https://github.com/hisfog/SQLdepth-Impl}.
22.Trust your Good Friends: Source-free Domain Adaptation by Reciprocal Neighborhood Clustering
Authors:Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, Shangling Jui, Jian Yang
Abstract: Domain adaptation (DA) aims to alleviate the domain shift between source domain and target domain. Most DA methods require access to the source data, but often that is not possible (e.g. due to data privacy or intellectual property). In this paper, we address the challenging source-free domain adaptation (SFDA) problem, where the source pretrained model is adapted to the target domain in the absence of source data. Our method is based on the observation that target data, which might not align with the source domain classifier, still forms clear clusters. We capture this intrinsic structure by defining local affinity of the target data, and encourage label consistency among data with high local affinity. We observe that higher affinity should be assigned to reciprocal neighbors. To aggregate information with more context, we consider expanded neighborhoods with small affinity values. Furthermore, we consider the density around each target sample, which can alleviate the negative impact of potential outliers. In the experimental results we verify that the inherent structure of the target features is an important source of information for domain adaptation. We demonstrate that this local structure can be efficiently captured by considering the local neighbors, the reciprocal neighbors, and the expanded neighborhood. Finally, we achieve state-of-the-art performance on several 2D image and 3D point cloud recognition datasets.
23.Impact of Image Context for Single Deep Learning Face Morphing Attack Detection
Authors:Joana Pimenta, Iurii Medvedev, Nuno Gonçalves
Abstract: The increase in security concerns due to technological advancements has led to the popularity of biometric approaches that utilize physiological or behavioral characteristics for enhanced recognition. Face recognition systems (FRSs) have become prevalent, but they are still vulnerable to image manipulation techniques such as face morphing attacks. This study investigates the impact of the alignment settings of input images on deep learning face morphing detection performance. We analyze the interconnections between the face contour and image context and suggest optimal alignment conditions for face morphing detection.
24.Discrete Morphological Neural Networks
Authors:Diego Marcondes, Junior Barrera
Abstract: A classical approach to designing binary image operators is Mathematical Morphology (MM). We propose the Discrete Morphological Neural Networks (DMNN) for binary image analysis to represent W-operators and estimate them via machine learning. A DMNN architecture, which is represented by a Morphological Computational Graph, is designed as in the classical heuristic design of morphological operators, in which the designer should combine a set of MM operators and Boolean operations based on prior information and theoretical knowledge. Then, once the architecture is fixed, instead of adjusting its parameters (i.e., structural elements or maximal intervals) by hand, we propose a lattice gradient descent algorithm (LGDA) to train these parameters based on a sample of input and output images under the usual machine learning approach. We also propose a stochastic version of the LGDA that is more efficient, is scalable and can obtain small error in practical problems. The class represented by a DMNN can be quite general or specialized according to expected properties of the target operator, i.e., prior information, and the semantic expressed by algebraic properties of classes of operators is a differential relative to other methods. The main contribution of this paper is the merger of the two main paradigms for designing morphological operators: classical heuristic design and automatic design via machine learning. Thus, conciliating classical heuristic morphological operator design with machine learning. We apply the DMNN to recognize the boundary of digits with noise, and we discuss many topics for future research.
25.Time Series Analysis of Urban Liveability
Authors:Alex Levering, Diego Marcos, Devis Tuia
Abstract: In this paper we explore deep learning models to monitor longitudinal liveability changes in Dutch cities at the neighbourhood level. Our liveability reference data is defined by a country-wise yearly survey based on a set of indicators combined into a liveability score, the Leefbaarometer. We pair this reference data with yearly-available high-resolution aerial images, which creates yearly timesteps at which liveability can be monitored. We deploy a convolutional neural network trained on an aerial image from 2016 and the Leefbaarometer score to predict liveability at new timesteps 2012 and 2020. The results in a city used for training (Amsterdam) and one never seen during training (Eindhoven) show some trends which are difficult to interpret, especially in light of the differences in image acquisitions at the different time steps. This demonstrates the complexity of liveability monitoring across time periods and the necessity for more sophisticated methods compensating for changes unrelated to liveability dynamics.
26.CityDreamer: Compositional Generative Model of Unbounded 3D Cities
Authors:Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
Abstract: In recent years, extensive research has focused on 3D natural scene generation, but the domain of 3D city generation has not received as much exploration. This is due to the greater challenges posed by 3D city generation, mainly because humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose CityDreamer, a compositional generative model designed specifically for unbounded 3D cities, which separates the generation of building instances from other background objects, such as roads, green lands, and water areas, into distinct modules. Furthermore, we construct two datasets, OSM and GoogleEarth, containing a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. Through extensive experiments, CityDreamer has proven its superiority over state-of-the-art methods in generating a wide range of lifelike 3D cities.
27.Iterative Multi-granular Image Editing using Diffusion Models
Authors:K J Joseph, Prateksha Udhayanan, Tripti Shukla, Aishwarya Agarwal, Srikrishna Karanam, Koustava Goswami, Balaji Vasan Srinivasan
Abstract: Recent advances in text-guided image synthesis has dramatically changed how creative professionals generate artistic and aesthetically pleasing visual assets. To fully support such creative endeavors, the process should possess the ability to: 1) iteratively edit the generations and 2) control the spatial reach of desired changes (global, local or anything in between). We formalize this pragmatic problem setting as Iterative Multi-granular Editing. While there has been substantial progress with diffusion-based models for image synthesis and editing, they are all one shot (i.e., no iterative editing capabilities) and do not naturally yield multi-granular control (i.e., covering the full spectrum of local-to-global edits). To overcome these drawbacks, we propose EMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent iteration strategy, which re-purposes a pre-trained diffusion model to facilitate iterative editing. This is complemented by a gradient control operation for multi-granular control. We introduce a new benchmark dataset to evaluate our newly proposed setting. We conduct exhaustive quantitatively and qualitatively evaluation against recent state-of-the-art approaches adapted to our task, to being out the mettle of EMILIE. We hope our work would attract attention to this newly identified, pragmatic problem setting.
28.Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
Authors:Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
Abstract: We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.
29.OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Authors:Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby
Abstract: Current 3D open-vocabulary scene understanding methods mostly utilize well-aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a completely new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds. The "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The "Lookup" module searches through the outcomes of "Snap" with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free, easy-to-train, and flexible approach achieved state-of-the-art results on a wide range of indoor and outdoor datasets with a large margin. Furthermore, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with state-of-the-art 2D open-world models such as ODISE and GroundingDINO, superb results are observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries, including those that require intricate reasoning and world knowledge. The code and model will be made publicly available.