Computer Vision and Pattern Recognition (cs.CV)
Tue, 25 Apr 2023
1.MixNeRF: Memory Efficient NeRF with Feature Mixed-up Hash Table
Authors:Yongjae Lee, Li Yang, Deliang Fan
Abstract: Neural radiance field (NeRF) has shown remarkable performance in generating photo-realistic novel views. Since the emergence of NeRF, many studies have been conducted, among which managing features with explicit structures such as grids has achieved exceptionally fast training by reducing the complexity of multilayer perceptron (MLP) networks. However, storing features in dense grids requires significantly large memory space, which leads to memory bottleneck in computer systems and thus large training time. To address this issue, in this work, we propose MixNeRF, a memory-efficient NeRF framework that employs a mixed-up hash table to improve memory efficiency and reduce training time while maintaining reconstruction quality. We first design a \textit{mixed-up hash table} to adaptively mix part of multi-level feature grids into one and map it to a single hash table. Following that, in order to obtain the correct index of a grid point, we further design an \textit{index transformation} method that transforms indices of an arbitrary level grid to those of a canonical grid. Extensive experiments benchmarking with state-of-the-art Instant-NGP, TensoRF, and DVGO, indicate our MixNeRF could achieve the fastest training time on the same GPU hardware with similar or even higher reconstruction quality. Source code is available at \url{https://github.com/nfyfamr/MixNeRF}.
2.ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds
Authors:Xiangze Jia, Hui Zhou, Xinge Zhu, Yandong Guo, Ji Zhang, Yuexin Ma
Abstract: In this paper, we propose a novel self-supervised motion estimator for LiDAR-based autonomous driving via BEV representation. Different from usually adopted self-supervised strategies for data-level structure consistency, we predict scene motion via feature-level consistency between pillars in consecutive frames, which can eliminate the effect caused by noise points and view-changing point clouds in dynamic scenes. Specifically, we propose \textit{Soft Discriminative Loss} that provides the network with more pseudo-supervised signals to learn discriminative and robust features in a contrastive learning manner. We also propose \textit{Gated Multi-frame Fusion} block that learns valid compensation between point cloud frames automatically to enhance feature extraction. Finally, \textit{pillar association} is proposed to predict pillar correspondence probabilities based on feature distance, and whereby further predicts scene motion. Extensive experiments show the effectiveness and superiority of our \textbf{ContrastMotion} on both scene flow and motion prediction tasks. The code is available soon.
3.Unsupervised Synthetic Image Refinement via Contrastive Learning and Consistent Semantic and Structure Constraints
Authors:Ganning Zhao, Tingwei Shen, Suya You, C. -C. Jay Kuo
Abstract: Ensuring the realism of computer-generated synthetic images is crucial to deep neural network (DNN) training. Due to different semantic distributions between synthetic and real-world captured datasets, there exists semantic mismatch between synthetic and refined images, which in turn results in the semantic distortion. Recently, contrastive learning (CL) has been successfully used to pull correlated patches together and push uncorrelated ones apart. In this work, we exploit semantic and structural consistency between synthetic and refined images and adopt CL to reduce the semantic distortion. Besides, we incorporate hard negative mining to improve the performance furthermore. We compare the performance of our method with several other benchmarking methods using qualitative and quantitative measures and show that our method offers the state-of-the-art performance.
4.MMRDN: Consistent Representation for Multi-View Manipulation Relationship Detection in Object-Stacked Scenes
Authors:Han Wang, Jiayuan Zhang, Lipeng Wan, Xingyu Chen, Xuguang Lan, Nanning Zheng
Abstract: Manipulation relationship detection (MRD) aims to guide the robot to grasp objects in the right order, which is important to ensure the safety and reliability of grasping in object stacked scenes. Previous works infer manipulation relationship by deep neural network trained with data collected from a predefined view, which has limitation in visual dislocation in unstructured environments. Multi-view data provide more comprehensive information in space, while a challenge of multi-view MRD is domain shift. In this paper, we propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN), which is trained by 2D and 3D multi-view data. We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions to learn the consistent representations. Besides, taking advantage of position information within the 3D data, we select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects. Finally, the features of multi-view 2D and 3D data are concatenated to predict the pairwise relationship of objects. Experimental results on the challenging REGRAD dataset show that MMRDN outperforms the state-of-the-art methods in multi-view MRD tasks. The results also demonstrate that our model trained by synthetic data is capable to transfer to real-world scenarios.
5.Detection of Pavement Cracks by Deep Learning Models of Transformer and UNet
Authors:Yu Zhang, Lin Zhang
Abstract: Fracture is one of the main failure modes of engineering structures such as buildings and roads. Effective detection of surface cracks is significant for damage evaluation and structure maintenance. In recent years, the emergence and development of deep learning techniques have shown great potential to facilitate surface crack detection. Currently, most reported tasks were performed by a convolutional neural network (CNN), while the limitation of CNN may be improved by the transformer architecture introduced recently. In this study, we investigated nine promising models to evaluate their performance in pavement surface crack detection by model accuracy, computational complexity, and model stability. We created 711 images of 224 by 224 pixels with crack labels, selected an optimal loss function, compared the evaluation metrics of the validation dataset and test dataset, analyzed the data details, and checked the segmentation outcomes of each model. We find that transformer-based models generally are easier to converge during the training process and have higher accuracy, but usually exhibit more memory consumption and low processing efficiency. Among nine models, SwinUNet outperforms the other two transformers and shows the highest accuracy among nine models. The results should shed light on surface crack detection by various deep-learning models and provide a guideline for future applications in this field.
6.Application of Segment Anything Model for Civil Infrastructure Defect Assessment
Authors:Mohsen Ahmadi, Ahmad Gholizadeh Lonbar, Abbas Sharifi, Ali Tarlani Beris, Mohammadsadegh Nouri, Amir Sharifzadeh Javidi
Abstract: This research assesses the performance of two deep learning models, SAM and U-Net, for detecting cracks in concrete structures. The results indicate that each model has its own strengths and limitations for detecting different types of cracks. Using the SAM's unique crack detection approach, the image is divided into various parts that identify the location of the crack, making it more effective at detecting longitudinal cracks. On the other hand, the U-Net model can identify positive label pixels to accurately detect the size and location of spalling cracks. By combining both models, more accurate and comprehensive crack detection results can be achieved. The importance of using advanced technologies for crack detection in ensuring the safety and longevity of concrete structures cannot be overstated. This research can have significant implications for civil engineering, as the SAM and U-Net model can be used for a variety of concrete structures, including bridges, buildings, and roads, improving the accuracy and efficiency of crack detection and saving time and resources in maintenance and repair. In conclusion, the SAM and U-Net model presented in this study offer promising solutions for detecting cracks in concrete structures and leveraging the strengths of both models that can lead to more accurate and comprehensive results.
7.Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint
Authors:Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Jie Li, Xinbo Gao
Abstract: Weakly Supervised Temporal Action Localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classificationproblem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be sub-optimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene ( i.e., the scene same as positive actions) as co-scene actions, this sub-optimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC firstly adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video; Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches, and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
8.Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation
Authors:Junde Wu, Rao Fu, Huihui Fang, Yuanpei Liu, Zhaowei Wang, Yanwu Xu, Yueming Jin, Tal Arbel
Abstract: The Segment Anything Model (SAM) has recently gained popularity in the field of image segmentation. Thanks to its impressive capabilities in all-round segmentation tasks and its prompt-based interface, SAM has sparked intensive discussion within the community. It is even said by many prestigious experts that image segmentation task has been "finished" by SAM. However, medical image segmentation, although an important branch of the image segmentation family, seems not to be included in the scope of Segmenting "Anything". Many individual experiments and recent studies have shown that SAM performs subpar in medical image segmentation. A natural question is how to find the missing piece of the puzzle to extend the strong segmentation capability of SAM to medical image segmentation. In this paper, we present a possible solution by fine-tuning the pretrained SAM model following parameter-efficient fine-tuning paradigm with Adapter. Although this work is still one of a few to transfer the popular NLP technique Adapter to computer vision cases, this simple implementation shows surprisingly good performance on medical image segmentation. A medical image adapted SAM, which we have dubbed Medical SAM Adapter (MSA), shows superior performance on 19 medical image segmentation tasks with various image modalities including CT, MRI, ultrasound image, fundus image, and dermoscopic images. MSA outperforms a wide range of state-of-the-art (SOTA) medical image segmentation methods, such as nnUNet, TransUNet, UNetr, MedSegDiff, and so on. Code will be released at: https://github.com/WuJunde/Medical-SAM-Adapter.
9.Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures
Authors:Eugenia Iofinova, Alexandra Peste, Dan Alistarh
Abstract: Pruning - that is, setting a significant subset of the parameters of a neural network to zero - is one of the most popular methods of model compression. Yet, several recent works have raised the issue that pruning may induce or exacerbate bias in the output of the compressed model. Despite existing evidence for this phenomenon, the relationship between neural network pruning and induced bias is not well-understood. In this work, we systematically investigate and characterize this phenomenon in Convolutional Neural Networks for computer vision. First, we show that it is in fact possible to obtain highly-sparse models, e.g. with less than 10% remaining weights, which do not decrease in accuracy nor substantially increase in bias when compared to dense models. At the same time, we also find that, at higher sparsities, pruned models exhibit higher uncertainty in their outputs, as well as increased correlations, which we directly link to increased bias. We propose easy-to-use criteria which, based only on the uncompressed model, establish whether bias will increase with pruning, and identify the samples most susceptible to biased predictions post-compression.
10.Shape-Net: Room Layout Estimation from Panoramic Images Robust to Occlusion using Knowledge Distillation with 3D Shapes as Additional Inputs
Authors:Mizuki Tabata, Kana Kurata, Junichiro Tamamatsu
Abstract: Estimating the layout of a room from a single-shot panoramic image is important in virtual/augmented reality and furniture layout simulation. This involves identifying three-dimensional (3D) geometry, such as the location of corners and boundaries, and performing 3D reconstruction. However, occlusion is a common issue that can negatively impact room layout estimation, and this has not been thoroughly studied to date. It is possible to obtain 3D shape information of rooms as drawings of buildings and coordinates of corners from image datasets, thus we propose providing both 2D panoramic and 3D information to a model to effectively deal with occlusion. However, simply feeding 3D information to a model is not sufficient to utilize the shape information for an occluded area. Therefore, we improve the model by introducing 3D Intersection over Union (IoU) loss to effectively use 3D information. In some cases, drawings are not available or the construction deviates from a drawing. Considering such practical cases, we propose a method for distilling knowledge from a model trained with both images and 3D information to a model that takes only images as input. The proposed model, which is called Shape-Net, achieves state-of-the-art (SOTA) performance on benchmark datasets. We also confirmed its effectiveness in dealing with occlusion through significantly improved accuracy on images with occlusion compared with existing models.
11.Spatiotemporal Graph Convolutional Recurrent Neural Network Model for Citywide Air Pollution Forecasting
Authors:Van-Duc Le
Abstract: Citywide Air Pollution Forecasting tries to precisely predict the air quality multiple hours ahead for the entire city. This topic is challenged since air pollution varies in a spatiotemporal manner and depends on many complicated factors. Our previous research has solved the problem by considering the whole city as an image and leveraged a Convolutional Long Short-Term Memory (ConvLSTM) model to learn the spatiotemporal features. However, an image-based representation may not be ideal as air pollution and other impact factors have natural graph structures. In this research, we argue that a Graph Convolutional Network (GCN) can efficiently represent the spatial features of air quality readings in the whole city. Specially, we extend the ConvLSTM model to a Spatiotemporal Graph Convolutional Recurrent Neural Network (Spatiotemporal GCRNN) model by tightly integrating a GCN architecture into an RNN structure for efficient learning spatiotemporal characteristics of air quality values and their influential factors. Our extensive experiments prove the proposed model has a better performance compare to the state-of-the-art ConvLSTM model for air pollution predicting while the number of parameters is much smaller. Moreover, our approach is also superior to a hybrid GCN-based method in a real-world air pollution dataset.
12.Pseudo Labels Refinement with Intra-camera Similarity for Unsupervised Person Re-identification
Authors:Pengna Li, Kangyi Wu, Sanping Zhou. Qianxin Huang, Jinjun Wang
Abstract: Unsupervised person re-identification (Re-ID) aims to retrieve person images across cameras without any identity labels. Most clustering-based methods roughly divide image features into clusters and neglect the feature distribution noise caused by domain shifts among different cameras, leading to inevitable performance degradation. To address this challenge, we propose a novel label refinement framework with clustering intra-camera similarity. Intra-camera feature distribution pays more attention to the appearance of pedestrians and labels are more reliable. We conduct intra-camera training to get local clusters in each camera, respectively, and refine inter-camera clusters with local results. We hence train the Re-ID model with refined reliable pseudo labels in a self-paced way. Extensive experiments demonstrate that the proposed method surpasses state-of-the-art performance.
13.Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-Shot Medical Segmentation
Authors:Peilun Shi, Jianing Qiu, Sai Mu Dalike Abaxi, Hao Wei, Frank P. -W. Lo, Wu Yuan
Abstract: We examine the recent Segment Anything Model (SAM) on medical images, and report both quantitative and qualitative zero-shot segmentation results on nine medical image segmentation benchmarks, covering various imaging modalities, such as optical coherence tomography (OCT), magnetic resonance imaging (MRI), and computed tomography (CT), as well as different applications including dermatology, ophthalmology, and radiology. Our experiments reveal that while SAM demonstrates stunning segmentation performance on images from the general domain, for those out-of-distribution images, e.g., medical images, its zero-shot segmentation performance is still limited. Furthermore, SAM demonstrated varying zero-shot segmentation performance across different unseen medical domains. For example, it had a 0.8704 mean Dice score on segmenting under-bruch's membrane layer of retinal OCT, whereas the segmentation accuracy drops to 0.0688 when segmenting retinal pigment epithelium. For certain structured targets, e.g., blood vessels, the zero-shot segmentation of SAM completely failed, whereas a simple fine-tuning of it with small amount of data could lead to remarkable improvements of the segmentation quality. Our study indicates the versatility of generalist vision foundation models on solving specific tasks in medical imaging, and their great potential to achieve desired performance through fine-turning and eventually tackle the challenges of accessing large diverse medical datasets and the complexity of medical domains.
14.Change detection needs change information: improving deep 3D point cloud change detection
Authors:Iris de Gélis Magellium Institut de Recherche en Informatique et Systèmes Aléatoires IRISA - UMR 6074 - Université Bretagne Sud, Thomas Corpetti Littoral - Environnement - Télédétection - Géomatique LETG - UMR 6554 - Université Rennes 2, Sébastien Lefèvre Institut de Recherche en Informatique et Systèmes Aléatoires IRISA - UMR 6074 - Université Bretagne Sud
Abstract: Change detection is an important task to rapidly identify modified areas, in particular when multi-temporal data are concerned. In landscapes with complex geometry such as urban environment, vertical information turn out to be a very useful knowledge not only to highlight changes but also to classify them into different categories. In this paper, we focus on change segmentation directly using raw 3D point clouds (PCs), to avoid any loss of information due to rasterization processes. While deep learning has recently proved its effectiveness for this particular task by encoding the information through Siamese networks, we investigate here the idea of also using change information in early steps of deep networks. To do this, we first propose to provide the Siamese KPConv State-of-The-Art (SoTA) network with hand-crafted features and especially a change-related one. This improves the mean of Intersection over Union (IoU) over classes of change by 4.70\%. Considering that the major improvement was obtained thanks to the change-related feature, we propose three new architectures to address 3D PCs change segmentation: OneConvFusion, Triplet KPConv, and Encoder Fusion SiamKPConv. All the three networks take into account change information in early steps and outperform SoTA methods. In particular, the last network, entitled Encoder Fusion SiamKPConv, overtakes SoTA with more than 5% of mean of IoU over classes of change emphasizing the value of having the network focus on change information for change detection task.
15.Hybrid Neural Rendering for Large-Scale Scenes with Motion Blur
Authors:Peng Dai, Yinda Zhang, Xin Yu, Xiaoyang Lyu, Xiaojuan Qi
Abstract: Rendering novel view images is highly desirable for many applications. Despite recent progress, it remains challenging to render high-fidelity and view-consistent novel views of large-scale scenes from in-the-wild images with inevitable artifacts (e.g., motion blur). To this end, we develop a hybrid neural rendering model that makes image-based representation and neural 3D representation join forces to render high-quality, view-consistent images. Besides, images captured in the wild inevitably contain artifacts, such as motion blur, which deteriorates the quality of rendered images. Accordingly, we propose strategies to simulate blur effects on the rendered images to mitigate the negative influence of blurriness images and reduce their importance during training based on precomputed quality-aware weights. Extensive experiments on real and synthetic data demonstrate our model surpasses state-of-the-art point-based methods for novel view synthesis. The code is available at https://daipengwa.github.io/Hybrid-Rendering-ProjectPage.
16.Dynamic Video Frame Interpolation with integrated Difficulty Pre-Assessment
Authors:Ban Chen, Xin Jin, Youxin Chen, Longhai Wu, Jie Chen, Jayoon Koo, Cheul-hee Hahm
Abstract: Video frame interpolation(VFI) has witnessed great progress in recent years. While existing VFI models still struggle to achieve a good trade-off between accuracy and efficiency: fast models often have inferior accuracy; accurate models typically run slowly. However, easy samples with small motion or clear texture can achieve competitive results with simple models and do not require heavy computation. In this paper, we present an integrated pipeline which combines difficulty assessment with video frame interpolation. Specifically, it firstly leverages a pre-assessment model to measure the interpolation difficulty level of input frames, and then dynamically selects an appropriate VFI model to generate interpolation results. Furthermore, a large-scale VFI difficulty assessment dataset is collected and annotated to train our pre-assessment model. Extensive experiments show that easy samples pass through fast models while difficult samples inference with heavy models, and our proposed pipeline can improve the accuracy-efficiency trade-off for VFI.
17.Bayesian Optimization Meets Self-Distillation
Authors:HyunJae Lee, Heon Song, Hyeonsoo Lee, Gi-hyeon Lee, Suyeong Park, Donggeun Yoo
Abstract: Bayesian optimization (BO) has contributed greatly to improving model performance by suggesting promising hyperparameter configurations iteratively based on observations from multiple training trials. However, only partial knowledge (i.e., the measured performances of trained models and their hyperparameter configurations) from previous trials is transferred. On the other hand, Self-Distillation (SD) only transfers partial knowledge learned by the task model itself. To fully leverage the various knowledge gained from all training trials, we propose the BOSS framework, which combines BO and SD. BOSS suggests promising hyperparameter configurations through BO and carefully selects pre-trained models from previous trials for SD, which are otherwise abandoned in the conventional BO process. BOSS achieves significantly better performance than both BO and SD in a wide range of tasks including general image classification, learning with noisy labels, semi-supervised learning, and medical image analysis tasks.
18.Exploring the Mutual Influence between Self-Supervised Single-Frame and Multi-Frame Depth Estimation
Authors:Jie Xiang, Yun Wang, Lifeng An, Haiyang Liu, Jian Liu
Abstract: Although both self-supervised single-frame and multi-frame depth estimation methods only require unlabeled monocular videos for training, the information they leverage varies because single-frame methods mainly rely on appearance-based features while multi-frame methods focus on geometric cues. Considering the complementary information of single-frame and multi-frame methods, some works attempt to leverage single-frame depth to improve multi-frame depth. However, these methods can neither exploit the difference between single-frame depth and multi-frame depth to improve multi-frame depth nor leverage multi-frame depth to optimize single-frame depth models. To fully utilize the mutual influence between single-frame and multi-frame methods, we propose a novel self-supervised training framework. Specifically, we first introduce a pixel-wise adaptive depth sampling module guided by single-frame depth to train the multi-frame model. Then, we leverage the minimum reprojection based distillation loss to transfer the knowledge from the multi-frame depth network to the single-frame network to improve single-frame depth. Finally, we regard the improved single-frame depth as a prior to further boost the performance of multi-frame depth estimation. Experimental results on the KITTI and Cityscapes datasets show that our method outperforms existing approaches in the self-supervised monocular setting.
19.Local Implicit Ray Function for Generalizable Radiance Field Representation
Authors:Xin Huang, Qi Zhang, Ying Feng, Xiaoyu Li, Xuan Wang, Qing Wang
Abstract: We propose LIRF (Local Implicit Ray Function), a generalizable neural rendering approach for novel view rendering. Current generalizable neural radiance fields (NeRF) methods sample a scene with a single ray per pixel and may therefore render blurred or aliased views when the input views and rendered views capture scene content with different resolutions. To solve this problem, we propose LIRF to aggregate the information from conical frustums to construct a ray. Given 3D positions within conical frustums, LIRF takes 3D coordinates and the features of conical frustums as inputs and predicts a local volumetric radiance field. Since the coordinates are continuous, LIRF renders high-quality novel views at a continuously-valued scale via volume rendering. Besides, we predict the visible weights for each input view via transformer-based feature matching to improve the performance in occluded areas. Experimental results on real-world scenes validate that our method outperforms state-of-the-art methods on novel view rendering of unseen scenes at arbitrary scales.
20.Inverting the Imaging Process by Learning an Implicit Camera Model
Authors:Xin Huang, Qi Zhang, Ying Feng, Hongdong Li, Qing Wang
Abstract: Representing visual signals with implicit coordinate-based neural networks, as an effective replacement of the traditional discrete signal representation, has gained considerable popularity in computer vision and graphics. In contrast to existing implicit neural representations which focus on modelling the scene only, this paper proposes a novel implicit camera model which represents the physical imaging process of a camera as a deep neural network. We demonstrate the power of this new implicit camera model on two inverse imaging tasks: i) generating all-in-focus photos, and ii) HDR imaging. Specifically, we devise an implicit blur generator and an implicit tone mapper to model the aperture and exposure of the camera's imaging process, respectively. Our implicit camera model is jointly learned together with implicit scene models under multi-focus stack and multi-exposure bracket supervision. We have demonstrated the effectiveness of our new model on a large number of test images and videos, producing accurate and visually appealing all-in-focus and high dynamic range images. In principle, our new implicit neural camera model has the potential to benefit a wide array of other inverse imaging tasks.
21.Class Attention Transfer Based Knowledge Distillation
Authors:Ziyao Guo, Haonan Yan, Hui Li, Xiaodong Lin
Abstract: Previous knowledge distillation methods have shown their impressive performance on model compression tasks, however, it is hard to explain how the knowledge they transferred helps to improve the performance of the student network. In this work, we focus on proposing a knowledge distillation method that has both high interpretability and competitive performance. We first revisit the structure of mainstream CNN models and reveal that possessing the capacity of identifying class discriminative regions of input is critical for CNN to perform classification. Furthermore, we demonstrate that this capacity can be obtained and enhanced by transferring class activation maps. Based on our findings, we propose class attention transfer based knowledge distillation (CAT-KD). Different from previous KD methods, we explore and present several properties of the knowledge transferred by our method, which not only improve the interpretability of CAT-KD but also contribute to a better understanding of CNN. While having high interpretability, CAT-KD achieves state-of-the-art performance on multiple benchmarks. Code is available at: https://github.com/GzyAftermath/CAT-KD.
22.Depth-Relative Self Attention for Monocular Depth Estimation
Authors:Kyuhong Shim, Jiyoung Kim, Gusang Lee, Byonghyo Shim
Abstract: Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.
23.Quantifying the Effect of Image Similarity on Diabetic Foot Ulcer Classification
Authors:Imran Chowdhury Dipto, Bill Cassidy, Connah Kendrick, Neil D. Reeves, Joseph M. Pappachan, Vishnu Chandrabalan, Moi Hoon Yap
Abstract: This research conducts an investigation on the effect of visually similar images within a publicly available diabetic foot ulcer dataset when training deep learning classification networks. The presence of binary-identical duplicate images in datasets used to train deep learning algorithms is a well known issue that can introduce unwanted bias which can degrade network performance. However, the effect of visually similar non-identical images is an under-researched topic, and has so far not been investigated in any diabetic foot ulcer studies. We use an open-source fuzzy algorithm to identify groups of increasingly similar images in the Diabetic Foot Ulcers Challenge 2021 (DFUC2021) training dataset. Based on each similarity threshold, we create new training sets that we use to train a range of deep learning multi-class classifiers. We then evaluate the performance of the best performing model on the DFUC2021 test set. Our findings show that the model trained on the training set with the 80\% similarity threshold images removed achieved the best performance using the InceptionResNetV2 network. This model showed improvements in F1-score, precision, and recall of 0.023, 0.029, and 0.013, respectively. These results indicate that highly similar images can contribute towards the presence of performance degrading bias within the Diabetic Foot Ulcers Challenge 2021 dataset, and that the removal of images that are 80\% similar from the training set can help to boost classification performance.
24.Segment anything, from space?
Authors:Simiao Ren, Francesco Luzi, Saad Lahrichi, Kaleb Kassaw, Leslie M. Collins, Kyle Bradbury, Jordan M. Malof
Abstract: Recently, the first foundation model developed specifically for vision tasks was developed, termed the "Segment Anything Model" (SAM). SAM can segment objects in input imagery based upon cheap input prompts, such as one (or more) points, a bounding box, or a mask. The authors examined the zero-shot image segmentation accuracy of SAM on a large number of vision benchmark tasks and found that SAM usually achieved recognition accuracy similar to, or sometimes exceeding, vision models that had been trained on the target tasks. The impressive generalization of SAM for segmentation has major implications for vision researchers working on natural imagery. In this work, we examine whether SAM's impressive performance extends to overhead imagery problems, and help guide the community's response to its development. We examine SAM's performance on a set of diverse and widely-studied benchmark tasks. We find that SAM does often generalize well to overhead imagery, although it fails in some cases due to the unique characteristics of overhead imagery and the target objects. We report on these unique systematic failure cases for remote sensing imagery that may comprise useful future research for the community. Note that this is a working paper, and it will be updated as additional analysis and results are completed.
25.PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling
Authors:Zhe Li, Zerong Zheng, Yuxiao Liu, Boyao Zhou, Yebin Liu
Abstract: Creating pose-driven human avatars is about modeling the mapping from the low-frequency driving pose to high-frequency dynamic human appearances, so an effective pose encoding method that can encode high-fidelity human details is essential to human avatar modeling.To this end, we present PoseVocab, a novel pose encoding method that encourages the network to discover the optimal pose embeddings for learning the dynamic human appearance. Given multi-view RGB videos of a character, PoseVocab constructs key poses and latent embeddings based on the training poses. To achieve pose generalization and temporal consistency, we sample key rotations in $so(3)$ of each joint rather than the global pose vectors, and assign a pose embedding to each sampled key rotation. These joint-structured pose embeddings not only encode the dynamic appearances under different key poses, but also factorize the global pose embedding into joint-structured ones to better learn the appearance variation related to the motion of each joint. To improve the representation ability of the pose embedding while maintaining memory efficiency, we introduce feature lines, a compact yet effective 3D representation, to model more fine-grained details of human appearances. Furthermore, given a query pose and a spatial position, a hierarchical query strategy is introduced to interpolate pose embeddings and acquire the conditional pose feature for dynamic human synthesis. Overall, PoseVocab effectively encodes the dynamic details of human appearance and enables realistic and generalized animation under novel poses. Experiments show that our method outperforms other state-of-the-art baselines both qualitatively and quantitatively in terms of synthesis quality. Code is available at https://github.com/lizhe00/PoseVocab.
26.The Potential of Visual ChatGPT For Remote Sensing
Authors:Lucas Prado Osco, Eduardo Lopes de Lemos, Wesley Nunes Gonçalves, Ana Paula Marques Ramos, José Marcato Junior
Abstract: Recent advancements in Natural Language Processing (NLP), particularly in Large Language Models (LLMs), associated with deep learning-based computer vision techniques, have shown substantial potential for automating a variety of tasks. One notable model is Visual ChatGPT, which combines ChatGPT's LLM capabilities with visual computation to enable effective image analysis. The model's ability to process images based on textual inputs can revolutionize diverse fields. However, its application in the remote sensing domain remains unexplored. This is the first paper to examine the potential of Visual ChatGPT, a cutting-edge LLM founded on the GPT architecture, to tackle the aspects of image processing related to the remote sensing domain. Among its current capabilities, Visual ChatGPT can generate textual descriptions of images, perform canny edge and straight line detection, and conduct image segmentation. These offer valuable insights into image content and facilitate the interpretation and extraction of information. By exploring the applicability of these techniques within publicly available datasets of satellite images, we demonstrate the current model's limitations in dealing with remote sensing images, highlighting its challenges and future prospects. Although still in early development, we believe that the combination of LLMs and visual models holds a significant potential to transform remote sensing image processing, creating accessible and practical application opportunities in the field.
27.Methods and datasets for segmentation of minimally invasive surgical instruments in endoscopic images and videos: A review of the state of the art
Authors:Tobias Rueckert Regensburg Medical Image Computing, Daniel Rueckert Artificial Intelligence in Healthcare and Medicine, Klinikum rechts der Isar, Technical University of Munich, Germany Department of Computing, Imperial College London, UK, Christoph Palm Regensburg Medical Image Computing Regensburg Center of Health Sciences and Technology
Abstract: In the field of computer- and robot-assisted minimally invasive surgery, enormous progress has been made in recent years based on the recognition of surgical instruments in endoscopic images. Especially the determination of the position and type of the instruments is of great interest here. Current work involves both spatial and temporal information with the idea, that the prediction of movement of surgical tools over time may improve the quality of final segmentations. The provision of publicly available datasets has recently encouraged the development of new methods, mainly based on deep learning. In this review, we identify datasets used for method development and evaluation, as well as quantify their frequency of use in the literature. We further present an overview of the current state of research regarding the segmentation and tracking of minimally invasive surgical instruments in endoscopic images. The paper focuses on methods that work purely visually without attached markers of any kind on the instruments, taking into account both single-frame segmentation approaches as well as those involving temporal information. A discussion of the reviewed literature is provided, highlighting existing shortcomings and emphasizing available potential for future developments. The publications considered were identified through the platforms Google Scholar, Web of Science, and PubMed. The search terms used were "instrument segmentation", "instrument tracking", "surgical tool segmentation", and "surgical tool tracking" and result in 408 articles published between 2015 and 2022 from which 109 were included using systematic selection criteria.
28.Flickr-PAD: New Face High-Resolution Presentation Attack Detection Database
Authors:Diego Pasmino, Carlos Aravena, Juan Tapia, Christoph Busch
Abstract: Nowadays, Presentation Attack Detection is a very active research area. Several databases are constituted in the state-of-the-art using images extracted from videos. One of the main problems identified is that many databases present a low-quality, small image size and do not represent an operational scenario in a real remote biometric system. Currently, these images are captured from smartphones with high-quality and bigger resolutions. In order to increase the diversity of image quality, this work presents a new PAD database based on open-access Flickr images called: "Flickr-PAD". Our new hand-made database shows high-quality printed and screen scenarios. This will help researchers to compare new approaches to existing algorithms on a wider database. This database will be available for other researchers. A leave-one-out protocol was used to train and evaluate three PAD models based on MobileNet-V3 (small and large) and EfficientNet-B0. The best result was reached with MobileNet-V3 large with BPCER10 of 7.08% and BPCER20 of 11.15%.
29.Face Feature Visualisation of Single Morphing Attack Detection
Authors:Juan Tapia, Christoph Busch
Abstract: This paper proposes an explainable visualisation of different face feature extraction algorithms that enable the detection of bona fide and morphing images for single morphing attack detection. The feature extraction is based on raw image, shape, texture, frequency and compression. This visualisation may help to develop a Graphical User Interface for border policies and specifically for border guard personnel that have to investigate details of suspect images. A Random forest classifier was trained in a leave-one-out protocol on three landmarks-based face morphing methods and a StyleGAN-based morphing method for which morphed images are available in the FRLL database. For morphing attack detection, the Discrete Cosine-Transformation-based method obtained the best results for synthetic images and BSIF for landmark-based image features.
30.A Strong and Reproducible Object Detector with Only Public Datasets
Authors:Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, Lei Zhang
Abstract: This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64.6 AP on COCO val2017 and 64.8 AP on COCO test-dev using only 700M parameters without any test time augmentation. It explores the combination of the powerful FocalNet-Huge backbone with the effective Stable-DINO detector. Different from existing SOTA models that utilize an extensive number of parameters and complex training techniques on large-scale private data or merged data, our model is exclusively trained on the publicly available dataset Objects365, which ensures the reproducibility of our approach.
31.CompletionFormer: Depth Completion with Convolutions and Vision Transformers
Authors:Zhang Youmin, Guo Xianda, Poggi Matteo, Zhu Zheng, Huang Guan, Mattoccia Stefano
Abstract: Given sparse depths and the corresponding RGB images, depth completion aims at spatially propagating the sparse measurements throughout the whole image to get a dense depth prediction. Despite the tremendous progress of deep-learning-based depth completion methods, the locality of the convolutional layer or graph model makes it hard for the network to model the long-range relationship between pixels. While recent fully Transformer-based architecture has reported encouraging results with the global receptive field, the performance and efficiency gaps to the well-developed CNN models still exist because of its deteriorative local feature details. This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid architecture naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods. Code is available at \url{https://github.com/youmi-zym/CompletionFormer}.
32.DQS3D: Densely-matched Quantization-aware Semi-supervised 3D Detection
Authors:Huan-ang Gao, Beiwen Tian, Pengfei Li, Hao Zhao, Guyue Zhou
Abstract: In this paper, we study the problem of semi-supervised 3D object detection, which is of great importance considering the high annotation cost for cluttered 3D indoor scenes. We resort to the robust and principled framework of selfteaching, which has triggered notable progress for semisupervised learning recently. While this paradigm is natural for image-level or pixel-level prediction, adapting it to the detection problem is challenged by the issue of proposal matching. Prior methods are based upon two-stage pipelines, matching heuristically selected proposals generated in the first stage and resulting in spatially sparse training signals. In contrast, we propose the first semisupervised 3D detection algorithm that works in the singlestage manner and allows spatially dense training signals. A fundamental issue of this new design is the quantization error caused by point-to-voxel discretization, which inevitably leads to misalignment between two transformed views in the voxel domain. To this end, we derive and implement closed-form rules that compensate this misalignment onthe-fly. Our results are significant, e.g., promoting ScanNet [email protected] from 35.2% to 48.5% using 20% annotation. Codes and data will be publicly available.