Ridge regression baseline model outperforms deep learning method for cancer genetic dependency prediction

By: Chang, D.; Zhang, X.

Accurately predicting genetic or other cellular vulnerabilities of unscreened, or difficult to screen, cancer samples will allow vast advancements in precision oncology. We re-analyzed a recently published deep learning method for predicting cancer genetic dependencies from their omics profiles. After implementing a ridge regression baseline model with an alternative, simplified problem setup, we achieved a model that outperforms the original... more
Accurately predicting genetic or other cellular vulnerabilities of unscreened, or difficult to screen, cancer samples will allow vast advancements in precision oncology. We re-analyzed a recently published deep learning method for predicting cancer genetic dependencies from their omics profiles. After implementing a ridge regression baseline model with an alternative, simplified problem setup, we achieved a model that outperforms the original deep learning method. Our study demonstrates the importance of problem formulation in machine learning applications and underscores the need for rigorous comparisons with baseline approaches. less
Multiplexed tumor profiling with generative AI accelerates histopathology workflows and improves clinical predictions

By: Pati, P.; Karkampouna, S.; Bonollo, F.; Comperat, E.; Radic, M.; Spahn, M.; Martinelli, A.; Wartenberg, M.; Kruithof-de Julio, M.; Rapsomaniki, M. A.

Understanding the spatial heterogeneity of tumors and its links to disease is a cornerstone of cancer biology. Emerging spatial technologies offer unprecedented capabilities towards this goal, but several limitations hinder their clinical adoption. To date, histopathology workflows still heavily depend on hematoxylin & eosin (H&E) and serial immunohistochemistry (IHC) staining, a cumbersome and tissue-exhaustive process that yields unaligned ... more
Understanding the spatial heterogeneity of tumors and its links to disease is a cornerstone of cancer biology. Emerging spatial technologies offer unprecedented capabilities towards this goal, but several limitations hinder their clinical adoption. To date, histopathology workflows still heavily depend on hematoxylin & eosin (H&E) and serial immunohistochemistry (IHC) staining, a cumbersome and tissue-exhaustive process that yields unaligned tissue images. We propose the VirtualMultiplexer, a generative AI toolkit that translates real H&E images to matching IHC images for several markers based on contrastive learning. The VirtualMultiplexer learns from unpaired H&E and IHC images and introduces a novel multi-scale loss to ensure consistent and biologically reliable stainings. The virtually multiplexed images enabled training a Graph Transformer that simultaneously learns from the joint spatial distribution of several markers to predict clinically relevant endpoints. Our results indicate that the VirtualMultiplexer achieves rapid, robust and precise generation of virtually multiplexed imaging datasets of high staining quality that are indistinguishable from the real ones. We successfully employed transfer learning to generate realistic virtual stainings across tissue scales, patient cohorts, and cancer types with no need for model fine-tuning. Crucially, the generated images are not only realistic but also clinically relevant, as they greatly improved the prediction of different clinical endpoints across patient cohorts and cancer types, speeding up histopathology workflows and accelerating spatial biology. less
Gut Microbiome Dynamics and Predictive Value in Hospitalized COVID-19 Patients: A Comparative Analysis of Shallow and Deep Shotgun Sequencing

By: Kopera, K.; Gromowski, T.; Wydmanski, W.; Skonieczna-Zydecka, K.; Muszynska, A.; Zielinska, K.; Wierzbicka-Wos, A.; Kaczmarczyk, M.; Kadaj-Lipka, R.; Cembrowska-Lech, D.; Januszkiewicz, K.; Koftis, K.; Witkiewicz, W.; Nalewajska, M.; Feret, W.; Marlicz, W.; Labaj, P. P.; Rydzewska, G.; Kosciolek, T.

The COVID-19 pandemic caused by SARS-CoV-2 has led to a wide range of clinical presentations, with respiratory symptoms being common. However, emerging evidence suggests that the gastrointestinal (GI) tract is also affected, with angiotensin-converting enzyme 2, a key receptor for SARS-CoV-2, abundantly expressed in the ileum and colon. The virus has been detected in GI tissues and fecal samples, even in cases with negative respiratory result... more
The COVID-19 pandemic caused by SARS-CoV-2 has led to a wide range of clinical presentations, with respiratory symptoms being common. However, emerging evidence suggests that the gastrointestinal (GI) tract is also affected, with angiotensin-converting enzyme 2, a key receptor for SARS-CoV-2, abundantly expressed in the ileum and colon. The virus has been detected in GI tissues and fecal samples, even in cases with negative respiratory results. GI symptoms have been associated with an increased risk of ICU admission and mortality. The gut microbiome, a complex ecosystem of around 40 trillion bacteria, plays a crucial role in immunological and metabolic pathways. Dysbiosis of the gut microbiota, characterized by a loss of beneficial microbes and decreased microbial diversity, has been observed in COVID-19 patients, potentially contributing to disease severity. We conducted a comprehensive gut microbiome study in 204 hospitalized COVID-19 patients using both shallow and deep shotgun sequencing methods. We aimed to track microbiota composition changes induced by hospitalization, link these alterations to clinical procedures (antibiotics administration) and outcomes (ICU referral, survival), and assess the predictive potential of the gut microbiome for COVID-19 prognosis. Shallow shotgun sequencing was evaluated as a cost-effective diagnostic alternative for clinical settings. less
miRVim: Three-dimensional miRNA Structure Data Server

By: Sahu, V. K.; Parida, A. S.; Ranjan, A.; Basu, S.

MicroRNAs (miRNAs), a distinct category of non-coding RNAs, exert multifaceted regulatory functions in a variety of organisms, including humans, animals, and plants. The inventory of identified miRNAs stands at approximately 60,000 among all species and 1,926 in Homo sapiens manifests miRNA expression. Their theranostic role has been explored by researchers over the last few decades, positioning them as prominent therapeutic targets as our un... more
MicroRNAs (miRNAs), a distinct category of non-coding RNAs, exert multifaceted regulatory functions in a variety of organisms, including humans, animals, and plants. The inventory of identified miRNAs stands at approximately 60,000 among all species and 1,926 in Homo sapiens manifests miRNA expression. Their theranostic role has been explored by researchers over the last few decades, positioning them as prominent therapeutic targets as our understanding of RNA targeting advances. However, the limited availability of experimentally determined miRNA structures has constrained drug discovery efforts relying on virtual screening or computational methods, including machine learning. To address this limitation, miRVim has been developed, providing a repository of human miRNA structures derived from both two-dimensional (MXFold2, CentroidFold, and RNAFold) and three-dimensional (RNAComposer and 3dRNA) structure prediction algorithms, in addition to experimentally available structures from the RCSB PDB repository. This data server aims to facilitate computational data analysis for drug discovery, opening new avenues for advancing technologies such as machine learning-based predictions in the field. The publicly accessible structures provided by miRVim, available at https://mirna.in/miRVim, offer a valuable resource for the research community, advancing the field of miRNA-related computational analysis and drug discovery. less
Identifying images in the biology literature that are problematic for people with a color-vision deficiency

By: Stevens, H. P.; Winegar, C. V.; Oakley, A. F.; Piccolo, S. R.

To help maximize the impact of scientific journal articles, authors must ensure that article figures are accessible to people with color-vision deficiencies. Up to 8% of males and 0.5% of females experience a color-vision deficiency. For deuteranopia, the most common color-vision deficiency, we evaluated images published in biology-oriented research articles between 2012 and 2022. Out of 66,253 images, 56,816 (85.6%) included at least one col... more
To help maximize the impact of scientific journal articles, authors must ensure that article figures are accessible to people with color-vision deficiencies. Up to 8% of males and 0.5% of females experience a color-vision deficiency. For deuteranopia, the most common color-vision deficiency, we evaluated images published in biology-oriented research articles between 2012 and 2022. Out of 66,253 images, 56,816 (85.6%) included at least one color contrast that could be problematic for people with moderate-to-severe deuteranopia (\"deuteranopes\"). However, after informal evaluations, we concluded that spatial distances and within-image labels frequently mitigated potential problems. We systematically reviewed 4,964 images, comparing each against a simulated version that approximates how it appears to deuteranopes. We identified 636 (12.8%) images that would be difficult for deuteranopes to interpret. Although still prevalent, the frequency of this problem has decreased over time. Articles from cell-oriented biology subdisciplines were most likely to be problematic. We used machine-learning algorithms to automate the identification of problematic images. For a hold-out test set of 879 additional images, a convolutional neural network classified images with an area under the receiver operating characteristic curve of 0.89. To enable others to apply this model, we created a Web application where users can upload images, view deuteranopia-simulated versions, and obtain predictions about whether the images are problematic. Such efforts are critical to ensuring the biology literature is interpretable to diverse audiences. less
Generalizable features for the diagnosis of infectious disease, autoimmunity and cancer from adaptive immune receptor repertoires

By: Xu, Z.; Ismanto, H. S.; Saputri, D. S.; Haruna, S.; Wilamowski, J.; Teraguchi, S.; Sengupta, A.; Standley, D. M.

Adaptive immune receptor repertoires (AIRRs) have emerged as promising biomarkers for disease diagnosis and clinical prognosis. However, their high diversity and limited sharing between donors pose unique challenges when employing them as features for machine learning based diagnostics. In this study, we investigate the commonly used approach of representing each receptor as a member of a \"clonotype cluster\". We then construct a feature vec... more
Adaptive immune receptor repertoires (AIRRs) have emerged as promising biomarkers for disease diagnosis and clinical prognosis. However, their high diversity and limited sharing between donors pose unique challenges when employing them as features for machine learning based diagnostics. In this study, we investigate the commonly used approach of representing each receptor as a member of a \"clonotype cluster\". We then construct a feature vector for each donor from clonotype cluster frequencies (CCFs). We find that CCFs are sparse features and that classifiers trained on them do not generalize well to new donors. To overcome this limitation, we introduce a novel approach where we transform cluster frequencies using an adjacency matrix built from pairwise similarities of all receptors. This transformation produces a new feature, termed paratope cluster occupancies (PCOs). Leveraging publicly available AIRR datasets encompassing infectious diseases (COVID-19, HIV), autoimmune diseases (autoimmune hepatitis, type 1 diabetes), and cancer (colorectal cancer, non-small cell lung cancer), we demonstrate that PCOs exhibit lower sparsity compared to CCFs. Furthermore, we establish that classifiers trained on PCOs exhibited improved generalizability and overall classification performance (median ROC AUC 0.893) when compared to CCFs (median ROC AUC 0.714) over the six diseases. Our findings highlight the potential of utilizing PCOs as a feature representation for AIRR analysis in diverse disease contexts. less
Streamlining remote nanopore data access with slow5curl

By: Wong, B.; Ferguson, J. M.; Gamaarachchi, H.; Deveson, I. W.

As adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such a... more
As adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curl less
Benchmarking AlphaSC: A Leap in Single-Cell Data Processing

By: Vuong, H.; Luu, T.; Nguyen, N.; Tra, N.; Nguyen, H.; Nguyen, H.; Truong, T.; Pham, S.

We benchmarked AlphaSC, BioTuring\'s GPU-accelerated single-cell analytics package, against other popular tools including Scanpy, Seurat, and RAPIDS. The results demonstrate that AlphaSC operates thousands of times faster than Seurat and Scanpy. Additionally, it surpasses RAPIDS, another GPU-accelerated package from NVIDIA, by an order of magnitude in terms of speed while also consuming considerably less RAM and GPU memory. Importantly, this ... more
We benchmarked AlphaSC, BioTuring\'s GPU-accelerated single-cell analytics package, against other popular tools including Scanpy, Seurat, and RAPIDS. The results demonstrate that AlphaSC operates thousands of times faster than Seurat and Scanpy. Additionally, it surpasses RAPIDS, another GPU-accelerated package from NVIDIA, by an order of magnitude in terms of speed while also consuming considerably less RAM and GPU memory. Importantly, this significant increase in AlphaSC\'s performance does not compromise its quality. less
permGWAS2: Enhanced and Accelerated Permutation-based Genome-Wide Association Studies

By: John, M.; Korte, A.; Grimm, D. G.

Motivation: Permutation-based significance thresholds have been shown to be a robust alternative to Bonferroni-based significance thresholds in genome-wide association studies (GWAS). However, the implementation of permutation-based thresholds is computationally demanding. The recently published method permGWAS introduced a batch-wise approach using 4D tensors to efficiently compute permutation-based GWAS. However, running multiple univariate... more
Motivation: Permutation-based significance thresholds have been shown to be a robust alternative to Bonferroni-based significance thresholds in genome-wide association studies (GWAS). However, the implementation of permutation-based thresholds is computationally demanding. The recently published method permGWAS introduced a batch-wise approach using 4D tensors to efficiently compute permutation-based GWAS. However, running multiple univariate tests in parallel leads to many repetitive computations and increased computational resources. More importantly, the previous version of permGWAS does not take into account the population structure when permuting the phenotype. Results: We propose permGWAS2, an improved and accelerated version that uses a block matrix decomposition to optimize computations, thereby reducing redundant computations. It also introduces an alternative permutation strategy that takes into account the population structure during permutation. We show that this improved framework provides a more streamlined approach to performing permutation-based GWAS with a lower false discovery rate compared to the previous version and the traditional Bonferroni correction. Availability: permGWAS2 is open-source and publicly available on GitHub for download: https://github.com/grimmlab/permGWAS. less
Long COVID: G Protein-Coupled Receptors (GPCRs) responsible for persistent post-COVID symptoms

By: Das, S.; Kumar, S.

As of early December 2022, COVID-19 had a significant impact on the lives of people all around the world, with over 630 million documented cases and over 6 million deaths. A recent clinical analysis revealed that under certain conditions, a patients disease symptoms are more likely to persist. Long COVID is characterised by many symptoms that continue long after the SARS-CoV-2 infection has resolved. This work utilised computational methods t... more
As of early December 2022, COVID-19 had a significant impact on the lives of people all around the world, with over 630 million documented cases and over 6 million deaths. A recent clinical analysis revealed that under certain conditions, a patients disease symptoms are more likely to persist. Long COVID is characterised by many symptoms that continue long after the SARS-CoV-2 infection has resolved. This work utilised computational methods to analyse the persistence of COVID symptoms after recovery and to identify the relevant genes. Based on functional similarity, differentially expressed genes (DEGs) of SARS-CoV-2 infection and 255 symptoms of long covid were examined, and potential genes were identified based on the rank of functional similarity. Then, hub genes were identified by analysing the interactions between proteins. Using the identified key genes and the drug-gene interaction score, FDA drugs with potential for possible alternatives were identified. Also discovered were the gene ontology and pathways for 255 distinct symptoms. A website (https://longcovid.omicstutorials.com/) with a list of significant genes identified as biomarkers and potential treatments for each symptom was created. All of the hub genes associated with the symptoms, GNGT1, GNG12, GNB3, GNB4, GNG13, GNG8, GNG3, GNG7, GNG10, and GNAI1, were discovered to be associated with G-protein coupled receptors. This demonstrates that persistent COVID infection affects various organ systems and promotes chronic inflammation following infection. CTLA4, PTPN22, KIT, KRAS, NF1, RET, and CTNNB1 were identified as the common genes that regulate T-cell immunity via GPCR and cause a variety of symptoms, including autoimmunity, cardiovascular, dermatological, general symptoms, gastrointestinal, pulmonary, reproductive, genitourinary, and endocrine symptoms (RGEM). Among other functions, they were found to be involved in the positive regulation of protein localization to the cell cortex, the regulation of triglyceride metabolism, the binding of G protein-coupled receptors, the binding of G protein-coupled serotonin receptors, the heterotrimeric G-protein complex, and the cell cortex region. These biomarker data, together with the gene ontology and pathway information that accompanies them, are intended to aid in determining the cause and improving the efficacy of treatment. less