AlzGenPred: A CatBoost based method using network features to classify the Alzheimers Disease associated genes from the high throughput sequencing data

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

AlzGenPred: A CatBoost based method using network features to classify the Alzheimers Disease associated genes from the high throughput sequencing data

Authors

Shukla, R.; Singh, T. R.

Abstract

Background and Objective: AD is a progressive neurodegenerative disorder characterized by memory loss. Due to the advancement in next-generation sequencing technologies, an enormous amount of AD-associated genomics data is available. However, the information about the involvement of these genes in AD association is still a research topic because all these algorithms are based on statistical techniques. Therefore, AlzGenPred is developed to identify the AD-associated genes from a large set of data. Methods: To develop the AlzGenPred, we have compiled a benchmark dataset consisting of 1086 AD and non-AD genes and used them as positive and negative datasets. We have generated several features including the fused features and evaluated them through machine learning methods. Then hyperparameter tuning approach was also applied and the final model was selected. The proposed method was validated by using the AlzGene and transcriptomics datasets and proposed as a standalone tool. Results: Total 13504 features belonging to eight different encoding schemes of these sequences were generated and evaluated by using 16 ML algorithms. It reveals that network-based features can classify AD genes while sequence-based features are not able to classify them. Then we generated 24 different fused features (6020 D) using sequence-based features and fed them into a two-step lightGBM-based recursive feature selection method. It increased up to 5-7% accuracy. After that selected eight fused features with CKSAAP were used for the hyperparameter tuning. They showed <70% accuracy. Therefore, network-based features were used to generate the CatBoost-based ML method called AlzGenPred with 96.55% accuracy and 98.99% AUROC. The developed method is tested on the AlzGene dataset where it showed 96.43% accuracy. Then the model is validated using the transcriptomics dataset also. Conclusion: The validation of AlzGenPred using the AlzGene dataset and transcriptomics dataset obtained from Human, mouse, and ES-derived neural cells revealed that it can classify the omics data and can sort the AD-associated genes. These predicted genes can be directly used in the wet lab for further testing which will reduce labor cost and time expenses. The AlzGenPred is developed as a standalone package and is available for users at https://www.bioinfoindia.org/alzgenpred/ and https://github.com/shuklarohit815/AlzGenPred.

Follow Us on

0 comments

Add comment