VirusPredictor

VirusPredictor (version 1.0) is comprised of three sections, including DNA sequence transformation and feature selection; three-class virus prediction XGBoost model training and testing; and six-class subgroup prediction XGBoost model training and testing. VirusPredictor first classifies the query sequences into one of the three classes, i.e., infectious virus, endogenous retrovirus (ERV), and non-ERV human. The predicted infectious virus candidates will then be further classified into one of the six virus taxonomic classes, i.e., dsDNA, ssDNA, Retro-transcribing, ssRNA(-), ssRNA(+), and dsRNA. VirusPredictor is written in Python. Please note this version cannot distinguish other sequences other than these three categories (virus, ERV, and non-ERV human) as every sequence will be classified into one of these three in this version even if it is not any of the three. We are adding an option to mark ambiguous sequence as unknown and adding new functions by expanding the applications to distinguish other sequence categories.

Software and Dataset Download

Download: VirusPredictor Verson 1.0 (rar) or VirusPredictor Verson 1.0 (tar) and Software Manual and FAQ

Download: ERV sequences and Non-ERV human sequences

Update log

Recent major updates:
1.      Accepted input sequences in both FASTQ and FASTA formats, automatically converted FASTQ to FASTA, and then continued to the next steps.
2.      Added input file and output file directories to help users easily add their query sequences and locate the prediction results.
3.      Updated software code to automatically identify an input sequence with missing nucleotides and accelerate calculation speed.
4.      Corrected several bugs in the Python codes.

Full update log:

Ver 1.0
08/01/2024      A .tar version is added
02/19/2024      Added three additional test files (“test_virus.fasta”, “test_ERV.fasta” and “test_non-ERV.fasta”) to the packages
12/28/2023      Updated model names to be recognized directly by main functions
11/19/2023      Adjusted input-/output-files directory setting to be easily utilized by users
10/28/2023      Updated the training step of both XGBoost models for GPU version users to accelerate model training speed
09/14/2023      Optimized the logic of the Python code
09/14/2023      Optimized the structure of the models

Ver 0.9
06/30/2023      Corrected bugs in the main Python script
05/19/2023      Tested the performance of the six-class model on three gradient length test sequences, i.e., 150-350, 850-950, and 2,000-5,000 bp
04/14/2023      Tested the performance of the three-class model on three gradient length test sequences, i.e., 150-350, 850-950, and 2,000-5,000 bp

Ver 0.8
02/03/2023      Added input file and output file directories to help users easily add their query sequences and locate the prediction results
01/15/2023      Corrected three bugs in the k-tuple method in the Python script

Ver 0.7
11/27/2022      Corrected a bug in the recoding method in the Python script
11/25/2022      Updated the macro average metrics in the Python codes

Ver 0.6
09/20/2022      Utilized random forest algorithm to evaluate the performance of different top features to find the optimal subset of features
08/16/2022      Added ten cut length gradient sequences into the testing datasets
07/11/2022      Added ten cut length gradient sequences into the training datasets

Ver 0.5
05/16/2022      Corrected bugs for checking input files
05/08/2022      Extended the input file format to accept FASTQ format of sequences
05/08/2022      Updated the Python code to automatically identify sequence with missing nucleotides and report a warning in the output file

Ver 0.4
02/23/2022      Added macro average precision, recall, and F1 score metrics for model evaluation
02/17/2022      Utilized MinMaxScaler strategy to normalize the training and testing datasets to improve the models’ accuracies

Ver 0.3
12/05/2021      Retrained the models with grid-search strategy to obtain new hyperparameters of the models
11/21/2021      Updated the non-ERV human dataset to train more powerful models

Ver 0.2
09/01/2021      Updated the dimension of input dataset and re-trained the models
08/21/2021      Added three new sequence numerical methods to obtain more information from input sequences for the models

Ver 0.1 released
05/09/2021      Released VirusPredictor Version 0.1 (testing version)

Note: we constantly update the software for new functions, fixed bugs, and others. If you would like to use the latest version, please send your email address to the authors so that we can notice you when new versions become available.

Citation: Guangchen Liu, Xun Chen, Yihui Luan, and Dawei Li. VirusPredictor: XGBoost-based software to predict virus-related sequences in human data. Bioinformatics. 2024. PMID: 38597887.

Questions?
The software has been tested in multiple servers and by different users. If you have any questions about installation, error messages, or interpretation of results, feel free to contact the authors: gch_liu [at] 163.com or dawei.li [at] ttuhsc.edu.

Please report any bugs to us at your earliest convenience! Thank you very much!

L I L A B Bioinformatics and Genomics