Back to >> Software list
VirusPredictor: XGBoost software to predict virus-related sequences
VirusPredictor
(version 1.0) is comprised of three sections, including DNA sequence
transformation and feature selection; three-class virus prediction
XGBoost model training and testing; and six-class subgroup prediction
XGBoost model training and testing. VirusPredictor first classifies the
query sequences into one of the three classes, i.e., infectious virus,
endogenous retrovirus (ERV), and non-ERV human. The predicted
infectious virus candidates will then be further classified into one of
the six virus taxonomic classes, i.e., dsDNA, ssDNA,
Retro-transcribing, ssRNA(-), ssRNA(+), and dsRNA. VirusPredictor is
written in Python.
Please note this version cannot distinguish other sequences other than
these three categories (virus, ERV, and non-ERV human) as every
sequence will be classified into one of these three in this version
even if it is not any of the three. We are adding an option to mark
ambiguous sequence as unknown and adding new functions by expanding the applications to distinguish other sequence categories.
Software and Dataset Download
Download: VirusPredictor Verson 1.0 (rar) or VirusPredictor Verson 1.0 (tar) and Software Manual and FAQ
Download: ERV sequences and Non-ERV human sequences
Update
log
Recent
major updates:
1. Accepted input sequences in both FASTQ and FASTA formats, automatically
converted FASTQ to FASTA, and then continued to the next steps.
2. Added input file and output file directories to help users easily add
their query sequences and locate the prediction results.
3. Updated software code to automatically identify an input sequence with
missing nucleotides and accelerate calculation speed.
4. Corrected several bugs in the Python codes.
Full
update log:
Ver 1.0
08/01/2024 A .tar version is added 02/19/2024 Added three additional test files (“test_virus.fasta”, “test_ERV.fasta” and “test_non-ERV.fasta”) to the packages 12/28/2023 Updated model names to be recognized
directly by main functions
11/19/2023 Adjusted input-/output-files directory
setting to be easily utilized by users
10/28/2023 Updated
the training step of both XGBoost models for GPU version users to accelerate
model training speed
09/14/2023 Optimized
the logic of the Python code
09/14/2023 Optimized the structure of the models
Ver 0.9
06/30/2023 Corrected bugs in the main Python script
05/19/2023 Tested
the performance of the six-class model on three gradient length test sequences,
i.e., 150-350, 850-950, and 2,000-5,000 bp
04/14/2023 Tested
the performance of the three-class model on three gradient length test
sequences, i.e., 150-350, 850-950, and 2,000-5,000 bp
Ver 0.8
02/03/2023 Added
input file and output file directories to help users easily add their query
sequences and locate the prediction results
01/15/2023 Corrected
three bugs in the k-tuple method in the Python script
Ver 0.7
11/27/2022 Corrected a bug in the recoding method in
the Python script
11/25/2022 Updated the macro average metrics in the
Python codes
Ver 0.6
09/20/2022 Utilized
random forest algorithm to evaluate the performance of different top features
to find the optimal subset of features
08/16/2022 Added ten cut length gradient sequences
into the testing datasets
07/11/2022 Added ten cut length gradient sequences
into the training datasets
Ver 0.5
05/16/2022 Corrected bugs for checking input files
05/08/2022 Extended the input file format to accept
FASTQ format of sequences
05/08/2022 Updated
the Python code to automatically identify sequence with missing nucleotides and
report a warning in the output file
Ver 0.4
02/23/2022 Added macro average precision, recall, and
F1 score metrics for model evaluation
02/17/2022 Utilized
MinMaxScaler strategy to normalize the training and testing datasets to improve
the models’ accuracies
Ver 0.3
12/05/2021 Retrained
the models with grid-search strategy to obtain new hyperparameters of the
models
11/21/2021 Updated the non-ERV human dataset to train
more powerful models
Ver 0.2
09/01/2021 Updated the dimension of input dataset and
re-trained the models
08/21/2021 Added
three new sequence numerical methods to obtain more information from input
sequences for the models
Ver 0.1 released
05/09/2021 Released VirusPredictor Version 0.1 (testing
version)
Note:
we constantly update the software for new functions, fixed bugs, and
others. If you would like to use the latest version, please send your
email address to the authors so that we can notice you when new versions become available. Citation:
Guangchen
Liu, Xun Chen, Yihui Luan, and Dawei Li. VirusPredictor: XGBoost-based
software to predict virus-related sequences in human data. Bioinformatics. 2024. PMID: 38597887.
Questions?The software has been tested in multiple servers and by different users. If
you have any questions about installation, error messages, or
interpretation of results, feel free to contact the authors: gch_liu [at] 163.com or dawei.li [at] ttuhsc.edu.
Please report any bugs to us at your earliest convenience! Thank you very much!
|