Home              Research              People              Publications              Software              Computers              Positions              Contact


Back to >> Software list


ERVcaller: Identify endogenous retrovirus and other transposable element insertions


We developed a new software tool (ERVcaller) for identifying and estimating the allele frequency of non-reference transposable element insertions, particularly ERVs, from short read sequence data. It is well known that standard reference-based alignment and assembly approaches for short read sequence data generally fail to properly assemble large sequence insertions that are not present in the reference genome used. Reads derived from such insertions typically either fail to align or align anomalously.

ERVcaller takes such anomalous reads and remaps them to sets of reference transposable element sequences of interest to identify the locations of insertions not present in the original reference genome. It further uses this data to determine whether each sequenced individual is homozygous for the insertion allele, heterozygous, or homozygous for the pre-insertion allele, thus allowing for estimates of population (or case-control group) allele frequencies for each insertion.

The overall idea used here is similar to a number of previous tools for mobile element identification, such as RetroSeq, but ERVcaller is a significant improvement over previous tools, as it has incorporated the best aspects of multiple different older tools, utilizing a more diverse set of input data and providing more detailed output, most notably genotype data. Benchmark comparisons with other tools also show notable improvements in sensitivity, precision, and/or speed with each previous approach.


ERVcaller download

Download ERVcaller Verson 1.4
and software manual and FAQ

Note: we constantly update the software for new functions, fixed bugs, and others. If you would like to use the latest version, please send your email address to us (dawei.li at  ttuhsc.edu) so that we can notice you when new versions become available.

Questions?

The software has been tested in multiple servers and by different users. If you have any questions about installation, error messages, or interpretation of results, feel free to contact the authors.


Recent major updates:

1) Further increased the accuracy
2) Added the Phred-scale genotype quality and likelihoods
3) Speed up the genotype process significantly
4) Added the function to distinguish missing and none TE insertion genotypes in the combined VCF file for population genomics studies
5) Corrected multiple bugs


Full update log:

# Updates (v1.4):
#       02/15/2019:             Re-designed the engineer process to increase the (genotyping) speed significantly
#       02/10/2019:             Added the scripts to distinguish missing genotypes and none TE insertions genotypes for all samples in the combined VCF file
#       02/06/2019:             Corrected the output coordinates of TE insertions with TSD
#       02/02/2019:             Further standardized the VCF format for the usage of bcftools
#       02/01/2019:             Added Phred-scale genotype quality and likelihoods
#       01/29/2019:             Adjusted reciprocal-aligned reference genomic region length using the estimated insert size and SD, which significantly reduced false-positives
#       01/29/2019:             Added a function to estimate insert size and its standard deviation (SD)
#       01/38/2019:             Corrected multiple bugs in the main Perl script
#       01/24/2019:             Corrected a bug in the script combing VCF files from multiple samples
#
# Updates (v1.3):
#       11/20/2018:             Added the scripts to merge various samples into a list of known TE loci or TE loci detected from the analyzed samples
#       11/12/2018:             Updated the Output in VCF_v4.2 format
#       11/05/2018:             Debugged the support of the BAM files generated by Bowtie2
#
# Updates (v1.2):
#       11/01/2018:             Further optimized the speed of validation steps
#       10/21/2018:             Supported multiple bam files as the input
#       10/10/2018:             Optimized the validation steps to increase the specificity
#
# Updates (v1.1):
#       09/02/2018:             Optimized the validation steps to significantly increase the speed
#       08/28/2018:             Updated the parameter of -S to specify the length of split reads used (20 bp by default; >=40 bp is recommended for reads of 150 bp in length)
#       08/10/2018:             Added component to support BAM files using different chromosome IDs as the reference genome, such as "Chr1", "chr1", "1", and "NC_000001.11"
#       07/17/2018:             Corrected bugs for checking input files;
#       07/17/2018:             Corrected the errors for detecting and genotyping TE insertions using single-end sequencing data;
#       07/16/2018:             Re-formatted the output files
#       07/15/2018:             Released ERVcaller Version 1.1 and software manual
#
# Release (v1.0):
#       05/27/2018:             Released ERVcaller Version 1.0 (a testing version) and software manual

Citation:
 Chen X, Li D*. ERVcaller: Identifying polymorphic endogenous retrovirus and other transposable element insertions using whole-genome sequencing data. Bioinformatics. 2019 Oct 15;35(20):3913-3922. PMID: 30895294. (* corresponding author).

Please report any bugs to us at your earliest convenience!  Thank you very much!