JustOrthologs: A Fast, Accurate, and User-Friendly Ortholog Identification Algorithm
Novelty of Approach
-Does not use all-versus-all BLAST comparisons
-Uses conservation in CDS region length to reduce pairwise comparisons
-Uses dinucleotide composition to further reduce runtime
Results
-Reduce ortholog identification runtime by 96%
-Maintain overall precision and accuracy
-Genes with more CDS regions have higher precision and accuracy
-Confirm gene annotations for 384,120 genes
-Grouped 1,675,415 genes in previously unreported ortholog groups
-Identified 51,429 potentially mislabeled genes
-Annotated 622,843 ortholog groups
Implications
-Whole genome analyses are now possible
-Algorithm not based on pairwise BLAST comparisons
-Annotate more orthologous genes
-Provide functional insights for genes of unknown functions
-Phylogenetic inference
Whole Genome Comparison of Different Species
Species 1 | Species 2 | Number of Genes in Species 1 | Number of Genes in Species 2 | Number of Shared Ortholog Annotations from HGNC | True Positives Reported | False Positives Reported | Unnamed genes reported in orthologous pairs | Precision (%) | Recall (%) |
Homo sapiens | Pan paniscus | 20 088 | 17 900 | 14 653 | 14 119 | 462 | 905 | 96.83 | 96.36 |
Homo sapiens | Equus | 20 088 | 16 691 | 12 725 | 8 229 | 150 | 246 | 98.21 | 64.67 |
caballus | |||||||||
Homo sapiens | Falco | 20 088 | 12 643 | 10 659 | 841 | 38 | 35 | 95.68 | 7.89 |
peregrinus | |||||||||
Gallus gallus | Falco | 16 420 | 12 643 | 9 163 | 5 132 | 139 | 597 | 97.36 | 56.01 |
peregrinus | |||||||||
Astyanax mexicanus | Danio rerio | 21 920 | 22 408 | 5 832 | 683 | 296 | 688 | 69.77 | 11.71 |
Cynoglossus semilaevis | Danio rerio | 19 450 | 22 408 | 5 699 | 199 | 104 | 205 | 65.68 | 3.49 |
Oncorhynchus kisutch | Salmo salar | 30 680 | 40 642 | 2 800 | 2 424 | 183 | 18 300 | 92.98 | 86.57 |
Oreochromis niloticus | Pundamilia nyererei | 27 785 | 21 832 | 8 645 | 8 326 | 94 | 9 857 | 98.88 | 96.31 |
Alligator mississippiensis | Crocodylus porosus | 17 492 | 13 837 | 10 993 | 10 238 | 4 | 1615 | 99.96 | 93.13 |
Mus musculus | Rattus norvegicus | 21 815 | 21 481 | 15 199 | 12 183 | 720 | 279 | 94.42 | 80.16 |
Bos taurus | Capra hircus | 17 980 | 19 208 | 12 894 | 11 929 | 97 | 1 337 | 99.19 | 92.52 |
Bos taurus | Vicugna pacos | 17 980 | 16 297 | 11 411 | 7 991 | 18 | 502 | 99.78 | 70.03 |
Calypte anna | Haliaeetus leucocephalus | 12 225 | 14 150 | 9 825 | 7 041 | 15 | 662 | 99.79 | 71.66 |
Calypte anna | Chaetura pelagica | 12 225 | 11 852 | 8 770 | 6 565 | 14 | 695 | 99.79 | 74.86 |
Prunus avium | Prunus mume | 24 179 | 22 628 | 0 | 0 | 0 | 14 004 | N/A | N/A |
Large Ortholog Groups Recovered Using JustOrthologs
Genes with the Same Annotation | Genes with Other Annotations | Genes with Unknown Annotations | Total Genes | Reason for Other Annotations |
127 | 0 | 63 | 190 | N/A |
178 | 0 | 7 | 185 | N/A |
172 | 1 | 7 | 180 | XP_018109801.1 has 100% BLAST identity with NP_001087532.1, which is annotated the same as the other 172 genes |
155 | 2 | 21 | 178 | The nucleotide composition and exon length of XP_001959559.1 and XP_002071834.1 are similar to XP_010179458.1. However, the alignment is very different. These two genes are probably incorrectly reported as orthologous by JustOrthologs. |
169 | 0 | 9 | 178 | N/A |
169 | 1 | 5 | 175 | XP_414807.2 has a 99% BLAST identity with XP_015732072.1 from a closely related species, which is annotated the same as the other 169 genes. |
166 | 0 | 5 | 171 | N/A |
165 | 1 | 5 | 171 | NP_068697.1 is annotated Trp53inp1 instead of TP53INP1. |
163 | 1 | 6 | 170 | XP_014347657.1 is annotated LRRC8E instead of LRRC8C |
165 | 0 | 4 | 169 | N/A |
161 | 0 | 7 | 168 | N/A |
162 | 0 | 5 | 167 | N/A |
161 | 1 | 4 | 166 | XP_020368157.1 is incorrectly reported as orthologous by JustOrthologs. The CDS region lengths matched some exons in XP_005866852.1, but the alignment of the sequences was very poor. |
163 | 0 | 3 | 166 | N/A |
152 | 1 | 13 | 166 | XP_018123052.1 is annotated grb10.L instead of GRB10 |
161 | 0 | 4 | 165 | N/A |
156 | 0 | 9 | 165 | N/A |
159 | 0 | 6 | 165 | N/A |
160 | 0 | 5 | 165 | N/A |
160 | 0 | 4 | 164 | N/A |
159 | 0 | 5 | 164 | N/A |
158 | 0 | 5 | 163 | N/A |
156 | 1 | 5 | 162 | XP_017312051.1 is incorrectly reported as orthologous by JustOrthologs. The CDS region lengths matched several exons within XP_020920808.1, but the alignment of the sequences was poor. |
156 | 0 | 5 | 161 | N/A |
158 | 0 | 3 | 161 | N/A |
153 | 0 | 7 | 160 | N/A |
149 | 0 | 9 | 158 | N/A |
154 | 0 | 3 | 157 | N/A |
146 | 0 | 11 | 157 | N/A |
153 | 0 | 4 | 157 | N/A |