Benchmark dataset
The following datasets, HomFam and OXFam were used as benchamrk datasets in "Application of the MAFFT sequence alignment program to large data - reexamination of the usefulness of chained guide trees".
HomFam | This HomFam dataset is modified version of original HomFam dataset constructed by the authors of Clustal Omega. The dataset contains totally 89 HOMSTRAD families as reference multiple sequence alignments and their corresponding Pfam sequences to be aligned. The information of secondly structure (as α-helix, β-strand or 310-helix) in HOMSTRAD was held as capital letter in the reference alignments. The repeated-sequence contained families were removed and the character of 'U' in the sequences was replaced with 'X'. The order of sequences was randomized in every sequence set to prevent artificial effects induced by the input sequence order in MSA calculation. |
OXFam | The OXFam dataset contains 165 OXBench reference alignments and their corresponding Pfam sequences. The construction procedure was almost same as that of HomFam. In the construction process, 53 sequence families which possibly included repeated-sequences, are of multi-domain families or shared homologous relationships with the other families were excluded from original 218 OXBench sequence families. As is the case with the above HomFam, the information of structure conserved region (SCR) was held as capital letters in the reference alignments. The order of sequences was randomized in every sequence set to prevent artificial effects induced by the input sequence order in MSA calculation. |
Benchmark results for large-scaled dataset
The following data include benchmark results for popular and high-speed multiple sequence aligners against currently available large-scaled benchmark dataset.
HomFam
Small(0, 3000]38 files | Medium(3000, 10000]32 files | Large(10000,]19 files | All[93, 93681]89 files | |
---|---|---|---|---|
Mean SP / TC score | ||||
MAFFT - FFT-NS-1 | 0.8971 / 0.7656 | 0.8560 / 0.6448 | 0.7415 / 0.5069 | 0.8491 / 0.6669 |
MAFFT - FFT-NS-1 (memsavetree) | 0.8980 / 0.7846 | 0.8573 / 0.6469 | 0.7023 / 0.4639 | 0.8416 / 0.6667 |
MAFFT - FFT-NS-2 | 0.9074 / 0.7806 | 0.8957 / 0.7298 | 0.7795 / 0.5646 | 0.8759 / 0.7162 |
MAFFT - FFT-NS-2 (memsavetree) | 0.9085 / 0.8022 | 0.8926 / 0.7179 | 0.7134 / 0.4764 | 0.8611 / 0.7023 |
MAFFT - Randomchain | 0.8699 / 0.7315 | 0.8671 / 0.6967 | 0.7106 / 0.4932 | 0.8349 / 0.6681 |
MAFFT - PartTree (partsize=50) | 0.8443 / 0.6549 | 0.8190 / 0.5817 | 0.6148 / 0.3609 | 0.7862 / 0.5658 |
MAFFT - PartTree (partsize=1000) | 0.8840 / 0.7465 | 0.8406 / 0.6307 | 0.6844 / 0.4321 | 0.8258 / 0.6377 |
MAFFT - DPPartTree (partsize=50) | 0.8892 / 0.7610 | 0.8599 / 0.6578 | 0.7142 / 0.4601 | 0.8413 / 0.6597 |
MAFFT - DPPartTree (partsize=1000) | 0.8918 / 0.7690 | 0.8684 / 0.6914 | 0.7546 / 0.5454 | 0.8541 / 0.6934 |
MAFFT - Sparsecore (p=100) | 0.9105 / 0.7939 | 0.9004 / 0.7275 | 0.7945 / 0.5943 | 0.8821 / 0.7274 |
MAFFT - Sparsecore (p=500) | 0.9267 / 0.8315 | 0.9167 / 0.7573 | 0.8045 / 0.6148 | 0.8970 / 0.7586 |
MAFFT - Sparsecore (p=1000) | 0.9405 / 0.8628 | 0.9228 / 0.7746 | 0.8159 / 0.6283 | 0.9075 / 0.7810 |
MAFFT - Sparsecore (p=100, memsavetree) | 0.9238 / 0.8380 | 0.9094 / 0.7428 | 0.7641 / 0.5469 | 0.8845 / 0.7416 |
MAFFT - Sparsecore (p=500, memsavetree) | 0.9221 / 0.8233 | 0.9213 / 0.7782 | 0.8175 / 0.6204 | 0.8995 / 0.7638 |
MAFFT - Sparsecore (p=1000, memsavetree) | 0.9392 / 0.8599 | 0.9327 / 0.8052 | 0.7907 / 0.5899 | 0.9052 / 0.7826 |
Clustal Omega | 0.9148 / 0.8057 | 0.8693 / 0.7152 | 0.6871 / 0.4449 | 0.8498 / 0.6961 |
Clustal Omega - Full | 0.9088 / 0.8086 | 0.8806 / 0.7365 | 0.6692 / 0.4386 | 0.8475 / 0.7037 |
Clustal Omega - Randomchain | 0.8798 / 0.7580 | 0.8309 / 0.6918 | - / - | - / - |
Muscle 1 iteration | 0.8094 / 0.6640 | 0.7572 / 0.5672 | - / - | - / - |
Muscle 1 iteration - Randomchain | 0.8224 / 0.6720 | 0.8189 / 0.6771 | 0.7001 / 0.4471 | 0.7951 / 0.6258 |
Muscle 2 iteration | 0.8078 / 0.6606 | 0.6949 / 0.4645 | - / - | - / - |
Muscle 2 iteration - Randomchain | 0.8437 / 0.7053 | 0.8251 / 0.6528 | 0.7425 / 0.5274 | 0.8154 / 0.6484 |
UPP -fast | 0.8616 / 0.7466 | 0.8407 / 0.7087 | 0.7700 / 0.5853 | 0.8345 / 0.6985 |
UPP -default | 0.8678 / 0.7492 | 0.8708 / 0.7570 | 0.7956 / 0.6330 | 0.8535 / 0.7272 |
MAFFT - G-INS-1 | 0.9358 / 0.8549 | 0.9520 / 0.8480 | 0.8844 / 0.7441 | 0.9306 / 0.8288 |
Total CPU time (min) | ||||
MAFFT - FFT-NS-1 | 1.2 | 15 | 140 | 160 |
MAFFT - FFT-NS-1 (memsavetree) | 2.0 | 21 | 240 | 260 |
MAFFT - FFT-NS-2 | 2.9 | 36 | 420 | 460 |
MAFFT - FFT-NS-2 (memsavetree) | 5.6 | 66 | 910 | 990 |
MAFFT - Randomchain | 2.0 | 15 | 71 | 88 |
MAFFT - PartTree (partsize=50) | 1.5 | 11 | 35 | 47 |
MAFFT - PartTree (partsize=1000) | 3.0 | 21 | 71 | 94 |
MAFFT - DPPartTree (partsize=50) | 8.1 | 47 | 100 | 160 |
MAFFT - DPPartTree (partsize=1000) | 55 | 270 | 490 | 820 |
MAFFT - Sparsecore (p=100) | 7.4 | 61 | 580 | 650 |
MAFFT - Sparsecore (p=500) | 160 | 390 | 790 | 1300 |
MAFFT - Sparsecore (p=1000) | 810 | 2100 | 1500 | 4400 |
MAFFT - Sparsecore (p=100, memsavetree) | 9.8 | 94 | 1200 | 1300 |
MAFFT - Sparsecore (p=500, memsavetree) | 150 | 440 | 1400 | 2000 |
MAFFT - Sparsecore (p=1000, memsavetree) | 730 | 2100 | 2200 | 5000 |
Clustal Omega | 21 | 160 | 300 | 480 |
Clustal Omega - Full | 44 | 570 | 5400 | 6000 |
Clustal Omega - Randomchain | 130 | 18000 | - | - |
Muscle 1 iteration | 3.5 | 36 | - | - |
Muscle 1 iteration - Randomchain | 1.7 | 9.4 | 45 | 56 |
Muscle 2 iteration | 9.0 | 120 | - | - |
Muscle 2 iteration - Randomchain | 3.0 | 17 | 69 | 89 |
UPP -fast | 53 | 190 | 260 | 500 |
UPP -default | 360 | 1600 | 2400 | 4400 |
MAFFT - G-INS-1 | (370) | (5200) | (44000) | (49000) |
Versions:
MAFFT 7.294; Clustal Omega 1.2.1; Muscle 3.8.31; UPP 2.0
Execution commands from above on the method column:
mafft --retree 1 --maxiterate 0 input
mafft --retree 1 --maxiterate 0 --memsavetree input
mafft input
mafft --memsavetree input
mafft --randomchain --randomseed seed input
mafft --parttree --partsize 50 input
mafft --parttree --partsize 1000 input
mafft --dbparttree --partsize 50 input
mafft --dpparttree --partsize 1000 input
mafft-sparsecore.rb -s seed -p 100 -i input
mafft-sparsecore.rb -s seed -p 500 -i input
mafft-sparsecore.rb -s seed -p 1000 -i input
mafft-sparsecore.rb -s seed -p 100 -A "--memsavetree" -i input
mafft-sparsecore.rb -s seed -p 500 -A "--memsavetree" -i input
mafft-sparsecore.rb -s seed -p 1000 -A "--memsavetree" -i input
clustalo -i input
clustalo --full -i input
clustalo --pileup -i input
muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -in input
muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -usetree randomchain -in input
muscle -maxiters 2 -in input
muscle -maxiters 2 -usetree randomchain -in input
run_upp.py -m amino -B 100 -s input
run_upp.py -m amino -s input
mafft --globalpair --thread 10 input
OXFam
Small(0, 3000]74 files | Medium(3000, 10000]59 files | Large(10000,]32 files | All[19, 81503]165 files | |
---|---|---|---|---|
Mean SP / TC score | ||||
MAFFT - FFT-NS-1 | 0.9248 / 0.8936 | 0.8813 / 0.8142 | 0.7884 / 0.7018 | 0.8828 / 0.8280 |
MAFFT - FFT-NS-1 (memsavetree) | 0.9157 / 0.8840 | 0.8652 / 0.7933 | 0.7648 / 0.6963 | 0.8684 / 0.8152 |
MAFFT - FFT-NS-2 | 0.9300 / 0.9016 | 0.8853 / 0.8253 | 0.8207 / 0.7404 | 0.8928 / 0.8430 |
MAFFT - FFT-NS-2 (memsavetree) | 0.9246 / 0.8921 | 0.8869 / 0.8182 | 0.7668 / 0.6901 | 0.8805 / 0.8265 |
MAFFT - Randomchain | 0.9010 / 0.8622 | 0.8571 / 0.7989 | 0.7674 / 0.6827 | 0.8594 / 0.8048 |
MAFFT - PartTree (partsize=50) | 0.9180 / 0.8664 | 0.8246 / 0.7327 | 0.6825 / 0.5857 | 0.8389 / 0.7641 |
MAFFT - PartTree (partsize=1000) | 0.9046 / 0.8612 | 0.8299 / 0.7480 | 0.7108 / 0.6195 | 0.8403 / 0.7739 |
MAFFT - DPPartTree (partsize=50) | 0.9154 / 0.8769 | 0.8508 / 0.7779 | 0.7568 / 0.6686 | 0.8616 / 0.8011 |
MAFFT - DPPartTree (partsize=1000) | 0.9261 / 0.8843 | 0.8584 / 0.7906 | 0.7843 / 0.7099 | 0.8744 / 0.8169 |
MAFFT - Sparsecore (p=100) | 0.9438 / 0.9143 | 0.9058 / 0.8546 | 0.8364 / 0.7613 | 0.9094 / 0.8633 |
MAFFT - Sparsecore (p=500) | 0.9441 / 0.9155 | 0.9228 / 0.8807 | 0.8391 / 0.7645 | 0.9161 / 0.8738 |
MAFFT - Sparsecore (p=1000) | 0.9533 / 0.9319 | 0.9257 / 0.8897 | 0.8427 / 0.7707 | 0.9220 / 0.8855 |
MAFFT - Sparsecore (p=100, memsavetree) | 0.9483 / 0.9190 | 0.8845 / 0.8267 | 0.8480 / 0.7806 | 0.9060 / 0.8592 |
MAFFT - Sparsecore (p=500, memsavetree) | 0.9328 / 0.9030 | 0.9167 / 0.8783 | 0.8383 / 0.7694 | 0.9087 / 0.8682 |
MAFFT - Sparsecore (p=1000, memsavetree) | 0.9543 / 0.9340 | 0.9276 / 0.8896 | 0.8319 / 0.7699 | 0.9210 / 0.8863 |
Clustal Omega | 0.9257 / 0.8842 | 0.8735 / 0.8118 | 0.7409 / 0.6408 | 0.8712 / 0.8111 |
Clustal Omega - Full | 0.9244 / 0.8886 | 0.8595 / 0.7839 | 0.7440 / 0.6688 | 0.8662 / 0.8085 |
Clustal Omega - Randomchain | 0.8905 / 0.8452 | 0.8477 / 0.7888 | - / - | - / - |
Muscle 1 iteration | 0.8450 / 0.7782 | 0.6365 / 0.5220 | - / - | - / - |
Muscle 1 iteration - Randomchain | 0.8797 / 0.8268 | 0.8464 / 0.7846 | 0.6937 / 0.6067 | 0.8317 / 0.7690 |
Muscle 2 iteration | 0.8555 / 0.8000 | 0.6896 / 0.5818 | - / - | - / - |
Muscle 2 iteration - Randomchain | 0.8995 / 0.8540 | 0.8371 / 0.7719 | 0.7229 / 0.6309 | 0.8429 / 0.7814 |
UPP -fast | 0.9327 / 0.9028 | 0.8940 / 0.8535 | 0.7878 / 0.7196 | 0.8908 / 0.8496 |
UPP -default | 0.9415 / 0.9138 | 0.9068 / 0.8676 | 0.8211 / 0.7601 | 0.9057 / 0.8675 |
MAFFT - G-INS-1 | 0.9572 / 0.9358 | 0.9485 / 0.9147 | 0.8749 / 0.8212 | 0.9381 / 0.9060 |
Total CPU time (min) | ||||
MAFFT - FFT-NS-1 | 2.4 | 31 | 160 | 200 |
MAFFT - FFT-NS-1 (memsavetree) | 3.8 | 50 | 270 | 330 |
MAFFT - FFT-NS-2 | 5.5 | 81 | 470 | 560 |
MAFFT - FFT-NS-2 (memsavetree) | 11 | 170 | 1000 | 1200 |
MAFFT - Randomchain | 3.4 | 30 | 96 | 130 |
MAFFT - PartTree (partsize=50) | 2.7 | 22 | 59 | 83 |
MAFFT - PartTree (partsize=1000) | 6.2 | 51 | 140 | 190 |
MAFFT - DPPartTree (partsize=50) | 16 | 110 | 210 | 340 |
MAFFT - DPPartTree (partsize=1000) | 120 | 780 | 1500 | 2400 |
MAFFT - Sparsecore (p=100) | 11 | 130 | 660 | 800 |
MAFFT - Sparsecore (p=500) | 190 | 640 | 1000 | 1900 |
MAFFT - Sparsecore (p=1000) | 1100 | 3500 | 2800 | 7500 |
MAFFT - Sparsecore (p=100, memsavetree) | 17 | 210 | 1300 | 1500 |
MAFFT - Sparsecore (p=500, memsavetree) | 200 | 720 | 1700 | 2600 |
MAFFT - Sparsecore (p=1000, memsavetree) | 1100 | 3400 | 3300 | 7900 |
Clustal Omega | 27 | 220 | 590 | 840 |
Clustal Omega - Full | 82 | 1300 | 7100 | 8400 |
Clustal Omega - Randomchain | 170 | 4600 | - | - |
Muscle 1 iteration | 7.1 | 94 | - | - |
Muscle 1 iteration - Randomchain | 3.1 | 21 | 53 | 77 |
Muscle 2 iteration | 18 | 320 | - | - |
Muscle 2 iteration - Randomchain | 5.5 | 38 | 85 | 130 |
UPP -fast | 92 | 380 | 540 | 1000 |
UPP -default | 660 | 3300 | 4900 | 8900 |
MAFFT - G-INS-1 | (760) | (13000) | (71000) | (86000) |
Versions:
MAFFT 7.294; Clustal Omega 1.2.1; Muscle 3.8.31; UPP 2.0
Execution commands from above on the method column:
mafft --retree 1 --maxiterate 0 input
mafft --retree 1 --maxiterate 0 --memsavetree input
mafft input
mafft --memsavetree input
mafft --randomchain --randomseed seed input
mafft --parttree --partsize 50 input
mafft --parttree --partsize 1000 input
mafft --dbparttree --partsize 50 input
mafft --dpparttree --partsize 1000 input
mafft-sparsecore.rb -s seed -p 100 -i input
mafft-sparsecore.rb -s seed -p 500 -i input
mafft-sparsecore.rb -s seed -p 1000 -i input
mafft-sparsecore.rb -s seed -p 100 -A "--memsavetree" -i input
mafft-sparsecore.rb -s seed -p 500 -A "--memsavetree" -i input
mafft-sparsecore.rb -s seed -p 1000 -A "--memsavetree" -i input
clustalo -i input
clustalo --full -i input
clustalo --pileup -i input
muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -in input
muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -usetree randomchain -in input
muscle -maxiters 2 -in input
muscle -maxiters 2 -usetree randomchain -in input
run_upp.py -m amino -B 100 -s input
run_upp.py -m amino -s input
mafft --globalpair --thread 10 input
ContTest
Small(0, 3000]15 files | Medium(3000, 10000]70 files | Large(10000,]51 files | All[1467, 43912]136 files | |
---|---|---|---|---|
Mean ContTest score | ||||
MAFFT - FFT-NS-1 | 0.3830 | 0.4803 | 0.5231 | 0.4856 |
MAFFT - FFT-NS-1 (memsavetree) | 0.3915 | 0.4857 | 0.5076 | 0.4835 |
MAFFT - FFT-NS-2 | 0.4081 | 0.4874 | 0.5439 | 0.4998 |
MAFFT - FFT-NS-2 (memsavetree) | 0.3980 | 0.5029 | 0.5525 | 0.5099 |
MAFFT - Randomchain | 0.4406 | 0.5227 | 0.5997 | 0.5425 |
MAFFT - PartTree (partsize=50) | 0.3812 | 0.4030 | 0.4288 | 0.4103 |
MAFFT - PartTree (partsize=1000) | 0.3883 | 0.4351 | 0.4523 | 0.4364 |
MAFFT - DPPartTree (partsize=50) | 0.3723 | 0.4289 | 0.4817 | 0.4424 |
MAFFT - DPPartTree (partsize=1000) | 0.3779 | 0.4555 | 0.4988 | 0.4632 |
MAFFT - Sparsecore (p=100) | 0.3747 | 0.5005 | 0.5771 | 0.5153 |
MAFFT - Sparsecore (p=500) | 0.3883 | 0.5180 | 0.6046 | 0.5361 |
MAFFT - Sparsecore (p=1000) | 0.3808 | 0.5237 | 0.6198 | 0.5440 |
MAFFT - Sparsecore (p=100, memsavetree) | 0.3878 | 0.5143 | 0.5927 | 0.5298 |
MAFFT - Sparsecore (p=500, memsavetree) | 0.3981 | 0.5264 | 0.6107 | 0.5438 |
MAFFT - Sparsecore (p=1000, memsavetree) | 0.3535 | 0.5324 | 0.6126 | 0.5428 |
Clustal Omega | 0.3039 | 0.4291 | 0.4262 | 0.4142 |
Clustal Omega - Full | 0.3080 | 0.4585 | 0.4640 | 0.4440 |
Clustal Omega - Randomchain | 0.4328 | 0.5324 | 0.5703 | 0.5357 |
Muscle 1 iteration | 0.2701 | 0.3678 | 0.3414 | 0.3471 |
Muscle 1 iteration - Randomchain | 0.3817 | 0.5217 | 0.5957 | 0.5340 |
Muscle 2 iteration | 0.3206 | 0.3817 | 0.3254 | 0.3538 |
Muscle 2 iteration - Randomchain | 0.4442 | 0.5289 | 0.6141 | 0.5515 |
UPP -fast | 0.3515 | 0.5139 | 0.5744 | 0.5187 |
UPP -default | 0.3555 | 0.5254 | 0.5936 | 0.5323 |
MAFFT - G-INS-1 | 0.3853 | 0.5445 | 0.6582 | 0.5696 |
Total CPU time (min) | ||||
MAFFT - FFT-NS-1 | 0.48 | 20 | 150 | 170 |
MAFFT - FFT-NS-1 (memsavetree) | 0.84 | 36 | 240 | 280 |
MAFFT - FFT-NS-2 | 1.2 | 54 | 440 | 500 |
MAFFT - FFT-NS-2 (memsavetree) | 2.5 | 120 | 990 | 1100 |
MAFFT - Randomchain | 0.56 | 16 | 88 | 100 |
MAFFT - PartTree (partsize=50) | 0.44 | 11 | 50 | 61 |
MAFFT - PartTree (partsize=1000) | 1.0 | 24 | 120 | 140 |
MAFFT - DPPartTree (partsize=50) | 2.3 | 53 | 160 | 210 |
MAFFT - DPPartTree (partsize=1000) | 14 | 260 | 770 | 1000 |
MAFFT - Sparsecore (p=100) | 2.2 | 77 | 650 | 730 |
MAFFT - Sparsecore (p=500) | 30 | 250 | 930 | 1200 |
MAFFT - Sparsecore (p=1000) | 160 | 1100 | 2100 | 3400 |
MAFFT - Sparsecore (p=100, memsavetree) | 3.4 | 140 | 1300 | 1500 |
MAFFT - Sparsecore (p=500, memsavetree) | 40 | 310 | 1600 | 2000 |
MAFFT - Sparsecore (p=1000, memsavetree) | 180 | 1200 | 2800 | 4200 |
Clustal Omega | 5.0 | 130 | 460 | 600 |
Clustal Omega - Full | 16 | 830 | 5600 | 6400 |
Clustal Omega - Randomchain | 28 | 6400 | 110000 | 120000 |
Muscle 1 iteration | 2.0 | 81 | 550 | 630 |
Muscle 1 iteration - Randomchain | 0.53 | 13 | 50 | 63 |
Muscle 2 iteration | 3.9 | 230 | 2500 | 2700 |
Muscle 2 iteration - Randomchain | 0.86 | 22 | 78 | 100 |
UPP -fast | 17 | 230 | 500 | 750 |
UPP -default | 130 | 2000 | 4600 | 6700 |
MAFFT - G-INS-1 | (110) | (7100) | (48000) | (55000) |
Versions:
MAFFT 7.294; Clustal Omega 1.2.1; Muscle 3.8.31; UPP 2.0
Execution commands from above on the method column:
mafft --retree 1 --maxiterate 0 input
mafft --retree 1 --maxiterate 0 --memsavetree input
mafft input
mafft --memsavetree input
mafft --randomchain --randomseed seed input
mafft --parttree --partsize 50 input
mafft --parttree --partsize 1000 input
mafft --dbparttree --partsize 50 input
mafft --dpparttree --partsize 1000 input
mafft-sparsecore.rb -s seed -p 100 -i input
mafft-sparsecore.rb -s seed -p 500 -i input
mafft-sparsecore.rb -s seed -p 1000 -i input
mafft-sparsecore.rb -s seed -p 100 -A "--memsavetree" -i input
mafft-sparsecore.rb -s seed -p 500 -A "--memsavetree" -i input
mafft-sparsecore.rb -s seed -p 1000 -A "--memsavetree" -i input
clustalo -i input
clustalo --full -i input
clustalo --pileup -i input
muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -in input
muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -usetree randomchain -in input
muscle -maxiters 2 -in input
muscle -maxiters 2 -usetree randomchain -in input
run_upp.py -m amino -B 100 -s input
run_upp.py -m amino -s input
mafft --globalpair --thread 10 input