For a while now I've felt Trait-o-matic is insanely slow -- I couldn't understand why it took so long. As some of you know, I've been working on integrating Trait-o-matic's genome analysis functionality into GET-Evidence's code to create a single unified set of code people can install and use and develop.<br>
<br>Well, I finally tested the issue this morning. If I run the core "gff_dbsnp_query.py" on the command line myself, it takes 16 minutes. I wrote a small (30 line) perl program to perform the same matching of genome variants to dbSNPs and it took only 1 minute. I think the difference is that my perl program is running through two pre-sorted files (genome variant & dbSNP), while Trait-o-matic's python program matching the genome variant file, line-by-line, against a MySQL database containing the dbSNP data. I'm not clear on exactly why this means we're taking a 16-fold hit.<br>
<br>I suspect all the other steps in Trait-o-matic processing can be improved in the same way: simultaneously moving through pre-sorted files rather than loading one into MySQL and then querying MySQL. Can anyone tell me why we shouldn't make the change? (Please be specific, "maybe Xiaodi did it this way for some reason we don't know" isn't helpful.) The dbsnp database is *only* used by this gff_dbsnp_query.py program, so I can't see any other reason for having it.<br>
<br> -- Madeleine<br>