[GET-dev] trait-o-matic's mysql queries seem unnecessarily slow

Thu Sep 16 14:55:31 EDT 2010

For a while now I've felt Trait-o-matic is insanely slow -- I couldn't
understand why it took so long. As some of you know, I've been working on
integrating Trait-o-matic's genome analysis functionality into
GET-Evidence's code to create a single unified set of code people can
install and use and develop.

Well, I finally tested the issue this morning. If I run the core
"gff_dbsnp_query.py" on the command line myself, it takes 16 minutes. I
wrote a small (30 line) perl program to perform the same matching of genome
variants to dbSNPs and it took only 1 minute. I think the difference is that
my perl program is running through two pre-sorted files (genome variant &
dbSNP), while Trait-o-matic's python program matching the genome variant
file, line-by-line, against a MySQL database containing the dbSNP data. I'm
not clear on exactly why this means we're taking a 16-fold hit.

I suspect all the other steps in Trait-o-matic processing can be improved in
the same way: simultaneously moving through pre-sorted files rather than
loading one into MySQL and then querying MySQL. Can anyone tell me why we
shouldn't make the change? (Please be specific, "maybe Xiaodi did it this
way for some reason we don't know" isn't helpful.) The dbsnp database is
*only* used by this gff_dbsnp_query.py program, so I can't see any other
reason for having it.

  -- Madeleine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20100916/b07c4849/attachment.html>