[GET-dev] trait-o-matic's mysql queries seem unnecessarily slow

Thu Sep 16 15:30:50 EDT 2010

> I wrote code for another project that does exactly this.  In that
> project, UNIX sort / IO / compression was the bottleneck for
> data-processing.

It takes 1 minute 24 seconds to sort the dbSNP file, which only needs to be
done once upon installation and I doubt it takes any longer than loading
into a MySQL database.

It takes 12 seconds to sort the genome file. Also, you are probably going to
re-use that sorted file in other parts of genome processing that can be
improved in this same way.

Databases are almost always slower than flat-files when you're not
> doing any writing.  I would also expect to see see significant
> performance gains using code that is specifically designed to our data
> - in this case, pre-sorting is a big win!  If you want to make an even
> faster app, try using C to read in a binary file instead of ASCII
> data; however, then you lose the advantage of human-readable files.
>

Yes, as far as I can tell we aren't doing any writing to the MySQL dbsnp
database, nor any other MySQL databases used by Trait-o-matic's genome
processing. GET-Evidence's evidence database is obviously a different
issue.  :-)

For now I'm happy to keep the human-readable files, I know how to write the
perl programs and it's an easy and large improvement. (I'll probably instead
learn enough to do it in python, but still not too hard.) Anything in C and
binary would take a lot of learning on my part and I don't know how
significant the gain would be.

  -- Madeleine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20100916/1ce80439/attachment.html>