[GET-dev] Now with tasty indels!

Madeleine Price Ball meprice at gmail.com
Thu Dec 9 16:41:18 EST 2010


Lots of improvements to GET-Evidence, and lots of new people signing up for
this list!

The biggest improvement, in terms of functionality, is analysis of
insertions and deletions in coding regions -- in other words, tell us if and
where a frameshift occurred, or if an amino acid was deleted, or inserted.
To do this, Tom and I had to settle on a standard method for reporting these
in the uploaded genome data and how to report any amino acid changes we did
find. I wrote up two guides for this, you can currently see them on my dev
site:
http://mball.freelogy.org/guide_upload_and_source_file_formats
http://mball.freelogy.org/guide_amino_acid_calls

In the process I ended up deciding to use UCSC's transcript annotation,
which means there are some changes that require you to go redo some
installation steps. I'm putting a more detailed explanation what I did as
postscript to this email.

Also, Tom has added DataTables (http://www.datatables.net/) so we can get
awesome genome reports users can sort and filter. And he's adding some
ability to link to my.personalgenomes.org profile pages. So that also has
some installation steps you'll need to redo.

Presumably Tom will pull my stuff soon, so here's the combined list of
things you need to redo (I may have missed some though):
* remove "sudo rm /home/trait/config.py"
* Do these again (should create new config.py and run setup-external-data):
   cd ~/get-evidence/server/script/
   source config-local.sh
   sudo -u $USER ./install-user.sh
* edit /home/trait/config.py to have the SQL database password (same as in
public_html/config.py)
* re-run "make install" (to get DataTables)
* go to "install.php" again (to get SQL tables updated)

There are other new items in the INSTALL. I'm not sure what else might need
to be done. As always, feel free to post an email here if you have problems.

   - Madeleine

P.S. More details about what I did.

I've updated gff_nonsynonymous_filter_from_file.py -- previously it would
just take 3 bases, translate them to amino acid, then compare to see if the
amino acid changed. Now it concatenates the whole coding sequence around the
reference and variant genotypes, calls amino acids on the whole thing, then
calls amino acid differences between the two. In the process I added a
sanity check: it turns out a lot of refFlat transcript annotations are not
"sane" -- many of them, for example, have some frameshift mistake that
results in stop codons throughout the sequence. So now we throw out anything
that doesn't meet that requirement -- you're *very* unlikely to randomly not
have any stop codons, so passing that check is a good test for a good
transcript.

That said, the stuff I was getting out of it was still making me unhappy --
e.g. frameshifts based on one of ten different transcripts annotated,
intronic in the other nine. Based on a recommendation from Ivan Adzhubey I
decided we needed to move to UCSC's annotation. One great thing about this
is that UCSC has annotated "canonical transcripts". Previously we'd report
several different amino acid variants for a single genetic change from
different transcripts -- I think this kind of made things a mess, since you
only want one variant page in the end. Canonical transcripts solves that.

But attaching gene names to the UCSC transcripts was a bit of a mess. I
added a download of HGNC canonical gene names. Where possible I used that to
directly infer name from a transcript. Then I used refFlat to connect
transcript ID to gene name as long as the gene name was on HGNC's list and
matched the correct chromosome (a weak sanity check, but we had none at all
before). Failing those, I used kgXref's suggested gene name, again checking
against the HGNC list and chromosome. If all these failed, I used transcript
ID as "gene name".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20101209/1681630b/attachment.html>


More information about the Arvados mailing list