[GET-dev] Complete Genomics meeting notes
Madeleine Price Ball
meprice at gmail.com
Tue Sep 14 15:26:02 EDT 2010
Complete Genomics held a small, very informative meeting about processing
and interpreting their current data set (version 1.3). Notably this doesn't
help with older files (I hear there's some gnarly binaries out there), but
they've very open about most of the process and software tools. I missed the
first of the four sessions, but I got presentation slides for all sessions,
sample data sets, and took some notes on the other three sessions.
Email me and I can give you a link that contains: (1) the presentation
slides, (2) my notes, (3) a sample data set of chr21 data from HapMap
NA19240, (4) some training documents.
Of particular note, Complete Genomics is now producing an annotation of
variant impacts akin to Trait-o-matics amino acid change detection that I
think we should imitate. It includes things we've wanted to add (splice
sites, whether an indel is frameshift or in-frame) as well as things I
hadn't thought of (missense at start junction, whether an in-frame indel
creates a new amino acid or not).
Here's my version of the different annotations a variant could have, adapted
from their list:
* NO-CHANGE - Reference allele
* UNTRANSCRIBED - Not in a transcribed region (I made this category up,
they have a blank value.)
* UNDEFINED - transcribed but not translated to protein, e.g. noncoding gene
or UTR (CGI treats UTR differently, calling it "blank" along with
untranscribed variants.)
* COMPATIBLE - substitution in coding region, synonymous
* MISSENSE - substitution in coding region, non-synonymous amino acid change
& not NONSENSE, NONSTOP, or MISSTART
* MISSTART - substitution in coding region, amino acid change at start codon
(special case of MISSENSE)
* NONSENSE - substitution in coding region creates premature stop codon
* NONSTOP - substitution in coding region changes stop codon to amino acid,
creates run-through
* FRAMESHIFT - insertion or deletion in coding region causes frameshift
* INSERT - insertion in coding region, in-frame, preserves original amino
acids
* INSERT+ - insertion in coding region, in-frame, destroys original amino
acid(s) as well as adding additional
* DELETE - deletion in coding region, in-frame, does not create a new amino
acid
* DELETE+ - deletion in coding region, in-frame, also creates a new amino
acid
* ACCEPTOR - variant is at splice junction, inside region defined as
canonical acceptor + 13 adjacent intronic bases
* ACCEPTOR-DISRUPT - variant changes the canonical acceptor sequence ("AG")
* DONOR - variant is at splice junction, inside region defined as canonical
donor sequence ("GT") + 4 adjacent intronic bases
* DONOR-DISRUPT - variant changes the canonical donor sequence ("GT")
The difference between DELETE and DELETE+ is illustrated by the following
(view in monospace font):
DELETE:
...Lys-Val-Gln... ----> ...Lys-Gln...
...AAA-GTG-CAG... ----> ...AAA-ATG...
^^^ ^^^ ^^^ neither of these is new
DELETE+:
...Lys-Val-Gln... ----> ...Met-Gln...
...AAA-GTG-CAG... ----> ...ATG-ATG...
^^ ^ ^^^ Met is new
Similarly, an INSERT+ is an insert that doesn't neatly occur between
triplets and results in the loss of an original amino acid as well as the
addition of new ones.
There was some suggestion that the ACCEPTOR region be broader, like up to 50
bases rather than 15. I think I would also be tempted to apply an ACCEPTOR
or DONOR label to variants inside exons but also occur near to splice
junctions. I guess this would mean we can't apply just one label per
variant, since a variant could be "MISSENSE" as well as "ACCEPTOR" when it
occurs in an exon +1bp relative to the canonical acceptor site.
At any rate, I think that Complete Genomics has a nice list of items we can
use, and I think we might as well use the same vocabulary.
-- Madeleine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20100914/45db4439/attachment.html>
More information about the Arvados
mailing list