Complete Genomics held a small, very informative meeting about processing and interpreting their current data set (version 1.3). Notably this doesn't help with older files (I hear there's some gnarly binaries out there), but they've very open about most of the process and software tools. I missed the first of the four sessions, but I got presentation slides for all sessions, sample data sets, and took some notes on the other three sessions.<br>

<br>Email me and I can give you a link that contains: (1) the presentation slides, (2) my notes, (3) a sample data set of chr21 data from HapMap NA19240, (4) some training documents.<br><br>Of particular note, Complete Genomics is now producing an annotation of variant impacts akin to Trait-o-matics amino acid change detection that I think we should imitate. It includes things we've wanted to add (splice sites, whether an indel is frameshift or in-frame) as well as things I hadn't thought of (missense at start junction, whether an in-frame indel creates a new amino acid or not).<br>

<br>Here's my version of the different annotations a variant could have, adapted from their list:<br><br>* NO-CHANGE - Reference allele<br>

* UNTRANSCRIBED - Not in a transcribed region  (I made this category up, they have a blank value.)<br>* UNDEFINED - transcribed but not translated to protein, e.g. noncoding gene or UTR (CGI treats UTR differently, calling it "blank" along with untranscribed variants.)<br>

* COMPATIBLE - substitution in coding region, synonymous<br>* MISSENSE - substitution in coding region, non-synonymous amino acid change & not NONSENSE, NONSTOP, or MISSTART<br>* MISSTART - substitution in coding region, amino acid change at start codon (special case of MISSENSE)<br>

* NONSENSE - substitution in coding region creates premature stop codon<br>* NONSTOP - substitution in coding region changes stop codon to amino acid, creates run-through<br>* FRAMESHIFT - insertion or deletion in coding region causes frameshift<br>

* INSERT - insertion in coding region, in-frame, preserves original amino acids<br>* INSERT+ - insertion in coding region, in-frame, destroys original amino acid(s) as well as adding additional<br>* DELETE - deletion in coding region, in-frame, does not create a new amino acid<br>

* DELETE+ - deletion in coding region, in-frame, also creates a new amino acid<br>* ACCEPTOR - variant is at splice junction, inside region defined as canonical acceptor + 13 adjacent intronic bases<br>* ACCEPTOR-DISRUPT - variant changes the canonical acceptor sequence ("AG")<br>

* DONOR - variant is at splice junction, inside region defined as canonical donor sequence ("GT") + 4 adjacent intronic bases<br>* DONOR-DISRUPT - variant changes the canonical donor sequence ("GT")<br>

<br>The difference between DELETE and DELETE+ is illustrated by the following (view in monospace font):<br><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">DELETE:</span><br style="font-family: courier new,monospace;">

<span style="font-family: courier new,monospace;">...Lys-Val-Gln... ----> ...Lys-Gln...</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">...AAA-GTG-CAG... ----> ...AAA-ATG...</span><br style="font-family: courier new,monospace;">

<span style="font-family: courier new,monospace;">       ^^^                 ^^^ ^^^ neither of these is new</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">DELETE+:</span><br style="font-family: courier new,monospace;">

<span style="font-family: courier new,monospace;">...Lys-Val-Gln... ----> ...Met-Gln...</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">...AAA-GTG-CAG... ----> ...ATG-ATG...</span><br style="font-family: courier new,monospace;">

<span style="font-family: courier new,monospace;">    ^^ ^                   ^^^ Met is new</span><br><br>Similarly, an INSERT+ is an insert that doesn't neatly occur between triplets and results in the loss of an original amino acid as well as the addition of new ones.<br>

<br>There was some suggestion that the ACCEPTOR region be broader, like up to 50 bases rather than 15. I think I would also be tempted to apply an ACCEPTOR or DONOR label to variants inside exons but also occur near to splice junctions. I guess this would mean we can't apply just one label per variant, since a variant could be "MISSENSE" as well as "ACCEPTOR" when it occurs in an exon +1bp relative to the canonical acceptor site.<br>

<br>At any rate, I think that Complete Genomics has a nice list of items we can use, and I think we might as well use the same vocabulary.<br><br>  -- Madeleine<br>