[GET-dev] Autoscores + Counsyl variants

Abraham Rosenbaum rosenbaum4 at gmail.com
Fri May 21 10:27:27 EDT 2010


We designate NBLOSUM=10 for stop codons, so I think we have that built
in already. In terms of a more graded system, Mike Chou has expressed
an interest in a grading system based upon predicted
protein-modification motif changes -- we can turn this into a real
effort and say we are working on it.
-Abraham

On Fri, May 21, 2010 at 10:22 AM, Kimberly Robasky <krobasky at gmail.com> wrote:
> Another suggestion - add an autoscore point to anything that creates a
> stop codon in a coding region?
>
> On Fri, May 21, 2010 at 9:53 AM, Kimberly Robasky <krobasky at gmail.com> wrote:
>> I think if your scoring values were more graded, your autoscore data
>> would be less "lumpy".
>>
>> For example, I think its too conservative to require your BLOSUM
>> scores to be >3; So I would suggest looking breaking it down to 1
>> point for >= -3 and 2 points for > 3, and maybe even boosting your
>> other scores to compensate.  Here's why:
>>
>> Look at mutations for Serine; mutating away from that S kills any
>> phosphorylation motif that might be there.  This will almost certainly
>> cause some kind of phenotype, but reading down the column of serine,
>> you won't find any >3's at all.  However, you find 16 that are >= -3.
>> That's more than any other residue, with Tyrosine coming in second
>> place, the other important phosphorylation residue, with 15 scores
>> that are >-3.
>>
>> I think it's particularly worth emphasizing BLOSUM scores, given what
>> we learned from Shamil about how polyphen works, and that BLOSUM is
>> the only indicator we have for conservation (even if its only a rough
>> one).
>>
>> -Kim
>>
>> On Fri, May 21, 2010 at 9:02 AM, Abraham Rosenbaum <rosenbaum4 at gmail.com> wrote:
>>> To help in troubleshooting:
>>> CFTR Ser1255Stop should have 6 stars (stop codon, OMIM, GeneReviews).
>>> Some of the ACADM genes have 0 stars; this gene has a Genetests entry,
>>> is available for testing and is present in OMIM.
>>> According to the latest download we have >80,000 nsSNPs in our
>>> database (we should emphasize this point) but the variant_flat does
>>> not produce a list of splice variants or all synonymous entries. I
>>> think that it would be a good idea to get this data so that we can
>>> further de-emphasize our reliance on exons.
>>> -Abraham
>>>
>>> On Fri, May 21, 2010 at 8:53 AM, Madeleine Price Ball <meprice at gmail.com> wrote:
>>>> I've uploaded a new copy of the counsyl variants list, there were some
>>>> ^M's in there, invisible to us when making the google spreadsheet.
>>>> http://mad.printf.net/counsyl_variants.csv
>>>>
>>>> I guess LAMB3-R635X should have at least 5 points?
>>>> 2 for nonsense mutation
>>>> 2 for being in OMIM
>>>> 1 for being a GeneTests testable gene
>>>>
>>>> I don't know if we should worry about how "lumpy" the database data
>>>> is. Since much of the database is imported from OMIM we expect it to
>>>> have a lot of 2's -- the profile for an individual will look
>>>> different. Here's the list for PGP1:
>>>> http://mad.printf.net/PGP1_nsSNPs.csv
>>>>
>>>> It's strange for the counsyl list to have any 0's, Sasha pointed out
>>>> yesterday that almost by definition these are genes that "have testing
>>>> available". Maybe the names aren't matching, or maybe Counsyl tests
>>>> them but they don't have per-gene testing available as listed on
>>>> GeneTests.
>>>>
>>>> Should we worry about how "lumpy" the Counsyl list looks? Sasha--any
>>>> luck on getting an HCM list from Heidi?
>>>>
>>>>     - Madeleine
>>>>
>>>> On Fri, May 21, 2010 at 12:41 AM, Tom Clegg
>>>> <tom at scalablecomputingexperts.com> wrote:
>>>>> Autoscores for all of the Counsyl variants are attached.
>>>>>
>>>>> There were a few lines that look like they were corrupted by some
>>>>> translation process (I ignored them):
>>>>>
>>>>> ",nsSNP8S
>>>>> ",nsSNP58Q
>>>>> ",nsSNP52W
>>>>>
>>>>> Distribution of autoscores for counsyl variants:  (select
>>>>> autoscore,count(variant_id) from counsyl_autoscore group by autoscore)
>>>>>
>>>>> +-----------+-------------------+
>>>>> | autoscore | count(variant_id) |
>>>>> +-----------+-------------------+
>>>>> | 0         |                33 |
>>>>> | 1         |                 4 |
>>>>> | 2         |               119 |
>>>>> | 3         |                 4 |
>>>>> | 4         |               129 |
>>>>> +-----------+-------------------+
>>>>>
>>>>> Distribution of autoscores for all variants:  (cut -f40 latest-flat.tsv |
>>>>> tail -n +2 | sort -n | uniq -c)
>>>>>   62304 0
>>>>>    3993 1
>>>>>   10473 2
>>>>>    1512 3
>>>>>    3378 4
>>>>> Presumably *some* variants should be getting scores >4 -- I'll have to look
>>>>> at this tomorrow (examples welcome).
>>>>> The "in genetests?" contribution to the above autoscores is based on whether
>>>>> the gene is *listed* in genetests, not whether its record indicates "test
>>>>> available"... contrary to what I told Madeleine today.  I've fixed that just
>>>>> now, and the scores are being recalculated.  (64 of the 836 genes in
>>>>> genetests are "no test available")
>>>>> Tom
>>>>>
>>>>> _______________________________________________
>>>>> GET-dev mailing list
>>>>> GET-dev at lists.freelogy.org
>>>>> http://lists.freelogy.org/mailman/listinfo/get-dev
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> GET-dev mailing list
>>>> GET-dev at lists.freelogy.org
>>>> http://lists.freelogy.org/mailman/listinfo/get-dev
>>>>
>>>
>>> _______________________________________________
>>> GET-dev mailing list
>>> GET-dev at lists.freelogy.org
>>> http://lists.freelogy.org/mailman/listinfo/get-dev
>>>
>>
>




More information about the Arvados mailing list