[GET-dev] Re: Latest draft

Xiaodi Wu xiaodi.wu at gmail.com
Mon May 31 16:39:29 EDT 2010


Bug 39 has been addressed; I've committed the change to my own
production branch at GitHub (the only one that doesn't have merge
issues given how much has changed since I last edited) and sent a pull
request to Tom, so he should be able to work that into his branch now.
Might be useful to share this info with the BioPython team, since the
original matrices were derived from their source, and it stands to
reason that they've been using the wrong matrix values for BLOSUM100
all along and still are.

Re: bug 22, this is a problem beyond our control. dbSNP (via the link
you sent me) agrees with Trait-o-matic and labels rs3798220 as I1891M.
This is based on NP_005568.2. As noted in that file,

"Depending on the individual, the encoded protein contains 2-43 copies
of kringle-type domains. The allele represented here contains 15
copies of the kringle-type repeats and corresponds to that found in
the reference genome sequence. [provided by RefSeq]. Sequence Note:
This gene is highly polymorphic in length and number of exons due to
variation in the number of kringle IV-2 repeats which vary from 2-43
copies among individuals. This RefSeq record was created from the
reference genome assembly based on the exon representation found in
DQ452068.1 whose sequence is consistent with the reference genome
sequence, and includes 15 copies of the kringle IV-2 repeats."

So the traditional designation of I4399M is based on a sequence with a
different number of kringle repeats, which is no longer used. There's
nothing we can do about this except to note that literature previous
to 2006 (apparently, that's when this record superseded the previous
one; I'm not sure how many kringles are in that one) I4399M refers to
the current designation of I1891M.


On Sun, May 30, 2010 at 1:09 PM, Madeleine Price Ball
<meprice at fas.harvard.edu> wrote:
> On Sun, May 30, 2010 at 1:28 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>> Hi Madeleine,
>>
>> I can commit the change for bug 39 (BLOSUM100) if you update me on
>> what the new workflow is in terms of the git repository that's the
>> current master, etc. It takes only a few minutes and I may as well do
>> it, since I've worked on the original.
>
> I believe this one is the master:
> http://github.com/tomclegg/trait-o-matic/
>
>> Re: your comment in bug 22
>> (amino acid positions), refFlat does have splice variants. Your best
>> bet as to figuring out where the problem is is by double checking to
>> see if the SNP has been re-mapped. With every version of dbSNP (and,
>> obviously, every major release of the reference genome, but that's not
>> what we're interested in here), some of the SNP rs number--genome
>> position correlations are changed to reflect better data. This might
>> be one of those cases. Otherwise, it's hard to believe how amino acid
>> 1891 and amino acid 4399 could be confused; the alternative is that
>> they are not and really do point back to the same genome position, in
>> which case Trait-o-matic is designed to detect and look up both.
>
> Well, there aren't any splice variants for this produced by
> trait-o-matic, we checked that. When I look up the SNP on dbSNP:
> http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs3798220
>
> I don't know how dbSNP would cause the problem, but I don't think
> that's the issue. I find the same position for 36.3 reference genome
> build as in the P0 trait-o-matic output, chr6 160881127.
>
> Maybe someone needs to sit down with the refFlat file and figure out
> whether it matches UCSC annotation and whether those match the 4399
> position published.
>




More information about the Arvados mailing list