[GET-dev] Blosum100 matrix correct in GET?

Kimberly Robasky krobasky at gmail.com
Wed May 26 18:19:03 EDT 2010


FYI - [Pertsemlidis and Fondon, 2002] just cites the ftp website from
whence you can download a matrix that was calculated using the
algorithm described in:

Henikoff S, Henikoff JG. Amino acid substitution matrices from protein
blocks. Proc Natl Acad Sci USA. 1992;89:10915–10919. [PMC free
article] [PubMed]

The original BLOSUM100 was apparently in the above paper's
supplemental materials, but I can't seem to find them now.  Probably
we should cite [Henikoff & Henikoff, 1992] on GET?


On Wed, May 26, 2010 at 3:50 PM, Madeleine Price Ball <meprice at gmail.com> wrote:
> Ok, looking at substitution_matrix.py:
> http://github.com/tomclegg/trait-o-matic/blob/master/core/utils/substitution_matrix.py
>
> I'm trying to see how well using the official blosum100 compares
> against the behavior we've been getting from whatever this matrix is.
> The only thing that I think we care about is whether a particular
> transition is scored as "being over 3" (below -3, whatever).
>
> For now I'm ignoring transitions to or from termination codons. I'll
> discuss this later. So, of the 400 amino acid pairs we can look at, I
> calculate that 228 are meeting the "over 3" (below -3) threshold. For
> the official NBLOSUM100 matrix from ncbi:
> ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM100  A threshold of -4
> gets me 240 pairs, a threshold of -5 gets me 196.
>
> Good news: all 228 from the old matrix are in the set of 240 caught by
> the -4 threshold, and all 196 caught by the -5 threshold are in the
> set of 228 from the old matrix. I think we should use the new matrix
> with a -4 threshold.
>
> Kim created this ticket (which I've updated with stuff covered in this email):
> https://trac.scalablecomputingexperts.com/ticket/39
>
> I found this reference for the BLOSUM100 matrix, associated with
> NCBI's BLAST website:
> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC138974/
> Pertsemlidis and Fondon, "Having a BLAST with bioinformatics" Genome
> Biol. 2001; 2(10): reviews2002.1–reviews2002.10.
> Published online 2001 September 27.
>
> Regarding termination codons:
> We currently treat these as having 10 points, this is what the
> official matrix does as well. In my opinion, we should treat amino
> acid -> termination as very bad, but I don't think termination ->
> amino acid merits the same badness. BLOSUM is symmetric and scores
> these two as the same, but I don't think we should implement it this
> way. It definitely isn't worth 10 points, and I'm not even sure it
> should be worth "-4", all you're doing is tacking on extra amino acids
> onto the end of a protein ... but we can go with "-4" for now.
>
> Do we know where the termination codon score is currently implemented
> in the program? I don't see it in substitution_matrix.py
>
>  - Madeleine
>
> On Tue, May 25, 2010 at 2:04 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>> I hadn't looked into it before. Keep in mind that we use the negative
>> of the BLOSUM value; so we agree on G>A.
>>
>> As to the other ones, the discrepancy originates in the source data.
>> The values are in substitution_matrix.py, and the BLOSUM100 there
>> gives me G>V as -5 and G>R as -4. I got these values from the source
>> posted at <http://portal.open-bio.org/pipermail/biopython/2000-September/000418.html>
>> which is in turn derived, for BLOSUM100, from the no-longer
>> functioning URL
>> <http://www.embl-heidelberg.de/~vogt/matrices/blosum100.cmp>. I can't
>> explain why that source gives those values for BLOSUM100. I guess it's
>> just an inaccurate source, and we can modify it as necessary.
>>
>>
>> On Tue, May 25, 2010 at 9:10 AM, Alexander Wait Zaranek
>> <awaitz at post.harvard.edu> wrote:
>>> I seem to remember you had looked into this.
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Kimberly Robasky <krobasky at gmail.com>
>>> Date: Tue, May 25, 2010 at 10:01 AM
>>> Subject: [GET-dev] Blosum100 matrix correct in GET?
>>> To: get-dev at lists.freelogy.org
>>>
>>>
>>> I sent this to Tom & Sasha initially, and Xaodi is looking into it,
>>> but Sasha encouraged me to post it here (where Xaodi can post a response):
>>>
>>> G->A, G->V, G->R in Blosum100 from NCBI are, respectively, -1, -8, -6:
>>> ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM100
>>>
>>> But on GET, its 1, 5, 4:
>>> http://evidence.personalgenomes.org/RHO-G51A
>>> http://evidence.personalgenomes.org/RHO-G51V
>>> http://evidence.personalgenomes.org/RHO-G51R
>>>
>>> What am I missing?
>>> -Kim
>>>
>>> _______________________________________________
>>> GET-dev mailing list
>>> GET-dev at lists.freelogy.org
>>> http://lists.freelogy.org/mailman/listinfo/get-dev
>>>
>>
>> _______________________________________________
>> GET-dev mailing list
>> GET-dev at lists.freelogy.org
>> http://lists.freelogy.org/mailman/listinfo/get-dev
>>
>
> _______________________________________________
> GET-dev mailing list
> GET-dev at lists.freelogy.org
> http://lists.freelogy.org/mailman/listinfo/get-dev
>




More information about the Arvados mailing list