[GET-dev] Blosum100 matrix correct in GET?

Madeleine Price Ball meprice at gmail.com
Wed May 26 15:50:02 EDT 2010


Ok, looking at substitution_matrix.py:
http://github.com/tomclegg/trait-o-matic/blob/master/core/utils/substitution_matrix.py

I'm trying to see how well using the official blosum100 compares
against the behavior we've been getting from whatever this matrix is.
The only thing that I think we care about is whether a particular
transition is scored as "being over 3" (below -3, whatever).

For now I'm ignoring transitions to or from termination codons. I'll
discuss this later. So, of the 400 amino acid pairs we can look at, I
calculate that 228 are meeting the "over 3" (below -3) threshold. For
the official NBLOSUM100 matrix from ncbi:
ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM100  A threshold of -4
gets me 240 pairs, a threshold of -5 gets me 196.

Good news: all 228 from the old matrix are in the set of 240 caught by
the -4 threshold, and all 196 caught by the -5 threshold are in the
set of 228 from the old matrix. I think we should use the new matrix
with a -4 threshold.

Kim created this ticket (which I've updated with stuff covered in this email):
https://trac.scalablecomputingexperts.com/ticket/39

I found this reference for the BLOSUM100 matrix, associated with
NCBI's BLAST website:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC138974/
Pertsemlidis and Fondon, "Having a BLAST with bioinformatics" Genome
Biol. 2001; 2(10): reviews2002.1–reviews2002.10.
Published online 2001 September 27.

Regarding termination codons:
We currently treat these as having 10 points, this is what the
official matrix does as well. In my opinion, we should treat amino
acid -> termination as very bad, but I don't think termination ->
amino acid merits the same badness. BLOSUM is symmetric and scores
these two as the same, but I don't think we should implement it this
way. It definitely isn't worth 10 points, and I'm not even sure it
should be worth "-4", all you're doing is tacking on extra amino acids
onto the end of a protein ... but we can go with "-4" for now.

Do we know where the termination codon score is currently implemented
in the program? I don't see it in substitution_matrix.py

  - Madeleine

On Tue, May 25, 2010 at 2:04 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
> I hadn't looked into it before. Keep in mind that we use the negative
> of the BLOSUM value; so we agree on G>A.
>
> As to the other ones, the discrepancy originates in the source data.
> The values are in substitution_matrix.py, and the BLOSUM100 there
> gives me G>V as -5 and G>R as -4. I got these values from the source
> posted at <http://portal.open-bio.org/pipermail/biopython/2000-September/000418.html>
> which is in turn derived, for BLOSUM100, from the no-longer
> functioning URL
> <http://www.embl-heidelberg.de/~vogt/matrices/blosum100.cmp>. I can't
> explain why that source gives those values for BLOSUM100. I guess it's
> just an inaccurate source, and we can modify it as necessary.
>
>
> On Tue, May 25, 2010 at 9:10 AM, Alexander Wait Zaranek
> <awaitz at post.harvard.edu> wrote:
>> I seem to remember you had looked into this.
>>
>>
>> ---------- Forwarded message ----------
>> From: Kimberly Robasky <krobasky at gmail.com>
>> Date: Tue, May 25, 2010 at 10:01 AM
>> Subject: [GET-dev] Blosum100 matrix correct in GET?
>> To: get-dev at lists.freelogy.org
>>
>>
>> I sent this to Tom & Sasha initially, and Xaodi is looking into it,
>> but Sasha encouraged me to post it here (where Xaodi can post a response):
>>
>> G->A, G->V, G->R in Blosum100 from NCBI are, respectively, -1, -8, -6:
>> ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM100
>>
>> But on GET, its 1, 5, 4:
>> http://evidence.personalgenomes.org/RHO-G51A
>> http://evidence.personalgenomes.org/RHO-G51V
>> http://evidence.personalgenomes.org/RHO-G51R
>>
>> What am I missing?
>> -Kim
>>
>> _______________________________________________
>> GET-dev mailing list
>> GET-dev at lists.freelogy.org
>> http://lists.freelogy.org/mailman/listinfo/get-dev
>>
>
> _______________________________________________
> GET-dev mailing list
> GET-dev at lists.freelogy.org
> http://lists.freelogy.org/mailman/listinfo/get-dev
>




More information about the Arvados mailing list