[GET-dev] Blosum100 matrix correct in GET?
xiaodi.wu at gmail.com
Wed May 26 18:54:59 EDT 2010
Termination is hardcoded into the system. That'll be hard to modify.
The reason we use a symmetric matrix is because it's not possible to
know for sure whether the ancestral allele or the newer allele is the
stop codon without other info (e.g. phylogenetic calculations); so
symmetry is used so that scoring is not contingent upon that
determination. You can always throw in a guess based on allele
frequencies, but then you may as well go the whole way and leverage
the full advantage of knowing allele frequencies (it's a good proxy
for how deleterious the mutation might be). I did implement precisely
that on snp-dev at one point and showed last year that it was many
times better than just using a substitution matrix and comparable with
many more complicated techniques like SIFT. This change didn't quite
make it into the mainline repository, though, so it's been written
As to the former question, we use the negative because the score is
meant to indicate how deleterious we predict the mutation to be.
Hence, more deleterious should intuitively line up with a larger
number. So I just took the additive inverse of the BLOSUM.
On Wed, May 26, 2010 at 5:19 PM, Kimberly Robasky <krobasky at gmail.com> wrote:
> FYI - [Pertsemlidis and Fondon, 2002] just cites the ftp website from
> whence you can download a matrix that was calculated using the
> algorithm described in:
> Henikoff S, Henikoff JG. Amino acid substitution matrices from protein
> blocks. Proc Natl Acad Sci USA. 1992;89:10915–10919. [PMC free
> article] [PubMed]
> The original BLOSUM100 was apparently in the above paper's
> supplemental materials, but I can't seem to find them now. Probably
> we should cite [Henikoff & Henikoff, 1992] on GET?
> On Wed, May 26, 2010 at 3:50 PM, Madeleine Price Ball <meprice at gmail.com> wrote:
>> Ok, looking at substitution_matrix.py:
>> I'm trying to see how well using the official blosum100 compares
>> against the behavior we've been getting from whatever this matrix is.
>> The only thing that I think we care about is whether a particular
>> transition is scored as "being over 3" (below -3, whatever).
>> For now I'm ignoring transitions to or from termination codons. I'll
>> discuss this later. So, of the 400 amino acid pairs we can look at, I
>> calculate that 228 are meeting the "over 3" (below -3) threshold. For
>> the official NBLOSUM100 matrix from ncbi:
>> ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM100 A threshold of -4
>> gets me 240 pairs, a threshold of -5 gets me 196.
>> Good news: all 228 from the old matrix are in the set of 240 caught by
>> the -4 threshold, and all 196 caught by the -5 threshold are in the
>> set of 228 from the old matrix. I think we should use the new matrix
>> with a -4 threshold.
>> Kim created this ticket (which I've updated with stuff covered in this email):
>> I found this reference for the BLOSUM100 matrix, associated with
>> NCBI's BLAST website:
>> Pertsemlidis and Fondon, "Having a BLAST with bioinformatics" Genome
>> Biol. 2001; 2(10): reviews2002.1–reviews2002.10.
>> Published online 2001 September 27.
>> Regarding termination codons:
>> We currently treat these as having 10 points, this is what the
>> official matrix does as well. In my opinion, we should treat amino
>> acid -> termination as very bad, but I don't think termination ->
>> amino acid merits the same badness. BLOSUM is symmetric and scores
>> these two as the same, but I don't think we should implement it this
>> way. It definitely isn't worth 10 points, and I'm not even sure it
>> should be worth "-4", all you're doing is tacking on extra amino acids
>> onto the end of a protein ... but we can go with "-4" for now.
>> Do we know where the termination codon score is currently implemented
>> in the program? I don't see it in substitution_matrix.py
>> - Madeleine
>> On Tue, May 25, 2010 at 2:04 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>>> I hadn't looked into it before. Keep in mind that we use the negative
>>> of the BLOSUM value; so we agree on G>A.
>>> As to the other ones, the discrepancy originates in the source data.
>>> The values are in substitution_matrix.py, and the BLOSUM100 there
>>> gives me G>V as -5 and G>R as -4. I got these values from the source
>>> posted at <http://portal.open-bio.org/pipermail/biopython/2000-September/000418.html>
>>> which is in turn derived, for BLOSUM100, from the no-longer
>>> functioning URL
>>> <http://www.embl-heidelberg.de/~vogt/matrices/blosum100.cmp>. I can't
>>> explain why that source gives those values for BLOSUM100. I guess it's
>>> just an inaccurate source, and we can modify it as necessary.
>>> On Tue, May 25, 2010 at 9:10 AM, Alexander Wait Zaranek
>>> <awaitz at post.harvard.edu> wrote:
>>>> I seem to remember you had looked into this.
>>>> ---------- Forwarded message ----------
>>>> From: Kimberly Robasky <krobasky at gmail.com>
>>>> Date: Tue, May 25, 2010 at 10:01 AM
>>>> Subject: [GET-dev] Blosum100 matrix correct in GET?
>>>> To: get-dev at lists.freelogy.org
>>>> I sent this to Tom & Sasha initially, and Xaodi is looking into it,
>>>> but Sasha encouraged me to post it here (where Xaodi can post a response):
>>>> G->A, G->V, G->R in Blosum100 from NCBI are, respectively, -1, -8, -6:
>>>> But on GET, its 1, 5, 4:
>>>> What am I missing?
>>>> GET-dev mailing list
>>>> GET-dev at lists.freelogy.org
>>> GET-dev mailing list
>>> GET-dev at lists.freelogy.org
>> GET-dev mailing list
>> GET-dev at lists.freelogy.org
More information about the Arvados