[GET-dev] Mapping build 37 SNPs to build 36 genomes?

Fri Jun 18 23:18:30 EDT 2010

On Fri, Jun 18, 2010 at 5:49 PM, Tom Clegg
<tom at scalablecomputingexperts.com> wrote:
> On Fri, Jun 18, 2010 at 11:07 AM, Kimberly Robasky <krobasky at gmail.com>
> wrote:
>> I'm trying to understand how GET is mapping build 37 SNPs to build 36
>> genomes
>
> The short answer is that it doesn't.  Trait-o-matic assumes that all GFF
> coordinates are hg18 (36.3).  Madeleine and I have talked about this briefly
> a while ago.  It seems like a first step might be to accept/store
> coordinates as "build:chromosome:position" instead of just
> "chromosome:position".

Yes, I think storing build is essential.  Even on the variant - this
is how dbSNP does it.  For most SNPs I've seen in there, you can find
the 36.3 and the 37 mappings in dbSNP, but for this AGAP7 variant,
dbSNP only has a build 37 mapping, so it would seem this is important?
 See my note later down on this (about creating a "Build" section on
the variant page)

> Then, the nsSNP engine can be extended to support
> multiple references by using the appropriate version of refFlat and dbsnp.

I'm with you so far.

>> How does GET know that NA19240 has variant rs77023418?:
>> http://evidence.personalgenomes.org/AGAP7-Thr362Asn
>> You see NA19240 at the bottom has no chr/coordinate, nor does this
>> page cross-reference the dbSNP id.
>
> I don't think this is a build 36/37 issue.  It's just that the current
> NA19240 data set does not list any AGAP7 variants but the previous one does.
>
> GET-Evidence knows that a single human might have multiple data sets, so it
> still remembers that NA19240 has that variant, even though the latest data
> set does not include it.  I think there are 3 bugs to fix here:
>
> (1) GET-Evidence should be more careful to link to the data set that
> actually has the variant (rather than linking to all current data sets for
> that variant).

Maybe both?  You could add dataset_id column to the variant_occurs
table, then reference both that one and the genome_id to get pointers
to both the reference dataset for the variant in question as well as
the most recent dataset.  If you agree with this approach, let me
know, and I'll log it into trac.

> (2) Trait-o-matic should be able to display both data sets publicly, without
> having both appear on the "public samples" front page.

I would love to see dataset revisions available as links off each of
the "public samples" dataset pages.  How about an option for
"updating" a dataset that would automatically create links to previous
revisions (maybe add a box on the genome submission form for specifing
the url to replace?).

You could do this with a new table, maybe genome_versions, with
columns genome_id, job_id, version.  Then, probably the easiest thing
(but not so elegant) is to create another new table, genome_current
mapping genome_id to job_id so you don't have to constantly hunt
through genome_versions to find the most current dataset.

trac it?

> (3) GET-Evidence should remove the "genome X" sections when a public data
> set becomes non-public (basically "stop linking to a previously public data
> set" is not automated so it's no big surprise that this didn't happen
> properly).  IIRC it does correctly handle "public data set is recomputed and
> no longer has variant V" so it might not take much to fix this.

IIRC, you did RC :) -- I think I saw that in import_genomes.php

But would this account for the 15% or so variants that have no entry
in variant_occurs?

Also, I think it would be really nice if GET could report the source
of the variant right on the page (e.g., did somebody add it manually?
Is it from pharmGKB?...?)  I'm pretty sure that at least some of those
15% are from pharmgkb, but I'd have to look a little closer to
confirm.  See my note below on adding a section called "Source" below
for more.

> The second bug probably isn't a super high priority if we're going to
> revise/replace the "public samples" and "trait-o-matic report" functionality
> using GET-Evidence.
>
> chr10   MAQ     SNP     51135377        51135377        .       +       .
>     alleles G/T;amino_acid AGAP7 T362N;ref allele G;ref_allele G

Hey, where did you find that?  I looked for 51135377/8/9, in my job
125 gff files but didn't find it - curious.  Is that from the old
NA19240 dataset?

>> I've been slogging through source trying to figure out how genome id
>> gets mapped to variant in the edits/snap_latest/snap_release tables,
>> but to no avail.  It doesn't seem to have been mapped in the makefile
>> or install.php, either.  Could this be an artifact from some previous
>> source code base?
>
> snap_latest has a row with variant_id=1078, genome_id=25 -- that corresponds
> to "Dec 27 2009 Genome Importing Robot added [NA19240]" in the edit history.
>
> (When #3 is fixed you'll see a "removed [NA19240]" edit corresponding to
> this.)
>
> For sake of explanation let's look at a variant that isn't affected by bug
> (3) above:
>
> http://evidence.personalgenomes.org/RHO-Gly51Ala
>
> mysql> select * from variant_occurs where variant_id = 83444;
> +------------+------+------------+--------------+------+-----------+--------+
> | variant_id | rsid | dataset_id | zygosity     | chr  | chr_pos   | allele
> |
> +------------+------+------------+--------------+------+-----------+--------+
> |      83444 |    0 | T/snp/179  | heterozygous | chr3 | 130730418 | C
>  |
> +------------+------+------------+--------------+------+-----------+--------+
> 1 row in set (0.00 sec)
> mysql> select * from datasets where dataset_id = 'T/snp/179';
> +------------+-----------+--------------------------------------------+------+
> | dataset_id | genome_id | dataset_url                                | sex
>  |
> +------------+-----------+--------------------------------------------+------+
> | T/snp/179  |        17 | http://snp.med.harvard.edu/results/job/179 | M
>  |
> +------------+-----------+--------------------------------------------+------+
> 1 row in set (0.00 sec)
> mysql> select * from genomes where genome_id = 17;
> +-----------+-----------------+---------------+
> | genome_id | global_human_id | name          |
> +-----------+-----------------+---------------+
> |        17 | snp-17          | James Sherley |
> +-----------+-----------------+---------------+
> 1 row in set (0.00 sec)

Nicely explained.

>> More broadly, I found this because I'm trying to map all variants to
>> coordinates that I use to compute conservation, but I have no
>> coordinates for around 15% of the variants, including this AGAP7
>> variant.
> The "variant_occurs" table (and the variants table, to get the AA
> coordinates) should give you coordinates for all the variants that occur in
> the current versions of all the public genomes.
> In general, the AGAP7-Thr362Asn situation (genome entry despite no data set
> evidence) will still happen even with bug (3) fixed -- if someone (other
> than the genome importing robot) writes something in the text field for that
> genome, the genome importer will leave it alone.  Perhaps it would be
> helpful to have an annotation on the web page to explain this: "No current
> data sets indicate that this genome has this variant."  Someone want to
> offer a more concise version of that?

Again, I think it would be more helpful to have a section on the
bottom of the variant page, maybe: "Source" that points to the source
of the variant; maybe the possible values are:
[current dataset|obsolete dataset|private
dataset|dbSNP|pharmGKB|OMIM|gwas|HGMD|manual submission]

I would also really like to see a section at the bottom named "Build"
referring to builds in which this variant is found.  Again, I see
dbSNP for an example of what I have in mind.

thanks Tom!!  Very thorough, and very helpful as always.

-Kim

> Tom
>
> _______________________________________________
> GET-dev mailing list
> GET-dev at lists.freelogy.org
> http://lists.freelogy.org/mailman/listinfo/get-dev
>
>