[GET-dev] Mapping build 37 SNPs to build 36 genomes?

Tom Clegg tom at scalablecomputingexperts.com
Fri Jun 18 17:49:45 EDT 2010


On Fri, Jun 18, 2010 at 11:07 AM, Kimberly Robasky <krobasky at gmail.com>
wrote:
> I'm trying to understand how GET is mapping build 37 SNPs to build 36
> genomes

The short answer is that it doesn't.  Trait-o-matic assumes that all GFF
coordinates are hg18 (36.3).  Madeleine and I have talked about this briefly
a while ago.  It seems like a first step might be to accept/store
coordinates as "build:chromosome:position" instead of just
"chromosome:position".  Then, the nsSNP engine can be extended to support
multiple references by using the appropriate version of refFlat and dbsnp.

> How does GET know that NA19240 has variant rs77023418?:
> http://evidence.personalgenomes.org/AGAP7-Thr362Asn
> You see NA19240 at the bottom has no chr/coordinate, nor does this
> page cross-reference the dbSNP id.

I don't think this is a build 36/37 issue.  It's just that the current
NA19240 data set does not list any AGAP7 variants but the previous one does.

GET-Evidence knows that a single human might have multiple data sets, so it
still remembers that NA19240 has that variant, even though the latest data
set does not include it.  I think there are 3 bugs to fix here:

(1) GET-Evidence should be more careful to link to the data set that
actually has the variant (rather than linking to all current data sets for
that variant).

(2) Trait-o-matic should be able to display both data sets publicly, without
having both appear on the "public samples" front page.

(3) GET-Evidence should remove the "genome X" sections when a public data
set becomes non-public (basically "stop linking to a previously public data
set" is not automated so it's no big surprise that this didn't happen
properly).  IIRC it does correctly handle "public data set is recomputed and
no longer has variant V" so it might not take much to fix this.

The second bug probably isn't a super high priority if we're going to
revise/replace the "public samples" and "trait-o-matic report" functionality
using GET-Evidence.

chr10   MAQ     SNP     51135377        51135377        .       +       .
    alleles G/T;amino_acid AGAP7 T362N;ref allele G;ref_allele G

> I've been slogging through source trying to figure out how genome id
> gets mapped to variant in the edits/snap_latest/snap_release tables,
> but to no avail.  It doesn't seem to have been mapped in the makefile
> or install.php, either.  Could this be an artifact from some previous
> source code base?

snap_latest has a row with variant_id=1078, genome_id=25 -- that corresponds
to "Dec 27 2009 Genome Importing Robot added [NA19240]" in the edit history.

(When #3 is fixed you'll see a "removed [NA19240]" edit corresponding to
this.)

For sake of explanation let's look at a variant that isn't affected by bug
(3) above:

http://evidence.personalgenomes.org/RHO-Gly51Ala

mysql> select * from variant_occurs where variant_id = 83444;
+------------+------+------------+--------------+------+-----------+--------+
| variant_id | rsid | dataset_id | zygosity     | chr  | chr_pos   | allele
|
+------------+------+------------+--------------+------+-----------+--------+
|      83444 |    0 | T/snp/179  | heterozygous | chr3 | 130730418 | C
 |
+------------+------+------------+--------------+------+-----------+--------+
1 row in set (0.00 sec)

mysql> select * from datasets where dataset_id = 'T/snp/179';
+------------+-----------+--------------------------------------------+------+
| dataset_id | genome_id | dataset_url                                | sex
 |
+------------+-----------+--------------------------------------------+------+
| T/snp/179  |        17 | http://snp.med.harvard.edu/results/job/179 | M
 |
+------------+-----------+--------------------------------------------+------+
1 row in set (0.00 sec)

mysql> select * from genomes where genome_id = 17;
+-----------+-----------------+---------------+
| genome_id | global_human_id | name          |
+-----------+-----------------+---------------+
|        17 | snp-17          | James Sherley |
+-----------+-----------------+---------------+
1 row in set (0.00 sec)

> More broadly, I found this because I'm trying to map all variants to
> coordinates that I use to compute conservation, but I have no
> coordinates for around 15% of the variants, including this AGAP7
> variant.

The "variant_occurs" table (and the variants table, to get the AA
coordinates) should give you coordinates for all the variants that occur in
the current versions of all the public genomes.

In general, the AGAP7-Thr362Asn situation (genome entry despite no data set
evidence) will still happen even with bug (3) fixed -- if someone (other
than the genome importing robot) writes something in the text field for that
genome, the genome importer will leave it alone.  Perhaps it would be
helpful to have an annotation on the web page to explain this: "No current
data sets indicate that this genome has this variant."  Someone want to
offer a more concise version of that?

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20100618/5554cd09/attachment.html>


More information about the Arvados mailing list