Lots of improvements to GET-Evidence, and lots of new people signing up for this list!<br><br>The biggest improvement, in terms of functionality, is analysis of insertions and deletions in coding regions -- in other words, tell us if and where a frameshift occurred, or if an amino acid was deleted, or inserted. To do this, Tom and I had to settle on a standard method for reporting these in the uploaded genome data and how to report any amino acid changes we did find. I wrote up two guides for this, you can currently see them on my dev site:<br>
<a href="http://mball.freelogy.org/guide_upload_and_source_file_formats">http://mball.freelogy.org/guide_upload_and_source_file_formats</a><br><a href="http://mball.freelogy.org/guide_amino_acid_calls">http://mball.freelogy.org/guide_amino_acid_calls</a><br>
<br>In the process I ended up deciding to use UCSC's transcript annotation, which means there are some changes that require you to go redo some installation steps. I'm putting a more detailed explanation what I did as postscript to this email.<br>
<br>Also, Tom has added DataTables (<a href="http://www.datatables.net/">http://www.datatables.net/</a>) so we can get awesome genome reports users can sort and filter. And he's adding some ability to link to <a href="http://my.personalgenomes.org">my.personalgenomes.org</a> profile pages. So that also has some installation steps you'll need to redo. <br>
<br>Presumably Tom will pull my stuff soon, so here's the combined list of things you need to redo (I may have missed some though):<br>* remove "sudo rm /home/trait/config.py"<br>* Do these again (should create new config.py and run setup-external-data):<br>
cd ~/get-evidence/server/script/<br> source config-local.sh<br> sudo -u $USER ./install-user.sh<br>* edit /home/trait/config.py to have the SQL database password (same as in public_html/config.py)<br>* re-run "make install" (to get DataTables)<br>
* go to "install.php" again (to get SQL tables updated)<br><br>There are other new items in the INSTALL. I'm not sure what else might need to be done. As always, feel free to post an email here if you have problems.<br>
<br> - Madeleine<br><br>P.S. More details about what I did.<br><br>I've updated gff_nonsynonymous_filter_from_file.py -- previously it would just take 3 bases, translate them to amino acid, then compare to see if the amino acid changed. Now it concatenates the whole coding sequence around the reference and variant genotypes, calls amino acids on the whole thing, then calls amino acid differences between the two. In the process I added a sanity check: it turns out a lot of refFlat transcript annotations are not "sane" -- many of them, for example, have some frameshift mistake that results in stop codons throughout the sequence. So now we throw out anything that doesn't meet that requirement -- you're *very* unlikely to randomly not have any stop codons, so passing that check is a good test for a good transcript.<br>
<br>That said, the stuff I was getting out of it was still making me unhappy -- e.g. frameshifts based on one of ten different transcripts annotated, intronic in the other nine. Based on a recommendation from Ivan Adzhubey I decided we needed to move to UCSC's annotation. One great thing about this is that UCSC has annotated "canonical transcripts". Previously we'd report several different amino acid variants for a single genetic change from different transcripts -- I think this kind of made things a mess, since you only want one variant page in the end. Canonical transcripts solves that.<br>
<br>But attaching gene names to the UCSC transcripts was a bit of a mess. I added a download of HGNC canonical gene names. Where possible I used that to directly infer name from a transcript. Then I used refFlat to connect transcript ID to gene name as long as the gene name was on HGNC's list and matched the correct chromosome (a weak sanity check, but we had none at all before). Failing those, I used kgXref's suggested gene name, again checking against the HGNC list and chromosome. If all these failed, I used transcript ID as "gene name".<br>