[GET-dev] pgp sequence format

Ruth McCole rmccole at genetics.med.harvard.edu
Fri Jan 28 14:24:41 EST 2011


I have been working with pgp sequences I downloaded in .gff format from 
here: http://evidence.personalgenomes.org/genomes

I was wondering about why a representation of insertions was chosen 
where the start coordinate is greater than the end. I understand this is 
in order to distinguish insertions from deletions and SNPs, but the 
violation of gff format has been pretty annoying. In order to analyze 
the file in either galaxy or bedtools, I have changed it so that if 
start>end, 'INDEL' is replaced by 'INS', and the start and end 
coordinates are swopped.

Is there any reason why I shouldn't do this? Please don't ask for the 
python script to do this because it has a bug in it which means its ok 
for the sequence I'm primarily working on, but not necessarily any 
other. I am trying to fix this.

Many thanks,

Ruth McCole
Postdoctoral Researcher
Wu Lab

Department of Genetics
Harvard Medical School
77 Avenue Louis Pasteur
Boston, MA 02115

More information about the Arvados mailing list