[GET-dev] pgp sequence format

Tom Clegg tom at clinicalfuture.com
Fri Jan 28 15:42:05 EST 2011


Hi Ruth,

On Fri, Jan 28, 2011 at 11:24 AM, Ruth McCole
<rmccole at genetics.med.harvard.edu> wrote:
> I was wondering about why a representation of insertions was chosen where
> the start coordinate is greater than the end. I understand this is in order
> to distinguish insertions from deletions and SNPs, but the violation of gff
> format has been pretty annoying.

Agreed.  We found this choice annoying too even when we were making
it.  GFF can't describe a zero-length feature so we decided to come up
with a "nearly GFF" way to do it.  The long term solution is to use a
different file format which has a native way to describe this type of
insertion.  Meanwhile, at least we can process the available data.

In GFF the "end" coordinate is supposed to be the "ending position of
the feature, inclusive".  Our reasoning was that if you instead
interpret the end coordinate as "the position of the base immediately
following the feature, minus one" then the coordinates for all legal
GFF features still have the same meaning, but you have the bonus of
being able to describe a zero-length feature as well.

Of course, start>end is explicitly disallowed in GFF, but all of the
alternatives we came up with seemed to involve changing the meaning of
the coordinates when describing an insertion.  For example, if we used
start=end, existing GFF programs would incorrectly interpret a 1bp
insertion as a SNP.  We figured raising an error was preferable to
silently getting the data wrong.

> In order to analyze the file in either
> galaxy or bedtools, I have changed it so that if start>end, 'INDEL' is
> replaced by 'INS', and the start and end coordinates are swopped.
>
> Is there any reason why I shouldn't do this?

I don't think swapping start and end will give you the right results.

For example, for the insertion "acgt -> acAgt" our files will say start=3 end=2.

If you swap them to start=2 end=3, the feature would likely be
interpreted as "acgt -> aAt".

> Please don't ask for the python
> script to do this because it has a bug in it which means its ok for the
> sequence I'm primarily working on, but not necessarily any other. I am
> trying to fix this.

I think the way to turn these into legal correct GFF files (without
just tossing these lines out) would be to look up the relevant
positions in hg18 and encode the above example as something like
start=3 end=3 ... alleles AG; ref_allele G.

Tom




More information about the Arvados mailing list