[BioC] gff files: how to tell if right-open interval convention used?
Simon Anders
anders at embl.de
Fri Apr 1 10:53:13 CEST 2011
Hi
I guess this all depends on which GFF specification you consider
authoritative. Different institution have published slightly
contradictory specs.
According to the Sanger web site, the original GFF spec is due to Durbin
and Haussler, and the current version (according to Sanger) is GFF2,
which, at
http://www.sanger.ac.uk/resources/software/gff/spec.html
specifies:
"<start>, <end>: Integers. <start> must be less than or equal to <end>.
Sequence numbering starts at 1, so these numbers should be between 1 and
the length of the relevant sequence, inclusive. "
A feature is typically a stretch of consecutive base pairs which make up
something, say, an exon, so I guess they did not have zero-length
features in mind when writing this. So, if start is less than or equal
to end, a single-base-pair feature would be denoted with start=end.
Furthermore, it should be possible to indicate the total chromosome as a
feature, and then, we can fulfill the second sentence of the quote above
only by using closed intervals.
The UCSC Genome Browser team also interpret it that they. At
http://genome.ucsc.edu/FAQ/FAQformat.html#format3
they write
"[column] 5. end - The ending position of the feature (inclusive)."
and I guess, "inclusive" means closed.
Only later, somebody came up with the idea of zero-length (a.k.a.
inter-base) features such as positions of deletions, and suggested GFF3.
Unfortunately, zero-length features can only be represented by a
half-open convention, as otherwise, we cannot distinguish whether a
feature with start equal to end means a single base pair or an
inter-base position. Hence, to me, this GFF3 proposal at
http://www.sequenceontology.org/resources/gff3.html, which Hervé quoted,
seems to be ill-defined. It implies that it is half-open, without saying
so explicitly, and so breaks backwards compatibility.
<rant> Every half-way competent computer scientist knows that specifying
_clearly_ whether the end is included is the very first thing one does
when drafting a spec for anything involving intervals, because there is
ample examples for specs using either choice. It baffles me that most
genomics file format specs are so unclear in such things. Is our field
really that unprofessional? </rant>
Simon
More information about the Bioconductor
mailing list