[GET-dev] Python style guide, pylint, and generators

Wed Mar 2 12:54:12 EST 2011

Python style guide:
Following a certain style helps make code more readable, helping others join
a project. I was recently pointed to this python style guide, which I'm now
trying to follow and I think should be the style for python code in our
project: http://www.python.org/dev/peps/pep-0008/

To that end, my latest commit updating gff_getevidence_map.py attempts to
adhere to the style guide:
https://github.com/madprime/get-evidence/blob/940417ccc5a74ec6a235f537575347ee38ab991e/server/gff_getevidence_map.py

pylint:
I've also been shown pylint, a useful tool for checking if python code is
well-written. Run with default configuration, my new gff_getevidence_map.py
gets a pylint score of 9.25/10. Perhaps we should aim for a minimum score
(e.g. 8) in all our python code?

Generators:
Python generator objects allow you to chain together functions so the output
of one feeds into the input of another, like a UNIX pipe. Here's a nice
review of python generators: http://www.dabeaz.com/generators/Generators.pdf

This can be a big win for us because a lot of our data processing involved
"picking up a GFF file, analyzing each variant and possibly adding data,
then putting it back into a GFF file". The GFF file (or whatever genome
format we use) should only need to be read once! We should aim to have any
script which analyzes data from a genome GFF file act as a generator that
returns GFF strings -- that way we can pull the GFF data through all
analyses instead of re-reading it.

For example, before we might have done this:
 gff_dbsnp_query.match2dbSNP_to_file(gff_input_file, dbsnp_file,
dbsnp_output_file)
And passed the location of 'dbsnp_output_file' (which has GFF-formatted
data):
 gff_nonsynonymous_filter.predict_nonsynonymous_to_file(dbsnp_output_file,
twobit_path, transcripts_file, output_file)

But now we can do this:
 dbsnp_gen = gff_dbsnp_query.match2dbSNP(gff_input_file, dbsnp_file)
And pass that generator instead:
 gff_nonsynonymous_filter.predict_nonsynonymous_to_file(dbsnp_gen,
twobit_path, transcripts_file, output_file)
... As a result we are not unnecessarily writing and re-reading the GFF
data. The processing is going to happen in one step as a pipeline, with the
GFF input read just far enough by gff_dbpsnp_query to produce the next line
requested by gff_nonsynonymous_filter.

The new gff_getevidence_map.py is also modified so it can now act as a GFF
generator if passed an optional output_file parameter (causing it to switch
from generating JSON output lines to generating the GFF input lines and
outputing the JSON lines to file). On the to do list is fixing remaining any
non-generator scripts. In the future, any new scripts that analyze the
genome GFF should be written to generate the genome GFF as output (whether
or not it's modified along the way).

     - Madeleine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20110302/b2afca41/attachment.html>