[ARVADOS] updated: b7fe2ea36c87fa648f019c20679b50ab462aec5a

Wed Oct 15 15:17:12 EDT 2014

Summary of changes:
 crunch_scripts/run-command                      | 114 +++++++++++++++---------
 doc/_config.yml                                 |   1 +
 doc/user/topics/arv-run.html.textile.liquid     |  67 ++++++++++++++
 doc/user/topics/run-command.html.textile.liquid |  48 ++++++++--
 sdk/cli/bin/arv                                 |   4 +-
 sdk/python/arvados/commands/run.py              |  16 ++--
 6 files changed, 194 insertions(+), 56 deletions(-)
 create mode 100644 doc/user/topics/arv-run.html.textile.liquid

       via  b7fe2ea36c87fa648f019c20679b50ab462aec5a (commit)
       via  0a37e2d631fd98e2766245c4719586d38bdf10c8 (commit)
       via  2ec515c8ef7f9cae426a6490d1317333718e1d5e (commit)
      from  c79e86aff4cb20413cf0f09c52fe5066ca197deb (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.


commit b7fe2ea36c87fa648f019c20679b50ab462aec5a
Author: Peter Amstutz <peter.amstutz at curoverse.com>
Date:   Wed Oct 15 15:16:41 2014 -0400

    3609: Added documentation page.  Added to "arv" frontend command.  Bug fix to
    print help when there are no command line arguments.

diff --git a/doc/_config.yml b/doc/_config.yml
index 61bfb6f..b03a18d 100644
--- a/doc/_config.yml
+++ b/doc/_config.yml
@@ -32,6 +32,7 @@ navbar:
       - user/topics/keep.html.textile.liquid
     - Run a pipeline on the command line:
       - user/topics/running-pipeline-command-line.html.textile.liquid
+      - user/topics/arv-run.html.textile.liquid
       - user/reference/sdk-cli.html.textile.liquid
     - Develop a new pipeline:
       - user/tutorials/intro-crunch.html.textile.liquid
diff --git a/doc/user/topics/arv-run.html.textile.liquid b/doc/user/topics/arv-run.html.textile.liquid
new file mode 100644
index 0000000..b406e6b
--- /dev/null
+++ b/doc/user/topics/arv-run.html.textile.liquid
@@ -0,0 +1,67 @@
+---
+layout: default
+navsection: userguide
+title: "Using arv-run"
+...
+
+The @arv-run@ command enables you create Arvados pipelines at the command line that fan out to multiple concurrent tasks across Arvado compute nodes.
+
+{% include 'tutorial_expectations' %}
+
+h1. Quick introduction
+
+Run one @grep@ task per file, and redirect the output to output.txt
+
+<notextile>
+<pre>
+$ <span class="userinput">cd ~/keep/by_id/3229739b505d2b878b62aed09895a55a+142</span>
+$ <span class="userinput">arv-run grep -H -n ATTGGAGGAAAGATGAGTGAC -- *.fastq \> output.txt</span>
+Running pipeline qr1hi-d1hrv-mg3bju0u7r6w241
+</pre>
+</notextile>
+
+h1. Usage
+
+ at arv-run@ takes a command or command pipeline, along with stdin and stdout redirection, and creates an Arvados pipeline to run the command.  The syntax is designed to mimic standard shell syntax, so it is usually necessary to quote the metacharacters < > and | as either \< \> and \| or '<' '>' and '|'.
+
+ at arv-run@ introspects the command line to determine which arguments are file inputs.  If you specify a file that is only available on the local filesystem, it will be first uploaded to Arvados, and then the command line will be rewritten to refer to the newly uploaded file.  @arv-run@ also works together with @arv-mount@ to identify if a file specified on the command line is part of an Arvados collection.  If so, the command line will be rewritten to refer to the file within the collection without any upload necessary.
+
+ at arv-run@ will parallelize on the files listed on the command line after @-- at .  You may specify @--batch-size N@ after the @--@ but before listing any files to specify how many files to provide put on the command line for each task (see below for example).
+
+You may use stdin @<@ redirection on multiple files.  This will create a separate task for each input file.
+
+You are only permitted to supply a single file name for stdout @>@ redirection.  If there are multiple tasks, their output will be collated at the end of the pipeline.  Alternately, you may use "run-command":run-command.html parameter substitution in the file name to generate different filenames for each task.
+
+Multiple commands connected by pipes all execute in the same container.  If you need to capture intermediate results of a pipe, use the @tee@ command.
+
+ at arv-run@ commands always run inside a Docker image.  By default, this is "arvados/jobs".  Use @arv --docker-image IMG@ to specify the image to use.  Note: the Docker image must be uploaded to Arvados using @arv keep docker at .
+
+Use @arv-run --dry-run@ to print out the final Arvados pipeline generated by @arv-run@ without submitting it.
+
+By default, the pipeline will be submitted to your configured Arvado instance.  Use @arv-run --local@ to run the command locally using "arv-crunch-job".
+
+h1. Examples
+
+Run one @grep@ task per file, with each input files piped from stdin.  Redirect the output to output.txt.
+
+<notextile>
+<pre>
+$ <span class="userinput">arv-run grep -H -n ATTGGAGGAAAGATGAGTGAC \< *.fastq \> output.txt</span>
+</pre>
+</notextile>
+
+Run @cat | grep@ once per file.  Redirect the output to output.txt.
+
+<notextile>
+<pre>
+$ <span class="userinput">arv-run cat -- *.fastq \| grep -H -n ATTGGAGGAAAGATGAGTGAC \> output.txt</span>
+</pre>
+</notextile>
+
+Run @bwa@ for pairs of fastq files in "inputs" using the reference human_g1k_v37.fasta.
+
+<notextile>
+<pre>
+<span class="userinput">arv-run --docker-image arvados/jobs-java-bwa-samtools bwa mem reference/human_g1k_v37.fasta -- --batch-size 2 inputs/*.fastq \> '$(task.uuid).sam'</span>
+</pre>
+</notextile>
diff --git a/sdk/cli/bin/arv b/sdk/cli/bin/arv
index 9b486d2..59bdfae 100755
--- a/sdk/cli/bin/arv
+++ b/sdk/cli/bin/arv
@@ -112,7 +112,7 @@ def init_config
   end
 end
 
-subcommands = %w(keep pipeline tag ws edit)
+subcommands = %w(keep pipeline run tag ws edit)
 
 def check_subcommands client, arvados, subcommand, global_opts, remaining_opts
   case subcommand
@@ -142,6 +142,8 @@ def check_subcommands client, arvados, subcommand, global_opts, remaining_opts
       puts "Available methods: run"
     end
     abort
+  when 'run'
+    exec `which arv-run`.strip, *remaining_opts
   when 'tag'
     exec `which arv-tag`.strip, *remaining_opts
   when 'ws'
diff --git a/sdk/python/arvados/commands/run.py b/sdk/python/arvados/commands/run.py
index 551f955..a15a457 100644
--- a/sdk/python/arvados/commands/run.py
+++ b/sdk/python/arvados/commands/run.py
@@ -60,6 +60,10 @@ def statfile(prefix, fn):
 def main(arguments=None):
     args = arvrun_parser.parse_args(arguments)
 
+    if len(args.args) == 0:
+        arvrun_parser.print_help()
+        return
+
     reading_into = 2
 
     slots = [[], [], []]

commit 0a37e2d631fd98e2766245c4719586d38bdf10c8
Author: Peter Amstutz <peter.amstutz at curoverse.com>
Date:   Wed Oct 15 13:53:06 2014 -0400

    3609: Use run-command batch function instead implementing it in run.py.  Permit
    arv-run output redirection to not have a space between > or < and the first
    filename.

diff --git a/sdk/python/arvados/commands/run.py b/sdk/python/arvados/commands/run.py
index e118a9e..551f955 100644
--- a/sdk/python/arvados/commands/run.py
+++ b/sdk/python/arvados/commands/run.py
@@ -64,10 +64,14 @@ def main(arguments=None):
 
     slots = [[], [], []]
     for c in args.args:
-        if c == '>':
+        if c.startswith('>'):
             reading_into = 0
-        elif c == '<':
+            if len(c) > 1:
+                slots[reading_into].append(c[1:])
+        elif c.startswith('<'):
             reading_into = 1
+            if len(c) > 1:
+                slots[reading_into].append(c[1:])
         elif c == '|':
             reading_into = len(slots)
             slots.append([])
@@ -165,9 +169,7 @@ def main(arguments=None):
                 inp = "input%i" % (s-2)
                 groupargs = group_parser.parse_args(slots[2][i+1:])
                 if groupargs.batch_size:
-                    component["script_parameters"][inp] = []
-                    for j in xrange(0, len(groupargs.args), groupargs.batch_size):
-                        component["script_parameters"][inp].append(groupargs.args[j:j+groupargs.batch_size])
+                    component["script_parameters"][inp] = {"batch":groupargs.args, "size":groupargs.batch_size}
                     slots[s] = slots[s][0:i] + [{"foreach": inp, "command": "$(%s)" % inp}]
                 else:
                     component["script_parameters"][inp] = groupargs.args

commit 2ec515c8ef7f9cae426a6490d1317333718e1d5e
Author: Peter Amstutz <peter.amstutz at curoverse.com>
Date:   Wed Oct 15 13:35:31 2014 -0400

    3609: Further improve list handling.  Update documentation to new preferred
    syntax.  Add "batch" function to run-command.

diff --git a/crunch_scripts/run-command b/crunch_scripts/run-command
index fbfd511..e21089e 100755
--- a/crunch_scripts/run-command
+++ b/crunch_scripts/run-command
@@ -97,36 +97,64 @@ class SigHandler(object):
             sp.send_signal(signum)
         self.sig = signum
 
+# http://rightfootin.blogspot.com/2006/09/more-on-python-flatten.html
+def flatten(l, ltypes=(list, tuple)):
+    ltype = type(l)
+    l = list(l)
+    i = 0
+    while i < len(l):
+        while isinstance(l[i], ltypes):
+            if not l[i]:
+                l.pop(i)
+                i -= 1
+                break
+            else:
+                l[i:i + 1] = l[i]
+        i += 1
+    return ltype(l)
+
 def add_to_group(gr, match):
     m = match.groups()
     if m not in gr:
         gr[m] = []
     gr[m].append(match.group(0))
 
-def expand_item(p, c, flatten=True):
+def var_items(p, c, key):
+    if "var" in c:
+        # Var specifies
+        return (c["var"], get_items(p, c[key]))
+    else:
+        if isinstance(c[key], list):
+            return (None, get_items(p, c[key]))
+        m = re.match("^\$\((.*)\)$", c[key])
+        if m and m.group(1) in p:
+            return (m.group(1), get_items(p, c[key]))
+        else:
+            # backwards compatible, foreach specifies bare parameter name to use
+            return (c[key], get_items(p, p[c[key]]))
+
+def expand_item(p, c):
     if isinstance(c, dict):
         if "foreach" in c and "command" in c:
-            var = c["foreach"]
-            items = get_items(p, p[var])
+            var, items = var_items(p, c, "foreach")
             r = []
             for i in items:
                 params = copy.copy(p)
                 params[var] = i
-                r.extend(expand_item(params, c["command"]))
+                r.append(expand_item(params, c["command"]))
             return r
         if "list" in c and "index" in c and "command" in c:
-            var = c["list"]
-            items = get_items(p, p[var])
+            var, items = var_items(p, c, "list")
             params = copy.copy(p)
             params[var] = items[int(c["index"])]
-            return expand_list(params, c["command"])
+            return expand_item(params, c["command"])
         if "regex" in c:
             pattern = re.compile(c["regex"])
             if "filter" in c:
-                items = get_items(p, p[c["filter"]])
+                var, items = var_items(p, c, "filter")
                 return [i for i in items if pattern.match(i)]
             elif "group" in c:
-                items = get_items(p, p[c["group"]])
+                var, items = var_items(p, c, "group")
                 groups = {}
                 for i in items:
                     match = pattern.match(i)
@@ -134,50 +162,46 @@ def expand_item(p, c, flatten=True):
                         add_to_group(groups, match)
                 return [groups[k] for k in groups]
             elif "extract" in c:
-                items = get_items(p, p[c["extract"]])
+                var, items = var_items(p, c, "extract")
                 r = []
                 for i in items:
                     match = pattern.match(i)
                     if match:
                         r.append(list(match.groups()))
                 return r
+        if "batch" in c and "size" in c:
+            var, items = var_items(p, c, "batch")
+            sz = int(c["size"])
+            r = []
+            for j in xrange(0, len(items), sz):
+                r.append(items[j:j+sz])
+            return r
     elif isinstance(c, list):
-        return expand_list(p, c)
+        return [expand_item(p, arg) for arg in c]
     elif isinstance(c, basestring):
-        if flatten:
-            return [subst.do_substitution(p, c)]
+        m = re.match("^\$\((.*)\)$", c)
+        if m and m.group(1) in p:
+            return expand_item(p, p[m.group(1)])
         else:
             return subst.do_substitution(p, c)
 
     return []
 
-def expand_list(p, l, flatten=True):
-    if isinstance(l, basestring):
-        return expand_item(p, l)
-    elif flatten:
-        return [exp for arg in l for exp in expand_item(p, arg, flatten)]
-    else:
-        return [expand_item(p, arg, flatten) for arg in l]
-
-def get_items(p, value, flatten=True):
-    if isinstance(value, dict):
-        return expand_item(p, value)
-
+def get_items(p, value):
+    value = expand_item(p, value)
     if isinstance(value, list):
-        return expand_list(p, value, flatten)
-
-    fn = subst.do_substitution(p, value)
-    mode = os.stat(fn).st_mode
-    prefix = fn[len(os.environ['TASK_KEEPMOUNT'])+1:]
-    if mode is not None:
-        if stat.S_ISDIR(mode):
-            items = [os.path.join(fn, l) for l in os.listdir(fn)]
-        elif stat.S_ISREG(mode):
-            with open(fn) as f:
-                items = [line.rstrip("\r\n") for line in f]
-        return items
-    else:
-        return None
+        return value
+    elif isinstance(value, basestring):
+        mode = os.stat(value).st_mode
+        prefix = value[len(os.environ['TASK_KEEPMOUNT'])+1:]
+        if mode is not None:
+            if stat.S_ISDIR(mode):
+                items = [os.path.join(value, l) for l in os.listdir(value)]
+            elif stat.S_ISREG(mode):
+                with open(value) as f:
+                    items = [line.rstrip("\r\n") for line in f]
+            return items
+    raise Exception("get_items did not yield a list")
 
 stdoutname = None
 stdoutfile = None
@@ -187,7 +211,7 @@ stdinfile = None
 def recursive_foreach(params, fvars):
     var = fvars[0]
     fvars = fvars[1:]
-    items = get_items(params, params[var], False)
+    items = get_items(params, params[var])
     logger.info("parallelizing on %s with items %s" % (var, items))
     if items is not None:
         for i in items:
@@ -205,9 +229,10 @@ def recursive_foreach(params, fvars):
                     }).execute()
                 else:
                     if isinstance(params["command"][0], list):
-                        logger.info(expand_list(params, params["command"], False))
+                        for c in params["command"]:
+                            logger.info(flatten(expand_item(params, c)))
                     else:
-                        logger.info(expand_list(params, params["command"], True))
+                        logger.info(flatten(expand_item(params, params["command"])))
     else:
         logger.error("parameter %s with value %s in task.foreach yielded no items" % (var, params[var]))
         sys.exit(1)
@@ -252,9 +277,10 @@ try:
 
     cmd = []
     if isinstance(taskp["command"][0], list):
-        cmd.append(expand_list(taskp, taskp["command"], False))
+        for c in taskp["command"]:
+            cmd.append(flatten(expand_item(taskp, c)))
     else:
-        cmd.append(expand_list(taskp, taskp["command"], True))
+        cmd.append(flatten(expand_item(taskp, taskp["command"])))
 
     if "task.stdin" in taskp:
         stdinname = subst.do_substitution(taskp, taskp["task.stdin"])
diff --git a/doc/user/topics/run-command.html.textile.liquid b/doc/user/topics/run-command.html.textile.liquid
index 6d3e87b..c025008 100644
--- a/doc/user/topics/run-command.html.textile.liquid
+++ b/doc/user/topics/run-command.html.textile.liquid
@@ -77,7 +77,32 @@ If the value is a JSON object, it is evaluated as a list function described belo
 
 h2. List functions
 
-When @run-command@ is evaluating a list (such as "command"), in addition to string parameter substitution, you can use list item functions.  Note: in the following functions, you specify the name of a user parameter to act on; you cannot provide the list value directly in line.
+When @run-command@ is evaluating a list (such as "command"), in addition to string parameter substitution, you can use list item functions.  In the following functions, you can either specify the name of a user parameter to act on or provide list value directly in line, for example, the following two fragments yield the same result:
+
+<pre>
+{
+  "command": ["echo", {"foreach": "$(a)", "command": ["--something", "$(a)"]}],
+  "a": ["alice", "bob"]
+}
+</pre>
+
+<pre>
+{
+  "command": ["echo", {"foreach": ["alice", "bob"], "var":"a", "command": ["--something", "$(a)"]}],
+}
+</pre>
+
+Note: when you provide the list inline with "foreach" or "index", you must include the "var" parameter to specify the substitution variable name to use when evaluating the command fragment.
+
+You can also nest functions:
+
+<pre>
+{
+  "command": ["echo", {"foreach": {"filter": ["alice", "bob"], "regex": "b.*"},
+                       "var":"a",
+                       "command": ["--something", "$(a)"]}]
+}
+</pre>
 
 h3. foreach
 
@@ -85,7 +110,7 @@ The @foreach@ list item function (not to be confused with the @task.foreach@ dir
 
 <pre>
 {
-  "command": ["echo", {"foreach": "a", "command": ["--something", "$(a)"]}],
+  "command": ["echo", {"foreach": "$(a)", "command": ["--something", "$(a)"]}],
   "a": ["alice", "bob"]
 }
 </pre>
@@ -96,7 +121,7 @@ This function extracts a single item from a list.  The value of @index@ is zero-
 
 <pre>
 {
-  "command": ["echo", {"list": "a", "index": 1, "command": ["--something", "$(a)"]}],
+  "command": ["echo", {"list": "$(a)", "var":"a", "index": 1, "command": ["--something", "$(a)"]}],
   "a": ["alice", "bob"]
 }
 </pre>
@@ -107,7 +132,7 @@ Filter the list so that it only includes items that match a regular expression.
 
 <pre>
 {
-  "command": ["echo", {"filter": "a", "regex": "b.*"}],
+  "command": ["echo", {"filter": "$(a)", "regex": "b.*"}],
   "a": ["alice", "bob"]
 }
 </pre>
@@ -118,7 +143,7 @@ Generate a list of lists, where items are grouped on common subexpression match.
 
 <pre>
 {
-  "command": ["echo", {"foreach": "b", "command":["--group", {"foreach": "b", "command":"$(b)"}]}],
+  "command": ["echo", {"foreach": "$(b)", "command":["--group", {"foreach": "b", "command":"$(b)"}]}],
   "a": ["alice", "bob", "carol", "dave"],
   "b": {"group": "a", "regex": "[^a]*(a?).*"}
 }
@@ -130,12 +155,23 @@ Generate a list of lists, where items are split by subexpression match.  Items w
 
 <pre>
 {
-  "command": ["echo", {"foreach": "b", "command":[{"foreach": "b", "command":"$(b)"}]}],
+  "command": ["echo", {"foreach": "$(b)", "command":[{"foreach": "$(b)", "command":"$(b)"}]}],
   "a": ["alice", "bob", "carol", "dave"],
   "b": {"extract": "a", "regex": "(.+)(a)(.*)"}
 }
 </pre>
 
+h3. batch
+
+Generate a list of lists, where items are split into batch size.  If the list does not divide evenly into batch sizes, the last batch will be short.  The following example evaluates to @["echo", "--something", "alice", "bob", "--something", "carol", "dave"]@
+
+<pre>
+{
+  "command": ["echo", {"foreach":{"batch": "$(a)", "size": 2}, "var":"a", "command":["--something", "$(a)"]}],
+  "a": ["alice", "bob", "carol", "dave"]
+}
+</pre>
+
 h2. Directives
 
 Directives alter the behavior of run-command.  All directives are optional.

-----------------------------------------------------------------------


hooks/post-receive
--