[ARVADOS] updated: 3fb3557b13cce8fee3bab0f81ab03b78fbef67dd
git at public.curoverse.com
git at public.curoverse.com
Tue Mar 11 12:21:31 EDT 2014
Summary of changes:
doc/_config.yml | 3 +-
...nning-pipeline-command-line.html.textile.liquid | 14 +++++++-----
doc/user/topics/tutorial-job1.html.textile.liquid | 22 ++-----------------
.../topics/tutorial-parallel.html.textile.liquid | 9 +------
.../tutorials/intro-crunch.html.textile.liquid | 17 +++++++++++++++
.../tutorial-firstscript.html.textile.liquid | 2 +-
.../tutorials/tutorial-keep.html.textile.liquid | 2 +-
7 files changed, 34 insertions(+), 35 deletions(-)
create mode 100644 doc/user/tutorials/intro-crunch.html.textile.liquid
via 3fb3557b13cce8fee3bab0f81ab03b78fbef67dd (commit)
from 005928e7cbe6abbe418588f3eb652b3dee16e544 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
commit 3fb3557b13cce8fee3bab0f81ab03b78fbef67dd
Author: Peter Amstutz <peter.amstutz at curoverse.com>
Date: Tue Mar 11 12:22:57 2014 -0400
More documentation updates and reorganization.
diff --git a/doc/_config.yml b/doc/_config.yml
index bc71282..20b3f25 100644
--- a/doc/_config.yml
+++ b/doc/_config.yml
@@ -22,13 +22,14 @@ navbar:
- user/getting_started/community.html.textile.liquid
- Tutorials:
- user/tutorials/tutorial-keep.html.textile.liquid
+ - user/tutorials/intro-crunch.html.textile.liquid
- user/tutorials/tutorial-pipeline-workbench.html.textile.liquid
- user/tutorials/tutorial-firstscript.html.textile.liquid
- user/tutorials/tutorial-new-pipeline.html.textile.liquid
- user/tutorials/running-external-program.html.textile.liquid
- Intermediate topics:
- - user/topics/tutorial-job1.html.textile.liquid
- user/topics/running-pipeline-command-line.html.textile.liquid
+ - user/topics/tutorial-job1.html.textile.liquid
- user/topics/tutorial-job-debug.html.textile.liquid
- user/topics/tutorial-parallel.html.textile.liquid
- user/topics/tutorial-trait-search.html.textile.liquid
diff --git a/doc/user/topics/running-pipeline-command-line.html.textile.liquid b/doc/user/topics/running-pipeline-command-line.html.textile.liquid
index b8ee8ed..3f85077 100644
--- a/doc/user/topics/running-pipeline-command-line.html.textile.liquid
+++ b/doc/user/topics/running-pipeline-command-line.html.textile.liquid
@@ -4,7 +4,7 @@ navsection: userguide
title: "Running a pipeline on the command line"
...
-Run the pipeline using @arv pipeline run@, using the UUID that you received from @arv pipeline create@:
+It is possible run pipelines on the command line using @arv pipeline run@ using the UUID that you received from @arv pipeline create@:
<notextile>
<pre><code>$ <span class="userinput">arv pipeline run --template qr1hi-p5p6p-xxxxxxxxxxxxxxx</span>
@@ -31,7 +31,7 @@ The Keep locators of the output of each of @"do_hash"@ and @"filter"@ component
0f1d6bcf55c34bed7f92a805d2d89bbf alice.txt
504938460ef369cd275e4ef58994cffe bob.txt
8f3b36aff310e06f3c5b9e95678ff77a carol.txt
-$ <span class="userinput">arv keep get 735ac35adf430126cf836547731f3af6+56</span>
+$ <span class="userinput">arv keep get 735ac35adf430126cf836547731f3af6+56/0-filter.txt</span>
0f1d6bcf55c34bed7f92a805d2d89bbf alice.txt
</code></pre>
</notextile>
@@ -53,7 +53,7 @@ Notice that the pipeline definition explicitly specifies the Keep locator for th
</code></pre>
</notextile>
-What if we want to run the pipeline on a different input block? One option is to define a new pipeline template, but would potentially result in clutter with many pipeline templates defined for one-off jobs. Instead, you can override values in the input of the component like this:
+You can specify values for pipeline component script_parameters like this:
<notextile>
<pre><code>$ <span class="userinput">arv pipeline run --template qr1hi-d1hrv-vxzkp38nlde9yyr do_hash::input=33a9f3842b01ea3fdf27cc582f5ea2af+242</span>
@@ -78,9 +78,11 @@ filter qr1hi-8i9sb-j347g1sqovdh0op fb728f0ffe152058fa64b9aeed344cb5+54
Now check the output:
<notextile>
-<pre><code>$ <span class="userinput">arv keep ls -s fb728f0ffe152058fa64b9aeed344cb5+54</span>
-0 0-filter.txt
+<pre><code>$ <span class="userinput">arv keep get 880b55fb4470b148a447ff38cacdd952+54/md5sum.txt</span>
+44b8ae3fde7a8a88d2f7ebd237625b4f var-GS000016015-ASM.tsv.bz2
+$ <span class="userinput">arv keep get fb728f0ffe152058fa64b9aeed344cb5+54/0-filter.txt</span>
+
</code></pre>
</notextile>
-Here the filter script output is empty, so none of the files in the collection have hash code that start with 0.
+Since none of the files in the collection have hash code that start with 0, output of the filter component is empty.
diff --git a/doc/user/topics/tutorial-job1.html.textile.liquid b/doc/user/topics/tutorial-job1.html.textile.liquid
index 796f684..c4db2db 100644
--- a/doc/user/topics/tutorial-job1.html.textile.liquid
+++ b/doc/user/topics/tutorial-job1.html.textile.liquid
@@ -4,27 +4,15 @@ navsection: userguide
title: "Running a Crunch job on the command line"
...
-This tutorial introduces the concepts and use of the Crunch job system using the @arv@ command line tool and Arvados Workbench.
+This tutorial introduces how to run individual Crunch jobs using the @arv@ command line tool.
*This tutorial assumes that you are "logged into an Arvados VM instance":{{site.baseurl}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.baseurl}}/user/getting_started/check-environment.html*
-In "retrieving data using Keep,":tutorial-keep.html we downloaded a file from Keep and did some computation with it (specifically, computing the md5 hash of the complete file). While a straightforward way to accomplish a computational task, there are several obvious drawbacks to this approach:
-* Large files require significant time to download.
-* Very large files may exceed the scratch space of the local disk.
-* We are only able to use the local CPU to process the file.
-
-The Arvados "Crunch" framework is designed to support processing very large data batches (gigabytes to terabytes) efficiently, and provides the following benefits:
-* Increase concurrency by running tasks asynchronously, using many CPUs and network interfaces at once (especially beneficial for CPU-bound and I/O-bound tasks respectively).
-* Track inputs, outputs, and settings so you can verify that the inputs, settings, and sequence of programs you used to arrive at an output is really what you think it was.
-* Ensure that your programs and workflows are repeatable with different versions of your code, OS updates, etc.
-* Interrupt and resume long-running jobs consisting of many short tasks.
-* Maintain timing statistics automatically, so they're there when you want them.
-
-For your first job, you will run the "hash" crunch script using the Arvados system. The "hash" script computes the md5 hash of each file in a collection.
+You will create a job to run the "hash" crunch script. The "hash" script computes the md5 hash of each file in a collection.
h2. Jobs
-A "job" is a single run of a specific version of a crunch script with a specific input.
+Crunch pipelines consist of one or more jobs. A "job" is a single run of a specific version of a crunch script with a specific input. You an also run jobs individually.
A request to run a crunch job are is described using a JSON object. For example:
@@ -231,7 +219,3 @@ The log collection consists of one log file named with the job id. You can acce
2013-12-16_20:44:53 qr1hi-8i9sb-1pm1t02dezhupss 7575 finish
</code></pre>
</notextile>
-
-<hr>
-
-This concludes the first tutorial. In the next tutorial, we will "write a crunch job script.":tutorial-firstscript.html
diff --git a/doc/user/topics/tutorial-parallel.html.textile.liquid b/doc/user/topics/tutorial-parallel.html.textile.liquid
index 2e4cf78..cf83900 100644
--- a/doc/user/topics/tutorial-parallel.html.textile.liquid
+++ b/doc/user/topics/tutorial-parallel.html.textile.liquid
@@ -4,7 +4,7 @@ navsection: userguide
title: "Parallel Crunch tasks"
...
-In the tutorial "writing a crunch script,":tutorial-firstscript.html our script used a "for" loop to compute the md5 hashes for each file in sequence. This approach, while simple, is not able to take advantage of the compute cluster with multiple nodes and cores to speed up computation by running tasks in parallel. This tutorial will demonstrate how to create parallel Crunch tasks.
+In the previous tutorials, we used @arvados.job_setup.one_task_per_input_file()@ to automatically parallelize our jobs by creating a separate task per file. For some types of jobs, you may need to split the work up differently, for example creating tasks to process different segments of a single large file. In this this tutorial will demonstrate how to create Crunch tasks directly.
Start by entering the @crunch_scripts@ directory of your git repository:
@@ -65,7 +65,7 @@ EOF</span>
Because the job ran in parallel, each instance of parallel-hash creates a separate @md5sum.txt@ as output. Arvados automatically collates theses files into a single collection, which is the output of the job:
<notextile>
-<pre><code>~/<b>you</b>/crunch_scripts$ <span class="userinput">arv keep get e2ccd204bca37c77c0ba59fc470cd0f7+162</span>
+<pre><code>~/<b>you</b>/crunch_scripts$ <span class="userinput">arv keep ls e2ccd204bca37c77c0ba59fc470cd0f7+162</span>
md5sum.txt
md5sum.txt
md5sum.txt
@@ -76,9 +76,4 @@ md5sum.txt
</code></pre>
</notextile>
-h2. The one job per file pattern
-
-This example demonstrates how to schedule a new task per file. Because this is a common pattern, the Crunch Python API contains a convenience function to "queue a task for each input file":{{site.baseurl}}/sdk/python/crunch-utility-libraries.html#one_task_per_input which reduces the amount of boilerplate code required to handle parallel jobs.
-
-Next, "Constructing a Crunch pipeline":tutorial-new-pipeline.html
diff --git a/doc/user/tutorials/intro-crunch.html.textile.liquid b/doc/user/tutorials/intro-crunch.html.textile.liquid
new file mode 100644
index 0000000..46b4d6c
--- /dev/null
+++ b/doc/user/tutorials/intro-crunch.html.textile.liquid
@@ -0,0 +1,17 @@
+---
+layout: default
+navsection: userguide
+title: Introduction to Crunch
+...
+
+In "getting data from Keep,":tutorial-keep.html#arv-get we downloaded a file from Keep and did some computation with it (specifically, computing the md5 hash of the complete file). While a straightforward way to accomplish a computational task, there are several obvious drawbacks to this approach:
+* Large files require significant time to download.
+* Very large files may exceed the scratch space of the local disk.
+* We are only able to use the local CPU to process the file.
+
+The Arvados "Crunch" framework is designed to support processing very large data batches (gigabytes to terabytes) efficiently, and provides the following benefits:
+* Increase concurrency by running tasks asynchronously, using many CPUs and network interfaces at once (especially beneficial for CPU-bound and I/O-bound tasks respectively).
+* Track inputs, outputs, and settings so you can verify that the inputs, settings, and sequence of programs you used to arrive at an output is really what you think it was.
+* Ensure that your programs and workflows are repeatable with different versions of your code, OS updates, etc.
+* Interrupt and resume long-running jobs consisting of many short tasks.
+* Maintain timing statistics automatically, so they're there when you want them.
diff --git a/doc/user/tutorials/tutorial-firstscript.html.textile.liquid b/doc/user/tutorials/tutorial-firstscript.html.textile.liquid
index 0582d53..5d0a4da 100644
--- a/doc/user/tutorials/tutorial-firstscript.html.textile.liquid
+++ b/doc/user/tutorials/tutorial-firstscript.html.textile.liquid
@@ -64,7 +64,7 @@ Make the file executable:
notextile. <pre><code>~/<b>you</b>/crunch_scripts$ <span class="userinput">chmod +x hash.py</span></code></pre>
{% include 'notebox_begin' %}
-The steps below describe how to execute the script after committing changes to git. To test the script locally, please see the "debugging a crunch script":{{site.baseurl}}/user/topics/tutorial-job-debug.html page.
+The steps below describe how to execute the script after committing changes to git. To run a script locally for testing, please see "debugging a crunch script":{{site.baseurl}}/user/topics/tutorial-job-debug.html .
{% include 'notebox_end' %}
diff --git a/doc/user/tutorials/tutorial-keep.html.textile.liquid b/doc/user/tutorials/tutorial-keep.html.textile.liquid
index 9fbdb2a..0321760 100644
--- a/doc/user/tutorials/tutorial-keep.html.textile.liquid
+++ b/doc/user/tutorials/tutorial-keep.html.textile.liquid
@@ -81,7 +81,7 @@ You may access collections through the "Collections section of Arvados Workbench
* "https://{{ site.arvados_workbench_host }}/collections/c1bad4b39ca5a924e481008009d94e32+210":https://{{ site.arvados_workbench_host }}/collections/c1bad4b39ca5a924e481008009d94e32+210
* "https://{{ site.arvados_workbench_host }}/collections/887cd41e9c613463eab2f0d885c6dd96+83/alice.txt":https://{{ site.arvados_workbench_host }}/collections/887cd41e9c613463eab2f0d885c6dd96+83/alice.txt
-h2. Using arv-get
+h2(#arv-get). Using arv-get
You can view the contents of a collection using @arv keep ls@:
-----------------------------------------------------------------------
hooks/post-receive
--
More information about the arvados-commits
mailing list