[ARVADOS] updated: 26d74dc0524c87c5dcc0c76040ce413a4848b57a
git at public.curoverse.com
git at public.curoverse.com
Tue Oct 14 10:58:55 EDT 2014
Summary of changes:
presentations/barcamp/dependencies.go | 20 +++
presentations/barcamp/genomes_640.jpg | Bin 0 -> 90926 bytes
presentations/barcamp/goroutine.go | 10 ++
presentations/barcamp/keep.slide | 302 ++++++++++++++++++++++++++++++++++
presentations/barcamp/lolwut.jpg | Bin 0 -> 58786 bytes
presentations/barcamp/server.go | 15 ++
6 files changed, 347 insertions(+)
create mode 100644 presentations/barcamp/dependencies.go
create mode 100644 presentations/barcamp/genomes_640.jpg
create mode 100644 presentations/barcamp/goroutine.go
create mode 100644 presentations/barcamp/keep.slide
create mode 100644 presentations/barcamp/lolwut.jpg
create mode 100644 presentations/barcamp/server.go
via 26d74dc0524c87c5dcc0c76040ce413a4848b57a (commit)
from 6bdfed00c27c6034ffe4ad79a05bc9cadd9b9489 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
commit 26d74dc0524c87c5dcc0c76040ce413a4848b57a
Author: Tim Pierce <twp at curoverse.com>
Date: Tue Oct 14 10:58:18 2014 -0400
Adding Keep slides from BarCamp Boston 9.
No issue #.
diff --git a/presentations/barcamp/dependencies.go b/presentations/barcamp/dependencies.go
new file mode 100644
index 0000000..cf8b480
--- /dev/null
+++ b/presentations/barcamp/dependencies.go
@@ -0,0 +1,20 @@
+import (
+ "bufio"
+ "bytes"
+ "container/list"
+ "crypto/md5"
+ "encoding/json"
+ "fmt"
+ "github.com/gorilla/mux" // HL
+ "io"
+ "log"
+ "net/http"
+ "os"
+ "regexp"
+ "runtime"
+ "strconv"
+ "strings"
+ "syscall"
+ "time"
+)
+
diff --git a/presentations/barcamp/genomes_640.jpg b/presentations/barcamp/genomes_640.jpg
new file mode 100644
index 0000000..c64ae34
Binary files /dev/null and b/presentations/barcamp/genomes_640.jpg differ
diff --git a/presentations/barcamp/goroutine.go b/presentations/barcamp/goroutine.go
new file mode 100644
index 0000000..99ae0d0
--- /dev/null
+++ b/presentations/barcamp/goroutine.go
@@ -0,0 +1,10 @@
+func master() {
+ go start_slave()
+ // The slave may continue to run after master() returns.
+}
+
+func start_slave() {
+ while (work_to_do()) {
+ ...
+ }
+}
diff --git a/presentations/barcamp/keep.slide b/presentations/barcamp/keep.slide
new file mode 100644
index 0000000..a99d59d
--- /dev/null
+++ b/presentations/barcamp/keep.slide
@@ -0,0 +1,302 @@
+# This file is a Go slide presentation.
+#
+# To render it:
+#
+# $ go get code.google.com/p/go.tools/present
+#
+# Then, in the directory containing the slide file:
+# $ $GOPATH/bin/present
+
+Keep: Open Source Content-Addressed Storage
+How We Turned a Big Hot Mess of Perl Into a Sweet Go Ride
+11 Oct 2014
+Tags: golang, cas, content addressed storage, porting
+
+Tim Pierce
+Senior Software Engineer, Curoverse
+twp at curoverse.com
+http://curoverse.com/
+ at qwrrty
+
+* Overview
+
+- The problem: large-scale data management is hard
+- The solution: content-addressed storage and federation
+- History of Warehouse and Keep
+- Keep: motivations and design goals
+- What we learned from Go
+
+* The problem: data management is hard.
+
+* The problem
+
+Managing data for scientific research is hard.
+
+It's too easy to lose data:
+
+ rm result<tab> ---OH WAIT CRAP NO
+ ./generate_results.py -o results1.csv ---OH WAIT CRAP NO
+
+Or lose track of how we got it.
+
+ $ ls -l results/
+ -rw-r----- 1 twp twp 859458786 Sep 2 15:39 results1.csv
+ -rw-r----- 1 twp twp 758489475 Sep 3 15:51 results2.csv
+ -rw-r----- 1 twp twp 958747348 Sep 4 11:46 results3.csv
+ -rw-r----- 1 twp twp 795984373 Sep 5 17:12 results4.csv
+ -rw-r----- 1 twp twp 833857373 Sep 6 9:38 results5.csv
+ -rw-r----- 1 twp twp 894847636 Sep 7 12:46 results6.csv
+ -rw-r----- 1 twp twp 847476854 Oct 2 12:17 results_umm.csv
+ -rw-r----- 1 twp twp 766845784 Sep 12 19:08 results_wednesday_i_think.csv
+ -rw-r----- 1 twp twp 932875738 Sep 8 18:32 results_whatever.csv
+
+* The problem
+
+Why is federation important?
+
+.image genomes_640.jpg
+
+Because the alternative is snail-mailing hard drives of data all over the world.
+
+* The solution: Keep
+
+* Keep: the open source content-addressed storage system.
+
+Design goals:
+
+- Gracefully handle data sets measured in the terabytes and petabytes.
+- Multi-tenant architecture
+- Lightweight permissions system
+- Minimize external dependencies
+- Data federation
+
+* What is Content-Addressed Storage?
+
+A very large key/value store, in which the _address_ of a data object is the hash of its _content_. Example:
+
+ $ head -c 10000000 /dev/urandom > /tmp/stuff
+ $ md5sum /tmp/stuff
+ c54a33209b03905476bf971e722c683d /tmp/stuff
+
+This file can only be stored under the address `c54a33209b03905476bf971e722c683d`.
+
+ PUT /c54a33209b03905476bf971e722c683d
+ -> HTTP/1.1 200 OK
+ c54a33209b03905476bf971e722c683d+10000000
+
+Attempting to store it under a different name yields an HTTP 4xx error.
+
+ PUT /ffffffffffffffffffffffffffffffff
+ -> HTTP/1.1 422 Hash mismatch in request
+
+With CAS, if you store the same data blob twice, you always get the same key back.
+
+Best known example: Git refs.
+
+* Why is CAS useful?
+
+Permanent storage.
+
+- Blobs cannot be overwritten.
+- Even on purpose!
+
+Determine quickly whether a large data blob is present in the store.
+
+These characteristics make content-addressed storage extremely well suited to any system which demands accountability on very large data sets, such as:
+
+- Scientific computing
+- Accounting data management
+- Photo retouching
+
+* Existing alternatives
+
+CAS systems typically have names like: EMC, NetApp, IBM.
+
+Cost for a storage device and licensing tends to be on the order of $2,000-$10,000 per terabyte.
+
+Not open source, therefore, poopy.
+
+A few open source alternatives exist, notably [[https://camlistore.org/][Camlistore]]. But some missing or immature features:
+
+- Multi-tenant support
+- Blob permissions
+
+* Keep architecture
+
+Data is written in blocks up to 64MB.
+
+Smart client, dumb server. Client is responsible for:
+
+- Structured data (directories, folders, names, etc)
+- Replication
+
+Default implementation on top of POSIX filesystem. Cheap, easy to deploy.
+
+ -rw------- 1 keep keep 14551778 Oct 10 17:06 /keep/87b/87b0f2f2eb0c1f90c6da46309a799cc0
+ -rw------- 1 keep keep 8154404 Oct 10 17:06 /keep/cc5/cc57ebe000aed447e1e481569e1a8abd
+ -rw------- 1 keep keep 7239086 Oct 10 17:06 /keep/ae8/ae8a8b29d9fb6325ee93d951cdae896f
+ -rw------- 1 keep keep 14455989 Oct 10 17:06 /keep/e92/e928a4d4b5c3ea903914d178bbfdb035
+
+Keep Volumes can be implemented on top of any backend service:
+
+- RAID
+- Amazon S3
+- Google Cloud Storage
+
+* Keep permissions
+
+Goals:
+
+- Require permission to read blocks
+- No hard dependencies on external authentication services
+
+Solution: _permission_hints_.
+
+Block requests are accompanied by a timestamped signature, e.g.:
+
+ 87b0f2f2eb0c1f90c6da46309a799cc0+14551778 + Abcf33732294c3e1fe16e39cea3114c9461274645 @ 5438550a
+ \--------- block locator string --------/ \------------ SHA-1 signature ----------/ timestamp
+
+Signatures are derived from the block hash, the user's OAuth token,
+the expiration timestamp, and a server-side signing secret.
+
+Permission hints can be generated by the authentication server.
+
+The Keep server can verify a valid permission signature instantly, _without_even_having_to_contact_any_other_service._
+
+* From Perl to Go
+
+Original Keep implementation, "Warehouse", written in Perl for the Harvard Personal Genome Project.
+
+Many drawbacks:
+
+- Perl
+- Eats ALL the memory
+- Not multithreaded (see also: Perl)
+- Slooooooooooooow. Slow slow slow.
+- Slow.
+- Did I mention Perl?
+
+* What did Go give us?
+
+So we rewrote Warehouse in Go.
+
+- 3 weeks to working prototype (supports GET and PUT)
+- 6 weeks to production (including permissions)
+
+Advantages:
+
+- Easier dependency management
+- Clean concurrency architecture
+- Rapid develop/build/test/deploy cycle
+
+* Dependency management: Warehouse
+
+Welcome to dependency hell.
+
+ Package: libwarehouse-perl
+ Architecture: all
+ Depends: ${perl:Depends}, ${misc:Depends}, libdbi-perl, libwww-perl,
+ libio-stringy-perl, libtimedate-perl, libgnupg-interface-perl,
+ libunix-syslog-perl, libbsd-resource-perl, libio-compress-zlib-perl,
+ libdigest-sha-perl, bioperl, gcc, g++, libstdc++6, bison, perlmagick,
+ imagemagick, gnuplot, bzip2, libbz2-dev, libfftw3-3, libfftw3-dev,
+ libxml-simple-perl, ghostscript, xsltproc, libyaml-perl,
+ libjson-perl, realpath, psmisc
+ Recommends: libhttp-ghttp-perl, libwhisker2-perl, libfuse-perl
+ Description: Warehouse -- Client and Server library for the storage warehouse.
+ Warehouse -- Client and Server library for the Free Factory storage warehouse.
+
+Deployment dependencies include MogileFS, MySQL, PGP, others.
+
+The stuff of nightmares.
+
+* Dependency management: Keep
+
+.code dependencies.go
+
+Exactly one third-party dependency: [[https://github.com/gorilla/mux][github.com/gorilla/mux]] (HTTP request routing).
+
+* Concurrency: Perl
+
+.image lolwut.jpg
+
+* Concurrency: Keep
+
+Go implements concurrency in "goroutines" -- small tasks that run independently of each other and may be run on other threads.
+
+The Go runtime is responsible for managing threads and scheduling tasks.
+
+Writing concurrent code is as simple as:
+
+.code goroutine.go
+
+* Concurrency: Keep
+
+The standard Go libraries provide a rich set of tools for writing concurrent applications.
+
+This is a complete implementation of a working multithreaded HTTP server in Go:
+
+.code server.go
+
+The HTTP library's `http.Server` type handles each request in its own goroutine.
+
+* Rapid development cycles
+
+Very fast build cycles:
+
+ hitchcock:/home/twp/arvados/services/keepstore% wc handlers.go keepstore.go perms.go volume.go work_queue.go
+ ...
+ 1515 6183 45084 total
+ hitchcock:/home/twp/arvados/services/keepstore% time go build
+ go build 1.01s user 0.15s system 99% cpu 1.154 total
+
+Testing:
+
+ hitchcock:/home/twp/arvados/services/keepstore% wc *_test.go
+ ...
+ 1830 6233 50798 total
+ hitchcock:/home/twp/arvados/services/keepstore% go test
+ ...
+ PASS
+ ok _/home/twp/arvados/services/keepstore 1.018s
+
+* Rapid development cycles
+
+ PASS
+ ok _/home/twp/arvados/services/keepstore 1.018s
+
+Why does the test take that long?
+
+Oh right.
+
+ // Sleep for 1s, then put the block again. The volume
+ // should report a more recent mtime.
+ //
+ // TODO(twp): this would be better handled with a mock Time object.
+ // Alternatively, set the mtime manually to some moment in the past
+ // (maybe a v.SetMtime method?)
+ //
+ time.Sleep(time.Second)
+
+* Lessons from Go
+
+You don't have to choose between performance and rapid prototyping.
+
+Extremely fast design and test cycle.
+
+Where Perl celebrates "laziness, impatience, and hubris," Go makes it
+easy to do the right thing.
+
+Go makes laziness awkward.
+
+* Where do we go from here?
+
+Keep is open source ([[https://github.com/curoverse/arvados/blob/master/agpl-3.0.txt][AGPLv3]])
+
+Keep source code is at [[https://github.com/curoverse/arvados/tree/master/services/keepstore]]
+
+The full source code repository: [[https://github.com/curoverse/arvados]]
+
+We are eager for contributors and new ideas for how to use Keep!
+
diff --git a/presentations/barcamp/lolwut.jpg b/presentations/barcamp/lolwut.jpg
new file mode 100644
index 0000000..0fba831
Binary files /dev/null and b/presentations/barcamp/lolwut.jpg differ
diff --git a/presentations/barcamp/server.go b/presentations/barcamp/server.go
new file mode 100644
index 0000000..91f519a
--- /dev/null
+++ b/presentations/barcamp/server.go
@@ -0,0 +1,15 @@
+package main
+
+import (
+ "log"
+ "net/http"
+)
+
+func main() {
+ http.HandleFunc("/hello", helloHandler)
+ log.Fatal(http.ListenAndServe(":8080", nil))
+}
+
+func helloHandler(w http.ResponseWriter, r *http.Request) {
+ w.Write([]byte("Hello!\n"))
+}
-----------------------------------------------------------------------
hooks/post-receive
--
More information about the arvados-commits
mailing list