[arvados] Tuning keep performance

Fri Aug 4 14:25:53 EDT 2017

Try using the Collection class instead of CollectionWriter, and setting
put_threads in the Collection constructor (in our experiments I think we
found 4-6 threads to get the best throughput).

On Fri, Aug 4, 2017 at 2:08 PM, George Chlipala <gchlip2 at uic.edu> wrote:

> We have an application that we would use to download data from Illumina
> Basespace directly to our servers.  Previously we had been writing directly
> to disk and average data transfer speeds were > 10 MB/s.  We modified the
> application (a python script) to now push the data into Arvados via a
> CollectionWriter.  Now we are seeing data transfer speeds 200-400 kB/s.
> Both the Arvados servers and our Basespace application are on the same
> subnet and connected via 1 Gbps ethernet.
>
> I have setup keepstore volume to serialize and I have the default buffer
> setting.
>
> Here is our keepstore configuration (keepstore.yml).
>
> BlobSignatureTTL: 96h0m0s
> BlobSigningKeyFile: /etc/arvados/keepstore/blob-signing.key
> Debug: false
> EnableDelete: true
> Listen: :25107
> LogFormat: text
> MaxBuffers: 100
> MaxRequests: 0
> PIDFile: ""
> RequireSignatures: false
> SystemAuthTokenFile: /etc/arvados/keepstore/system-auth.key
> TrashCheckInterval: 24h0m0s
> TrashLifetime: 96h0m0s
> Volumes:
> - DirectoryReplication: 0
>   ReadOnly: false
>   Root: /mnt/keep
>   Serialize: true
>   Type: Directory
>
> Also I have checked the socket connections on the system hosting the
> application and it is directly connecting to the keepstore server.
>
> Are there any other items to look at in order to improve performance?
>
> For references, here are snippets from our push application.  The
> following are the lines associated with creating the CollectionWriter.
>
> self.arv = arvados.api(token=arv_token, host=arvados_api_host)
> self.writer = CollectionWriter(self.arv, replication=replication)
>
> The following are the lines on how we push the data.  The fileinfo object
> is a custom class that has the path and filename for the file fetched from
> Basespace.  We are fetching the file from Basespace and saving to a temp
> directory in case there are issues during the download.  I have checked and
> the download speed is > 10 MB/s.
>
> with open(fileinfo.path, 'rb') as filein, self.writer.open('./raw_data/' +
> fileinfo.filename) as col_file:
>     logging.info("Adding file {0} to Arvados collection".format(fileinfo.
> filename))
>     for data in filein.read():
>           col_file.write(data)
>           fileinfo.byte_count += len(data)
>
>     col_file.close()
>     filein.close()
>
> Any help would be greatly appreciated!
>
> George Chlipala, Ph.D.
> Senior Research Specialist
> Research Resources Center
> University of Illinois at Chicago
>
> phone: 312-413-1700 <(312)%20413-1700>
> email: gchlip2 at uic.edu
>
> _______________________________________________
> arvados mailing list
> arvados at arvados.org
> http://lists.arvados.org/mailman/listinfo/arvados
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20170804/c383d663/attachment.html>