[arvados] Tuning keep performance
Peter Amstutz
peter.amstutz at curoverse.com
Fri Aug 4 14:25:53 EDT 2017
Try using the Collection class instead of CollectionWriter, and setting
put_threads in the Collection constructor (in our experiments I think we
found 4-6 threads to get the best throughput).
On Fri, Aug 4, 2017 at 2:08 PM, George Chlipala <gchlip2 at uic.edu> wrote:
> We have an application that we would use to download data from Illumina
> Basespace directly to our servers. Previously we had been writing directly
> to disk and average data transfer speeds were > 10 MB/s. We modified the
> application (a python script) to now push the data into Arvados via a
> CollectionWriter. Now we are seeing data transfer speeds 200-400 kB/s.
> Both the Arvados servers and our Basespace application are on the same
> subnet and connected via 1 Gbps ethernet.
>
> I have setup keepstore volume to serialize and I have the default buffer
> setting.
>
> Here is our keepstore configuration (keepstore.yml).
>
> BlobSignatureTTL: 96h0m0s
> BlobSigningKeyFile: /etc/arvados/keepstore/blob-signing.key
> Debug: false
> EnableDelete: true
> Listen: :25107
> LogFormat: text
> MaxBuffers: 100
> MaxRequests: 0
> PIDFile: ""
> RequireSignatures: false
> SystemAuthTokenFile: /etc/arvados/keepstore/system-auth.key
> TrashCheckInterval: 24h0m0s
> TrashLifetime: 96h0m0s
> Volumes:
> - DirectoryReplication: 0
> ReadOnly: false
> Root: /mnt/keep
> Serialize: true
> Type: Directory
>
> Also I have checked the socket connections on the system hosting the
> application and it is directly connecting to the keepstore server.
>
> Are there any other items to look at in order to improve performance?
>
> For references, here are snippets from our push application. The
> following are the lines associated with creating the CollectionWriter.
>
> self.arv = arvados.api(token=arv_token, host=arvados_api_host)
> self.writer = CollectionWriter(self.arv, replication=replication)
>
> The following are the lines on how we push the data. The fileinfo object
> is a custom class that has the path and filename for the file fetched from
> Basespace. We are fetching the file from Basespace and saving to a temp
> directory in case there are issues during the download. I have checked and
> the download speed is > 10 MB/s.
>
> with open(fileinfo.path, 'rb') as filein, self.writer.open('./raw_data/' +
> fileinfo.filename) as col_file:
> logging.info("Adding file {0} to Arvados collection".format(fileinfo.
> filename))
> for data in filein.read():
> col_file.write(data)
> fileinfo.byte_count += len(data)
>
> col_file.close()
> filein.close()
>
> Any help would be greatly appreciated!
>
> George Chlipala, Ph.D.
> Senior Research Specialist
> Research Resources Center
> University of Illinois at Chicago
>
> phone: 312-413-1700 <(312)%20413-1700>
> email: gchlip2 at uic.edu
>
> _______________________________________________
> arvados mailing list
> arvados at arvados.org
> http://lists.arvados.org/mailman/listinfo/arvados
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20170804/c383d663/attachment.html>
More information about the arvados
mailing list