Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

After evaluating several tools, the HPC admins settled on 2 tools, ecs-sync and rclone, as 'best of breed' for moving data between ROSS and Grace's cluster storage system, where the /home, /scratch, and /storage directory live.  The ecs-sync program is the more efficient and the speedier of the two for bulk data moves.  It consumes fewer compute resources and less memory than rclone.  When properly tweaked for number of threads (i.e. when the sweet spot is found) it moves data significantly faster than rclone.  The rclone program has more features than ecs-sync, including ways to browse data in ROSS, to mount a ROSS bucket as if it were a POSIX file system, and to synchronize content using a familiar rsync-like command syntax.  While ecs-sync is great for fast, bulk moves, rclone works very well for nuanced access to ROSS and small transfers.  The rclone program also works quite nicely for moving data between a workstation or laptop and ROSS.

...

OptionDescription
accessKeyThe S3 access ID for ROSS's S3 API, which is the object user's username in the ECS world.
secretKeyThe S3 access secret for ROSS associated with the object user, which can be copied from the ECS portal by the namespace administrator.
portThe port to use to access ROSS.  Should 9020 if protocol is set to http, or 9021 if the protocol is set to https.
protocolThe protocol to use to access ROSS, either http or https.
bucketThe bucket to write or read objects from.
create-bucketBy default, the target bucket must exist. If true, this option will create the bucket if it does not exists.
enableVHostsSpecifies whether virtual hosted buckets will be used (i.e. bucket name in the hostname instead of in the path) (default is path-style buckets),
keyPrefixSpecifies a string to be prepended to the name generated for the object.  For example, if keyPrefix is set to "prefix/", the source is a file system, and ecs-sync is copying a file with a path "subdir/subdir/filename", then the object name will be "prefix/subdir/subdir/filename".
apacheClientEnabled

Disabling this will use the native Java HTTP protocol handler, which can be faster in some situations, but is buggy

geoPinningEnabledEnables geo-pinning. This will use a standard algorithm to select a consistent VDC for each object key or bucket name, taking into account where the request is made, and where the VDCs that hold the data are.
includeVersionsEnable to transfer all versions of every object. NOTE: this will overwrite all versions of each source key in the target system if any exist!
mpuEnabledEnables multi-part upload (MPU).  Large files will be split into multiple streams and (if possible) sent in parallel.
mpuPartSizeMbSets the part size to use when multipart upload is required (objects over 5GB). Default is 128MB, minimum is 4MB,
mpuThreadCountThe number of threads to use for multipart upload (only applicable for file sources).
mpuThresholdMbSets the size threshold (in MB) when an upload shall become a multipart upload.
preserveDirectoriesIf enabled, directories are stored in S3 as empty objects to preserve empty dirs and metadata from the source.
remoteCopyIf enabled, a remote-copy command is issued instead of streaming the data.  Remote-copy can be much faster than the streaming alternative.  Remote-copy can only be used when the source and target is the same system (e.g. both are on ROSS).
resetInvalidContentTypeWhen set to true (the default), any invalid content-type is reset to the default
(application/octet-stream). Turn this off to fail these objects (ECS does not allow invalid content-types).
smartClientEnabledThe smart client is enabled by default. Use this option to turn it off when using a load balancer (which presumably would perform a function similar to smart client) or a fixed set of nodes.  
socketConnectTimeoutMsSets the connection timeout in milliseconds (default is 15000ms).
socketReadTimeoutMsSets the read timeout in milliseconds (default is 0ms).
urlEncodeKeysEnables URL-encoding of object keys in bucket listings. Use this if a source bucket has illegal XML characters in key names.
vdcsSets which Virtual Data Center and which nodes in that virtual data center that ecs-sync sync will communicate with in carrying out the copying.  The format is "vdc-name(host,..)[,vdc-name(host,..)][,..]",  The smart client capability will load balance across the active nodes, skipping the inactive ones.  We use this in favor of the hosts option.  We use this option above, listing the current UAMS vdc and nodes.
hostThis is an alternative to the vdcs command, where hosts can be comma separated list of host network names, or can be the name of a load balancer. We do not provide an example of how to use the host option, as we prefer the vdcs option.

rclone

Although not as fast nor efficient as ecs-sync, the rclone program (httphttps://www.rclone.org/) is one of the most versatile tools for accessing object stores such as ROSS.  It includes features for browsing various types of object and cloud storage systems, as well as local files using Posix file system conventions, using commands such as ls (list objects/files in a path), lsd (list directories/containers/buckets in a path), copy, move, delete, etc.  For some object/cloud storage systems, rclone can mount buckets as if they were a network file systems, though with reduced functionality and speed, so that users can use familiar commands to access bucket contents instead of the rclone commands or object API.  But the most popular feature of rclone is the ability to sync a directory tree to a bucket on an object store using a familiar rsync-like syntax.  The rclone site includes documentation on how to install, configure, and use rclone.  The rclone program is licensed under the permissive, open-source MIT license, hence is free to use and distribute.

...