Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you wish to access an existing group namespace as an object user, please contact the namespace administrator of that namespace and ask to be added as an object user for that namespace.  If you need assistance determining what namespaces are available and who is are the namespace administrator administrators feel free to contact the HPC Admins via e-mail (hpcadmin@uams.edu).

...

The primary protocols that the HPC Admins support is S3 and Swift.  We have more experience with S3, hence most of this article primarily assumes S3.  Note that S3 and Swift can be used interchangeably and simultaneously on the same bucket.

The file-based protocols NFS and HDFS must be enabled on a bucket at creation time, if you intend to use them, since the underlying system adds additional metadata when file access is enabled.  When enabled, NFS and HDFS, too, can be used interchangeably with the S3 and Swift object protocols on the same bucket.  However, a bucket with file access enabled loses some of the object features, such as life cycle management.  So consider carefully before enabling themfile access.  (You can still access a bucket using file system semantics with some external tools even when the bucket is not enabled for file access.)

...

  1. Many object tools have options to create and manipulate buckets, which is quite convenient, generally portable (i.e. would work on any object storage system), but are also limited.  In general, tools cannot create buckets with ECS options, such as encryption, replication and file access.  You have to use one of the other bucket creation options.  Object tools only need the object user's access credentials.  Please see the tools' documentation for details.
  2. The RESTful APIs, including S3, Swift, and an ECS-specific management API, can be used to create and manipulate buckets.  With the appropriate headers and parameters, ECS options can be enabled.  More information can be had in the ECS Data Access Guide and the ECS API Reference (hint - use the search function). 
    1. The protocol-specific S3 or Swift REST APIs need an object user's credentials, and the namespace administrator must have given appropriate permissions to the object user.
    2. The ECS-specific management APIs (which we do not support users in using) require namespace administrator credentials.
  3. Use the ECS Portal.  This is the simplest option that gives full control over bucket characteristics.  The ECS Portal can only be accessed by the namespace administrator.  (That would be you, for your personal namespace.)  The namespace administrator logs into the ECS Portal (https://ross.hpc.uams.edu) with the credentials tied to that namespace.  Details on how to use the ECS Portal to manage buckets can be found in the Buckets chapter of the ECS Administration Guide.

...

In theory, a bucket can hold millions of files.  We have noticed, though, that with path prefix searching the more objects that have a particular prefix, the longer the search takes.  While this is not dissimilar to POSIX filesystems, where the more files there are in a directory. the longer it takes to do a directory lookup, on an object store, being a flat namespace (no directory hierarchy), such searches can be much slower than the typical POSIX file system.  The total number of objects in a bucket also has a mild, though not dramatic, impact on the look up speed particular objects in a bucket.  Again, keep this in mind when planning your use of ROSS.  One way to get around this speed penalty is to use an external database (e.g. sqlite) to track in which buckets and objects your data resides in.

Tools recommendations for accessing ROSS

...

Being inherently parallel accessible (remember the 23+ storage nodes), the best performance is gain gained when operations proceed in parallel.  Keep that in mind if building your own scripts (e.g. using ECS's variant of s3curl) or when comparing tools.  More parallelism (i.e. multiple transfers happening simultaneously across multiple storage nodes) generally yields better performance, up to a limit.  Eventually the parallel transfers become limited by other factors such as memory or network bandwidth constraints.  Generally there is a 'sweet spot' in the number of parallel transfer threads that can run before stepping on each other's toes. 

...

After evaluating several tools, the HPC admins settled on 2 tools, ecs-sync and rclone, as 'best of breed' for moving data between ROSS and Grace's cluster storage system, where the /home, /scratch, and /storage directory live.  The ecs-sync program is the more efficient and the fastest speedier of the two for bulk data moves.  It consumes fewer compute resources and less memory than rclone.  Yet when When properly tweaked for number of threads (i.e. when the sweet spot is found) it moves data significantly faster than rclone.  The rclone program has more features than ecs-sync, including ways to browse data in ROSS, to mount a ROSS bucket as if it were a POSIX file system, and to synchronize content using a familiar rsync-like command syntax.  While ecs-sync is great for fast, bulk moves, rclone works very well for nuanced access to ROSS and small transfers.  The rclone program also works quite nicely for moving data between a workstation or laptop and ROSS.

ecs-sync

The ecs-sync program is specifically designed for the parallel bulk moving of data from one storage technology to another.  It comes from the Dell/EMC support labs, and is what EMC support engineers use for migrating data.  It certainly is possible to install ecs-sync outside of Grace, for example on a lab workstation to rapidly move data to or from ROSS.  However in this article we only discuss the use case of moving data between ROSS and Grace using the ecs-sync installed by the HPC Admins.

...