You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

DRAFT

Overview

ROSS (short for Research Object Storage System) is a multi-protocol, web-based storage system.  The primary 4.2 PB storage unit for ROSS resides on the UAMS campus.  In the coming months the UAMS unit will be connected with a 2 PB sister unit sited at the University of Arkansas in Fayetteville, to allow for a limited amount of off-site, replicated storage.  ROSS currently has 23 storage nodes, with 15 at UAMS and 8 at UA Fayetteville.   Future expansions could increase the number of storage nodes and capacity.

As the name implies, ROSS is an object store.  An object store really does not have a directory tree structure like conventional POSIX file systems do.  (Grace's storage systems are POSIX file systems.)  Instead, objects have names and live in buckets, essentially a one-level directory hierarchy.  People often name objects using what looks like a POSIX-style file path.  For example "directory1/directory2/directory3/directory4/filename" might be an object name.  The RESTful APIs used to access ROSS can do a prefix search, listing just the objects that start with a particular prefix. For example "directory1/directory2/" would pull back all the objects that start with that prefix.  So in a way one can mimic a POSIX file system with object naming conventions that use the equivalent of POSIX path names.  Unfortunately, prefix search is quite inefficient, making the simulated directory lookup slower than a native POSIX file system.  Nevertheless, some tools exist that can mimic a POSIX-style file system using an object store behind the scenes.  With a proper naming convention, it is fairly easy to backup and restore POSIX directory trees to and from an object store fairly efficiently.  This is what makes ROSS an excellent backing storage system for Grace.  (A backing storage system is the larger, but slower storage system that back ends a smaller, but faster storage cache in a hierarchical storage model.  Grace's storage system should be considered cache.)

Remember, the HPC Admins do not back up Grace's shared storage system, since it is only intended to be a staging or scratch area (aka a cache) used to run HPC jobs.  In other words, the primary copy of data should be elsewhere, not on Grace.  ROSS is one option for keeping the primary copy of data.  In fact, the main reason UAMS purchased ROSS is as the primary location for storing research data.

Data in ROSS starts as triple replicated on 3 different storage nodes, but then transforms to erasure coding giving resilience without major costs.  ROSS currently uses 12/16 error coding, meaning that for every 12 blocks of data stored it writes 16 blocks.  Up to four blocks could be damaged before ROSS is no longer able to recover data if a fifth block were damaged.  In contrast, most RAID systems only have a 1 or 2 drive redundancy.  ROSS scatters the 16 blocks across the storage nodes in a data center, improving the performance of data retrievals and further improving the resilience of the system (less chance that a node failure blocks access).  Data can be accessed from any of the storage nodes.  Currently the two systems (UAMS and UARK) are isolated from each other, not allowing replication.  But the plans to join the two are in progress.  Eventually all of the content in ROSS, regardless of which campus it physically lives on, would be accessible from either campus.  Of course, if data living on one campus is accessed from the other campus, the access will be somewhat slower due to networking delays and bandwidth limitations going between campus.

As mentioned, soon Grace will have an option for replicating data in Fayetteville.  However, even in this case I would not consider the copy in Fayetteville as a true backup copy.  The replication is good for maintaining data that needs high availability and equivalent performance regardless of which campus it is accessed from.  Replication also doubles the storage cost since it reduces available storage capacity at twice the rate that non-replicated storage does.  We still recommended that researchers keep backup or archive copies of data somewhere else, even if replication is turned on.

For data that meets OURRstore requirements (relatively static, STEM related, not clinical nor regulated), archiving to OURRstore is a very cost effective option to get resilient, long-term, offsite copies for little extra money (just media and shipping costs).  Other options for backup include the UAMS campus Research NAS or cloud storage such as Azure, Google, Amazon, or IBM.  USB or bare drives are also an option for backup, but not recommended, as they are quite error prone if not stored and managed properly.

If you are unfamiliar with the object storage terms that ROSS documents use, we recommend reading the Access to the Research Object Store System (ROSS) article on this wiki for an overview.


Costs for using ROSS

The College of Medicine (COM) Associate Dean of Research, in consultation with an advisory group at UAMS, recommended charging an up-front payment of $70 per TB for the use of ROSS, good for the 5 year life of the storage.  This is less than half the current replacement plus maintenance cost for the underlying hardware ($159 per TB).  If one wants replicated storage (i.e. one copy in Little Rock, one copy in Fayetteville), the cost would be double, or $140 per TB for the pair of copies.

In comparison, Amazon would charge $276 for Standard S3 (the closest comparable storage class in AWS) over the same 5 year period, plus data access and networking charges.  The access and networking fees can be higher than the storage fees, depending on access patterns.

The oversight of ROSS is moving from the UAMS COM to the Arkansas Research Computing Cooperative (ARCC), formed by a memorandum of understanding between UAMS and UA Fayetteville.  ARCC may revisit the charge for using ROSS.  One option being considered is a free tier, with yearly charges for any use over the free tier by individuals or groups (labs, departments, research projects, etc.).  Also being considered is whether payment would be 'up front' for the expected life of the storage, or would be simply billed periodically (e.g. annually) as the storage is used.

Until pricing changes, users should be prepared for the up front $70 per TB per copy pricing set by the COM.  Although we are not collecting fees at the moment, retroactive fee collection could begin after the ARCC steering committee settles on the fee schedule.  Any storage in use at the time that ARCC sets the fee schedule would be charged the lower of the $70 fee imposed by COM, or the new fees set by ARCC, for 5 years of storage.  Additional storage would be charged at the fee schedule set by ARCC.

Restrictions on data that can be stored on ROSS

There are few restrictions on the types of data that can be stored on ROSS.  Almost anything is allowed, as long as the data storage complies with both governmental, IRB, and UAMS rules and regulations.

Keep in mind that any campuses that are participants in ARCC, and by extension, the Arkansas Research Platform (ARP), have access to ROSS.  Unlike the UAMS Research NAS, which is locked down behind UAMS firewalls hence only accessible inside the UAMS Campus, ROSS is located in the ARCC Science DMZ, accessible by a number of campuses, both within Arkansas and potentially beyond.  As such, it is inappropriate to store in ROSS fully identified patient (PHI) or personal (PII) data, as it could be a violation of UAMS HIPAA or FERPA policies.  ROSS does have the ability to restrict access to buckets and to do server-side, data-at-rest encryption, but these capabilities have not been evaluated as to whether or not they are sufficient for HIPAA or FERPA compliance.  For now, ROSS should not be used for data that is regulated by HIPAA, FERPA, or any other governmental regulation.  De-identified human subject data is allowed.

In addition, by using ROSS, you agree that you are complying with any rules or restrictions place by your IRB protocol on data you store in ROSS.

Do not use ROSS for regulated data (e.g. data that must comply with HIPAA, FERPA, or other rules).  If your data is sensitive, please turn on server side encryption.  You are responsible for complying with any applicable IRB protocol.

The server-side encryption can either be turned on at the namespace level, where all buckets in a namespace are required to be encrypted, or on the bucket level for namespaces that do not have 'encryption required' turned on.  If you want namespace level encryption, please inform the HPC admins when requesting a namespace.  The encryption choice must be made at namespace or bucket creation time, and cannot be changed afterwards.  (One can copy data from an unencrypted bucket to an encrypted one, then destroy the unencrypted bucket, or visa versa, should a change be needed after bucket creation.)

Requesting access to ROSS

Before requesting access to ROSS, in addition to reading this article, please also read the Access to the Research Object Store System (ROSS) article on this wiki, since it describes key concepts needed to understand how storage is laid out and accessed. 

To initiate access to ROSS for a new personal or group (e.g. project, lab, department) namespace, please send a request via e-mail to HPCAdmin@uams.edu.  In your request, please indicate approximately how much storage you intend to use in ROSS for the requested namepace, divided into local and replicated amounts.  The HPC Admins will use this information in setting the initial quotas and for capacity planning.  You will be allowed to request increases in quota if needed and space is available.  We would also appreciate a brief description of what you will be using the storage for.  The "what it is used for" assists us in drumming up support (and possibly dollars) for expanding the system. 

If you wish to access an existing group namespace as an object user, please contact the namespace administrator for that namespace and ask to be added as an object user for that namespace.

Using ROSS

In experiments we have notices that write times to ROSS are considerably slower than read times, and slower than many POSIX file systems.  However, read times are significantly faster.  In other words, it takes longer to store new data into ROSS than to pull existing data out of ROSS.  We also notice (as is typically of most file systems) that transfers of large objects can go significantly faster than tiny objects.  Please keep these facts in mind when planning your use of ROSS - ROSS favors reads over writes and big things over little things.

In theory, a bucket can hold millions of files.  We have noticed, though, that with path prefix searching the more objects that have that prefix, typically related to the number of objects in a bucket, the longer the search takes.  While this is not dissimilar to POSIX filesystems, where the more files there are in a directory. the longer it takes to do a directory lookup, on an object store, being a flat namespace (no directory hierarchy), such searches can be much slower.  

What APIs are supported by ROSS

ROSS supports multiple object protocol APIs.  The most common is S3, followed by Swift (from the Open Stack family).  Although ROSS can support the NFS protocol, our experiments show that the native NFS support is slower for many use cases than other options that simply emulate a POSIX file system using object protocols.  ROSS can also support HDFS (used in Hadoop clusters) and a couple of Dell/EMC proprietary protocols (ATMOS and CAS); however, we have not tested any of these other protocols, hence the HPC Admins could only offer limited support for these other options.  The primary protocols that the HPC Admins support is S3 and Swift.  We have more experience with S3, hence most of this article assumes S3.  Later we describe our two favorite tools for accessing ROSS, but you are not required to use our suggested tools.

How do I get access credentials?

In order to use either S3 or Swift, you need credentials.  The credentials are tied to a particular namespace.  In fact

How do I create a bucket or container?

For your personal namespace, you can create buckets (the S3 term) or containers (the equivalent Swift term) from the administration API.

If it is a group namespace (e.g. lab, project, department), the namespace administrator can create buckets with certain characteristics for you.

In either cases, object users can also use APIs and tools to create buckets or containers.  Many of the ECS-specific features (e.g. replication, encryption, file access) are not available when using the bucket-creation options of tools and the APIs.  If you need any of those options, it is better to create the bucket or container with ROSS's web interface.  Using the APIs is beyond the scope of this document. 

What tools do you recommend for accessing ROSS

The bottom line - any tool that uses any of the supported protocols theoretically could work with ROSS.  There are a number of free and commercial tools that work nicely with object stores like ROSS.  Like most things in life, there are pros and cons to different tools.  Here is some guidance as to what we look for in tools.

Being inherently parallel accessible (remember the 23+ storage nodes), the best performance is gain when operations proceed in parallel.  Keep that in mind if building your own scripts (e.g. using ECS's variant of s3curl) or when comparing tools.  More parallelism (i.e. multiple transfers happening simultaneously across multiple storage nodes) generally yields better performance, up to a limit.  Eventually the parallel transfers become limited by other factors such as memory or network bandwidth constraints.  Generally there is a 'sweet spot' in the number of parallel transfer threads that can run before stepping on each other's toes. 

The S3 protocol includes support for multipart upload of objects, where a large object upload can be broken into multiple pieces and transferred in parallel.  Tools that support multipart upload likely will have better performance than tools that don't.  In addition, tools that support moving entire directory trees and that work in parallel will have better performance than tools that move files one object at a time.

After evaluating several tools, the HPC admins settled on 2 tools, ecs-sync and rclone, as 'best of breed' for moving data between ROSS and Grace's cluster storage system, where the /home, /scratch, and /storage directory live.  The ecs-sync program is the more efficient and the fastest of the two for bulk data moves.  It consumes fewer compute resources and less memory than rclone.  Yet when properly tweaked for number of threads (i.e. when the sweet spot is found) it moves data significantly faster than rclone.  The rclone program has more features than ecs-sync, including ways to browse data in ROSS, to mount a ROSS bucket as if it were a POSIX file system, and to synchronize content using a familiar rsync-like command syntax.  While ecs-sync is great for fast, bulk moves, rclone works very well for nuanced access to ROSS and small transfers.

ecs-sync

The ecs-sync program is specifically designed for moving data from one storage technology to another.  It comes from the Dell/EMC support labs, and is what EMC support engineers use for migrating data.  Due to security issues, we do not offer the job-oriented service mode described in the ecs-sync documentation.  (If the ecs-sync maintainers ever fixed the security holes, we could reconsider.)  Instead we only support running ecs-sync in what its documentation calls "Alternate (legacy) CLI execution", where a user runs ecs-sync as a command, instead of queuing up a job.  A command alias exists on Grace's login and compute nodes for running ecs-sync.  The alias actually runs the command on a data transfer node, so does not bog down the node on which the command is issued.  In other words, feel free to use ecs-sync from the login node, from the command prompt in Grace's Open OnDemand portal, or even from a compute job, if needed, though is it somewhat wasteful of resources to run ecs-sync from a compute job. 

The syntax for calling ecs-sync interactively on Grace is:

Interactive ecs-sync command
ecs-sync --xml-config <config-file>.xml

The <config-file>.xml are the set of instructions of what is to move where in how many threads using which storage nodes.  Of course, replace <config-file> with a name of your choosing.  We will describe the XML config contents content later.

In running interactively, you see all the messages coming back from ecs-sync as it does its job, giving instant feedback.  But if the shell dies, the command may stop.  To avoid this, you can use nohup to run ecs-sync as a background job that continues when you log out and redirect standard out/err into a file, for example using the following syntax:

Running ecs-sync in the background
nohup ecs-sync --xml-config <config-file>.xml > <log-file>.log &

Again, replace <config-file> and <log-file> with any name you wish.



  • No labels