Digital CollecS3 Bucket Setup and Architecture

The Digital Collections app divides it's content across a fairly large number of S3 buckets. Different content has different needs with regard to lifecycle management, storage class, and permissions, and we have used buckets to divide these things. Additionally, separate buckets can be helpful for AWS cost reporting. The exact division of buckets we currently have may not be optimal, we were learning as we went.  It might be nice to combine some buckets, but copying/moving lots of S3 keys can be cumbersome/expensive. 

Digital Collections app S3 buckets are controlled by terraform infrastructure-as-code, from terraform configuration in this repository: https://github.com/sciencehistory/terraform_scihist_digicoll/

Terraform configuration currently controls the buckets and all configuration including replication, but does not (as of this writing) control the IAM roles/policies related to access to buckets. But see https://github.com/sciencehistory/terraform_scihist_digicoll/issues/6

Do not make manual changes to S3 configuration controlled by terraform. At least not for other than a test/spike that will be immediately reflected in terraform config. If the terraform gets out of sync, it winds up a mess. 

Buckets

The terraform configuration should be considered the ultimate authority on our S3 buckets, in any conflict with this documentation. But some overview of all of our buckets....

bucketnotes

scihist-digicoll-production-originals

containing all original files (except for exceptions).pubilc

=> scihist-digicoll-production-originals-backup

A second copy of originals in another AWS region, kept synchronized with AWS replication rules

scihist-digicoll-production-originals-video

we keep video original files in a separate bucket for tracking purposes. public. 
=> scihist-digicoll-production-originals-video-backupsSimilar to originals-backups, but for the separate videos bucket

scihist-digicoll-production-derivatives

standard location for derivative files – contains most derivative files, but derivative files marked "restricted" are kept in a seperate prefix in originals bucket. (For most derivatives, we want them in public bucket so we can use cacheable URLs for thumbnails and such). public. 

=> scihist-digicoll-production-derivatives-backup

A backup of derivative files in another region for quicker recovery. Kept synchronized by AWS replication rules. 

scihist-digicoll-production-ondemand-derivatives

a "cache" location for our "large" derivatives that are created on-demand in background jobs (multi-image PDFs and zips). It has a lifecycle rule that deletes old things, and functions as a "cache". public. Could probably be put somewhere else. 

scihist-digicoll-production-dzi

"DZI" (Deep Zoom) tiles used for pan-and-zoom. public. 
=> scihist-digicoll-production-dzi-backupA backup of dzi files in another region for quicker recovery. Kept synchronized by AWS replication rules.
scihist-digicoll-production-ingest-mountBucket we use for mounting on Windows desktops, which the app "choose from cloud" function (powered by browse-everything gem) then lets you choose from for ingest. 
scihist-digicoll-production-uploadsUsed as shrine "cache" location – if someone does upload a file through the browser, it goes here first, as a sort of temporary holding location until shrine "promotes" it to the/an originals bucket. 
scihist-digicoll-production-public

A newer bucket we created with the intention of holding multiple things intended to be public. doesn't have too much at present.

A folder called maintenance_page contains a page used when the system is planned to be down.

A folder called IT_Department contains an SHI logo file that is used by the Teams desktop app


Staging environment has all(?) of these same buckets, with the word production replaced with staging in bucket name – except staging doens't have any -backup buckets. 


Backup Buckets – motivation/use

There are many different hypothetical uses for a second copy of data. Some of them may have different and even conflicting requirements. We haven't necesssarily fully tested and fully spec'd out our backups for which users they might be suitable for – more work could be done here. Some original historical thoughts can be found at Backups and Recovery (Historical notes)

Some notes on our backups:

  • Our backup buckets are created by AWS replication rules that should automatically keep them sync'd
  • Our backup buckets are intentionally in a separate AWS region, for AWS best practices. (One region may go down). But this does increase costs of bandwidth for making copies or for recovery. 
  • The backup buckets have versioning turned on, but (I think?) only keep old versions for 30 days
  • The originals bucket(s) content is irreplaceable, so the backup copy may be especially important. (Note separate videos original backup) But derivatives and DZI's both can be re-created from originals if lost. Backup/redundant copy here can serve to get us up either faster or cheaper than re-creating.  
  • There is an additional copy of some material (only originals?) in an on-site institute backup. 

Possible hypothetical uses for backup/redundancy copies might include:

  • User error that corrupts or deletes data, want to retrieve it from the second location (note our backups may only keep 30 days of history, so would have to be caught within that time frame)
  • data layer corruption at primary copy, want to restore. (Extremely unlikely on S3 architecture, although anything is possible)
  • temporary outage at present location, want to keep application up through outage of unknown length. (Definitely happens on S3 now and then)
  • permanent disappearance of the primary storage location (unlikely but could happen on S3, probably more likely than corruption)
  • meeting professional standards of some kinds for preservation


Possible mechanisms for recovery from S3 backup copies

We haven't really tested most of these. 

For individual file data loss or corruption

Either human error or storage layer. Individual files could be copied back over from backup locations. Note that backup locations may only keep old versions for 30 days. 

For a temporary outage

We could point a running app at the backup buckets using heroku config variables, and put it in READ-ONLY mode (by locking out staff logins), to keep the app up through temporary outage of our S3 us-east-1 buckets. 

This might incur additional expense because of cross-region bandwidth, if our live app is still running in us-east-1 but is pointed at us-west buckets. Not sure how significant. 

After temporary outage is over, app would be pointed back at normal buckets. 

For a permanent outage or loss

Or could be human error accidental deletion of the entire buckets or something. 

Hypothetically could create new buckets and copy all data over, but cost could be signficant of cross-region data transfer. Unless we permenantly move our AWS infrastructure to us-west region!

Replication setup

Some historical notes on how we set up replication. It's currently mostly controlled by terraform, although IAM roles may not be. 

Replication needs to be done manually. Every backup bucket must be set up as the target of replication from the bucket they backup. Right now that is the originals and derivatives buckets but this may change.

Replication happens across regions, this means you cannot use replication in the same region.

Replication is also done across region to reduce the problems when an AWS region goes "down".

  1. Go to the source bucket (originals, derivatives, or dzi) in the S3 console on AWS
  2. Select the management tab
  3. Press Replication
  4. Add rule (versioning must be enabled)
  5. Select entire bucket for replication
  6. Set the destination as the backup version of the bucket
  7. Select the IAM role kithe-TIER-backup
  8. For rule name call it backup
  9. Keep status and enabled.
  10. Then save the rule

Additional "local" copies

In addition to the extra copy on S3, we make another set of copies in on-site Science History Institute storage, that is also backed up to tape on-site. 

  • SyncBackPro runs 3 nightly mirror jobs from AWS S3 to the local server Promethium at 6pm
    • scihist-digicol-production-originals -> D:\Backup Folders\AWS S3 - Digital Collections - Images
    • chf-hydra-backup\PGSql -> D:\Backup Folders\AWS S3 - Digital Collections - SQL
    • chf-hydra-backup\Aspace -> D:\Backup Folders\AWS S3 - ArchivesSpace - SQL
  • Promethium data is backed up up to LTO tape daily. 
  • Weekly and monthly tapes are in an off-site rotation.  Annual tapes are kept off-site for 5 years.