Skip to content

Latest commit

 

History

History
214 lines (200 loc) · 12.7 KB

5.2. Storage - S3.md

File metadata and controls

214 lines (200 loc) · 12.7 KB

Amazon S3

  • Key / value store for objects
    • 💡 Great for big objects, not so great for small objects (latency)
    • It's object storage
      • Object => You rewrite entire object
      • Block store => You can just send changes to change the blob, can update incrementally as data is split into chunks.
  • S3 is a global service (not region specific)
    • ❗ A bucket however is attached to a region
  • Supports IPv6, IPv4.
  • 💡 Use cases: Static files, Key value store for big files e.g. movies, Website hosting
  • Objects (files) are stored in buckets (directories)
  • Buckets
    • ❗ Bucket names must be unique across all existing bucket names in Amazon S3.
    • Buckets are defined at the region level and must have globally unique name as it'll following URLs
      • https://s3-<region>.amazonaws.com/<bucket>/<object>
      • https://<bucket>.s3.amazonaws.com/<object> (virtual host style addressing)
    • You can set default encryption
      • Options are: None, AES-256 (SSE-S3) and AWS-KMS (SSE-KMS)
      • 📝Popular question: How to ensure that all objects can be encrypted at upload?
    • ❗ 100 S3 buckets per account -> soft limit
    • Folders
      • Can be created under buckets
      • ❗ They are pseudo and will be treated as object by S3
        • S3 has a flat structure with no hierarchy
        • But presentation in console gives the sense of "Directories"
    • Version Control
      • Can enable versioning at bucket level
        • ❗📝 Once enabled, it cannot be disabled but just suspended.
      • Same key overwrite will increment the version
      • Best practice as it allows to
        • Roll back to a previous version
        • Protection against accidental deletion i.e. unintended deletes.
          • Deleting files does not really delete versioned files
          • It only puts Delete marker and files can still be downloaded.
          • Deleting a delete marker -> Restores the object & adds a new delete marker.
            • ❗ Only the owner of a bucket can permanently remove a delete marker e.g. delete a version.
          • 💡 Best practice: Allow only specific users to delete & enforce MFA
      • 📝Any file that is not versioned prior to enabling versioning will have the version null.
  • Objects
    • Have a key which is FULL path
      • e.g. <my_bucket>/my_file.txt or <my_bucket>/my_folder1_another_folder/my_file.txt
      • ❗ The are no concept of directories within buckets just key names with slashes.
    • Object values are the contents of the file
      • ❗ Max size is 5 TB, min size is 0 bytes (touched)
      • ❗ If file is more than 5 GB, you need to use multi part upload
      • Metadata is list of text key/value pairs.
        • Can be system or user metadata.
        • ❗ Cannot be modified after creation
    • Tags are unicode key/value pairs.
      • ❗ Up to 10
      • 💡 Useful for security / lifecycle / replication rules
  • Logging
    • Object-level logging
    • Can enable CloudWatch request metrics: e.g. total number of HTTP GET/PUT, 4XX etc.
    • Can enable CloudTrail for auditing API calls and call stacks (not CloudWatch).
  • 📝S3 Consistency Model
    • Read after write consistency for PUTS of new objects
      • E.g. PUT 200 -> GET 200
      • ❗ Eventually consistent if you try to GET before to see if the object existed
        • E.g. GET 404 -> PUT 200 -> GET 404 -> GET 404 gets cached
    • Eventual consistency for DELETEs and PUTS of existing objects
      • Read after updating object -> Might get an older version
        • E.g. PUT 200 -> PUT 200 -> GET 200 (might be older version)
      • Delete -> Can retrieve it for a short time
        • E.g. DELETE 200 -> GET 200
  • Make an object publicly accessible
    • On object level
      • Set a predefined grant (also known as a canned ACL) for a bucket in Amazon S3.
        • In Console -> Bucket -> Permissions -> Public access settings -> Untick "Block new public bucket policies" and "Block public and cross-account access if the bucket has public policies"
      • Update the object's ACL -> Ensure your object is set to be publicly accessible.
    • On bucket level
      • Use a bucket policy that grants public read access to a specific object tag/prefix
  • Lifecycle Policies
    • Set of rules to automating moving data between different tiers, to save storage cost.
    • Transition actions
      • When objects are transitioned to another storage class
      • E.g. General Purpose --(60 days)--> Infrequent Access --(180 days)--> glacier
      • ❗ Objects must be stored at least 30 days in standard for standard to => Standard IA or OneZone IA
    • Expiration actions
      • Expire & delete an object after a certain time period.
      • E.g. access log files can be set to delete after specified period of time
    • You can combine transition & expiration actions.
    • You can filter the resources by a key name prefix, e.g. tax/ will only include tax/doc1.txt.
    • You can configure it to
      • Clean-up expired object delete markers
      • Affect "Current version" and & or "Previous version"
  • Multipart Upload
    • Use if file is more than >100MB (required for >=5GB)
    • Allows uploading parts concurrently
      • Improved throughput
      • Quick recovery from any network failures
      • Pause and resume object uploads
      • Begin an upload before you know the final object size.
    • All storage that any parts the aborted multipart upload consumed is freed & deleted.
      • S3 supports a bucket lifecycle rule to abort multipart upload that don't complete within a specified number of days.
  • Event Notifications

S3 Pricing

  • Charged mainly based on
    • Storage (per GB)
    • Requests
      • Fee for each request make to S3 APIs for managing objects.
        • 💡 Usually very cheap, but can be a good idea to minimize the requests to optimize costs.
      • See also Glacier for Glacier retrieval fees.
    • Data transfer
    • Optional fees for • region replication • transfer acceleration • management • analytics
  • Requester pays
    • Requester pays for requests + storage in bucket instead of bucket owner.
    • Requester is from AWS (not anonymous) & sends requester-pays flag and billed by AWS.
  • See S3 pricing page and AWS Pricing Calculator

S3 Storage Classes

  • Storage class is set based on object

  • S3 Storage Tiers

    Storage Description Use-cases Retrieval latency Cost
    Standard General purpose Big Data analytics, mobile & gaming applications, content distribution... ms 💲💲💲💲💲💲
    Standard IA Infrequent access For data that's less frequently accessed but requires rapid access e.g. you'll get a file directly every month. ms 💲💲💲💲💲
    One Zone-Infrequent Access Low latency & high throughput performance Storing secondary backup copies of on-premise data or storing data you can recreate over time ms 💲💲💲💲
    Reduced Redundancy Storage (RRS) Deprecated, don't use it, use one zone-infrequent access Noncritical, reproducible data at lower levels of redundancy than Amazon S3's standard storage ms 💲💲💲
    Intelligent Tiering ML moves objects between IA and Standard tiers based on changing access patterns Unpredictable workloads, Changing access patterns, Lack of experience with storage optimization ms (small) Monitoring + auto-tiering fee
    Glacier Low cost cold storage. DR Back-ups, Media Asset Archival Expedited: 1-5 min, Standard: 3-5 hr, Bulk: 5 -12 hr 💲💲 + retrieval fee
    Glacier Deep Archive Lowest cost coldest storage Magnetic tapes / VTL / Regulatory archiving 12-48 hours 💲 + retrieval fee
  • S3 Storage Tiers Reliability

    Standard Reduced Redundancy Storage Standard Infrequent Access One - Infrequent Access S3 Intelligent Tiering Glacier
    Durability 99.999999999% 99.99% 99.999999999% 99.999999999% 99.999999999% 99.999999999%
    Availability 99.99% 99.99% 99.99% 99.5% 99.9% No SLA
    AZ ≥3 ≥2 ≥3 1 ≥3 ≥3
    Concurrent facility fault tolerance 2 1 2 0 1 1

Glacier

  • Glacier classes moves objects to Amazon Glacier which is part of S3.
  • Concepts
    • Archive: store data in (one or more files) with unique archive ID.
      • 📝The content of the archive is immutable, meaning that after an archive is created it cannot be updated.
    • Vault: collection of archives where you upload data to.
  • Retrieval
    • ❗ S3 object stored through Glacier can only be retrieved with S3 APIs as it'll map object path to archive id.

    • Options

      Attribute Expedited Standard Bulk
      Data Subset of archives Any archive Large amounts
      Time 1 - 5 minutes 3 - 5 hours 5 -12 hours
      Pricing 💲💲💲💲 💲💲 💲
      • Expedited: Access subset of archives in 1-5 minutes.
        • On-demand: Accepted except for rare high load situations.
        • Provisioned: Buy capacity to ensure that your retrieval capacity is available.
      • Standard: Access any archive in 3-5 hours.
      • Bulk: Access large amounts in 5-12 hours
    • ❗ Deep Archive has only one option for retrieval time around 12-48 hours.

  • ❗ You cannot upload directly from console, possible only through AWS CLI.

Availability

  • Cross region replication (CCR)
    • Copying is asynchronous replication
    • On bucket level (entire bucket) or object level (based on prefix/tags)
    • ❗📝 Must enable versioning (source and destination)
      • CRR is built on top of versioning functionality.
      • 🤗 As the replication is asynchronous, it needs 'non-changing' copy of data during replication
        • Versioning makes it easy so file can still be modified (upload, delete, etc.) while replication is in progress.
    • Buckets must be in different AWS regions & can be in different accounts
    • Requires IAM permissions to S3
    • Delete markers & deleting individual versions / delete markers are not replicated.
    • Use cases: compliance, lower latency access, replication across accounts
    • Flow:
      1. Create a bucket, and one more new one as replica
      2. Original bucket -> Management -> Replication -> Add rule
        • Apply to entire bucket or "prefix or tags"
      3. Choose if objects encrypted with AWS KMS will also be encrypted
        • Other account must also have access to AWS KMS as well
      4. For replicated objects you can change
        • Storage class
        • Ownership to destination bucket owner
      5. Set IAM rule
        • Role gets created automatically that gives replication permissions to the original bucket
      • ❗ Does not replicate:
        • Existing files are not replicated, just new files but with same metadata and ACLs.
        • Replicas of other objects, lifecycle actions and configurations, objects without bucket owner permissions, SSE-C & SSE-KMS encrypted object (can be configured to replicate)...

S3 Websites

  • S3 can host static websites and have them accessible on the www.
    • 📝 The website URL will be => bucket name + s3-website + region + amazonaws.com.
      • <bucket-name>.s3-website-<AWS-region>.amazonaws.com
      • <bucket-name>.s3-website.<AWS-region>.amazonaws.com
  • If you get 403 (Forbidden) error; make sure bucket policy allows public reads!
  • Supports redirects through a metadata tag.
  • In bucket -> Properties -> Static website hosting
    • Select Use this bucket to host a website (optionally specify default index and error files)
    • Another option is to redirect requests (to another place)
    • S3 CORS (CORS=Cross Origin Resource Sharing)
      • 📝 If you request data from another S3 bucket, you need to enable CORS
      • Allows you to limit the number of websites that can request your files in S3 and this way limit costs.
      • E.g. bucket1 has image in bucket2. bucket2 must then has CORS headers for bucket1.
  • Prerequisites
    • ❗ The bucket must have the same name as your domain or subdomain.
    • (Optionally) Use CloudFront for SSL/TLS
    • A registered domain name.
    • Route 53 as DNS service for the domain to configure DNS.