Advanced AWS CLI with Jupyter Notebooks (Part 2). [Getting started with managing Amazon S3 with AWS CLI]
In a previous post we saw how to setup Jupyter Notebooks to work with the AWS CLI natively. In this installment, we will build on that and bring some additional power to the table by combining a bit of Python magic. We will take Amazon S3 as the service to work through the examples. Let’s start.
Table Of Contents
Setup Notebook for AWS CLI
First we quickly go through the same steps as we did previously to get this notebook to work with AWS CLI – setting up enviornment variables, iPython shell aliases, and enabling auto-magic.
%env AWS_PROFILE = handsonaws-demo
%rehashx
%automagic 1
aws whoami
As noted in previous post, even though we have enabled auto-magic here same commands will still work better only with the shell magic prefix !
. Luckily if you run into one of those, the fix is as simple as inserting the !
prefix for that cell alone. You do not need to turn off auto-magic for the notebook or do anything else that impacts other cells.
Amazon S3 Basics
AWS CLI offers two ways of interacting with S3 buckets and objects – the s3
and s3api
commands. The former is a higher level command that offers a limited number of abstracted S3 operations that makes working with S3 buckets and objects very easy. The latter s3api
more closely replicates all the API methods offered by S3 and allows more fine-grained control when you need it. We will look at mostly s3
here.
There are also two other S3 commands offered by the CLI – the s3control
and s3outposts
– that are for specialized purposes which are outside the scope of this post. s3control
command gives you access to control plane operations on S3 like S3 batch operations, Public Access blocking, and Storage Lens.
One useful reference is the help documentation available through the CLI itself. The subcommand help
displays the help documentation for each command and the list of all subcommands available under it.
Here is one place where the regular automagic doesn’t work (At least for me. The aws ... help
command seems to internally invoke the cat
command to display the help manual pages – which for some reason does not work from within my notebook setup on Win10, even when I alias cat
command to type
. Need to look deeper into this.)
The cell magic %%bash
comes in handy here. The output is verbose – collapsing the output or enabling auto-scroll on the output makes it more manageable in the notebook context. You can scroll-up to this cell whenever you need to refer to it.
%%bash aws s3 help
%%bash aws s3api help
As you can see the s3api
command has a much larger set of commands closely replicating the API methods.
We will start with the basics of s3
command - listing buckets and objects, uploading and downloading objects, etc – with an eye on some quirks.
One thing you will notice right away is that s3
command provides a more free-form text output and does not support the --output
parameter where you can force the output to show up as JSON
or YAML
. We will see that the s3api
command on the other hand provides JSON
output by default like most other CLI commands.
Reading S3 Buckets
List all buckets you own –
aws s3 ls
List all objects of a specific bucket –
aws s3 ls s3://s3demo-20210410/
By default any prefixes are listed with a PRE
preceding it and any objects with that prefix are ommited. This mimics the more common directory structure that we see in our local machines’ file systems. To recursively list all objects in the bucket – including those in the folders (more correctly prefixes) –
aws s3 ls s3://s3demo-20210410/ --recursive
The file sizes are listed in total number of bytes by default. To make them more easily understandable by humans you can use the --human-readable
option that shows the file sizes in Bytes/MiB/KiB/GiB/TiB/PiB/EiB.
aws s3 ls s3://s3demo-20210410/ --recursive --human-readable
You can use the --summarize
option to display a summary of total number of objects and their aggregate size.
aws s3 ls s3://s3demo-20210410/ --recursive --human-readable --summarize
Note that wherever the s3
subcommand does not deal with the local file system, you can drop the s3://
prefix to the bucketname as there is no ambiguity (unlike for eaxmple the cp
command we will see later). But you might still prefer to keep the prefix anyway to make it very obvious to the reader (which could be yourself in the future).
Here is an example with the s3://
prefix omitted.
aws s3 ls s3demo-20210410 --recursive --human-readable --summarize
At this stage if you prefer this format of ‘directory listing’ for all your buckets you can save yourself some keystrokes and create an alias shortcut for this. (We covered aws aliases in the previous post.)
aws s3dir s3demo-20210410
As you may have guessed the above command uses an alias s3dir
as a shorthand for s3 ls --recursive --human-readable --summarize
. You can echo the contents of the alias
file for reference –
!type C:\Users\Raghu\.aws\cli\alias
One thing you may not anticipate is that the S3 ls command returns a list of all objects and prefixes whose key matches the string provided – even if partially – as if it were a wildcard match. Here is an example –
aws s3 ls s3://s3demo-20210410/test
aws s3dir s3://s3demo-20210410/test
Writing to S3 buckets
Let’s create a file locally and upload it to S3.
echo helloworld > demo.txt
aws s3 cp demo.txt s3://s3demo-20210410/testprefix
While the above may look correct, there is a simple mistake that might trip up beginners. If you are thinking of uploading a file to a “folder” (more correctly a prefix) in a S3 bucket then you should take care to include the trailing forward-slash /
at the end of the prefix. Otherwise the last string testprefix
in the example command above is assumed to be the name of the object itself in the target bucket (root).
aws s3dir s3demo-20210410
As you can see, a new object by name “testprefix
” (size 13 Bytes) got created. An no new object is uploaded under the prefix. Let us get rid of this mistake and upload it correctly – using a training /
after the prefix.
aws s3 rm s3://s3demo-20210410/testprefix
aws s3 cp demo.txt s3://s3demo-20210410/testprefix/
You can see the command reports the file uploaded with the correct prefix and key this time. By default the keyname in target bucket stays the same as the local filename.
aws s3dir s3demo-20210410
You can specify a key name explicitly if you want the uploaded file to have a different filename than the local machine.
aws s3 cp demo.txt s3://s3demo-20210410/testprefix/newdemo.txt
Similar to recursive listing of objects, you can also upload to S3 recursively – namely including all folders and subfolders under it – using the --recursive
parameter.
I have a folder structure setup here locally for demo purposes with a few files in them. There is also one subfolder (subfolder2a
) that is empty.
!dir demoupload /s /b
aws s3 cp demoupload s3://s3demo-20210410/demoupload
Here we tried uploading a folder without the --recursive
parameter, and hence this error. This leaves a 0 byte object named demoupload
in the target bucket that we need to delete.
Note that we also failed to provide a trailing /
after the target prefix but that is optional when you use the --recursive
parameter.
aws s3 cp demoupload s3://s3demo-20210410/demoupload --recursive
aws s3dir s3demo-20210410
You may notice that no empty prefix named subfolder2a
was created in the target bucket as the source folder was empty.
If you specifically want to create an empty folder (prefix) in a target bucket, you need to use the put-object
operation of the s3api
command as follows.
aws s3api put-object --bucket s3demo-20210410 --key newfolder/
aws s3dir s3demo-20210410
Combining the power of Python with CLI
Using Jupyter notebooks we can bring a bit of Python magic into the mix to make life easier and interesting.
Python Variables
For a start we can assign frequently used values to a python variable and reuse them inside the AWs CLI commands using the {}
characters to enclose the variable names.
bucketname = "s3demo-20210410"
aws s3dir {bucketname}
Another handy use for python is to create a random string or timestamp that you want to use repeatedly within your notebook across multiple commands.
Date-time stamps that we will repeatedly use in constructing S3 bucket names or prefixes is one good use case.
from datetime import datetime rundatestamp = datetime.now().strftime('%Y%m%d') runtimestamp = datetime.now().strftime('%Y%m%d%H%M%S') prefixbydate = datetime.now().strftime('%Y/%m/%d') prefixbyhour = datetime.now().strftime('%Y/%m/%d/%H') print(prefixbydate) print(prefixbyhour)
aws s3 cp demo.txt s3://{bucketname}/{prefixbydate}/
aws s3dir {bucketname}/{prefixbydate}
newbucketname = "s3demo" newbucketname = newbucketname+'-'+rundatestamp print(newbucketname)
Or a string of random numbers to ensue an S3 bucketname is always unique.
import random randomsuffix=str(random.randint(1000,9999)) newbucketname = "s3demo" newbucketname = newbucketname+'-'+randomsuffix print(newbucketname)
Creating and Deleting Buckets
aws s3 mb s3://{newbucketname}
aws s3 mb s3://{newbucketname} --region us-east-1
Using the %%time
cell magic you can time the execution of any command block (cell) if that is of interest.
%%time aws s3 cp demoupload s3://{newbucketname}/{prefixbydate}/ --recursive
aws s3dir {newbucketname}
Before you can delete a bucket you need to ensure it has no objects remaining in it. Or you would encounter a failure.
aws s3 rb s3://{newbucketname}
Emptying a bucket involves recursively deleting all objects.
%%time aws s3 rm s3://{newbucketname} --recursive
Now you can remove the empty bucket
aws s3 rb s3://{newbucketname}
We will look at more bucket operations and the use of advanced CLI in subsequent installments.
Additional Tips & Notes
- Useful extensions for JupyterLab
- Collapsible Headings : https://github.com/aquirdTurtle/Collapsible_Headings
- Spellecheck : https://github.com/jupyterlab-contrib/spellchecker
- GNU Utils for Win32 – https://sourceforge.net/projects/unxutils/
- Git-bash for Windows – https://gitforwindows.org/
References
- AWS CLI (v2) reference for S3 command – https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html
- AWS CLI (v2) reference for S3api command – https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3api/index.html