AWS setup and usage


As we start to do more involved statistical inference using Stan, you may want to run your calculations on more powerful machines. While a local installation on your own machine may suffice, it will serve you well to have more expansive computing resources available. There are many options for this, including Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, and Caltech’s own high performance computing center. In this lesson, we will show you how to get up and running with AWS. While it looks like a lot of steps, some of the steps are done only once, so it is not much more work to launch instances after the initial setup. The next part of this lesson serves as a quick reference for how to spin up instances and use them after you have completed the setup outlined in this part of the lesson.

1. Create an Amazon Web Services account

You should create an AWS account if you have not already. For a total cost of about $20, you can have over 100 hours of compute time on a four core machine, which should be plenty for the entire course.

2. Launch your instance

Log on to the AWS Console by going to https://aws.amazon.com and clicking on the button at the upper right of the page.

Once you are on the AWS Console page, you can launch your instance. We have set up an Amazon Machine Image (AMI), which has the software you need for the course installed and the data sets pre-loaded. The AMI is available in Oregon (us-west-2). Be sure to select this region from the top right corner of the console. You should use the same region throughout the course, since that is physically where your machine will live.

  1. To launch an instance with this AMI, choose EC2 among the services available from your AWS console. You can select EC2 from the Services pulldown menu at the top left of your screen.

  2. After selecting EC2, you will see a menu of options the left pane. Under Images there, click AMIs.

  3. The resulting menu will default to AMIs Owned by me (you likely do not have any). Select instead Public images.

  4. In the search menu, search for bebi103, and the class AMI should appear. If it does not, double check to make sure your region is Oregon.

  5. You will see the bebi103 AMI listed. Right click on it and select Launch instance from AMI. (You may also select Request Spot instance if you want to save some money, but when you stop a spot instance, you will lose whatever you stored there.)

  6. You will then be taken to the Launch an instance page. In the Name and tags pane, you can give your instance a name, which I recommend, since it will help you find it more easily in your menu of instances when you come back to it. I simply named mine bebi103.

  7. You can skip the Application and OS Images pane because you already selected your AMI.

  8. In the Instance type pane, you are given many choices of instance types. For our beginning usage of Stan, I recommend a c5.xlarge instance, which as 4 cores and costs about 17¢/hour to use. When we start doing simulation based calibration (SBC), which is the most computationally intensive part of the course, I would use a c5.2xlarge instance or larger (8 or more cores).

  9. In the Key pair (login) pane, you will need to create a new key pair if you do not already have an AWS key pair. Click Create new key pair and in the pop-up window, you can enter a name for the key pair (something like bebi103_aws_keypair would be fine). Leave the default radio buttons selected and click Create key pair. Once you create the key pair, it will be downloaded to your local machine. DO NOT, I repeat, DO NOT store the key pair in any git repository, or anything that is backed up to the cloud, like Dropbox. ONLY store it locally on your machine and never, ever let it out to the internet. You only need to create the key pair once; you can reuse it going forward.

  10. In the Network settings pane, you can select sources to allow SSH traffic. You can leave this as “Anywhere” and leave everything else as the defaults. If you are concerned about security, you can instead only allow SSH traffic from your IP address, but this may prove inconvenient if you want to access your instance from home and on campus.

  11. In the Configure storage pane, select 30 GiB gp2 root volume. This should be enough storage for installed software, data sets, MCMC samples, and the rest of your work.

  12. Click Launch instance at the bottom right of the right Summary pane.

After following those steps, your instance will spin up. You can view your running instances by clicking on Instances on the dashboard on the left part of the screen.

It will take a while for your instance to spin up. When the Instance State says running and the Status Checks are complete, your instance is ready for you.

3. Connect to your instance

Now that your instance is launched, you can connect to it using your computer and the ssh protocol. The instructions work for Windows, macOS, or Linux, assuming you have a terminal running bash or zsh. In Windows, this is accomplished using GitBash, which you should install if you have not already. For macOS, use Terminal.

  1. Identify where you put your keypair file. For the purposes of this exercise, I will assume that you have a directory in your home directory called key_pairs/ and that your keypair file is ~/key_pairs/bebi103_aws_keypair.pem.

  2. Change permissions on your keypair for security. Do this in the terminal using

    chmod 400 ~/key_pairs/bebi103_aws_keypair.pem

  3. Open a new GitBash (Windows) or Terminal (macOS) window.

  4. SSH into your instance in the terminal. To do this, click on your instance on the Instances page from the dashboard. Clink on the instance you started. At the bottom of the webpage will appear information about your instance, including the Public IPv4 address. It will look something like 54.92.67.113. Copy this. In what following, I refer to this as <IPv4 Public IP>. SSH into your instance by doing the following on the command line.

    ssh -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>

  5. (optional, may only work for macOS) To avoid having to use -i "~/key_pairs/bebi103_aws_keypair.pem" each time, you can add your keypair to your bash or zsh profile by doing, e.g. for zsh,

    echo ssh-add -K $HOME/keypairs/bebi103_aws_keypair.pem >> ~/.zshrc; source ~/.zshrc

  6. Now that you are SSH’d into the instance, you can access it through the command line. You may notice that the bebi103 environment is already active. From here, the world is your oyster. For example, you can clone a repository where you keep your work for the class.

    git clone https://github.com/my_user_name/my_favorite_repository.git

  7. The data for the course are located in the ~/data folder. So, to keep the appropriate relative path, you should make a directory for your homework, ~homework/ where each .ipynb file is stored.

  8. Whenever you log in to your instance, you should update the instance in case I updated any software or added data sets. To do this, run

    bebi103_update

on the command line after SSH-ing into your instance.

  1. You now have your repository and all of the data for the course on your instance! Any documents you create or edit on your AWS instance can be managed with git and push/pulled to/from GitHub.

4. Launch JupyterLab

While connected to your instance, you can launch JupyterLab by executing

jupyter lab --no-browser

on the command line. You will see output like this:

To access the server, open this file in a browser:
    file:///home/ec2-user/.local/share/jupyter/runtime/jpserver-30060-open.html
Or copy and paste one of these URLs:
    http://localhost:8888/lab?token=e52184f06c9fb0f9ceea176b1d51d9cb36c72a019e688fed
 or http://127.0.0.1:8888/lab?token=e52184f06c9fb0f9ceea176b1d51d9cb36c72a019e688fed

Keep this window open.

In order to use JupyterLab through a browser on your machine, you need to set up a socket. To do so, open up another GitBash or Terminal window and execute the following.

ssh -i "~/key_pairs/bebi103_aws_keypair.pem" -L 8000:localhost:8888 ec2-user@<IPv4 Public IP>

This sets up a socket connecting port 8888 on your EC2 instance to port 8000 on your local machine. You can change these numbers as necessary. For example, in the URL listed above that you got with you launched JupyterLab, the port may be localhost:8889, in which case you need to substitute 8889 for 8888 in your ssh command. You may also want a different local port if you already have a JupyterLab instance running on port 8000, e.g., 8001. In what follows, I will use port number 8000 and 8888, which you will probably use 90% of the time, but you can make changes as you see fit.

After you have set up the socket, you can paste the URL given when you launched JupyterLab on your EC2 instance into your browser, but substitute 8000 for 8888. That is, direct your browser to

http://localhost:8000/lab?token=e52184f06c9fb0f9ceea176b1d51d9cb36c72a019e688fed

You will now have JupyterLab up and running!

Note that you may be running JupyterLab locally on your own machine. You should make sure you do not use the same port number of any JupyterLab instance running on your local machine when you launch JupyterLab on AWS. You can specify the port number to be, for example 8890, by launching JupyterLab with

jupyter lab --no-browser --port 8890

If you do that, make sure you use the corresponding port numbers when setting up your socket.

5. Copying results to and from AWS to your local machine

As you work on notebooks and create new files you want to save, you may want to move them to your local machine. If you are working on a notebook or .stan file, the best option is to use git and commit and push those files to a git repository directly from the command line on your EC2 instance.

Some files, though, such as MCMC results or intermediate data processing results, are not meant to be under version control. For these file, you an use scp. Within your GitBash or Terminal window on your local machine (you probably have to open yet another window), you can copy files as follows.

scp -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>:~/my_file.csv ./

This command will copy files from your EC2 instance to your working directory. Simply put the full path to the file you want to transfer after the colon above (remember ~/ means “home directory”). The second argument of scp is where you want to copy the file.

Similarly, you can upload files to your EC2 instance as follows (in this example to the home directory in your instance).

scp -i "~/key_pairs/bebi103_aws_keypair.pem" my_file.txt ec2-user@<IPv4 Public IP>:~/

6. Exiting

When you are finished with your session, you can shut down your notebook in the browser. If the shutdown does not give you a new prompt in the terminal window, you can do a hard shutdown of JupyterLab by pressing Ctrl-c.

After you are finished with your work on your instance, you should stop your instance. To do this, go back to the Instances page on your EC2 console. Right click your instance and click Stop instance. Do not terminate your instance unless you really want to. Terminating an instance will get rid of any changes you made to it.

7. Seriously. Stop your instances if you are not using them.

If your instance is not stopped and you leave it running, you will get charged! You can rack up a massive bill with idle, but running, instances. You should stop your instances whenever you are not using them. It is a minor pain to wait for them to spin up again, but forgetting about a running instance will cause more pain than that to your pocketbook.

8. Using your instance again

After the initial setup and your first session with your instance, you will likely start it up again. This is much easier.

In the AWS console, navigate to the Instances page, either through the left menu or via the EC2 dashboard. Find the instance you set up (mine is named bebi103), right click it, and select Start instance. After it has spun up, (Instance state is Running), you can access the Public IPv4 address, and fire up JupyterLab as per the above instructions.

9. Terminate your instances after the class is over

After the class is over, you might want to terminate your instance. This is because the storage in your instance (stored using AWS’s EBS, which is what keeps your repository, installations, etc., all intact) is not free. Once your free tier accessibility expires in a year if you are new to AWS, and/or your promo codes expire, you will start getting bills for your EBS usage. These get wiped if you terminate your instance and you will not get billed.