Welcome to Bioinformatics in the Cloud Workshop’s documentation!

Cloud technologies are emerging as a critical tool in Bioinformatics analysis as datasets grow exponentially in number and size. However, the set of cloud technologies and concepts necessary to deploy Bioinformatics analysis is rapidly evolving and complex. This series of workshops will introduce the cloud analysis paradigm using the Amazon Web Services (AWS) platform, cover some current strategies for deploying Bioinformatics data and applications for analysis, and give students some hands-on experience with these topics. The workshop will also highlight FireCloud, a scalable Bioinformatics cloud solution provided by the Broad Institute.

Note

Bioinformatics knowledge is not required, as the materials are intended to be sufficiently generic to allow users familiar with the prerequisite concepts to deploy their own applications in the cloud. The workshop simply uses a bioinformatics analysis as the use case for the hands-on materials.

Quick links:

Prerequisites

This workshop is fairly technical. You will need a good understanding of the following to maximally benefit from the materials:

  • Ability to use linux/command line
  • Programming in python, Java, C/C++, or other comparable languages
  • General familiarity with how the linux operating system works

Attendees are expected to bring their own (preferrably Mac/Linux) laptops.

Time & Location

  • Session 1 - Cloud Concepts: Monday July 30th 2PM-5PM
  • Session 2 - Packaging and Deploying Applications: Wednesday August 1st 2PM-5PM
  • Session 3 - FireCloud Case Study: Thursday August 2nd 2PM-5PM

Location: Life Sciences and Engineering Building (LSEB) 103

Registration

Registration is now closed.

Online Materials

Nota bene

This content is under construction!

Cloud Concepts Workshop

This is day 1 of the “Bioinformatics in the Cloud” workshop. In this session, you will learn basic cloud concepts and terminologies and work on setting up your own cloud instance and running an application on the cloud.

Workshop Outline:

  • Introduction to the cloud (~10min) @Dileep
  • Cloud concepts (~15min) @Sebastian
  • Deployment Walkthrough: Web Console (~35min) @Sebastian
  • Break (~5min)
  • Deployment Walkthrough: CLI (~30min) @Dileep
  • Working with deployed resources (~10min) @Dileep
  • Hands-on section (~40min)
  • Machine Learning on the Cloud (~30min) @Gerard

Prerequisites

The participants are required to have access to the following resources before attending the workshop

  • AWS account
    Access to the AWS account through BU
  • Web browser
    A modern web browser is needed to log into the AWS management console
  • A terminal emulator and SSH client
    A terminal emulator and ssh client are needed to log in remotely to our AWS instance
  • AWS CLI
    A working installation of the AWS CLI

What is Cloud Computing?

Cloud computing allows access to arbitrary amounts of compute resources instantaneously. The computing resources exist on servers managed by the cloud providers, thereby, helping you avoid the hassle of hardware maintenance.

_images/rise-of-clouds.jpg
Key advantages
_images/Advantage-of-Cloud-Computing.png
  1. High availability - your files are always available across multiple systems
  2. Fault tolerant - automatic backups enable recovery from failure
  3. Scalability and Elasticity - easily scale compute resources to fit new requirements within minutes

There are various cloud providers, the most popular ones include Amazon (Amazon Web Services), Google (Google Compute Engine) and Microsoft (Azure).

Common use-cases
  1. Web hosting
  2. Storage
  3. Software as a Service
  4. Big Data Analytics
  5. Test and Development

Cloud concepts

Virtual Machines

Virtual Machines emulate the architecture and functionality of physical computers in the cloud. In AWS, VMs are called Elastic Compute Cloud (EC2), which can be created using different operating systems (i.e. Linux, Windows) and vCPU sizes. Using EC2 eliminates the need to invest in hardware up-front. EC2 can be used to launch as many or as few virtual servers needed, configure security and networking, and manage storage. Amazon EC2 enables scaling up or down to handle changes in requirements or spikes in popularity, reducing the need to forecast traffic.

_images/vm.png
Storage Units

Storage services are also provided for the VMs, in AWS they come in two types depending on your needs:

  1. Elastic Block Storage (EBS): block level storage volumes that can be directly mount to EC2
  2. Simple Storage Service (S3): bucket of storage accessible through API or command line
_images/ebs_s3.jpg
Databases

Relational Database Service (RDS) allows to set up, operate and scale relational databases (i.e. MySQL)

Serverless

Removes the need to worry about managing and operating web servers for applications. It also provides scaling and cost-efficient options.

The AWS infrastructure

_images/amazon_services.png

This workshop involves working with the Amazon Web Services (AWS) cloud infrastructure, but the concepts in this workshop will apply to other cloud computing services as well. The only difference involves the exact terms used to describe services and actions.

0. The AWS and the web console
  1. Creating an account

In order to use AWS you will need to create an account. And in order to create instances and the other services used in this workshop you, will need to associate a credit card with the account. For the purposes of this workshop we will provide you with pre-existing AWS accounts, but you will need to create your own accounts for any future use.

  1. Logging into the AWS console

To log into AWS, go to aws.amazon.com and hit the Sign in to the Console button as shown below.

_images/main_page.png
  1. AWS regions

An AWS Region is a physical warehouse of servers (data centers) and other computer hardware that Amazon maintains. At any point in time you are can only operate in one region. After logging in, the current region is shown in the upper right corner of the console.

Regions are important for several reasons:

  1. When you launch a service like an EC2 instance, it will be confined to the region you launched it in. If you switch regions later, you will not see this instance.
  2. The cost of usage for many AWS resources varies by region.
  3. Since different regions are located in different parts of the world, your choice of region might add significant networking overhead to the performance of your application.
_images/aws_console.png

At the time of writing the following AWS regions exist:

Region Name Region Endpoint protocol
US East (Ohio) us-east-2 rds.us-east-2.amazonaws.com HTTPS
US East (N. Virginia) us-east-1 rds.us-east-1.amazonaws.com HTTPS
US West (N. California) us-west-1 rds.us-west-1.amazonaws.com HTTPS
US West (Oregon) us-west-2 rds.us-west-2.amazonaws.com HTTPS
Asia Pacific (Tokyo) ap-northeast-1 rds.ap-northeast-1.amazonaws.com HTTPS
Asia Pacific (Seoul) ap-northeast-2 rds.ap-northeast-2.amazonaws.com HTTPS
Asia Pacific (Osaka-Local) ap-northeast-3 rds.ap-northeast-3.amazonaws.com HTTPS
Asia Pacific (Mumbai) ap-south-1 rds.ap-south-1.amazonaws.com HTTPS
Asia Pacific (Singapore) ap-southeast-1 rds.ap-southeast-1.amazonaws.com HTTPS
Asia Pacific (Sydney) ap-southeast-2 rds.ap-southeast-2.amazonaws.com HTTPS
Canada (Central) ca-central-1 rds.ca-central-1.amazonaws.com HTTPS
China (Beijing) cn-north-1 rds.cn-north-1.amazonaws.com.cn HTTPS
China (Ningxia) cn-northwest-1 rds.cn-northwest-1.amazonaws.com.cn HTTPS
EU (Frankfurt) eu-central-1 rds.eu-central-1.amazonaws.com HTTPS
EU (Ireland) eu-west-1 rds.eu-west-1.amazonaws.com HTTPS
EU (London) eu-west-2 rds.eu-west-2.amazonaws.com HTTPS
EU (Paris) eu-west-3 rds.eu-west-3.amazonaws.com HTTPS
South America (S�o Paulo) sa-east-1 rds.sa-east-1.amazonaws.com HTTPS

VPC: Virtual private cloud. Your private section of AWS, where you can place AWS resources, and allow/restrict access to them.

1. EC2 instances

Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the Amazon Web Services (AWS) cloud. This service allows you to configure and rent computers to meet your compute needs on an as needed basis. Using EC2 eliminates the need to invest in hardware up-front. EC2 can be used to launch as many or as few virtual servers needed, configure security and networking, and manage storage. Amazon EC2 enables scaling up or down to handle changes in requirements or spikes in popularity, reducing the need to forecast traffic.

Instance come in various shapes and sizes. Some instances might be geared towards running CPU intensive tasks while other might be optmized for memory or storage. Some of the different options available are shown in the figure below and more information can be found here.

_images/ec2_instances.png

The following sections outline the various steps involved in setting up an EC2 instance:

  1. AMI selection

An Amazon Machine Instance (AMI) is a preconfigured template for launching an instance. It packages the various applications you need for your server (including the operating system and additional software). There are four main options when selecting an AMI: Quick Start, My AMIs, AWS Marketplace and Community AMIs. These options can be seen in the image below on the left sidebar. Select the desired AMI and then proceed to the next step.

_images/instance_launch.png
  1. Instance type selection

Once an AMI is selected, the next step is to choose an instance type. If choosing an AMI is equivalent to choosing the software you want on your computer then choosing an instance type is equivalent to choosing the hardware. Broadly speaking the different instance types vary in the number of CPUs, or size of RAM, or storage. The price per hour for each of the options is not listed here. To get the price of a particular instance, look up the name of the instance on the EC2 pricing list. Once you are ready, proceed to the next step by pressing the Next: Configure Instance Details button.

_images/instance_types.png
  1. Instance general configuration

Once you have selected your instance type, the next step is to configure your instance. This step involves many advanced concepts that will be not be covered in detail in this tutorial. Using the Number of instances option you can launch multiple instances with the same AMI and hardware configuration at the same time. Additionally you could also Request Spot instances (spot instances offer spare compute capacity at steep discounts but they are reclaimed whenever EC2 needs the capacity back) Shutdown behavior determines the behavior of the instance when it is shutdown from within the AMI

For this tutorial we will use the proceed with the default values for all the options.

_images/instance_configure.png
  1. Instance storage configuration

The next step is to configure the storage that will be available to the instance. The storage that you start with depends on the type of instance you have selected. In the image below we have an EBS root volumne with 8GiB size. This is the Root volume where the operating system will exist By default, this volume is set to be deleted when the instance is terminated, but, this behavior can be changed. The Add New Volume button can be used to add additional storage to our instance.

  • ephemeral or Instance store storage
  • EBS storage
_images/instance_storage.png
  1. Instance tagging

When dealing with multiple instances, tagging creates a simpler way to track usage and billing information from groups of related instances

_images/instance_tags.png
  1. Instance security

Secure login information for instances using key pairs (AWS stores the public key, and the user stores the private key in a secure place) A firewall that enables you to specify the protocols, ports, and source IP ranges that can reach your instances using security groups

_images/instance_security.png
  1. Instance review

Static IPv4 addresses for dynamic cloud computing, known as Elastic IP addresses

_images/instance_review.png

Created instance

_images/instance_created.png

Create a key and save it

_images/instance_key.png

Key file example. The file can be saved as Test.pem

-----BEGIN RSA PRIVATE KEY-----
MIIEpAIBAAKCAQEAhEpF18lIUouMH8qia/BSB70vrQVq/mTTkiRbsACB78rzy3XGRMfvwUseIsGY
H6SDOAFrRlmTrAArH5A0t2TZ8PKrq7b9FtEAvMCeE7rWEiqBblAWiER0k1pbnIqyKJJCo1YRSUs0
oNMdvjB4CUylYraSsSNFYJG5gRwcNhBENLDVnDS79geQcPLu/JeEiJ9V+w+CCYAG40f7li/TuULr
rSy6Oq6jgn2Gy7rrHU7XHU5hcEvxuSeoLb8h/bH1N+cN/H7x3ipEjIDdA2ScCkRXum1V6/kTFQFq
vDG0lqoTlmTNKgDGpb+rdzJgOg/3QX4RSrX/c0W6aFkV9Ib/jQxT+wIDAQABAoIBADAvWXc6wpQG
bjiaN0T3mPlmqHnuEkWs9f8yLQ9TcACmvNwr/tbIuISAVu6z8zP7WSxKIAfU0twAh7SMcxclrdh8
m5kFIvRvlkQqKKnpENY3E0PZ+gsSXB/b9qhzQGdUtt8Fl3BJ61Z07016HA7PEyJ8e7v3q+p7ycTE
N2Zd0GocRIX8zxdRo9GS8ouS0QcFgNF8KblzlJ6Vs0gI7o7mIRZIm9vWkuR9Lp9uEPD2flUIvN3z
yRmY/FE/R1yc76Uq+g8eywifRAh+GFyyO8PmFoYRni4Ki6+tEIFaq5JauT0JJF66EZeZP8ZKoWm9
1K30Ucti2D5l8t+CpbBM5JxhmjECgYEAxz1ET42F1sBGYqNn5hmfjrRp+YF3EYz2awRSibOeerpJ
Bh1QZeB7/QD3wcB00XFiMu/3haP9xs4eesjSSug+1F59nyzDplNsybz1sYpUQwP9LjX0loUCIb8r
3O2VdLJ5ZJ9dfNgpStC/wi7kkr8xjK5XiHgP6DLk6+H1Lr2d+kMCgYEAqfpUseZ/sm1vYt80LlWI
r8ozsUmzuISRspGVUppyDD47Iyj/1mkiWnsFDDl07oBcFIUFIEd1rkJNB3gXKSr76kcY0X4lav7a
0dvse2T9PC/pLSFkax9UjVnydCN8ElyNoXI2wT5HuLDjjCmHBD/4E9ZOO201JICSbRxaykl17+kC
gYEAxRiWuxwFiqwq9Okxny856LIRJAIvB+2q17Mu84n8/OvL0YCuSBoKjf6nGcSJy6eevUUmV84i
/sho3o5Lek7F2NCg9RYTdjaRKAEGDNwK/0Cy9UPq8fwiX7/+ZE+jyg3EiQYeNaKhNqHLEQ3SkFkT
a1gMv7QGCG5QiAi/w71QyoECgYARcn+VDyrWXsNLK8wIYYE5QhESRpVrADiQUr84DmBcf1rEniW8
lWgQT4ZSHeexv300If9Hs+4RZ/7OIHaIJEBdaNTUVBV1KRm+5sscU15m+if+GOpc0Id2RuBLKYVH
wTZMdxPFvCXSgF2q+mxAdGx7ZMj88pW83HGrP3jWQLoZWQKBgQCX5jxy3QXlPpwDppqwKKBQ8cGn
YDDQHCeD5LhrVCUqo5DCobswzmGKU/xEqYsqlk/Mz1Zkvg4FbJwJDgQGkSyAu071NLi0O6w27dm+
UHuvF5mCDdAHWirFUBSiebxOpEQnkZ9IPXUUCSC6IQvPFbdGN8G3WjoER6Lw121Q4rJxGA==
-----END RSA PRIVATE KEY-----
2. S3 buckets

Simple storage service (S3)

This service allows the storage of large volumes of data, which can be accessed by an API or a command line interface such as aws-cli

_images/s3_main.png

Setting up an S3 bucket

  1. Create S3 bucket
_images/s3_create.png
  1. Change permissions
_images/s3_permissions.png
  1. Review
_images/s3_review.png
3. EBS

Elastic Block Storage

EBS allows to rent storage and directly mount it to your EC2 instance. In contrast to S3, EBS can only be connected to one EC2 instance at a time and its storage prices are higher.

4. RDS
Relational Database Service
5. AWS Lambda
Run code without thinking about servers. Pay only for the compute time you consume.
6. Pricing

When you create EC2 instances or S3 buckets you are renting computing power from Amazon for which you will be charged. Once you start the instance you will be charged hourly

There is a pricing list Amazon provides a monthly price calculator

CloudFormation

AWS CloudFormation is a service that helps deploy infrastructure as code. You create a template that describes all the AWS resources that you want (like Amazon EC2 instances or Amazon RDS DB instances), and AWS CloudFormation takes care of provisioning and configuring those resources for you. You don’t need to individually create and configure AWS resources and figure out what’s dependent on what; AWS CloudFormation handles all of that. There are similar resources for other services as well, for example, Azure Resource Manager for Microsoft Azure.

Advantages
  1. Simplify infrastructure management
  2. Quickly replicate your infrastructure
  3. Reproducible infrastructure deployment
  4. Easily control and track changes to your infrastructure
  5. Automatic resource removal
Concepts
  • Templates: The CloudFormation template is a JSON or YAML formatted text file that contains the configuration information about the AWS resources you want to create. When fed to CloudFormation, the template will direct it to create the required resources on AWS. Templates can also be created using the AWS CloudFormation Designer
  • Stacks: When you use AWS CloudFormation, you manage related resources as a single unit called a stack. You create, update, and delete a collection of resources by creating, updating, and deleting stacks. All the resources in a stack are defined by the stack’s AWS CloudFormation template. You can create, update or delete stacks by using the AWS CloudFormation console, API, or AWS CLI.
  • Change Sets: If you need to make changes to the running resources in a stack, you update the stack. Before making changes to your resources, you can generate a change set, which is summary of your proposed changes. Change sets allow you to see how your changes might impact your running resources, especially for critical resources, before implementing them
Template Components

The anatomy of a CloudFormation template:

{
  "AWSTemplateFormatVersion": "version date",
  "Description": "description of the template",
  "Parameters": {"set of parameters"},
  "Mappings": {"set of mappings"},
  "Conditions": {"set of conditions"},
  "Resources": {"set of resources"},
  "Outputs": {"set of outputs"}
}

All templates consist of the following:

  1. Parameters: Values to pass to your template at run-time (during stack creation), containing the specifics for the EC2 or S3 needed. A parameter is an effective way to specify sensitive information, such as user names and passwords or unique information, that you don’t want to store in the template itself. You can refer to parameters from the Resources and Outputs sections of the template. Multiple parameters can be passed such as the EC2 instance type, SSH security protocols, etc. For example, the code section below defines an InstanceTypeParameter for an EC2 instance.
{
  "Parameters": {
    "InstanceTypeParameter": {
      "Type": "String",
      "Default": "t2.micro",
      "AllowedValues": [
        "t2.micro",
        "m1.small",
        "m1.large"
      ],
      "Description": "Enter t2.micro, m1.small, or m1.large. Default is t2.micro."
    }
  }
}
  1. Mappings: A mapping of keys and associated values that you use to specify conditional parameter values, similar to a lookup table. You can match a key to a corresponding value by using the Fn::FindInMap intrinsic function in the Resources and Outputs section. In this example, it will match the corresponding AMI for a given AWS region
{
  "Mappings": {
    "RegionMap": {
      "us-east-1": {
        "32": "ami-6411e20d"
      },
      "us-west-1": {
        "32": "ami-c9c7978c"
      },
      "eu-west-1": {
        "32": "ami-37c2f643"
      },
      "ap-southeast-1": {
        "32": "ami-66f28c34"
      },
      "ap-northeast-1": {
        "32": "ami-9c03a89d"
      }
    }
  }
}
  1. Conditions: Conditions that control whether certain resources are created or whether certain resource properties are assigned a value during stack creation or update. For example, you could conditionally create a resource that depends on whether the stack is for a production or test environment.
  2. Resources: The Resources section specifies the stack resources and their properties, such as an AWS EC2 instance or an AWS S3 bucket. This is the only part of the template that is mandatory. Each resource is listed separately and specifies the properties that are necessary for creating that particular resource. You can refer to resources in the Resources and Outputs sections of the template. The following code section describes an EC2Instance resource and InstanceSecurityGroup resource. The resource declaration begins with a string that specifies the logical name for the resource.
{
  "Resources": {
    "EC2Instance": {
      "Type": "AWS::EC2::Instance",
      "Properties": {
        "InstanceType": "InstanceType",
        "SecurityGroups": [
          "InstanceSecurityGroup"
        ],
        "KeyName": "KeyName",
        "ImageId": "ami-08f569078da6ad4c2"
      }
    },
    "InstanceSecurityGroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "Enable SSH access via port 22",
        "SecurityGroupIngress": [
          {
            "IpProtocol": "tcp",
            "FromPort": 22,
            "ToPort": 22,
            "CidrIp": "SSHLocation"
          }
        ]
      }
    }
  }
}
  1. Outputs: Describes the values that are returned whenever you view your stack’s properties. For example. you can declare an output for an EC2 instance to display its id and availability zone
{
  "Outputs": {
    "InstanceId": {
      "Description": "InstanceId of the newly created EC2 instance",
      "Value": "EC2Instance"
    },
    "AZ": {
      "Description": "Availability Zone of the newly created EC2 instance",
      "Value": {
        "Fn::GetAtt": [
          "EC2Instance",
          "AvailabilityZone"
        ]
      }
    }
  }
}

Note

  1. The Resource Type attribute has the format - AWS::ProductIdentifier::ResourceType. Eg: The Resource Type for an S3 bucket is AWS::S3:Bucket and that for an EBS volume is AWS::EC2::Volume.
  2. The Ref function returns the value of the object it refers to. The Ref function can also set a resource’s property to the value of another resource.
  3. Depending on the resource type, some properties are required, other optional properties are assigned default values.
  4. Some resources can have Multiple properties and some properties can have one or more subproperties.
Best Practices

Take a look at the official best practices to be able to use AWS CloudFormation more effectively and securely.

AWS CLI

The AWS CLI is an open source tool built on top of the AWS SDK for Python (Boto) that provides commands for interacting with AWS services. With minimal configuration, you can start using all of the functionality provided by the AWS Management Console from your favorite terminal program.

For installation instructions refer to the official documentation.

Advantages
  1. Easy to install
  2. Supports all Amazon Web Services
  3. Easy to use
  4. Can be incorporated in shell scripts for automation and reproducibility
Setting up your profile

Before you can start using the aws-cli you need to configure the CLI with your AWS credentials. The aws configure command is the fastest way to set this up. This command automatically generates the credentials file at ~/.aws/credentials and the config file at ~/.aws/config.

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-east-1
Default output format [None]: json

The AWS CLI will prompt you for four pieces of information. AWS Access Key ID and AWS Secret Access Key are your account credentials.

Alternatively you can manually create and populate these files.

~/.aws/credentials

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

~/.aws/config

[default]
region=us-east-1
output=json

If you have multiple profiles you can also configure additional named profiles using the --profile option

$ aws configure --profile user2
AWS Access Key ID [None]: AKIAI44QH8DHBEXAMPLE
AWS Secret Access Key [None]: je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY
Default region name [None]: us-east-1
Default output format [None]: text
Commands

Help:

To get help when using the AWS CLI, you can simply add help at the end of a command or sub-command.

$ aws help
$ aws ec2 help
$ aws ec2 describe-instances help

The help for each command is divided into six sections: Name, Description, Synopsis, Options, Examples and Output.

Command Structure:

$ aws <command> <sub-command> [options and parameters]

Specifying parameter values

$ aws ec2 create-key-pair --key-name my-key-pair

Output:

The AWS CLI supports three different output formats:

  • json
  • Tab-delimited text
  • ASCII formatted table

The default output format is chosen during the configuration step of aws configure. This can be changed by editing the config file or setting the AWS_DEFAULT_OUTPUT environment variable.

Additionally, per command output can be changed using the --output option

$ aws swf list-domains --registration-status REGISTERED --output text
# Example output
$ aws ec2 describe-volumes
{
    "Volumes": [
        {
            "AvailabilityZone": "us-west-2a",
            "Attachments": [
                {
                    "AttachTime": "2013-09-17T00:55:03.000Z",
                    "InstanceId": "i-a071c394",
                    "VolumeId": "vol-e11a5288",
                    "State": "attached",
                    "DeleteOnTermination": true,
                    "Device": "/dev/sda1"
                }
            ],
            "VolumeType": "standard",
            "VolumeId": "vol-e11a5288",
            "State": "in-use",
            "SnapshotId": "snap-f23ec1c8",
            "CreateTime": "2013-09-17T00:55:03.000Z",
            "Size": 30
        },
        {
            "AvailabilityZone": "us-west-2a",
            "Attachments": [
                {
                    "AttachTime": "2013-09-18T20:26:16.000Z",
                    "InstanceId": "i-4b41a37c",
                    "VolumeId": "vol-2e410a47",
                    "State": "attached",
                    "DeleteOnTermination": true,
                    "Device": "/dev/sda1"
                }
            ],
            "VolumeType": "standard",
            "VolumeId": "vol-2e410a47",
            "State": "in-use",
            "SnapshotId": "snap-708e8348",
            "CreateTime": "2013-09-18T20:26:15.000Z",
            "Size": 8
        }
    ]
}

You can query the resultant output using the --query option.

$ aws ec2 describe-instances --instance-ids i-0787e4282810ef9cf --query 'Reservations[0].Instances[0].PublicIpAddress'
"54.183.22.255"

Examples:

The following examples show the interface in action performing various tasks and demonstrate how powerful it can be.

# deleting an s3 bucket
aws s3 rb s3://bucket-name --force
# start ec2 instances
aws ec2 start-instances --instance-ids i-34hj23ie
Miscellaneous
  • Try the aws-shell to get a more interactive command line experience.
  • Use jq to parse the json outputs from various cli commands.

Cheat sheet

AWS CLI
  1. Configuring your AWS CLI
$ aws configure --profile <profile>
CloudFormation
  1. Deploying your stack using the AWS CLI via CloudFormation
$ aws --profile <profile> cloudformation create-stack --stack-name <stack> [--template-body <template>] [--parameters <parameters>]

Note

Local files need to be prefixed with file://

  1. Verify and check stack deployment using the AWS CLI
$ aws --profile <profile> cloudformation describe-stacks [--stack-name <stack>]
  1. List resources of a stack using the AWS CLI
$ aws --profile <profile> cloudformation list-stack-resources --stack-name <stack>
  1. Validate your CloudFormation template using the AWS CLI
$ aws --profile <profile> cloudformation validate-template --template-body <template>
  1. Update your stack using the AWS CLI
$ aws --profile <profile> cloudformation update-stack --stack-name <stack> [--template-body <template>] [--parameters <parameters>]
EC2 Instance
  1. Connecting to the deployed EC2 instance via ssh
$ ssh -i <key.pem> user@<publicip>

To obtain the PublicIpAddress of your instance:

$ aws ec2 describe-instances --instance-ids i-0787e4282810ef9cf --query 'Reservations[0].Instances[0].PublicIpAddress'

Note

The “key.pem” must only allow read access to the user

  1. Key - The key specified must be at the path indicated. It must be the private key. Permissions on the key must be restricted to the owner and the key must be associated with the instance.
  2. User - The user name must match the default user name associated with the AMI you used to launch the instance. For an Ubuntu AMI, this is ubuntu. For an Amazon Linux AMI, it is ec2-user.
  3. Instance - The public IP address or DNS name of the instance. Verify that the address is public and that port 22 is open to your local machine on the instance’s security group.
S3 bucket
  1. Copy an object from an S3 bucket to EC2 instance or local machine
$ aws s3 cp s3://my_bucket/my_folder/my_file.ext my_copied_file.ex
  1. Copy an object from and EC2 instance or local machine to S3 bucket
$ aws s3 cp my_copied_file.ext s3://my_bucket/my_folder/my_file.ext
  1. Download an entire Amazon S3 bucket to a local directory on your instance
$ aws s3 sync s3://remote_S3_bucket local_directory
Exercise
  1. Configure your AWS CLI
  2. Run the cloudformation describe-stacks and ec2 describe-instances to look at existing stacks or instances
  3. Try to --query the output or display the --output in different formats
  4. Combine the EC2 and S3 templates to create one template that launches both an EC2 instance and an S3 bucket
  5. Validate the template using the cloudformation validate-template command
  6. Update the ImageId to “ami-08f569078da6ad4c2” and run the cloudformation update-stack command
  7. Connect to the EC2 instance using the pem file
  8. Copy the contents of this S3 bucket (s3://buaws-training-shared/test_reads.fastq.gz) to the instance
  9. Delete the stack using the cloudformation delete-stack command

Cloud App Deployment Workshop

This is day 2 of the “Bioinformatics in the Cloud” workshop. In this session, you will learn about containerization software, how to execute docker containers on an AWS EC2 instance, and how to package your own applications into docker images.

Prerequisites

docker

This workshop assumes you have an environment where docker is installed. If you followed workshop 1, the EC2 instance you deployed already has docker installed and configured. If not, you may follow this setup guide to use docker on your own resources.

Creating an CloudFormation Stack

You may use the following template and parameters to create a CloudFormation stack with an EC2 instance that has docker pre-installed:

You may download these files as-is to create your AWS stack, be sure to change the stack name to something else!:

$ aws configure
AWS Access Key ID [****************QNQG]:
AWS Secret Access Key [****************z5bv]:
Default region name [us-east-1]:
Default output format [json]:
$ aws cloudformation create-stack --template-body file://main.yaml \
  --parameters file://buaws-training-ec2-parameters.json \
  --stack-name ec2-stack-studentXX
{
    "StackId": "arn:aws:cloudformation:us-east-1:438027732470:stack/test-stack-AL/18c...."
}

When your stack creation is complete, you should ssh to the instance using the appropriate private key:

$ ssh -i buawsawstrainec2.pem ec2-user@<IP from cloudformation output>
The authenticity of host 'XX.XXX.XX.XX (XX.XXX.XX.XX)' can't be established.
RSA key fingerprint is 5d:e6:c4:f6:35:a5:9e:85:66:a4:b3:af:56:86:20:93.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'XX.XXX.XX.XX' (RSA) to the list of known hosts.
Last login: Mon Jul 23 16:27:27 2018 from nowhere

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2018.03-release-notes/
2 package(s) needed for security, out of 4 available
Run "sudo yum update" to apply all updates.
[ec2-user@ip-172-31-19-57 ~]$

Containerization

Motivation

Science today faces a reproducibility crisis. Key findings published cross scientific disciplines are not corroborated by other scientists when tested independently. A survey conducted by Nature asked scientists which factors they though contributed the most to the crisis. Over 80% of respondants felt the reason of ‘Methods, code unavailable’ contributed to irreproducible research.

For many scientists, software and analysis have become an indispensible and increasingly unavoidable component of their research. Critical findings now arise from the analysis of data that uses tools developed in house as well as tools published by others. These components are usually integrated by custom ‘glue code’ that connects them together.

This environment poses a new set of challenges to scientists who use computational methods in their research:

  • How do we write analysis code that is robust and reproducible?
  • How can we concisely communicate our code with other researchers?
  • How do we share analysis code with other researchers in a form that can be easily executed?

As computational analysis and tools become more complex, so do the environments needed to execute them. Modern software packages often require hundreds of supporting software packages, provided either by a particular operating system or from a third party. Further, each of these software package dependencies has a specific version or set of versions that are needed for the package to run. The author of a package could in principle record all of these packages and their dependencies and provide this list with their software distribution, but maintaining this list of software and ensuring cross-platform compatibility is a major challenge. Environment management software packages such as miniconda are available to address this challenge, but introduce additional complexity due to the fact that it itself is an additional software dependency, package availability is largely dependent upon community support, and because third party software packages may not be supported across different platforms. A superior solution to managing and deploying complex software environments is to create containerized applications.

Containerization

Containerization, also known as operating-system-level virtualization, is a technology that enables the encapsulation and execution of sets of software and their dependencies in a platform-agnostic manner. A software container is a file that has been built by specific containerization software, e.g. docker or singularity, to contain all of the necessary software and instructions to run.

What is a container?

Generally speaking, a container is a file that specifies a collection of software that can run in a particular execution environment. The execution environment is provided by the containerization software, e.g. docker, such that the container doesn’t have to be aware of the particular machine it is running on. This means that a container will be portable to any environment where the containerization software can run, thus eliminating the need for software authors (i.e. us) to worry about whether or not our code will run on any given hardware/OS/etc.

At the time of writing (July 2018), docker is by far the most popular containerization software. docker has been open source since its release in 2013 and an enormous docker community has grown since. Due to its popularity, this workshop will use docker exclusively as the vehicle for demonstrating containerization of custom applications.

Another more recent containerization software called singularity is available that addresses some of the usability shortcomings of docker. If docker is not available on your computational resources due to security concerns, then singularity may be an option. The containerization concepts are identical between docker and singularity, and all of the content of this workshop is easily adaptable to from docker to singularity.

Introduction to docker

docker

docker is an open source software project supported and provided for free by Docker Inc. The software is available for Mac OS, Windows, and Linux operating systems. From its initial open source announcement in 2013, docker is:

a LinuX Container (LXC) technology augmented with a a high level API providing a lightweight virtualization solution that runs Unix processes in isolation. It provides a way to automate software deployment in a secure and repeatable environment.

(emphasis added). docker containers are:

  • automated because every docker container contains all of its own configuration is run with the same executable interface, and thus can be started automatically without manual intervention
  • secure because each runs in its own environment isolated from the host and other containers
  • repeatable because the container behavior is guaranteed to be the same on any system that runs the docker software

These three properties make docker an excellent solution to the problems faced by scientists who wish to write reproducible analysis and applications.

docker concepts

There are four critical concepts needed to get started as a docker user:

images

A docker image is a description of a software environment and is configuration. The concept of an image is abstract, as images are not run directly. Instead, images are used to instantiate containers that are runnable. For those familiar with object oriented programming, an image is to a container as a class is to an object. As such, images are not executed.

docker images are usually created, or built, with a Dockerfile. Images are often created using other images as a base and adding more application-specific configuration and software. For example, a common base image contains a standard ubuntu installation upon which other software is installed. While it is possible to build an image interactively without writing a Dockerfile, this practice is highly discouraged due to its irreproducibility.

Images can either be stored locally or in a public or private Image Registry. In any case, in order to create a container based off of an image, the image must be resident in the local docker installation. When building an image locally, the image is automatically added to the local registry. When using an image published on a public registry like Docker Hub, the image is first pulled to the local installation and then used to create an image.

Most docker images have a version associated with them. This enables the image to change over time while maintaining backwards compatibility and reproducibility. The image version is specified at build time.

container

A container is an instance created by image. You can think of a container as a physical file that has all of the software described by the image bundled together in a form that can be run. Each container is created using a single image.

By default, containers lack the permissions to communicate with the world outside its immediate docker execution environment. When a container is run, the user can specify locations on the host system that are exposed to the docker container by binding files and directories explicitly. The container can only read and write data to locations it is given permission to access. Containers that run services, like web servers, can also be granted access to certain ports on the host system at run time to allow communication outside of the host. In general, a docker container can only be granted access to the resources available to the user running the container (e.g. a normal user without elevated privileges cannot bind to reserved ports 0-1024 on linux).

Dockerfiles

A Dockerfile is a text file that contains the instructions for building an image. It is the preferred method for building docker images, over creating them interactively.

Dockerfiles are organized into sections that specify different aspects of an image. The following is a simple Dockerfile from the docs:

# Use an official Python runtime as a parent image
# This implicitly looks for and pulls the docker image named 'python'
# annotated with version '2.7-slim' from Docker Hub (if it was not already
# pulled locally)
FROM python:2.7-slim

# Set the working directory to /app inside the container
# The /app directory is created implicitly inside the container
WORKDIR /app

# Copy the current (host) directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
# The file requirements.txt was copied into /app during the ADD step above
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Make port 80 available to the world outside this container
# This implies that app.py runs a web server on port 80
EXPOSE 80

# Define environment variable $NAME
ENV NAME World

# Run app.py when the container launches
CMD ["python", "app.py"]

The commands in all capital letters at the beginning of the line are Dockerfile commands that perform different configuration operations on the image.

Image Registry

Image registries are servers that store and host docker images. The software to run a Docker Registry is freely available, but Docker Hub is by far the most popular public registry. Docker images for your own apps can be freely published to and listed on Docker Hub for others to pull and use. Other free registries exist, including Amazon Elastic Container Registry and Google Cloud Container Registry.

Exercise

Navigate to Docker Hub and locate the python repository. Explore the page until you find the Dockerfile for python version 3.7-stretch and view it. What parent image was used to build the python:3.7-stretch image?

Locate the parent image on Docker Hub and examine its Dockerfile. What parent image was used to build this image?

Continue looking up the parent images of each Dockerfile you find until you reach the root image. What is its name?

Running docker

Nota Bene

You must be using a computer with docker installed to complete the exercises on this page. If you are attending the BU workshop, refer to the page on connecting to your EC2 instance for instructions on how to SSH into your instance.

Your First Docker Container

Containers are run using the command:

$ docker run <image name>[:<tag>]

The <image name> must be a recognized docker image name either on the local machine or on Docker Hub. The optional :<tag> specifies a particular version of the image to run.

Exercise

Run a container for the hello-world docker image hosted on Docker Hub.

If you need help, try running docker and docker run without any arguments to see usage information.

Read the text output by the container after it has been run.

Pulling docker images

As part of running a container from a public docker image, the image itself is pulled and stored locally. This only occurs once for each version of an image; subsequently run containers will use the local copy of the image.

If you have never run any docker containers in this environment before, there should be no local images listed by the docker images command:

$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
$

To verify that the hello-world image has been pulled, we again use the docker images command after running the container:

$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
hello-world         latest              2cb0d9787c4d        2 weeks ago         1.85kB
$

This output tells us that we have the latest version of the hello-world image in our local registry.

We can pull images explicitly, rather than doing so implicitly with a docker run call, using the docker pull command:

$ docker pull nginx

This may be useful if we do not want to run a container immediately, or want to perform our own modifications to the image locally prior to running.

Exercise

Pull the nginx image using the docker pull command. Verify that the latest image of nginx has been pulled using docker images.

Managing docker containers
Running detached containers

The hello-world container runs, prints its message, and then exits. If we were running a docker container that provided a service, we would want the container to persist running until we chose to shut it down. An example of this is the nginx web server, which we can run with the command:

$ docker run -d -p 8080:80 nginx

Here, the -d flag tells docker to keep the container running and return control to the command line when it is finished setting up the container. The -p 8080:80 means forward port 80, the default port for HTTP traffic, on the container to the unrestricted port 8080 on the local machine. When control has returned to the command line, we can verify that the container is still running using the docker ps command:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                  NAMES
49af27e82231        nginx               "nginx -g 'daemon of…"   4 minutes ago       Up 4 minutes        0.0.0.0:8080->80/tcp   elastic_mcnulty
$

Exercise

Run an nginx container as above. Verify that the container is running with docker ps.

If specified correctly, the local port 8080 should behave as if it is a web server. Verify that this is the case by running:

$ curl localhost:8080

Attaching data volumes to containers

Scientific analyses almost always utilize some form of data. Docker containers are intended to execute code, and are not designed to house data. Directories and data volumes that exist on the host machine can be mounted in the container at run time to enable the container to read and write data to the host:

$ docker run -d -p 8080:80 --mount type=bind,source="$PWD"/data,target=/ nginx

The directory named data in the current host directory will be mounted as /data in the root directory of the container.

Stopping running containers

When a docker container has been run in a detached state, it runs until it is stopped or encounters an error. To stop a running container, we need either the CONTAINER ID or NAMES attribute of the running container from docker ps:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                  NAMES
49af27e82231        nginx               "nginx -g 'daemon of…"   4 minutes ago       Up 4 minutes        0.0.0.0:8080->80/tcp   elastic_mcnulty
$ docker stop 49af72e82231 # could also have provided elastic_mcnulty
49af27e82231
$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
$

Stopping a container sends signals to the container that it should start shutting down, so once a container is stopped it usually cannot be started again.

Nota Bene

Docker maintains a record of all containers that have been run on a machine. After they have been stopped, docker ps does not show them, but the containers still exists. To see a list of all containers that have been run, use docker ps -a.

It is good practice to remove old containers if they are no longer needed. You can do this with the command docker container prune.

Creating docker images
Building a custom image

Chances are there is not an existing docker container that does exactly what you want (but check first!). To create your own image, you must write a Dockerfile. As an example, we will create an image that has the python package scipy_ installed for us to use. It is common convention to create a new directory named for the the image you wish to create, and create a text file named Dockerfile in it. In the scipy directory, our Dockerfile contains:

# pull a current version of python3
FROM python:3.6

# install scipy with pip
RUN pip install scipy

# when the container is run, put us directly into a python3 interpreter
CMD ["python3"]

To build this docker images, we use the docker build command from within the scipy directory containing the Dockerfile:

$ docker build --tag scipy:latest .
Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM python:3.6
 ---> 638817465c7d
Step 2/3 : RUN pip install scipy
 ---> Running in 1eef65d3b6fd
Collecting scipy
  Downloading https://files.pythonhosted.org/...
Collecting numpy>=1.8.2 (from scipy)
  Downloading https://files.pythonhosted.org/...
Installing collected packages: numpy, scipy
Successfully installed numpy-1.15.0 scipy-1.1.0
Removing intermediate container 1eef65d3b6fd
 ---> 7f34e9147bef
Step 3/3 : CMD ["python3"]
 ---> Running in 5c9d778426e6
Removing intermediate container 5c9d778426e6
 ---> e27603f4ffaf
Successfully built e27603f4ffaf
Successfully tagged scipy:latest
$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED              SIZE
scipy               latest              e27603f4ffaf        About a minute ago   1.15GB
python              3.6                 638817465c7d        25 hours ago         922MB
$

The --tag scipy:latest argument gives our image a name when it is listed in docker images. Notice also that the python:3.6 image has been pulled in the process of building the scipy image.

Now that we have built our image, we can run and connect to the image using docker run with two additional flags:

$ docker run -i -t scipy
Python 3.6.0 (default, Jul 17 2018, 11:04:33)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy
>>>

The -i flag tells docker we want to use the container interactively, and the -t flag connects our current terminal to the container so that we may send and receive information to and from the terminal.

Exercise

Create a new Dockerfile where you will install the most recent version of R. Use ubuntu:bionic as the base image. You may follow these instructions, without using the sudo command.

Hint: Use a different RUN line for each command.

Solution

Passing containers CLI arguments

The CMD Dockerfile command specifies a standalone executable to run when a container starts. However, sometimes it is convenient to be able to pass command line arguments to a container, for example to run an analysis pipeline on different files, or files with filenames that are not known at build time. For instance, if you we might want to run the following:

$ docker run python process_fastq.py some_reads.fastq.gz

The CMD command does not allow command line arguments to be passed to the run command. Instead, the ENTRYPOINT command is used to prefix a set of commands to any command line arguments passed to docker:

FROM python:3.6

# we will mount the current working directory to /cwd when the container is run
WORKDIR /cwd

RUN pip install pysam

# ENTRYPOINT instead of CMD
ENTRYPOINT ["python3"]

Any command line arguments passed to docker will be appended to the command(s) specified in the ENTRYPOINT.

If a container is intended to run files that exist on the host, the docker run command must also be supplied with a mount point so the container can access the files. In the example above, the WORKDIR is specified as /cwd, so we can bind the current working directory of the host to /cwd in the container so it can access the files process_fastq.py and some_reads.fastq in the current directory:

$ docker run -mount type=bind,source=$PWD,target=/cwd process_fastq.py some_reads.fastq

Packaging your own application

Workflow Overview

The simplest workflow for building a docker container with your own code usually follows these steps:

  1. Identify an appropriate image
  2. Identify additional dependencies needed for your application
  3. Install those dependencies with the appropriate RUN commands
  4. Add your code to the image, either with ADD or git
  5. Specify an appropriate CMD or ENTRYPOINT specification
  6. Build your image, repeating 2-4 if needed until success
  7. Run a container of your image, test behavior
  8. Iterate, if needed
Preparing docker image for your code
Choosing a base image

The first step in creating a docker container is choosing an appropriate base image. In general, picking the most specific image that meets your requirements is desirable. For example, if you are packaging a python app, it is likely advantageous to choose a python base image with the appropriate python version rather than pulling an ubuntu base image and installing python using RUN commands.

Installing dependencies

Once a base image is chosen, any additional dependencies need to be installed. For debian based images, the apt package manager is used to manage additional packages. For Fedora based images, the yum package manager is used. Be sure to check which base linux image is used for a more specific image to know which package manager to use.

Annoyance Alert

In practice, it can be hard to know all of the additional system packages that need to be installed. Often, building a image to completion and running it to identify errors is the most expedient way to create an image.

Occasionally, a software package dependency, or a specific version of software, is not available in the software repositories for a base image linux distro. In these cases, it might be necessary to download and install precompiled binaries manually, or build a package from source. For example, here is an example Dockerfile that installs a specific version of samtools from a source release available on github:

FROM ubuntu:bionic

RUN apt update

# need these packages to download and build samtools:
# https://github.com/samtools/samtools/blob/1.9/INSTALL
RUN apt install -y wget gcc libz-dev ncurses-dev libbz2-dev liblzma-dev \
    libcurl3-dev libcrypto++-dev make
RUN wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 && \
    tar jxf samtools-1.9.tar.bz2 && \
    cd samtools-1.9 && ./configure && make install

CMD ["samtools"]
Putting your code into a docker image

Once your dependencies are installed, the final step is to move your own code into your image. There are primarily two different strategies for doing so:

  • Copy source files into the image using the ADD command in the Dockerfile
  • Clone a git repository into the image from a publicly hosted repo like github or bitbucket

Nota Bene

In any case, it is a good idea to create a git or other source code versioning system to develop your code, hosted publicly if possible. Your Dockerfile should be developed and tracked along with your code, so that both can be developed over time while maintaining reproducibility.

Locally

The local strategy is convenient when developing software. Running development code in a docker container ensures your testing and debugging environment are consistent with the execution environment where your code will ultimately run. To build from a local source tree:

  1. Create a Dockerfile in the root directory where your code resides
  2. Prepare the Dockerfile for your code as in Preparing docker image for your code
  3. Copy all of the source files into a directory (e.g. /app) in the container with ADD . /app
  4. Perform any setup that comes bundled with your package source (e.g. pip install -r requirements.txt or python setup.py) with the RUN command
  5. Set the CMD entry point appropriately for your app
  6. Build your image with an appropriate tag
  7. Run and test your application, ideally with unit tests

Assuming we have written a python application named app.py, from within the source code directory containing the application we could write the following Dockerfile:

# Use an official Python runtime as a parent image
FROM python:2.7-slim

# Copy the current (host) directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# mount the current working directory to /cwd when the container is run
WORKDIR /cwd

# Run app.py when the container launches
ENTRYPOINT ["python", "app.py"]

When a container is run, app.py will be run directly and passed any additional arguments specified to the docker run command.

Cloning from github/bitbucket

For software projects hosted on github or bitbucket, or when it is not desired to include a Dockerfile along with your application source code, the Dockerfile can also be set to clone and install a git repo instead of adding code locally. Instead of using the ADD command from above, use a RUN git clone <repo url> instead:

FROM python:3.6

# have to install git to clone
RUN apt install git

# git clone repo instead of ADD
RUN git clone https://bitbucket.org/bubioinformaticshub/docker_test_app /app
RUN pip install --trusted-host pypi.python.org -r /app/requirements.txt

# mount the current working directory to /cwd when the container is run
WORKDIR /cwd

# use ENTRYPOINT so we can pass files on the command line
ENTRYPOINT ["python", "/app/app.py"]

Cloning a public repo into a Docker container in this way has the advantage that the environment where you write your code can be the same or different than the platform where the code is run.

There is one additional caveat to this method of adding code to your image. To save on build time, docker caches the sequential steps in your Dockerfile when building an image, and only reruns the steps from the command where a change has been made. The ADD command automatically detects if local file changes have been made and automatically re-copies them into the container on docker build. This method of cloning a repo from bitbucket, however, does not re-trigger a build. When cloning your application from a public git repo, the --no-cache flag must be provided to your docker build command:

$ docker build --no-cache --tag app:latest .

This invalidates all build cache and re-clones your repo on each build.

Running your docker container

Once your code has been loaded into an image, containers for your image can be run in the normal way with docker run. Any host directories containing files needed for the analysis must be mounted:

$ docker run --mount type=bind,source=/data,target=/data \
    --mount type=bind,source=$PWD,target=/cwd app \
    --in=/data/some_data.txt --out=/data/some_data_output.csv

Remember that any time your code changes you will need to rebuild your image, including --no-cache if you pull your code from a git repo.

Publishing your docker image

Once your docker image is complete and your app is read to share, you can create a free account on Docker Hub and upload your image. Be sure to provide a full description of what the image does, what software it contains, and how to run it, specifying any directories the container expects to be mounted to access data (e.g. /data). You might alternatively consider hosting your image on the Amazon Elastic Container Registry or Google Cloud Container Registry. If your app will primarily be executed in either AWS or GAE environments, it may be preferable to publish your image to the corresponding registry.

Hands On Exercise
Writing the Dockerfile

Write, build, and run a Dockerfile that:

  1. Uses the python:3.6 base image
  2. Installs git with apt
  3. Clones the repo docker_test_app
  4. Installs the dependencies using the requirements.txt file in the repo
  5. Configures the ENTRYPOINT to run the script in the repo with python3
Running the Dockerfile with data from an S3 bucket

Nota Bene

When you run this app, you should specify the -t flag to your docker run command.

Try running the container using docker run with no arguments to see the usage.

A fastq file that can be passed to this script has been made available on a shared S3 bucket. You will download this file to your local instance using the aws cli. First, you must run aws configure and provide your access keys. Specify us-east-1 as the region. The bucket address of the file is:

s3://buaws-training-shared/test_reads.fastq.gz

Download the file using the aws cli and pass it to the app using docker run. You must mount the directory where you downloaded the fastq file using the --mount command line option as above.

FireCloud Workshop

This is day 3 of the “Bioinformatics in the Cloud” workshop. In this session, you will learn about the platform FireCloud. We will learn how to run workflows, upload data, and create methods.

Workshop Outline:

  • Introduction to FireCloud from Broad pipeline outreach coordinator Kate Noblett (~10min)
  • FireCloud Intro Presentation (~15min)
  • FireCloud Guided Tour (~25min)
  • FireCloud $5 Pipeline Hands-On (~15min)
  • Break (~5min)
  • FireCloud Custom Data and Method Hands-On (~40min)
  • Explore FireCloud (Rest of Workshop Time)

Prerequisites

Note

The participants are required to have access to the following resources before attending the workshop

  • FireCloud account
    Credits to run workflows ($300 free on sign up)
  • portal.firecloud.org
    Make an account and connect to a google account

Five dollar genome analysis pipeline

Clone and run a featured workspace

Open up portal.firecloud.org

Find the pipeline
  1. Navigate to the workspace tab
  2. Navigate to “Featured Workspaces”
  3. Click on “five-dollar-genome-analysis-pipeline”
_images/workspace.png
Clone the workspace
  1. Append your name to the workspace name (to make it unique)
  2. Clone the workspace
_images/clone_name.png
Find the workflow
  1. Navigate to the Method Configurations tab
_images/navigate_method.png
Select the pipeline
  1. Select the five-dollar-genome-analysis-pipeline
_images/method.png
Launch the analysis
  1. Launch the analysis
_images/launch_analysis.png
Select the sample
  1. Select either sample_id to run the analysis on
  2. Launch
_images/choose_data_launch.png
Monitor the pipeline
  1. Monitor the submitted job, well done!
_images/submitted.png

Upload data and run a custom method

Setup

Please download the materials for this section FireCloud Files

Create a workspace in FireCloud
  1. Workspaces > Create a new workspace
    1. name: hello_gatk_fc_YOUR_NAME
    2. billing project: YOUR_PROJECT
Add workspace attributes
  1. Workspaces > Summary > Workspace attributes > Import Attributes
    1. data_bundle > FireCloud > workspaceAttributes.tsv
  2. When it is uploaded, look at the workspace attributes section to see if the upload was successful
Set up data model
  1. Workspaces > Data > Import Metadata > Import from file
    1. Upload in this order:
      1. data_bundle > FireCloud > participant.txt
      2. data_bundle > FireCloud > sample.txt
  2. When it is uploaded, look at the two tables in the data tab that are filled in to see if it was successfully uploaded.
Put WDL on FireCloud
  1. Method Repository > Create New Method
    1. namespace: YOUR_NAME
    2. name: hello_gatk_fc
    3. wdl: load from file
      1. This WDL calls HaplotypeCaller in GVCF mode, which takes a BAM input & outputs a GVCF file of variant likelihoods.
      2. The FireCloud version has a docker image specified among other runtime settings -- the memory and disk size of the machine we will request from Google’s cloud, as well as the number of times we will try to run on a preemptible machine.
      3. Notice that you can type in the WDL field to edit if needed.
    4. documentation: We won’t be filling this out today, but in general documentation here is highly recommended, as it is helpful for others who may want to run your method.
    5. Upload
Import configuration to workspace
  1. Method Repository > your method > Export to Workspace
    1. Use Blank Configuration
      1. Name: hello_gatk_fc
      2. Root Entity Type: sample
      3. Destination Workspace: YOUR_PROJECT/hello_gatk_fc_YOUR_NAME
    2. Would you like to go to the edit page now? Yes

    3. Note

      If you get popup “Synchronize Access to Method” Grant Read Permission

Fill in method config
  1. Workspace > Method Configurations > hello_gatk_fc
  2. Select the Edit Configuration button to fill it in. There are 3 types of inputs.
    1. In the data model
      1. You’ll find this value in your data tab. Since it is under the sample section, and your root entity type is sample, simply type this. and allow autocomplete to guide you.
      2. eg: inputBam = this.inputBam
    2. In the workspace attributes
      1. You’ll find this value in your workspace attributes section under the summary tab. To find it, type in workspace. and let autocomplete guide you.
      2. eg: refDict = workspace.refDict
    3. Hard-coded
      1. These are values which are not in your data model or workspace attributes. They are fixed numbers or strings that are typed in here. You can find the values for these inputs in the inputs json in your data bundle (data_bundle > hello_gatk > hello_gatk_fc.inputs.json)
      2. eg: disk_size = 10
      3. eg: java_opt = "-Xmx2G"
  3. Fill in the remaining inputs on your own/helping your neighbors.
  4. Fill out the output. It won’t auto-complete, but we want to write it to the data model. The value should be this.output_gvcf
  5. Save the configuration
Run
  1. Refresh the page and check for the yellow refresh credentials banner BEFORE running. This isn’t typically an issue for users in a normal setting, but because in a workshop we start and stop a lot, the idle time can cause the credentials to time out. It will throw a Rawls error if you run that won’t pop up until after the job has been submitted and queued, which can be frustrating.
  2. Method Config > Launch Analysis > Select sample > Launch
  3. Watch & refresh from the monitor tab. Click the view link when it appears, and open the timing diagram to see what’s happening.

Indices and tables