Amazon AWS Cloud EC2 bootstrap with userdata cloudinit github puppet

Cloud infrastructure provides devs and ops the capability to be very agile, nimble, and efficient while achieving repeatable steps, procedures, and processes. One very important piece to attaining this achievement is automation and it starts when a new VM instance is launched by way of userdata. I’ve talked about this before in a previous post: “Let userdata tell Puppet how to configure your cloud VMs”. This is a follow up post describing a better, more flexible way to automatically turn a VM running a vanilla operating system into a fully functional server running your applications configured exactly like you need them without any human intervention.

Enter CloudInit

Automate all the things

CloudInit is a linux package that handles early initialization of a cloud instance. It was originally developed for Ubuntu, but has finally been adopted by RedHat as well and is available in RHEL6. CloudInit data is text that is submitted as userdata with a request to launch an instance. When the instance boots up (with the cloud-init package already installed) the userdata will be retrieved and if the userdata is in cloud-init format then the cloud-init package on the instance will process the cloud-init userdata.

CloudInit is supported on any cloud that has the concept of userdata that can be specified at instance launch. Amazon’s AWS EC2 instances support this and originated this idea. Other cloud platforms support it as well including OpenStack and any public cloud offering built on OpenStack, like Rackspace.

CloudInit provides many hooks at different points of the OS bootup process giving you the power to do just about anything you can dream up for configuration or pre-configuration preparation with another tool like Puppet. Below is an example of multipart cloudinit userdata using ‘cloud-config’ and ‘user-script’ or ‘user-data script’:

Content-Type: multipart/mixed; boundary="===============0035287898381899620=="
MIME-Version: 1.0

--===============0035287898381899620==
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"

#cloud-config
# Cloud-Init Hints:
# * Some default settings are in /etc/cloud/cloud.cfg
# * Some examples at: http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/files/head:/doc/examples/
# * CloudInit Module sourcecode at: http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/files/head:/cloudinit/config/

preserve_hostname: true
manage_etc_hosts: false

# dynamically set hostname using the instance's instanceid
bootcmd:
 - cloud-init-per instance my_set_hostname sh -xc "echo uberweb-$INSTANCE_ID > /etc/hostname; hostname -F /etc/hostname"
 - cloud-init-per instance my_etc_hosts sh -xc "sed -i -e '/^127.0.1.1/d' /etc/hosts; echo 127.0.1.1 uberweb-$INSTANCE_ID.uberapp.com uberweb-$INSTANCE_ID >> /etc/hosts"

# Add apt repositories
apt_sources:
 # Enable "multiverse" repos
 - source: deb $MIRROR $RELEASE multiverse
 - source: deb-src $MIRROR $RELEASE multiverse
 - source: deb $MIRROR $RELEASE-updates multiverse
 - source: deb-src $MIRROR $RELEASE-updates multiverse
 - source: deb http://security.ubuntu.com/ubuntu $RELEASE-security multiverse
 - source: deb-src http://security.ubuntu.com/ubuntu $RELEASE-security multiverse
 # Enable "partner" repos
 - source: deb http://archive.canonical.com/ubuntu $RELEASE partner
 - source: deb-src http://archive.canonical.com/ubuntu $RELEASE partner
 # Enable PuppetLabs repos (for latest version of puppet)
 - source: deb http://apt.puppetlabs.com $RELEASE main dependencies
   keyid: 4BD6EC30    # GPG key ID published on a key server
   filename: puppetlabs.list
 - source: deb-src http://apt.puppetlabs.com $RELEASE main dependencies
   keyid: 4BD6EC30    # GPG key ID published on a key server
   filename: puppetlabs.list

# Run 'apt-get update' on first boot
apt_update: true

# Run 'apt-get upgrade' on first boot
apt_upgrade: true

# Reboot after package install/upgrade if needed (e.g. if kernel update)
apt_reboot_if_required: True

# Install additional packages on first boot
packages:
 - wget
 - git
 - puppet
 - rubygems   # Used to install librarian-puppet
 - python-pip # Used to install awscli

# run commands
# runcmd contains a list of either lists or a string
# each item will be executed in order
runcmd:
 # Install AWS CLI
 - pip install awscli
 # Install librarian-puppet for retrieving dependent puppet modules from github
 - gem install librarian-puppet
 # Get the github ssh key out of s3 (IAM providing access to S3)
 # so we can clone from our private github repos
 - aws --region us-east-1 s3 cp s3://uberappaccount.bootstrap-bucket/github-rsa-key /root/.ssh/id_rsa && chmod 600 /root/.ssh/id_rsa
 # Add github.com to known_hosts
 - ssh -T -oStrictHostKeyChecking=no git@github.com

# set the locale
locale: en_US.UTF-8

# timezone: set the timezone for this instance (ALWAYS user UTC!)
timezone: UTC

# Log all cloud-init process output (info & errors) to a logfile
output: {all: ">> /var/log/cloud-init-output.log"}

# final_message written to log when cloud-init processes are finished
final_message: "System boot (via cloud-init) is COMPLETE, after $UPTIME seconds. Finished at $TIMESTAMP"

--===============0035287898381899620==
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="user-script.txt"

#!/bin/bash
git clone -b master git@github.com:uberco/puppet-uberapp.git /etc/puppet/
# Run librarian-puppet
cd /etc/puppet && HOME=/root librarian-puppet install
# Get hiera data files for puppet
mkdir -p /etc/puppet/hieradata && aws --region us-east-1 s3 cp s3://uberappaccount.bootstrap-bucket/hieradata/ /etc/puppet/hieradata/ --recursive && chmod 600 /etc/puppet/hieradata/*

# set up custom Puppet Facter facts
mkdir -p /etc/facter/facts.d
cat << 'EOF' > /etc/facter/facts.d/uberapp
#!/usr/bin/env python
facts = {
    "node_type":"uber-web", # 'uber-web' OR uber-producer' OR 'uber-worker'
    "bootstrap_s3path":"uberappaccount.bootstrap-bucket", # no trailing slash
    "uber_queue_high":"uber-hpq",
    "uber_queue_low":"uber-lpq", 
    "uber_dead_letter_queue":"uber-dlq",
    "uber_log_level":"INFO",
    "uber_deploy":"uber-web-1.0.3.war",
    "uber_notification_email":"admin@uberapp.com",
    "lumberjack_files":'{ "apache-logs": { "paths": ["/var/log/apache2/*.log"], "fields": {"type": "apache"} }, "syslog": { "paths": ["/var/log/syslog"], "fields": { "type": "syslog" } } }',
    "lumberjack_deb":"lumberjack_0.1.2_amd64.deb",
    "lumberjack_ensure":"present"
}
for k in facts:
    print "%s=%s" % (k,facts[k])
EOF
chmod +x /etc/facter/facts.d/uberapp

# Run puppet
puppet apply /etc/puppet/manifests/site.pp

--===============0035287898381899620==--

There is a lot going on in the above example cloud-init data. Let’s cover some of it, but hopefully most of it is self explanatory (if you are reading this in the first place).

Cloud-config section

The first section, ‘bootcmd’, is used to set the hostname on the machine. By default an instance’s hostname will be set to the “instance id” which is hash string that isn’t very human friendly, although it is a unique identifier that allows you to find the instance in the web console. A very useful thing I have found is to add a prefix to the instance id as the host name. In the above example I set the host name to “uberweb-$INSTANCE_ID” where $INSTANCE_ID will be replaced by the instance’s actual instance id. The reason this is so handy is that if you are using a monitoring tool (like Splunk, Logstash, etc) to monitor and collect log data from the instance then these tools usually report the server’s hostname with any event they send to a central aggregation service. When monitoring and searching through log events in a central aggregation portal you usually want to find certain types of machines quickly. One way to do this now is by searching for the prefix of the hostname. In this case, searching for log events from machines whose hostname starts with “uberweb” would quickly find events for all machines in my web server cluster. I could do the same thing for database servers or “workers” in a batch processing application.

A couple straight forward sections: ‘apt_sources’ adds some additional software repositories for installing software using package management. And ‘packages’ installs some software packages.

Next, ‘runcmd’ can run any bash commands. The “amazon command line interface” (awscli) python library gets installed as well as librarian-puppet. In this case awscli is used later to retrieve an ssh key out of AWS S3, this key provides access to private github repositories. From github we will later clone a repo which contains the puppet config to finalize the configuration of this instance (master-less puppet). If you are running on OpenStack you can do this same type of shared key / config retrieval from Swift object storage by installing and using the python-swiftclient library. The S3 or Swift storage locations can be private so that only authorized machines containing the correct credentials can access and download the contained files.

Next, set locale and timezone. Always use UTC time no matter where you or your server are located! The ‘output’ section specifies where useful cloud-init info and error log entries will get written. As soon as your instance is accessible you can login and ‘tail’ this log file to see what is happening in the cloud-init process. The ‘final_message’ section specifies a final message to be written to this log file.

User-script section

This section just contains bash commands. The first thing we do is clone our ‘puppet-uberapp’ repo from github. This repo would contain puppet configs for finishing the complete installation and configuration of an ‘uberapp’ web application server. We run ‘librarian-puppet’ to pull down any dependent puppet modules from github. Then, we copy any Puppet hieradata files out of S3 using the awscli which was previously installed. I must mention that the awscli is able to access S3 content without any AWS credentials being specified because the instance had an AWS IAM role specified at launch time that grants access to the S3 bucket containing the files we are retrieving. I can’t stress enough how useful IAM is and how much it helps in managing security and access controls.

Next, we create an executable python file for providing custom Puppet facts that can be used to make configuration decisions or provide custom config values when Puppet is run. These facts can be anything you can think of to help you configure you instance using Puppet. In this example I tell the instance what type of node to become, the names of some AWS SQS queues to use, specify a Java WAR file to deploy, the S3 bucket and path to use so that puppet can retrieve more files from S3, and then some Lumberjack / Logstash facts for installing and configuring the agent to forward log data to a Logstash receiver and then store in ElasticSearch.

Finally, we run Puppet! And… only a few short minutes after instance launch we have a fully configured and running application server that is forwarding log events to a central log aggregation server. I can watch my instance join the rest of my nodes in a tool like Kibana now.

Take it another step and you can use your cloud-init bootstrapping with AWS auto-scaling / cloud-formation or OpenStack Heat.

If anything needs more clarification please let me know. If anyone has a better way of doing things like this I would like to hear about them.

Go forth and automate!