Let userdata tell Puppet how to configure your cloud VMs

Updated post: Cloud instance bootstrap with userdata, cloudinit, github, and puppet.

Puppet is a tool that can automate system administration type operations as well as software installation and configuration of applications ranging from web servers, application servers, databases, performance monitoring tools, etc.

By default Puppet uses hostnames to recognize what configurations to apply to a machine. Each remote machine being managed by Puppet has a small piece of software installed called a Puppet agent. The Puppet agent reports itself to the central Puppet master upon startup and by default every 30 minutes thereafter. When the Puppet agent reports itself by default it just sends its hostname, this is how the Puppet master knows how to configure the remote machine. The Puppet master contains Puppet manifests that map which configurations should be applied to particular hostnames, and the matching on hostnames allows for the use of patterns so you can apply the same configuration to more than one machine.


webserver1.example.com, webserver2.example.com

could both have apache webserver installed with PHP and Python apache modules installed and configured.


dbmysql.app1.example.com, dbmysql.app2.example.com

could both have MySQL installed with certain databases and tables created for app1 and app2.

This default configuration based on hostname is limiting though because it does not allow for highly customized configurations based on values that can be specified when a VM is started. For example, multiple machines that have the same applications installed but are part of separate environments (development vs testing vs production environments) that need different values in their configuration files.

If you could specify variables for config files such as IP addresses, hosts file entries, location of a load balancer for an auto-scale node, etc then this is where the REAL highly customized power begins. If your machine configurations don’t rely on matching to hostnames but instead get applied based on any number of variables you may choose to specify when starting a VM then you have a very flexible and agile pattern to build upon.

Enter Facter

The Puppet agent can report lots of detailed information and meta-data known as “facts” via an add-on called Facter. These “facts” about the remote machine as well as the hostname of the remote machine are reported to the Puppet master on each check in. Facts are basically key=value pairs. By default, there is a fixed set of “facts” that are collected and submitted, but Facter can be extended to supply any additional information you have access to programmatically. It is this added information supplied via “facts” that allow highly customized configurations to be made.

Examples of default “facts” collected:

architecture => i386
ipaddress =>
is_virtual => true
kernel => Linux
kernelmajversion => 2.6
operatingsystem => CentOS
operatingsystemrelease => 5.5
physicalprocessorcount => 0
processor0 => Intel(R) Core(TM)2 Duo CPU     P8800  @ 2.66GHz
processorcount => 1

Examples of custom facts:

database => MySQL
webserver => apache2
apache-mods => php,python
cassandra => cluster:dev1,keyspace:kickassapp
cassandra-column-families => [script to create column families]
java-version => 1.6
DNS => dnsserver1,dnsserver2,dnsserver3
hosts => terracotta, someotherhost

To start learning more about Facter and how to use it see: Facter 101.

So… how do you get Puppet/Facter to report these custom facts? When using an Amazon EC2 compatible cloud, the API for starting a VM allows “user-data” to be associated to the VM. This user-data can be any textual data. In our case we will make it key=value pairs with one pair per line. A custom fact parser needs to be created to extend Facter so that it reads the user-data for the VM it is running on and submits user-data keys as fact names and the user-data values as the fact values. Below is the code I use to read my user-data into Facter to generate my custom facts. It is contained in a file named userdata_facts.rb. Also, notice the IP address,, used to make the REST call to retrieve the user-data. This is a specially routed IP address for VM instances that run within EC2 compatible clouds. To read more about this IP address and the EC2 Query API see the AWS documentation.

# /etc/puppet/modules/ec2/lib/facter/userdata_facts.rb
# @author Erik Paulsson

# This script will take user-data associated with a cloud instance
# and create "facts" out of them for use within Puppet.
# This script assumes that the user-data is supplied as key=value
# pairs, one pair per line.

require 'facter'

cmd = sprintf('/usr/bin/wget -q -O -')
result = `#{cmd}`

lines = result.split("\n")

lines.each do |line|
    keyval = line.split('=', 2)
    key = keyval[0]
    val = keyval[1]
    Facter.add(key) do
        setcode { val }

# Get the instance-id and set it as a fact.
cmd = sprintf('/usr/bin/wget -q -O -')
result = `#{cmd}`
Facter.add('instance-id') do
    setcode { result }

Even specify which Puppet master to use

One addition that makes this pattern even more flexible is if you can specify in user-data which Puppet master for the VM to be configured by. I like to do this by specifying a user-data key=value pair of:


But… how does the Puppet agent know to use this IP to reach the Puppet master? We need to bundle a small script into our VM image that gets run on boot at a run level before the Puppet agent starts up. By default the Puppet agent tries to contact the Puppet master by using the hostname ‘puppet’. All our script has to do is add a host entry for ‘puppet’ using the IP address we specified with our ‘puppetmaster_ip’ user-data key/value pair.

Here is the script I use for that which I call ‘vminit’:

# This script runs on boot before the puppeagent starts.
# The sole purpose of this script is to add an entry into the hosts file
# for the host "puppet" so that when the puppet agent starts it knows
# where to find the correct puppet master to get its configuration from.

# Constants

mkdir -p $basedir

if [ ! -f $basedir/ec2_userdata ]; then
        wget $ec2_userdata_url -O $basedir/ec2_userdata 2>&1
        # Note, if file does not exist it can't be fetched

        puppetmaster_ip=`grep "puppetmaster_ip" $basedir/ec2_userdata | cut -d '=' -f2`

tee -a /etc/hosts <<EOF
$puppetmaster_ip puppet

        echo "EC2 userdata file already exists, not downloading or initializing."

I want to stress that this technique will work for any cloud infrastructure that supports the Amazon EC2 API, including but not limited to:

This is a high-level overview of the design and use of Puppet to its full potential when operating in an EC2 compatible cloud environment. Obviously, there are some details missing to glue all this together and make it functional. If there is interest in this topic I can write a follow up to cover some more of the details.