Operations nerds - anyone interested in talking about infrastructure?

rachelandrew · August 8, 2013, 6:44am

Our product isn’t SaaS, however we have a fair bit of infrastructure supporting Perch. For example we serve hosted demos, have the site and associated payment process/license validation, host our own support and so on. I’m currently in the middle of a process of moving everything off a scrappy collection of virtual servers, all managed differently, to a new infrastructure that I am keeping consistent using Puppet.

We’re also doing things like splitting the marketing website from all of the business functions, it’s currently essentially what we launched with over four years ago which makes it really difficult to do stuff like A/B Testing of landing page conversions and so on as touching the site is a bit scary as we don’t have a sane deployment method.

I’m something of an old school sysadmin, so I’ve had to learn lots of new tools recently as part of trying to sort our systems out. I’d be happy to share what I’ve learned and would love to know what other people are doing, especially in terms of automation, deployment, monitoring and security.

imsickofmaps · August 8, 2013, 7:10am

o/

We use Chef not puppet but as far as I’m concerned the key thing is infrastructure-as-code not what the code is.

I don’t really use Chef for even half its power at all but its place in our set of tools has been awesome.

We use Vagrant to have the same OS as our servers at Hetzner, which runs the system’s Chef recipes every time we boot our laptops. Then I use Fabric (a python-based tool) to run tasks against the servers from inside the VM like staging bootstrap (rsync’s recipes and makes sure the environment is up to date) then staging deploy (which rsync’s the code, updates configs and restarts services). Using rsync means I don’t end up with stuff on the servers that shouldn’t be there (the biggest problem with people using git for deployment).

From a business sense I put this in the same category as test-driven-development. It’s an overhead initially to get it set up, but once its done its a massive liberator. I can essentially think of something, build it and deploy it against the real data on our site in minutes not hours. If I screw up, I just revert the commit and redeploy.

Also, I have 8 low powered vm’s on Hetzner (we use Riak as our datastore) and it costs me approx 100 Euro’s in total. When servers go “funny” (technical term for “I rebooted and it didn’t solve the problem”), I just remove it from the cluster and blow it up and create another one. 15 mins instead of 30 minutes of trying to even understand the root cause before you can even work out if its solvable.

The other thing I’d recommend is Pushover. It lets you wire up a whole bunch of events in your app to trigger push notifications on your phone. For example I have notifications for new orders, deploys finishing, net new support tickets because all those things are important to me.

rachelandrew · August 8, 2013, 7:24am

I looked at both Chef and Puppet and Puppet just seemed to sit better with my brain! However infrastructure as code has been a huge revelation to me. I’m using Vagrant as well - as an aside, for anyone who wants a simple intro to Vagrant and also uses PHP, have a look at PuPHPet.

I also think this stuff can be a huge liberator, and really fits well with most of our situations as tiny teams trying to do it all. There is obviously an overhead at first, but knowing your infrastructure is backed up in Git, as code and that you can recreate any part of it on a new box easily, is very nice.

Thanks for the link to Pushover, haven’t come across that and it looks useful.

cemhurturk · August 9, 2013, 8:48am

At Sendloop, we use Virtualbox instances which is exactly the same installation of our production servers. However, I guess it’s time to switch to Vagrant sometime in the near future.

In addition, we use Git mainly for deployment with complex hook scripts. For database, we use MySQL database servers. Because Sendloop has a huge load (appr. 10 million emails are sent every day through Sendloop servers which means 100+ million transaction load on MySQL servers), we split users across multiple MySQL servers. I know this is a very old technique but it works

noel · August 8, 2013, 12:59pm

We have an almost identical setup to @imsickofmaps: Chef, Vagrant, and rsync. I don’t love Chef, but it gets the job done. My main Chef tip is to just run Chef Solo. Chef Solo is the simple system – you upload the scripts to the server, run them, and away you go. The creators of Chef are trying to sell their mindbogglingly complicated hosted solution, so the docs hardly mention Chef Solo.

radiac · August 8, 2013, 1:44pm

I’ve been looking at moving from custom scripts to puppet or chef; I’ve only skim-read the docs so far, but the recommended pull model of agent/master (or chef’s server/nodes) strikes me as overcomplicated, and a bit less secure and reliable than pushing updates. I’m thinking of doing something similar to @noel, pushing and applying puppet manifests using fabric - am I just being paranoid, or missing something?

Out of interest, what do people use for monitoring?

casey · August 8, 2013, 2:25pm

I’m in the same boat as radiac, I think. We only have 10ish physical servers and I get by with my own (really basic) scripts. I only set up 1 or 2 new hosts a year so I’ve been putting off improving our process.

Lately I’ve been looking through the Ansible docs and it seems nice.

The usual nagios for monitoring. I also really like Munin for graphing trends. It does a lot out of the box with a very easy setup and you can whip up a plugin to graph something new in a few minutes. For performance monitoring, New Relic is awesome for Rails and well worth the $.

bradleyboy · August 8, 2013, 3:44pm

Like many of the replies above, we spent years living on a cobbled together “script”, which was a text file full of bash commands to run when adding a new server to our cluster. It took time and was not the sort of thing you wanted to have to do under pressure. Chef always melted my brain when trying to learn it, but I finally forced myself a few years ago and am happy I did. We use Joyent to manage our 5 server cluster for SlideShowPro.com, and adding a new box is a single command from the terminal thanks to Chef + Joyent’s CLI tools. And all those tools are wrapped in our own repo, so anyone can do this once they’ve git cloned.

Working with a small team for all these years, automation has become an addiction of mine. We’re now almost fully automated in our development environment and production release workflow for Koken. If you are setting up a new development machine, you check out the repo and then run a single command to install all the dev tools and local server configurations to get you up and running. Need to push a new release? Again, that is a single terminal command that packages the release, pushes it to the server, updates release notes, etc. Like I said, I’m addicted It’s an investment of time upfront, but the peace of mind that comes later is well worth it.

rachelandrew · August 9, 2013, 5:44am

A lot of people use Puppet masterless and sync manifests via Git etc. I’ve gone for a standard Puppetmaster setup, I intended to run masterless but the more I read the more sense it made to run a puppetmaster, and it is less complicated than it looks once you get it going.

cemhurturk · August 9, 2013, 6:53am

In addition to my original post, we use Pingdom for server uptime monitoring, Cockpito.com for in-app metric monitoring, Sendloop Engage feature for user tracking and event-driven messaging.

Deployment is not automated, we use Git for deployment mainly but it’s supported with some custom scripts.

We also use Bugsnag for fatal error detection and notification.

As a side note, I will definitely take a look at Puppet for automation.

imsickofmaps · August 9, 2013, 8:47am

We use NewRelic’s free monitoring. If you can put up with their sales people emailing you to upgrade all the time it’s great. Funny thing is, I intend to use NewRelic once I can justify the expense but the more they email the more I just want to abuse the free-tier.

cemhurturk · August 9, 2013, 8:53am

You may also try CopperEgg which is similar to NewRelic but cheaper.

pellisuk · August 9, 2013, 8:56am

In my day job (DevOps) we are using Chef & Vagrant to manage our AWS servers. We host our own chef server & repo. It’s doing a good job reducing the amount of repetitive tasks we have. In my bootstrapping I’m still using hosted web space but I’ve found FTPloy to be very useful. Anything that is pushed to my Bitbucket master branch is deployed to the server, staging for test can be done by pushing to a different branch.

wamberg · August 9, 2013, 4:43pm

I’d like to throw SaltStack into the mix for infrastructure management options. It seems like a lot of people have trouble getting the hang of Chef; I’m one of those people. I tried Salt after yet another Chef recipe upgrade broke my will. Here’s why I’m sticking with Salt.

Salt just works better for me. It might have something to do with my Python background. While Salt is written in Python, I haven’t found myself writing any Python to set up a machine. The Salt States are written in YAML. YAML has its own problems but I’ll take it over Ruby any day. For example, it has been immensely more simple for me to set up nginx and uwsgi servers in Salt with YAML and Jinja templates. I’ve also had more success doing git-based deployments.

Beyond States (Recipes in Chef), I found the client/server setup so much easier with Salt. My Chef server always had memory problems. I’m sure this has more to do with my administration ability than Chef but I couldn’t keep the Chef server running without a beam process grabbing all the memory and taking down my infrastructure-house of cards. Even when my Chef server was up and running, I felt like I was tied to Chef’s web interface to management my nodes. I don’t feel the same way with Salt. For example, in Salt, managing minion access to the server is a few terminal commands.

One issue is that the Salt States I write will only work on Ubuntu Linux machines. Maybe there are guidelines to writing Salt States that are more portable. One of the benefits of Chef (and Puppet, I don’t have any personal experience with it though) is that your recipe should work on a variety of Linux distros. This makes my States less likely to be useful for the devops community if I were to share them. My experience with Chef recipes though is that they don’t always work in every distro anyway. I seemed to have a penchant for finding a Chef recipe that was written by a RedHat sysadmin and never tested in Ubuntu.

If you’re having trouble with your current infrastructure manager, give Salt a shot.

dennis · August 10, 2013, 5:28pm

We use dedicated servers/racks hosted by Hetzner for our TestRail Hosted infrastructure and our websites. We prefer dedicated machines as they offer much better performance compared to cloud offerings like AWS in our experience. The hardware side is completely managed by Hetzner and they are flexible enough to configure private switches/networks, have API-switchable failover IPs, can install BBUs etc. We currently manage our ~20 servers with a homegrown set of shell scripts + Subversion for configuration management, which has been working well for us.

That said, provisioning new machines is more work than it should be and we plan to switch to one of the available configuration management tools. We briefly looked into using Chef, Puppet or Ansible, and having tested Ansible as part of some smaller projects with Vagrant, I think we will go with this tool. I especially like the fact that Ansible doesn’t require you to install an agent on the managed servers and that it uses simple SSH connections. Does anyone have experience using Ansible for their production systems here?

smtm · November 29, 2013, 8:48pm

What do you use to manage the traffic distribution to the cluster at Hetzner? A separate machine with a loadbalancer or some other setup?

I just moved into Hetzner from shared hosting with my hotel management rails app. Initially I built the boxes with vagrant, puppet on my local machine and deployed that to Hetzner. Now I am going with vhosts, but they don’t support a failover ips for those at Hetzner. So I am looking for alternatives

starr · November 30, 2013, 4:24am

If you happen to be using ruby & rails you might be interested in my service honeybadger.io. We’re working to give devs a complete monitoring solution that doesn’t cost an arm and a leg.

Sorry for the blatant plug. I figure I’m allowed one every 20 posts.

starr · November 30, 2013, 4:27am

Our ops guy recently switched from chef to ansible and is swearing by it.

dennis · December 12, 2013, 4:03pm

We use dedicated pairs of routers with HAProxy + Pacemaker/Corosync for failover. You can use Hetzner’s failover nets to automatically reroute traffic when you switch the master via Corosync.

dennis · December 12, 2013, 4:05pm

We also migrated our homegrown solution to Ansible recently and really like it. The nice thing about Ansible is that it doesn’t require any agent/service installations and works just via SSH. We also use it to provision our Vagrant dev machines.