January 18th, 2012
In deference to Adam Young, I’m going to try and write a series of broken-down posts on FUDCon, rather than one or two giant mish-mash-y summaries.
So, this one’s about the presentation I gave, titled ‘Cloud 0.1′, with a subtitle I haven’t quite nailed down yet, but which is something like ‘Why Not to Spend Lots of Time and Energy Running Your Own Infrastructure Much Worse Than Google Would, And How To Do It If You Insist’.
I’ve had the idea for a while now, but being lazy, didn’t write anything at all until the day before FUDCon, nor make any slides. Then I pitched it. To my surprise, it got enough votes to be scheduled. To my consternation, it got scheduled in the very first timeslot – so I had no time to finish my half-written notes, make any slides, do a runthrough, or generally do any of the stuff that would make it into a good talk.
Instead I got up, read my introduction, then improvised inexpertly for an hour. Many thanks to the dozen people who showed up and managed to avoid falling asleep or throwing rotten fruit.
The way I presented the talk was to spend a while talking about the many reasons it’s not a good idea to run your own infrastructure and the few reasons it is, then spend quite a while giving a 10,000 foot overview of how to set up a mail and web server, then spend the last 15 minutes briefly going over some rather neat webapps I run on my servers, and IRC/IM proxying. However, in hindsight, I think the most valuable bits are the consideration of whether you should run your own infrastructure, and the notes on neat, not-necessarily-well-known webapps and so on you can use if you do. The mailserver / webserver stuff is just too complex for a one hour presentation. So, since my notes are terrible, personal shorthand gibberish, and I have no slides, instead of giving you those, I’ll write a post about the same topics. Deal?
When I talk about ‘infrastructure’ I’m talking about the services that support your computing. The classic, old-school example is running your own mail server; other bits that come into the talk are a personal web server and IRC / IM proxying servers.
In the past it was pretty hard to find managed ways of doing any of those things, and it was fairly common for geeky types with personal internet connections to DIY. If you look at the internet, of, say, 1995, it was kind of designed as a giant interoperable network of nodes which would provide these kind of services to a group of users, and geeky types would essentially act as a node unto themselves – they were a service provider of one, providing services to themselves, and maybe a few friends and family, instead of relying on mail and web hosting services provided by their ISPs, which were inevitably crappy and limited.
These days it’s much less common, for a good reason: you can almost always get someone else to do it for you, much cheaper and better than you would do it yourself.
This forms the ‘why you probably shouldn’t do this’ side of the argument. There is just about nothing you can achieve by hosting your own mailserver which Google won’t do much better in exchange for sending you some ads and assimilating your personal information into the future Skynet, or which a service like Fastmail won’t do much better in exchange for a frankly pretty small cost – a cost which will almost certainly be less than the value of the time and money you’ll invest into doing it yourself. This is not surprising. There are huge, huge economies of scale built into infrastructure provision. Doing it for a user base of 1-5, on a hobbyist basis, is unsurprisingly vastly less efficient than doing it for ten million people on a very very professional basis.
The other disadvantages to self-hosting really just derive from this fact. You will almost certainly screw up more than a hosted provider will: you will break the server by deploying some dodgy app or an untested update. You will have less capacity (wave goodbye to your self-hosted blog when you get slashdotted, for e.g.) You will almost certainly have less redundancy – I know I don’t have any kind of failover on this webserver. You will almost certainly fail to take adequate backups. All these are boring, menial things which any decent hosted provider will do better just because it’s part of doing a professional job. You won’t because you’re doing this for fun, and those things are not fun.
Briefly, paid or ‘free’ (ad-supported / personal data supported) hosting services can provide you with almost anything you can host yourself, and do it much more efficiently. So why would you ever want to do it yourself? There are only a few reasons:
Necessity. I’m sticking this up at the top to make sure you don’t miss one of the best bits of this lengthy post. There are some things you can self-host that, to the best of my knowledge, you can’t actually get from a paid provider. The thing I know about is IRC/IM proxying. There’s no hosted provider of this that I know of. There’s a bit of this post down the bottom which explains what this is and, briefly, how to do it. If you’re a heavy user of IRC and/or IM you may well want to do it, because it’s really useful. So if you skip a lot of this post, do read that bit.
Education. You can learn quite a lot about how the internet (still, more or less) works by doing this stuff yourself. It will certainly teach you things. The internet is a somewhat different beast in practice these days, with so much of it existing inside Google’s and Facebook’s monstrously internally complex domains, but at a certain level it still works _more or less_ how the RFCs of the Internet Past declare it works, and running your own services will teach quite a lot about that.
Control. Obviously, the higher the level of functionality that you outsource, the less control you have over the implementation. This seems like a really big reason, but it often isn’t. When it comes to mail, a hosted mail provider will almost always provide everything you want. You just don’t need really fine grained control over the server configuration. You do not need to control the maximum simultaneous connection count to the IMAP server. You want a service that delivers your mail, allows you to send mail, allows you to organize your mail, and filters spam out for you. That’s really pretty much it. Gmail certainly achieves all these things. So do dozens of other services. Again, when it comes to web hosting, often what you want is a WordPress instance. You do not need deep control over the server’s PHP configuration. It’s more likely to irritate you than help you. There are cases where you actually need such control, as opposed to just maybe finding it cool that you have it, but those cases are fairly uncommon.
Fun. Yeah, it’s worth mentioning this. Some of us have very strange mindsets which find battling obscure MTA configuration to be an interesting way to spend our time. I’ve checked with medical professionals, and this is an incurable condition. Sorry. We just have to live with it. If you’re a fellow sufferer, you may self-host for no reason other than that you enjoy doing it.
Privacy. This is probably the largest remaining really valid reason. If you use a ‘free’ service for your infrastructure, you should always keep in mind that you almost certainly no longer own your stuff in any practical sense. If you use Gmail, Google pretty much owns your email. You don’t. They can look at it, use it to develop Skynet, send it to the government, and just generally do whatever the hell they like with it. In strict point of fact this is not entirely true – there are some legal restraints on what they can do with ‘your’ data – but I find it’s an excellent rule of thumb to work from. When dealing with such services I find it pays more or less to assume that everything you put into them will immediately be forwarded to the police and all your worst enemies, and then used to generate large amounts of advertising that will be mailed to you. Doing so avoids you being shocked in future when some of those things actually happen.
Paid services are a somewhat different ball of wax, in that you are not offering up your data in exchange for some services, but actually paying for the services. You therefore have a reasonable expectation that you will retain most of the ownership of your data. If you use a decent service provider, the contract you have with them may even possibly bear this out. However, there are still several problems, mostly legal ones. Your hoster can almost certainly be obliged to nuke your services and probably turn over your data to law enforcement under the terms of various bits of legislation, depending on where you are and where they are. Even if they’re not obliged to, they may well do so if asked by a sufficiently powerful body (like the government, or Universal Studios), on the basis that pissing you off is probably less damaging to them than pissing off the government. If you host your own services, this becomes much more unlikely.
It remains only to point out that, in brutal point of fact, this is often unlikely to be a consideration, but it is still worth bearing in mind, and though it’s not a huge issue for me, I do still value the fact that it’d be quite difficult for anyone to kill or forcibly access my mail or private web content.
In relation to this last point, it’s worth remembering that ‘self-hosting’ vs ‘using a provider’ is more of a spectrum than a binary state. Even those of us who ‘self-host’ are inevitably going to be outsourcing some stuff to someone. I use No-IP for DNS registration, for instance, so in theory someone could at least knock happyassassin.net offline by leaning on No-IP. I don’t have control over that level of things. But still, No-IP doesn’t own or even have access to any of my actual data, only my DNS records.
At the general level, even if you decide you want to ‘self-host’, you have a lot of flexibility in terms of what level you want to control yourself and what you want to pay someone else to look after for you. You don’t have to actually buy physical hardware and host everything off an internet connection you personally control. If that’s at, or near, the extreme ‘self-hosting’ end of the spectrum, then moving towards ‘completely managed’, we have:
* Stick your own hardware in a co-lo (i.e. you outsource the physical internet connection)
* Use a service like Slicehost where you get full root access to a bare virtual server (i.e. you outsource the physical connection and the ‘hardware’ provision)
* Use a service which gives you access somewhere higher up the stack
Everything else is a variant on that last one. It really only matters what level you get access at. Maybe you get a pre-set web server instance in which you can run whatever webapps you want. So-called ‘PaaS clouds’, like Openshift, are really just this kind of managed hosting, in a way; ‘IaaS clouds’ are pretty much like Slicehost. Maybe you just get a managed instance of some specific app or service, like WordPress (or ‘email’). It comes down to how much control and privacy you need, with the trade-off for more control and privacy usally being more expense and complexity.
So, there’s the theoretical for-and-against of self-hosting. It comes down to the broad conclusion that you probably don’t want to do it, and even if you do, you’re probably better off going for something in the middle of the spectrum – Slicehost, or one of the new public clouds, or something like that – than really doing (almost) everything yourself.
Assuming you self-host, or are going to start trying, despite all the above: here’s some notes on actually doing it.
Getting a domain of your own is pretty much the Point 0 of self-hosting. It’s also, fortunately, pretty simple. You can find a lot of confusing information on the topic but essentially it boils down to: buy a domain name and then set up the information that says ‘this domain is associated with this IP address’ – DNS records. It is much simpler to do these two things together, through one service. I use No-IP – their prices are reasonable and I’ve had no problem with their service. There are many other providers. It’s really as simple as picking a domain – like my happyassassin.net – paying your fee, and then filling out a little form which says ‘www.happyassassin.net should point to IP address xxx.xxx.xxx.xxx, mail.happyassassin.net should point to IP address xxx.xxx.xxx.xxx’, and so on. If you’re going to host mail for your domain, you’d also need an MX record, which says ‘mail for any address at happyassassin.net goes to IP address xxx.xxx.xxx.xxx’. And that’s really pretty much it. If you’re really self-hosting, as in you own the machines and they’re hanging off your own internet connection, all those IP addresses should be your own IP address. You’re going to want a static IP, for that.
Mail is the most complex thing to self-host and probably the least sensible, as hosted mail providers really do have it all figured out. I’m not going to turn this into a comprehensive ‘how to host your own mail’ walkthrough, because there are many of those already, and if you’re going to do it, what you should do is get a hold of a good guide and follow it carefully. But I do have one thing to contribute. I find it helps to bear in mind there are broadly three functions of a mail server, at least in my mental model, and you can pretty much treat them separately:
1: Retrieve messages from your existing mail accounts and serve them back out via IMAP for you to read on your client machines
I do this using fetchmail to actually retrieve the mail, procmail to sort it into folders and spam-test it via spamassassin, and dovecot to serve it back out via IMAP. I would strongly recommend the use of dovecot, it really is the best IMAP server around. It’s efficient, actively developed, highly standard-compliant, and supports things like IDLE very well. Other IMAP servers generally fail at at least one of those things. The retrieving and serving out are kind of different functions, but it makes no sense to do one without the other, really. There’s no point aggregating the mail from your various accounts in one place without also setting up a convenient interface – i.e. a server – for you to access it with.
2: Act as an SMTP server for your outgoing mail
When you want to send mail you send it through an SMTP server, right. Most people know that. Running your own SMTP server, for your personal use, has the advantage that you don’t have to keep changing to an SMTP server that’s accessible from the network you’re currently on. (Though, of course, if you just use Gmail, you can send outgoing mail from anything…)
3: Accept incoming mail from anyone to mail addresses at a domain you own
This is the most complicated case, probably. The fact that I’m set up to do this is why you can mail me at happyassassin.net, my own domain. When you send a mail there, your mail provider sees that mail to happyassassin.net is supposed to go to an IP address I own, and sends it there. That IP address actually is my own IP address, and connections to port 25 on that IP address are forwarded by my router to my mail server, which accepts the mail and sticks it into my mail folders just like fetchmail/procmail do for the email addresses I don’t administer myself.
I’m not going to explain in detail how to achieve all the above, but the key point is to remember these functions are distinct – you can do any one of them without doing the others. Where it’s easy to get confused is that you usually would use the same application, the same process, to do functions 2 and 3. I use postfix, because it’s marginally less insane than sendmail. But it’s best to think of them as two separate operations, and do one and then the other. If you think in terms of ‘how do I set up postfix’, you’re likely to get confused – finding guides for function 3 when what you really wanted was function 2. I know I did.
Another little note on that topic: the sketch of happyassassin.net mail I gave is, strictly speaking, incorrect. Your mail provider doesn’t really see that mail for happyassassin.net should go to my IP address: it sees that mail for happyassassin.net should go to No-IP. Why? Well, because I host my servers off my home internet connection, and that has port 25 blocked. Most home internet connections do. The way email actually works, mail for a domain is always initially delivered on port 25. The DNS record which says ‘mail for happyassassin.net goes to IP XXX’ cannot say ‘IP XXX on port 26′. It just says ‘IP XXX’. The port is hard-coded in the standards. So if you have a connection on which port 25 is blocked, you really can’t be the server that initially receives mail for your own domain. No-IP provide a neat service to get around this, called mail reflector. Essentially you set up your DNS records so that mail for your domain goes to No-IP’s server, and you tell No-IP the actual port of your server. Then No-IP’s server simply forwards mail straight through to your server. They don’t store it or have any access to it, except in the case that your server is down – they will keep it on theirs until your server comes back up, then forward it on. It’s a neat way around the port 25 problem, which costs $40 a year – at which price you could instead have fastmail handle your entire mail setup, including your own domain’s mail. Again, like I said, self-hosting is almost never actually economically sensible.
Setting up a web server, at the 10,000 feet scale, isn’t very difficult. Basically, you do ‘yum install httpd’ (or equivalent), and you’re done. You already registered www.mydomain.com and pointed it to your server’s IP address. Now you set your router to forward traffic on port 80 to the appropriate box, and you’re done. People going to www.mydomain.com will see a ‘hello world!’ post that’s the default homepage for Apache. Oh, and you do want to use Apache. There are alternatives, but they’re rarely what you want for self-hosting, and you will find much more help with configuring Apache than configuring anything else.
These days, you’re likely not going to be faffing around creating static content and dumping it in /var/www/html on your server. You really want to run webapps – you probably want to run a WordPress blog, for instance. Essentially your web server is providing useful services for you.
The 10,000 foot overview of how to install web apps is similarly simple: yum them. The most common ones are packaged. WordPress is: you can just do ‘yum install wordpress’. There are guides for the finicky bits of configuration.
There’s one stumbling block you’ll hit for most webapps, so I’ll mention it quickly: they almost all need a database. Web apps rarely store things as files on your local disk, because that’s silly. They want access to an SQL database instead, and they’ll store their configuration, your blog posts, and whatever else in there. You almost certainly want to use MySQL for this. MySQL will be packaged in any sane distro. Once you install it, it will probably be configured with no root password and a guest account. You will want to set a root password and destroy the guest account. There are guides to how to do this in the excellent MySQL documentation. Then, for each webapp you install, you’ll likely create a new database specially for that webapp, with a user account specially for that webapp which has access to the database. You can do this with a single one line command. The webapp will ask for a MySQL username and password as part of its setup process; feed it the username you created especially for it. That way, no webapp can access another’s data; only root will have access to all the databases, and you should only use the root account for any manual poking of the database you personally have to do. Never give the root password to any webapp (or any other person). The most popular webapps, like WordPress, tend to have the MySQL setup well documented, and you can apply the documentation to any other webapp which just needs a simple MySQL config to work. Which is most of them.
That’s web serving. Here are some of the webapps I run on my server. You may not have known about some, and find ‘em useful.
WordPress. Well, everyone knows about WordPress. It’s a blogging platform. If you want to have a blog on your server, you’re probably going to want to run WordPress. It’s well documented, easy to set up, hugely popular (and hence well supported), does everything you need from a blog, and has a bewildering array of plugins. Of course, if all you want is to have a blog, it’s almost certainly a better idea just to get it hosted by wordpress.com than faff around with setting up your own web server.
Roundcubemail. This is a webmail front end. Combined with my mailserver, it’s the last puzzle piece in extremely painfully replicating the functionality of Gmail – it gives me a pretty snazzy web front end to my mail, for the rare cases where I’m on someone else’s system and don’t want to set up an IMAP client, or something. It also came in quite handy at one FUDCon when the port blocking was so tight that IMAP clients didn’t actually work. Roundcube is a very very good webmail app, it has all the functionality of a desktop mail client, is pretty fast, and has a very snazzy interface. The old-school choice, Squirrelmail, is about as functional but nowhere near as pretty.
tt-rss is a news reader webapp. Running it is like hosting your own Google Reader, essentially. It’s a lot nicer than just running separate news reader clients on each of your client machines, because it means your read/unread state is always in sync. But of course, you could always just…use Google Reader. It’s not like knowledge of what RSS feeds you like is likely to be astonishingly private information.
MyTinyToDo is a very simple todo list webapp. I tried for years to find a big stonking egroupware suite – contacts, calendaring, and tasks, essentially – which would cover those things and sync well with my desktop clients and my phone. I never quite did. But mytinytodo handles one piece of the equation – tasks – just fine. I haven’t bothered trying to sync it with desktop clients / phone because you can just use the web interface very easily on any of those devices, it renders nicely on phones. Of course, you could always just use a hosted service like Remember The Milk.
OwnCloud is a ‘personal cloud’ server, or to avoid the buzzwordiness, it’s basically just a file server webapp. You point it at a place where files live and it makes them available through a web frontend and also via WebDAV (which lets you mount them as a shared drive on most OSes). It pretty much just does that, but it does it quite well and easily. At FUDCon, Jeroen gave me a long list of things that are wrong with it, and Jeroen is massively smarter than me so I’m sure he’s right, but all I know is it does what I ask it to. It’s handy for, say, storing your (encrypted!) password database, or a document you want accessible from anywhere. I store a lot of my notes in it. Your hosted equivalent would be, say, Dropbox.
Finally (man, 4000 words? Anyone still awake?) we come to the one thing I host myself, find useful, and could not find a hosted-provider equivalent of: IRC and IM proxying.
This achieves for IRC and IM what using a mail server achieves for mail, or using a web feed reader achives for news: you can use many clients without them conflicting, and with the state preserved between them. How it works is essentially that you run an app which acts as both an IRC client and an IRC server. It connects to all your IRC servers, and then on your client machines, instead of connecting directly to Freenode or EFnet, you connect to the proxy, which also acts as an IRC server. It then forwards all the traffic to you.
What does this get you? Well, you can sign in from six different clients at once – and instead of each looking to the rest of the world like a separate user, they all act as ‘you’. You can have part of a conversation from your laptop, part from your phone, and part from your desktop, and the outside world won’t know the difference.
Also, as the proxy’s always logged in, you can disconnect all your client machines, and the proxy will keep storing conversations, including any private messages. Then the next time you connect a client, you’ll get a log of all the channel traffic that happened while you were away, and any PMs you got sent will show up. It’s very handy.
Finally it’ll give you a handy central store of logs. It’s just a much better way to IRC.
I use Bip as an IRC proxy. It’s very easy to set up – really, you just install it and give it a list of IRC networks and channels you use, and tell it your nickname. Then you run it, and set up your IRC clients to connect to it, not directly to the networks. And you’re done. It’s probably the easiest thing you can self-host, as well as being the most useful.
On the same machine I run Bitlbee, which is an IM proxy – it connects out to MSN, Jabber, ICQ, AIM and so on, and also acts as an IRC server, effectively turning IM traffic into IRC traffic. I then have Bip use my Bitlbee server, so when I’m using MSN, my desktop is connected to my Bip instance, which is connected to my Bitlbee instance, which is connected to MSN. Fun, huh? Bitlbee can also actually connect to Twitter and Identi.ca, effectively turning your ‘social network’ traffic into IRC. You can tweet just by typing a message into your IRC client, and tweets from people you follow pop up as IRC messages. It’s a fun interface if you’re used to using IRC.
So…that’s my self-hosting story. Why you probably shouldn’t do it, and some things you might want to run if you do. Hope it’s helpful!