Sunday, February 15, 2009

Disruptive Backup Platform

In many data center environments where commercial backup software like Legato Networker or Veritas Netbackup are deployed they are mostly used to backup and restore files. Then for a minority of backups a special software is being used for better integration like RMAN for Oracle database. In recent years a nearline storage has been an interesting alternative to tapes - one of the outcomes is that all commercial backup software support it one way or the other.

But the real question is - do you actually need the commercial software or would it be more flexible and more cost effective to build your backup solution on open source software and commodity hardware? In some cases it's obvious to do so - for example a NAS server/cluster. It does make a lot of sense to set-up another server, exactly the same with the same HW and storage and replicate all your data to it. In case you loose your data on your main server you could restore a data from the spare one or make your spare one a live server which will make your "restore" almost instantaneous compared to a full copy of data. Then later-on once you have fixed your main server it could become your backup one. Not only it provides you with MUCH more quicker service restoration in case you lost data but it almost certainly will be cheaper than a legacy solution based on commercial software, tape libraries, etc.

Above example is a rather special case - what about a more general approach? What about OS backups and data backups which do not require special co-operation with an application to perform a backup? In many environments that covers well over 90% of all backups. I would argue that a combination of some really clever open source software can provide a valuable and even better backup solution for the 90% of servers than legacy approach. In fact some companies are already doing exactly that - one of them for example is Joyent (see slide 37), and there are others.

Recently I was asked to architect and come up with such a solution for my current employer. That was something we had in mind for some time here just never got to it. Until recently...

The main reason was a cost saving factor - we figured out that we should be able to save a lot of money if we build a backup platform for 90% cases rather then extend current legacy platform.

Before I started to implement a prototype I needed to understand the requirements and constrains, here are most important ones in no particular order:
  • it has to be significantly more cost effective than legacy platform
  • it has to work on many different Unix platforms and ideally Windows
  • it has to provide a remote copy in a remote location
  • some data needs to end-up on tape anyway
  • only basic file backup and restore functionality - no bare metal restore
  • it has to be able to scale to thousands of clients
  • due to other projects, priorities and business requirements I need to implement a working prototype very quickly (couple of weeks) and I can't commit my resources 100% to it
  • it has to be easy to use for sysadmins
After thinking about it I came up with some design goals:
  • use only free and open source software which is well known
  • provide a backup tool which will hide all the complexities
  • the solution has to be hardware agnostic and has to be able to reliably utilize commodity hardware
  • the solution has to scale horizontally
  • some concepts should be implemented as close to commercial software as possible to avoid "being different" for no particular reason
Based on above requirements and design goals I came up with the following implementation decisions. Notice that I omitted a lot of implementation and design details here - it's just an overview.


The Backup Algorithm

The idea is to automatically create a dedicated filesystem for each client and each time data has been copied (backed-up) to the filesystem create a snapshot for the filesystem. The snapshot will in effect represent a specific backup. Notice that next time you run a backup for the same client it will be an incremental copy - actually once you did your full backup for the first time all future backups will always be incremental. This should provide less load on your network and your servers compared to most commercial backup software when for all practical purposes you need to do a full backup on regular basis (TSM being an exception here but it has its own set of problems by doing so).
In order to expire old backups old snapshots will be removed if older than a global retention policy or a specific policy for a client.

Backup:
  1. create a dedicated filesystem (if not present) for a client
  2. rsync data from the client
  3. create a snapshot for the client filesystem after rsync finished (successfully or not)
Retention policy:
  1. check for global retention policy
  2. check for a local (client specific) retention policy
  3. delete all snapshots older than the local or global retention policy


Software

Because we are building an in-house solution it has to be easy to maintain for sysadmins. It means that I should use a well known and proven technologies that sysadmins know how to use and are familiar with.

I chose Rsync for file synchronization. It has been available for most Unix (and Windows) platforms for years and most sysadmins are familiar with it. It is also being actively developed. The important thing here is that rsync will be used only for file transfer so if another tool will be more convenient to use in a future it should be very easy to start using it instead of rsync.

When it comes to OS and filesystem choice it is almost obvious: Open Solaris + ZFS. There are many very good reasons why and I will try to explain some of them. When you look at the above requirements again you will see that we need an easy way to quickly create lots of filesystems on demand while using a common storage pool so you don't have to worry about pre-provisioning your storage for each client - all you care about is if you have enough storage for a current set of clients and retention policy. Then you need a snapshoting feature which has minimal impact on performance, scales to thousands of snapshots and again doesn't require to pre-provision any dedicated space to snapshots as you don't know in advance how much disk space you will need - it has to be very flexible and it shouldn't impose any unnecessary restrictions. Ideally the filesystem should also support transparent compression so you can save some disk space. It would be perfect to also have a transparent deduplication. Additionally the filesystem should be scalable in terms of block sizes and number of inodes - you don't want to tweak each filesystem for each client separately - it should just work by dynamically adjusting to your data. Another important feature of the filesystem should be some kind of data and metadata checksumming. This is important as the idea is to utilize commodity hardware which is generally less reliable. The only filesystem which provides all of the above features (except for dedup) is ZFS.


Hardware

Ideally the hardware of choice should have some desired features like: as low as possible $/GB ratio (a whole solution: server+storage), should be as compact as possible (TB/1RU should be as high as possible), should be able to sustain at least several hundreds MB/s of write throughput and should provide at least couple of GbE links. We settled down on Sun x4500 servers which deliver on all of these requirements. If you haven't check them yet here is some basic spec: up-to 48x 1TB SATA disk drives, 2x Quad-core CPUs, up-to 64GB of RAM, 4x on-board GbE - and all of this in 4U.

One can configure the disks in many different ways - I have chosen configuration with 2x disks for operating system, 2x global hot spares, 44x disks arranged in 4x RAID-6 (RAID-Z2) groups making one large pool. That configuration provides really good reliability and performance while maximizing available disk space.


The Architecture

A server with lots of storage should be deployed with relatively large network pipe by utilizing link aggregation across couple GbE links or by utilizing 10GbE card. The 2nd identical server is to be deployed in a remote location and asynchronous replication should be set-up between them. The moment there is a need for more storage one deploys another pair of servers in the same manner as the first one. The new pair is entirely independent from the previous one and can be build on different HW (whatever is best at the time of purchase). Then additional clients should be added to new servers. This provides horizontal scaling for all components (network, storage, CPU, ...) and best cost/performance effective solution moving forward.

Additionally a legacy backup client can be installed on one of the servers in pair to provide tape backups for selected or all data. The advantage is that only one client license will be required per pair instead of hundreds for each client. This alone can lower licensing and support costs considerably.


The backup tool

The most important part of the backup solution will be a tool to manage backups. Sysadmins shouldn't play directly with underlying technologies like ZFS, rsync or other utilities - they should be provided with the tool which will hide all the complexities and allow them to perform 99% of backup related operations. It will not only make the platform easier to use but will minimize the risk of people making a mistake. Sure there will be some bugs in the tool at the beginning but every time a bug will be fixed it will be for the benefit of all backups and it the issue shouldn't happen again. Here are some design goals and constrains for the tool:
  • has to be written in a language most sysadmins are familiar with
  • scripting language is preferred
  • has to be easy to use
  • has to hide most implementation details
  • has to protect from common failures
  • all common backup operations should be implemented (backup, restore, retention policy, archiving, deleting backups, listing backups, etc.)
  • it is not about creating a commercial product(*)
(*) the tool is only for internal use so it doesn't have to be modular, it doesn't need to implement all features as long as most common operations are covered - one of the most important goals here is to keep it simple so it is easy to understand and maintain by sysadmins

Given the above requirements it has been written in BASH with only a couple of external commands being used like date, grep, etc. - all of them are common tools all sysadmins are familiar with and all of them are standard tools delivered with Open Solaris. Then entire tool is just a single file with one required config file and another one which is optional.

All basic operations have already been implemented except for archiving (WIP) and restore. I left the restore functionality as one of the last features to implement because it is relatively easy to implement and for the time being the tricky part is to prove that the backup concept actually works and scales for lots of different clients. In a meantime if it will be required to restore a file or set of files all team members know how to do it - it is about restoring a file(s) from a RO filesystem (snapshot) on one server to another after all - be it rsync, nfs, tar+scp, ... -they all know how to do it. At some point (rather sooner or later) the basic restore functionality will need to be implemented to minimize a possibility of doing a mistake by a sysadmin while restoring files.

Another feature to be tested soon is a replication of all or selected clients to remote server. There are several ways to implement it where rsync and zfs send|recv seem to be best choice. I prefer zfs send|recv solution as it should be much faster than rsync in this environment.


The End Result

Given the time and resource constrains it is more of a proof of concept or a prototype which has already become a production tool, but the point is that after over a month in a production it seems to work pretty good so far with well over 100 clients in regular daily backup while more clients are being added every day. We are keeping a closed eye on it and will add more features if needed in a future. It is still work in progress (and probably always will be) but it is good enough for us as a replacement for most backups. I will post some examples of how to use the tool in another blog entry soon.

It is an in-house solution which will have to be supported internally - but we believe it is worth it and it saves a lot of money. It took only a couple of weeks to come up with the working solution and hopefully there won't be much to fix or implement after some time so we are not expecting a cost of maintaining the tool to be high. It is also easier and cheaper to train people on how to use it compared to commercial software. It doesn't mean that such an approach is best for every environment - it is not. But it is compelling alternative for many environments. Only time will tell how it works in a long term - so far so good.

Open Source and commodity hardware being disruptive again.

7 comments:

  1. Would you be willing to share your thoughts / ideas about deploying an rsync based backup infrastructure for Windows servers ? I have recently moved 60 TB of backup data to Solaris + ZFS but have yet to find a decent alternative to CIFS for our Windows servers ...

    ReplyDelete
  2. native cifs support is in the kernel or you can use samba.

    search docs.sun.com !

    you might want to ask on one of the mailing lists as well.

    regards,
    Andreas

    ReplyDelete
  3. You should probably use robocopy instead of rsync on Windows, or else you won't be backing up the NTFS ACLs. I'm pretty sure the new Solaris CIFS server supports ACLs, so there shouldn't be a problem on that side.

    Also, you need some kind of VSS strategy or else you will never back up the registry hives or any other locked files. You can use the vshadow tool to create the shadow copy and then use robocopy to send the files.

    ReplyDelete
  4. bacula
    http://www.bacula.org/en/

    ReplyDelete
  5. very creative! unfortunately it looks sketchy if you don't build enough redundancy into the layers, or do you have enough bandwidth to push all that data? If your project is successful it may be hard to scale.

    I am attempting to make opensolaris mount multiple CIFS shares and rsync incremental backups between these CIFS mounted volumes. I am running into bugs like "Error: value too large for defined data type" when I attempt these sorts of things with rsync.

    I may yet give up, make everything ZFS filesystems, and have the "clients" mount the CIFS data from the backend ZFS servers. Sort of like a poor-mans NAS. At least with ZFS I can snapshot and version to infinity, and I can get "fast" backup to tape on the "ZFS backend".

    ReplyDelete
  6. Re redundancy and scalability - well, the pool is protected with dual-parity raid and then all (or selected) backups/archives are being replicated to another server in different data center. So even if entire pool is lost you still have a copy in remote location of all your backups. From scalability point of view - I'm doing link aggregations on each server, additionally thanks to the approach it's always incrementals approach so bandwith requirements are not that high. Then remember then in order to sscale you add another server (pair of them for additional redundancy) which scales storage, network, cpu, etc...

    ReplyDelete
  7. That is a great idea and would give my old hard drives and even computers life again. Great post.

    ReplyDelete