Robert Milkowski's blog: February 2009

Tuesday, February 24, 2009

ZFS in the Trenches

Very good presentation by Ben Rockwood - excellent one to start with ZFS tuning.

Monday, February 23, 2009

BART on large file systems

I wanted to use bart(1M) to quickly compare contents of two file systems. But it didn't work...


# bart create /some/filesystem/
! Version 1.0
! Monday, February 23, 2009 (11:53:57)
# Format:
#fname D size mode acl dirmtime uid gid
#fname P size mode acl mtime uid gid
#fname S size mode acl mtime uid gid
#fname F size mode acl mtime uid gid contents
#fname L size mode acl lnmtime uid gid dest
#fname B size mode acl mtime uid gid devnode
#fname C size mode acl mtime uid gid devnode
#

And it simply exits.


# truss bart create /some/filesystem/
[...]
statvfs("/some/filesystem", 0x08086E48) Err#79 EOVERFLOW
[...]
#

It should probably be statvfs64() but lets check what's going on.


# cat statvfs.d
#!/usr/sbin/dtrace -Fs

syscall::statvfs:entry
/execname == "bart"/
{
 self->on=1;
}

syscall::statvfs:return
/self->on/
{
 self->on=0;
}

fbt:::entry
/self->on/
{
}

fbt:::return
/self->on/
{
 trace(arg1);
}

# ./statvfs.d >/tmp/a &
# bart create /some/filesystem/
# cat /tmp/a
CPU FUNCTION                              
2  -> statvfs32
[...]
2    -> cstatvfs32                       
2      -> fsop_statfs                    
[...]
2      <- fsop_statfs                                       0  2    <- cstatvfs32                                         79 [...]  2  <- statvfs32                                            79 #

It should have used statvfs64() which should have used cstat64_32()

Seems like /usr/bin/bart is a 32bit binary compiled without largefile(5) awareness. There is a bug for exactly the case already opened - 6436517.

While DTrace wasn't really necessary here it helped me to very quickly see what is actually going on in kernel and why it fails especially with a glance at the source thanks to OS OpenGrok. It helped to find the bug.

ps. the workaround in my case is to temporarily set a zfs/fs quota on /some/filesystem to 1TB - and then it works.

Friday, February 20, 2009

7210 Review

InfoWorld reviews the 7210 NAS appliance - the ultimate NetApp killer :)

Prawo Jazdy

BBC reports:

Details of how police in the Irish Republic finally caught up with the country's most reckless driver have emerged, the Irish Times reports.

He had been wanted from counties Cork to Cavan after racking up scores of speeding tickets and parking fines.
However, each time the serial offender was stopped he managed to evade justice by giving a different address.
But then his cover was blown.
It was discovered that the man every member of the Irish police's rank and file had been looking for - a Mr Prawo Jazdy - wasn't exactly the sort of prized villain whose apprehension leads to an officer winning an award.
In fact he wasn't even human.
"Prawo Jazdy is actually the Polish for driving licence and not the first and surname on the licence," read a letter from June 2007 from an officer working within the Garda's traffic division.

"Having noticed this, I decided to check and see how many times officers have made this mistake.
"It is quite embarrassing to see that the system has created Prawo Jazdy as a person with over 50 identities."
The officer added that the "mistake" needed to be rectified immediately and asked that a memo be circulated throughout the force.
In a bid to avoid similar mistakes being made in future relevant guidelines were also amended.
And if nothing else is learnt from this driving-related debacle, Irish police officers should now know at least two words of Polish.
As for the seemingly elusive Mr Prawo Jazdy, he has presumably become a cult hero among Ireland's largest immigrant population.

Thursday, February 19, 2009

The Backup Tool

In my previous blog entry I wrote an overview of an in-house backup solution which seems to be a good enough replacement for over 90% of backups currently done by Netbackup in our environment. I promised to show some examples on how it actually works. I can't give you output from a live system so will show some examples from a test one. Let's go thru couple of examples then.

Please keep in mind that it is still more like a working prototype than a finished product (and it will most certainly stay that way to some extend).

To list all backups (I run this on an empty system)

# backup -l
CLIENT NAME                                                                     REFER   USED  RATIO  RETENTION

Let's run a backup for a client mk-archive-1.test


# backup -c mk-archive-1.test
Creating new file system archive-2/backup/mk-archive-1.test
Using generic rules file: /archive-2/conf/standard-os.rsync.rules
Using client rules file: /archive-2/conf/mk-archive-1.test.rsync.rules
Starting rsync
Creating snapshot archive-2/backup/mk-archive-1.test@rsync-2009-02-19_15:11--2009-02-19_15:14
Log file: /archive-2/logs/mk-archive-1.test.rsync.2009-02-19_15:11--2009-02-19_15:14
#

Above you can see that it uses to config files - one is a global file describing includes/excludes which are run for all clients and the second file which describes an include/exclude file for that specific client. In many cases you don't need to create that file - the tool will create an empty one for you.

Let's list all our backups then.


# backup -lv
CLIENT NAME                                                                     REFER   USED  RATIO  RETENTION
mk-archive-1.test                                                               1.15G  1.15G  1.75x    35 (global)
mk-archive-1.test@rsync-2009-02-19_15:11--2009-02-19_15:14                      1.15G      0  1.75x
#

The snapshot definies a backup and I put the start and end date of the backup in its name.

If you want to schedule a backup from a cron you do not need any verbose output - there is an option "-q" which keeps the tool quiet.


# backup -q -c mk-archive-1.test
#
# backup -lv
CLIENT NAME                                                                     REFER   USED  RATIO  RETENTION
mk-archive-1.test                                                               1.15G  1.16G  1.75x    35 (global)
mk-archive-1.test@rsync-2009-02-19_15:11--2009-02-19_15:14                      1.15G  6.63M  1.75x
mk-archive-1.test@rsync-2009-02-19_15:16--2009-02-19_15:16                      1.15G      0  1.75x
#

Now lets change the retention policy for the client to 15 days.


# backup -c mk-archive-1.test -e 15
#
# backup -lv
CLIENT NAME                                                                        REFER   USED  RATIO  RETENTION
mk-archive-1.uk.test                                                               1.15G  1.16G  1.75x    15 (local)
mk-archive-1.uk.test@rsync-2009-02-19_15:11--2009-02-19_15:14                      1.15G  6.63M  1.75x
mk-archive-1.uk.test@rsync-2009-02-19_15:16--2009-02-19_15:16                      1.15G      0  1.75x
#

To start an expiry process of old backups (not that there is something to expire on this empty system...):


# backup -E

Expiry started on                       : 2009-02-19_17:21
Expiry finished on                      : 2009-02-19_17:21
Global retention policy                 :    35
Total number of deleted backups         :     0
Total number of preserved backups       :     0
Log file                                : /archive-2/logs/retention_2009-02-19_17:21--2009-02-19_17:21

You can also expire all backups or for a specific client according to a global and a client specific retention policies, you can generate reports, list all currently active backups, etc. The current usage information for the tool looks like:


# backup -h

usage: backup {-c client_name} [-r rsync_destination] [-hvq]
     backup [-lvF]
     backup [-Lv]
     backup {-R date} [-v]
     backup {-E} [-v] [-n] [-c client_name]
     backup {-e days} {-c client_name}
     backup {-D backup_name} [-f]
     backup {-A} {-c client_name} [-n] [-f] [-ff]

This script starts remote client backup using rsync.

OPTIONS:
 -h      Show this message
 -r      Rsync destination. If not specified then it will become Client_name/ALL/
 -c      Client name (FQDN)
 -v      Verbose
 -q      Quiet (no output)
 -l      list all clients in a backup
         -v will also include all backups for each client
         -vF will list only backups which are marked as FAILED
 -e      sets a retention policy for a client
         if number of days is zero then client retention policy is set to global
         if client_name is "global" then set a global retention policy
 -L      list all running backups
         -v more verbose output
         -vv even more verbose output
 -R      Show report for backups from a specified date ("today" "yesterday" are also allowed)
         -v list failed backups
         -vv list failed and successful backups
 -E      expire (delete) backups according to a retention policy
         -c client_name expires backup only for specified client
         -v more verbose output
         -n simulate only - do not delete anything
 -D      deletes specified backup
         -f forces deletion  of a backup - this is required to delete a backup if
         there are no more successful backups for the client
 -A      archive specified client - only one backup is allowed in order to achive client
         -c client_name - valid client name, this option is mandatory
         -n simulate only - do not archive anything
         -f deletes all backup for a client except most recent one and archives client
         -ff archives a client along with all backups
 -I      Initializes file systems within a pool (currently: archive-1)

EXAMPLES:

BACKUP
  In order to immediatelly start a backup for a given client:

    backup -c XXX.yyy.zz
    backup -r XXX.yyy.zz/ALL/ -c XXX.yyy.zz

  Above two commands are doing exactly the same - the first version is preffered.
  The 2nd version is useful when doing backups over ssh tunnel or via a dedicated backup interface
  when it is required to connect to different address that a client name. For example, in order
  to start a backup for a client XXX.yyy.zz t via ssh tunnel at localhost:5001 issue:

  backup -r localhost:5001/ALL/ -c XXX.yyy.zz

RETENTION POLICY

  backup -E               - expire backups according to retention policy
  backup -e 30 -c global  - sets global retention policy to 30 days
  backup -l               - list all clients in backup including their retention policy

Sunday, February 15, 2009

Disruptive Backup Platform

In many data center environments where commercial backup software like Legato Networker or Veritas Netbackup are deployed they are mostly used to backup and restore files. Then for a minority of backups a special software is being used for better integration like RMAN for Oracle database. In recent years a nearline storage has been an interesting alternative to tapes - one of the outcomes is that all commercial backup software support it one way or the other.

But the real question is - do you actually need the commercial software or would it be more flexible and more cost effective to build your backup solution on open source software and commodity hardware? In some cases it's obvious to do so - for example a NAS server/cluster. It does make a lot of sense to set-up another server, exactly the same with the same HW and storage and replicate all your data to it. In case you loose your data on your main server you could restore a data from the spare one or make your spare one a live server which will make your "restore" almost instantaneous compared to a full copy of data. Then later-on once you have fixed your main server it could become your backup one. Not only it provides you with MUCH more quicker service restoration in case you lost data but it almost certainly will be cheaper than a legacy solution based on commercial software, tape libraries, etc.

Above example is a rather special case - what about a more general approach? What about OS backups and data backups which do not require special co-operation with an application to perform a backup? In many environments that covers well over 90% of all backups. I would argue that a combination of some really clever open source software can provide a valuable and even better backup solution for the 90% of servers than legacy approach. In fact some companies are already doing exactly that - one of them for example is Joyent (see slide 37), and there are others.

Recently I was asked to architect and come up with such a solution for my current employer. That was something we had in mind for some time here just never got to it. Until recently...

The main reason was a cost saving factor - we figured out that we should be able to save a lot of money if we build a backup platform for 90% cases rather then extend current legacy platform.

Before I started to implement a prototype I needed to understand the requirements and constrains, here are most important ones in no particular order:

it has to be significantly more cost effective than legacy platform
it has to work on many different Unix platforms and ideally Windows
it has to provide a remote copy in a remote location
some data needs to end-up on tape anyway
only basic file backup and restore functionality - no bare metal restore
it has to be able to scale to thousands of clients
due to other projects, priorities and business requirements I need to implement a working prototype very quickly (couple of weeks) and I can't commit my resources 100% to it
it has to be easy to use for sysadmins

After thinking about it I came up with some design goals:

use only free and open source software which is well known
provide a backup tool which will hide all the complexities
the solution has to be hardware agnostic and has to be able to reliably utilize commodity hardware
the solution has to scale horizontally
some concepts should be implemented as close to commercial software as possible to avoid "being different" for no particular reason

Based on above requirements and design goals I came up with the following implementation decisions. Notice that I omitted a lot of implementation and design details here - it's just an overview.

The Backup Algorithm

The idea is to automatically create a dedicated filesystem for each client and each time data has been copied (backed-up) to the filesystem create a snapshot for the filesystem. The snapshot will in effect represent a specific backup. Notice that next time you run a backup for the same client it will be an incremental copy - actually once you did your full backup for the first time all future backups will always be incremental. This should provide less load on your network and your servers compared to most commercial backup software when for all practical purposes you need to do a full backup on regular basis (TSM being an exception here but it has its own set of problems by doing so).
In order to expire old backups old snapshots will be removed if older than a global retention policy or a specific policy for a client.

Backup:

create a dedicated filesystem (if not present) for a client
rsync data from the client
create a snapshot for the client filesystem after rsync finished (successfully or not)

Retention policy:

check for global retention policy
check for a local (client specific) retention policy
delete all snapshots older than the local or global retention policy

Software

Because we are building an in-house solution it has to be easy to maintain for sysadmins. It means that I should use a well known and proven technologies that sysadmins know how to use and are familiar with.

I chose Rsync for file synchronization. It has been available for most Unix (and Windows) platforms for years and most sysadmins are familiar with it. It is also being actively developed. The important thing here is that rsync will be used only for file transfer so if another tool will be more convenient to use in a future it should be very easy to start using it instead of rsync.

When it comes to OS and filesystem choice it is almost obvious: Open Solaris + ZFS. There are many very good reasons why and I will try to explain some of them. When you look at the above requirements again you will see that we need an easy way to quickly create lots of filesystems on demand while using a common storage pool so you don't have to worry about pre-provisioning your storage for each client - all you care about is if you have enough storage for a current set of clients and retention policy. Then you need a snapshoting feature which has minimal impact on performance, scales to thousands of snapshots and again doesn't require to pre-provision any dedicated space to snapshots as you don't know in advance how much disk space you will need - it has to be very flexible and it shouldn't impose any unnecessary restrictions. Ideally the filesystem should also support transparent compression so you can save some disk space. It would be perfect to also have a transparent deduplication. Additionally the filesystem should be scalable in terms of block sizes and number of inodes - you don't want to tweak each filesystem for each client separately - it should just work by dynamically adjusting to your data. Another important feature of the filesystem should be some kind of data and metadata checksumming. This is important as the idea is to utilize commodity hardware which is generally less reliable. The only filesystem which provides all of the above features (except for dedup) is ZFS.

Hardware

Ideally the hardware of choice should have some desired features like: as low as possible $/GB ratio (a whole solution: server+storage), should be as compact as possible (TB/1RU should be as high as possible), should be able to sustain at least several hundreds MB/s of write throughput and should provide at least couple of GbE links. We settled down on Sun x4500 servers which deliver on all of these requirements. If you haven't check them yet here is some basic spec: up-to 48x 1TB SATA disk drives, 2x Quad-core CPUs, up-to 64GB of RAM, 4x on-board GbE - and all of this in 4U.

One can configure the disks in many different ways - I have chosen configuration with 2x disks for operating system, 2x global hot spares, 44x disks arranged in 4x RAID-6 (RAID-Z2) groups making one large pool. That configuration provides really good reliability and performance while maximizing available disk space.

The Architecture

A server with lots of storage should be deployed with relatively large network pipe by utilizing link aggregation across couple GbE links or by utilizing 10GbE card. The 2nd identical server is to be deployed in a remote location and asynchronous replication should be set-up between them. The moment there is a need for more storage one deploys another pair of servers in the same manner as the first one. The new pair is entirely independent from the previous one and can be build on different HW (whatever is best at the time of purchase). Then additional clients should be added to new servers. This provides horizontal scaling for all components (network, storage, CPU, ...) and best cost/performance effective solution moving forward.

Additionally a legacy backup client can be installed on one of the servers in pair to provide tape backups for selected or all data. The advantage is that only one client license will be required per pair instead of hundreds for each client. This alone can lower licensing and support costs considerably.

The backup tool

The most important part of the backup solution will be a tool to manage backups. Sysadmins shouldn't play directly with underlying technologies like ZFS, rsync or other utilities - they should be provided with the tool which will hide all the complexities and allow them to perform 99% of backup related operations. It will not only make the platform easier to use but will minimize the risk of people making a mistake. Sure there will be some bugs in the tool at the beginning but every time a bug will be fixed it will be for the benefit of all backups and it the issue shouldn't happen again. Here are some design goals and constrains for the tool:

has to be written in a language most sysadmins are familiar with
scripting language is preferred
has to be easy to use
has to hide most implementation details
has to protect from common failures
all common backup operations should be implemented (backup, restore, retention policy, archiving, deleting backups, listing backups, etc.)
it is not about creating a commercial product(*)

(*) the tool is only for internal use so it doesn't have to be modular, it doesn't need to implement all features as long as most common operations are covered - one of the most important goals here is to keep it simple so it is easy to understand and maintain by sysadmins

Given the above requirements it has been written in BASH with only a couple of external commands being used like date, grep, etc. - all of them are common tools all sysadmins are familiar with and all of them are standard tools delivered with Open Solaris. Then entire tool is just a single file with one required config file and another one which is optional.

All basic operations have already been implemented except for archiving (WIP) and restore. I left the restore functionality as one of the last features to implement because it is relatively easy to implement and for the time being the tricky part is to prove that the backup concept actually works and scales for lots of different clients. In a meantime if it will be required to restore a file or set of files all team members know how to do it - it is about restoring a file(s) from a RO filesystem (snapshot) on one server to another after all - be it rsync, nfs, tar+scp, ... -they all know how to do it. At some point (rather sooner or later) the basic restore functionality will need to be implemented to minimize a possibility of doing a mistake by a sysadmin while restoring files.

Another feature to be tested soon is a replication of all or selected clients to remote server. There are several ways to implement it where rsync and zfs send|recv seem to be best choice. I prefer zfs send|recv solution as it should be much faster than rsync in this environment.

The End Result

Given the time and resource constrains it is more of a proof of concept or a prototype which has already become a production tool, but the point is that after over a month in a production it seems to work pretty good so far with well over 100 clients in regular daily backup while more clients are being added every day. We are keeping a closed eye on it and will add more features if needed in a future. It is still work in progress (and probably always will be) but it is good enough for us as a replacement for most backups. I will post some examples of how to use the tool in another blog entry soon.

It is an in-house solution which will have to be supported internally - but we believe it is worth it and it saves a lot of money. It took only a couple of weeks to come up with the working solution and hopefully there won't be much to fix or implement after some time so we are not expecting a cost of maintaining the tool to be high. It is also easier and cheaper to train people on how to use it compared to commercial software. It doesn't mean that such an approach is best for every environment - it is not. But it is compelling alternative for many environments. Only time will tell how it works in a long term - so far so good.

Open Source and commodity hardware being disruptive again.

Intel X25-M and fragmentation

Friday, February 13, 2009

Transactional Memory in Rock - early tests

Monday, February 09, 2009

Open Storage at Digitar

Tuesday, February 03, 2009

SPEC SFS uselessness

When it comes to benchmarks one always needs to be very cautious and sceptical not to mention that one should always a question - how is it relevant to my environment?
Then there are benchmarks which are basically useless for a customer - according to Bryan SPEC SFS is one of them. I never looked into details of the SFS benchmark before and his post is really an eye opener.

Monday, February 02, 2009

DTrace on Linux

I wonder if it ever will be included in main Linux distros - would be very nice.
On the other hand is SystemTap bound to fail? It's been years now and still no real progress...

ZFS send performance

Recently some ZFS related performance fixes were integrated into Open Solaris builds 102-105. I was mostly interested in 'zfs send' improvements (i.e. 6418042). Brent Jones reported an improvement 'on the order of 5-100 times faster' for replicating zfs file systems using zfs send|receive - that got my attention :)

I did a test on x4500 with a pool which is a 4x raidz2(11 disks each) + 2 hot spares.
I have many file systems there with relatively lots of files.
The test was to compare times of 'ptime zfs send -R -I A B >/dev/null' on build 101 and 105. I repeated the test several times to be sure that I'm getting consistent results. There are many snapshots between A and B and the output stream size is ~111GB. Additionally a file system with A and B snapshots has lzjb compression enabled.

Results are really good! On b101 it takes about 1010s on average to complete while on b105 it takes about 213s on average - that's a 4.7x improvement!