Archive for the ‘linux’ Category

How linux LVM snapshots work, part 1

February 12, 2010

Contents:

Introduction

Creating an example volume

Creating a snapshot

Data consistency

Usage

Introduction

Since everything I’ll show is based on linux LVM, If you haven’t used any LVM before, it is necessary to read LVM HOWTO to familiarize yourself with LVM concepts, and if you used one, but not on linux, it’s still good have a look there for available commands and naming.

Before we can talk about all the cool things that can be done with Oracle and LVM, we’ll need to talk about LVM snapshots themselves. Not only what they do, but also how they work.

And since in linux LVM is just a frontend (much more user-friendly) to a subsystem called Device Mapper, we’ll focus on that backend part more, because at some point we’ll reach the limits of LVM and will be forced to use device mapper only.

In part one we’ll create a small test environment to show how snapshots work and in part two I’ll discuss the internals and more specific snapshot features.

Creating an example volume

First, let’s create a demo volume and take a look at it.

[root@tullus ~]# lvcreate -n demo -L 100M vg00
Rounding up size to full physical extent 128,00 MB
Logical volume "demo" created

And since LVM is just a frontend to device mapper, let’s have a look what our command really did:
it created an entry in /dev/mapper directory named VolumeGroupName-LogicalVolumeName

[root@tullus ~]# ls -l /dev/mapper/*demo*
brw-rw---- 1 root disk 253, 11 feb 11 10:56 /dev/mapper/vg00-demo

This entry coresponds to the following map

[root@tullus ~]# dmsetup table /dev/mapper/vg00-demo
0 262144 linear 9:2 872677760

The fields mean:
0 – first block of device vg00-demo that the entry describes
262144 – number of blocks in the entry
linear – type of mapping – linear means 1:1 block mapping
9:2 – major and minor of the source device
872677760 starting block on the source device

So what this map really means? It means that block 0 of vg00 will really be block 872677760 of device 9:2 (/dev/md2 in my case) and so on. The map can contain many entries. If we extend our volume by another 128MB, we’ll see it

[root@tullus ~]# lvextend -L+128M /dev/vg00/demo /dev/hdb5
Extending logical volume demo to 256.00 MB
Logical volume demo successfully resized
[root@tullus ~]# dmsetup table vg00-demo
0 262144 linear 9:2 872677760
262144 262144 linear 3:69 628752768

So now there are two entries with the second saying, that 262144 blocks (128MB) of device vg00-demo starting from block 262144 are really mapped to device 3:69 starting from block 628752768 of /dev/hdb5 and so on.

We can now create a filesystem on it and put some contents there.

[root@tullus ~]# mkfs.ext3 -m 0 /dev/vg00/demo
mke2fs 1.39 (29-May-2006)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
65536 inodes, 262144 blocks
0 blocks (0.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67371008
32 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729, 204801, 221185

Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 26 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
[root@tullus ~]# mkdir /mnt/demo
[root@tullus ~]# mount /dev/vg00/demo /mnt/demo
[root@tullus ~]# echo "This is the first file" > /mnt/demo/first
[root@tullus ~]# echo "This is the second file" > /mnt/demo/second
[root@tullus ~]# ls /mnt/demo -l
total 14
-rw-r--r-- 1 root root 23 feb 11 11:12 first
drwx------ 2 root root 12288 feb 11 11:11 lost+found
-rw-r--r-- 1 root root 24 feb 11 11:12 second

Creating a snapshot

When we have out test subject ready, we can briefly look at what snapshots do.

[root@tullus ~]# lvcreate -s -n demo_snap -l 1 /dev/vg00/demo
Logical volume "demo_snap" created
[root@tullus ~]# lvscan | grep demo
ACTIVE Original '/dev/vg00/demo' [256,00 MB] inherit
ACTIVE Snapshot '/dev/vg00/demo_snap' [64,00 MB] inherit

As we can see, the snapshot was created and is one extent in size. From the operating system point of view it’ll be a block device containing the information from /dev/vg00/demo from the moment the snapshot was taken.

Let’s mount it and make some changes to our files.

[root@tullus ~]# mkdir /mnt/demo_snap
[root@tullus ~]# mount /dev/vg00/demo_snap /mnt/demo_snap
[root@tullus ~]# df -hP | grep demo
/dev/mapper/vg00-demo 248M 11M 238M 5% /mnt/demo
/dev/mapper/vg00-demo_snap 248M 11M 238M 5% /mnt/demo_snap
[root@tullus ~]# ls -l /mnt/demo_snap/
total 14
-rw-r--r-- 1 root root 23 feb 11 11:12 first
drwx------ 2 root root 12288 feb 11 11:11 lost+found
-rw-r--r-- 1 root root 24 feb 11 11:12 second

The contents are the same – it even claims that there’s the same amount of free space as on the original volume, which may be illusive, but we’ll get to that.

[root@tullus ~]# echo "changing second file on snapshot" >> /mnt/demo_snap/second
[root@tullus ~]# echo "changing first file on original volume" >> /mnt/demo/first

So we changed one file on the original volume and the other on the snapshot. The results are:
[root@tullus ~]# cat /mnt/demo/first
This is the first file
changing first file on original volume
[root@tullus ~]# cat /mnt/demo_snap/first
This is the first file

[root@tullus ~]# cat /mnt/demo/second
This is the second file
[root@tullus ~]# cat /mnt/demo_snap/second
This is the second file
changing second file on snapshot

So we can see that both devices are independent from each other and changes made on one of them don’t show up on the other.

Data consistency

I won’t discuss this issue in details because this subject generally is quite broad, but a few words have need to be said. When taking a snapshot, from a filesystem’s point of view it’s like pulling the plug on the drive at that moment, so similar restrictions apply, since we mostly take snapshots of mounted filesystems. Therefore in the beginning it will be in an inconsistent state on the snapshot. For journaled filesystem like ext3 it is not a problem, since automatic recovery will take place upon mount, but i.e. ext2 will complain about it:
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended

It is (or at least I think so) safe if ext3 is mounted with data=ordered (default) or data=journal options, but I’m not sure what will happen when data=writeback is used – the man page for mount says that it may cause data (not internal metadata) consistency issues after a crash, so the same probably applies to snapshots, but for the same reason you probably don’t want to use it to store your database in the first place;)

Another thing worth thinking about is sync. I’m not sure, but I think that lvm syncs the volume before taking a snapshot, and personally I don’t call sync before creating snapshots, but it may be worth considering.

Summing up – when using default settings on Centos for ext3 (data=ordered) I’ve never had and data corruption problems so far,  but it’s always a good idea to have data consistency in mind.

Usage

Ok, so what are snapshots good for? Very many things

  • consistent filesystem/database backups – at least thats what I see people most often do
  1. Create a snapshot
  2. Take a backup based on a snapshot
  3. Delete a snapshot

It’s probably the reason snapshots were created in the first place, and it doesn’t even require them to be writeable – in fact in AIX they are read only by design, unfortunately.

This usage however implies that snapshots have mostly short lifespan and don’t contain important data. It is probably the reason snapshot handling in LVM lacks few features that could be very useful. Most painful is that you can’t move a snapshot using pvmove, you can’t mirror a snapshot and they don’t grow automatically when they fill up, so if they fill up, they will fail, but in those usage cases it won’t matter anyway. You also can’t copy a snapshot and take a snapshot of a snapshot (but the other is device mapper limit, not LVM).

But since in linux they are writeable, more interesting things can be done with them, but that is also when lack of above mentioned features becomes a pain…

to be continued…