Posted: 2010-07-18 16:41:36 by Alasdair Keyes
Working in hosting means that I often have to deal with large amounts of data. Either in customer files on a shared hosting platform or as VHDs on a Virtual platform.
Storing lots of data in any form is a pain, it's unweildy, hard to backup, hard to move and storage costs a lot, not only for money on buying disks, but if not done in a smart way, takes up cabinet space and extra money in cooling/powering servers just for their storage capacity. With the recent evolution of cheap SAN hardware with iSCSI networking data storage for small companies is getting easier, however the amount of storage being used by any company is always increasing faster than you'd like.
Many SAN providers have deduplication which can save a lot of space, however, do we really need to buy SAN technology to get the benefits of deduplication? Many companies like ours have lots of server and lots of data which we would certainly like to slim down on. I started investigating into Open Source deduplication and found a couple of contenders.
First of all is lessfs. LessFS is an Open Source deduplication filesystem written using FUSE. I thought I'd give it a go and it seems pretty good so I thought I'd run a quick tutorial on how to use it. For this demo, I used Ubuntu 10.04 LTS x86_64. I chose this over my favourite choice of CentOS as it's software and libraries are far more current which I found would help in the installing of dependencies. This was installed into a VMware VM with a 3GB / ext3 partition and a 10GB ext3 partition mounted onto /data
Install Ubuntu 10.04
Update the OS and reboot
root@dedup:~# apt-get update && apt-get upgrade && reboot
Install dependencies required by lessfs
root@dedup:~# apt-get install gcc tokyocabinet-bin libtokyocabinet-dev libmhash-dev pkg-config libfuse-dev zlib1g-dev libbz2-dev
Fetch the latest lessfs source from http://sourceforge.net/projects/lessfs/files/
root@dedup:~# wget "http://downloads.sourceforge.net/project/lessfs/lessfs/lessfs-1.1.3/lessfs-1.1.3.tar.gz?use_mirror=ovh&ts=1279236748"
Extract the files
root@dedup:~# tar zxf lessfs-1.1.3.tar.gz
Configure, make and install lessfs
root@dedup:~# cd lessfs root@dedup:~# ./configure && make && make install
Copy the sample config file from etc/lessfs.cfg into your system /etc
root@dedup:~# cp -a etc/lessfs.cfg /etc/
Because we have used the default lessfs location of /data we don't need to touch the config file (unless you wish to tune it in any way). We just have to first create a couple of directories that it expects and also a mount point, I've chosen /mnt/less
root@dedup:~# mkdir -p /data/mta root@dedup:~# mkdir -p /data/dta root@dedup:~# mkdir -p /mnt/less
Now mount your fuse file system into /mnt/less
root@dedup:~# lessfs /etc/lessfs.cfg /mnt/less
If all goes well you'll see no output but get returned to your prompt, run mount and see if it's mounted
root@dedup:~# mount | grep less lessfs on /mnt/less type fuse.lessfs (rw,nosuid,nodev,max_read=131072,default_permissions,allow_other)
We're in business! You can now write data to /mnt/less
We could stop there but it's worth investigating the power of Deduplication and seeing how lessfs stacks up. Now lessfs creates a stats folder in your mount under .lessfs/lessfs_stats it shows how much space has been used to store files and how much has been saved using it's deduplication, lets have a look at the file
root@dedup:~# cat /mnt/less/.lessfs/lessfs_stats INODE SIZE COMPRESSED_SIZE FILENAME 7 0 0 lessfs_stats
Lets check how much data is being used by lessfs at the moment
root@dedup:~# du -hs /data 64M /data
This space is just lessfs's database and other storage mechanisms. We'll write some data and see how it goes. Lets create a 200MB file full of zeros, this should be easily dedup'd as it's all identical.
root@dedup:~# dd if=/dev/zero of=/mnt/less/zero bs=1M count=200 root@dedup:~# ls -alh /mnt/less/zero -rw-r--r-- 1 root root 200M 2010-07-16 19:40 /mnt/less/zero
You can see that the system sees it using 200MB, what does less fs think we've used
root@dedup:~# head /mnt/less/.lessfs/lessfs_stats INODE SIZE COMPRESSED_SIZE FILENAME 7 0 0 lessfs_stats 8 209715200 1628 zero
That's pretty good, 200MB compressed into 1K! But lessfs could be lying, as we know the actual data is stored in /data, how much has that grown.
root@dedup:~# du -hs /data 64M /data
Absolutely nothing, it's looking good. It's worth storing some real-world data...
root@dedup:~# scp -r 192.168.1.1:/Aerosmith /mnt/less/
This will allow some dedup, but lets see what happens if we store the same data again into a different folder
root@dedup:~# scp -r 192.168.1.1:/Aerosmith /mnt/less/Aerosmith_copy
root@dedup:~# du -hs /mnt/less/Aerosmith/ 249M /mnt/less/Aerosmith/ root@dedup:~# du -hs /mnt/less/Aerosmith_copy/ 249M /mnt/less/aerosmith_copy/
That's 500 MB we've copied to that filesystem, yet we're only using 308MB in total in the /data folder
root@dedup:~# du -hs /data/ 308M /data/
If we check the lessfs_stats file, we actually seem to see an increase in size vs compressed size. I'm not sure if this is a calculation issue or one related to block size etc. However, you wouldn't expect much with compressed files anyway.
45 3727130 3727159 08 - Cryin'.mp3 46 2942524 2942547 05 - Don't Stop.mp3 47 3301450 3301476 01 - Eat The Rich.mp3
The interesting bit is when we check the section related to the duplcate set of files...
131 3727130 0 08 - Cryin'.mp3 132 2942524 0 05 - Don't Stop.mp3 133 3301450 0 01 - Eat The Rich.mp3
it is deduplicating it perfectly. It has worked out that the files are copies and the copies using little to no space on disk.
There are plenty of options available with lessfs, tweaking the config file allows for encryption, dynamic defrag and transaction logging (to enable recovery after a crash) and the level of compression by using different compression systems (BZIP, GZIP, Deflate etc).
As with anything like this, I wouldn't use it on data I couldn't afford to lose until I was very sure it was stable. However, it shows plenty of promise for what lies ahead in the world of data storage.
If you found this useful, please feel free to donate via bitcoin to 1NT2ErDzLDBPB8CDLk6j1qUdT6FmxkMmNz
I'm now available for IT consultancy and software development services - Cloudee LTD.
Happy user of Digital Ocean (Affiliate link)