OS Deduplication with LessFS

Posted: 2010-07-18 16:41:36 by Alasdair Keyes

Direct Link | RSS feed


Working in hosting means that I often have to deal with large amounts of data. Either in customer files on a shared hosting platform or as VHDs on a Virtual platform.

Storing lots of data in any form is a pain, it's unweildy, hard to backup, hard to move and storage costs a lot, not only for money on buying disks, but if not done in a smart way, takes up cabinet space and extra money in cooling/powering servers just for their storage capacity. With the recent evolution of cheap SAN hardware with iSCSI networking data storage for small companies is getting easier, however the amount of storage being used by any company is always increasing faster than you'd like.

Many SAN providers have deduplication which can save a lot of space, however, do we really need to buy SAN technology to get the benefits of deduplication? Many companies like ours have lots of server and lots of data which we would certainly like to slim down on. I started investigating into Open Source deduplication and found a couple of contenders.

First of all is lessfs. LessFS is an Open Source deduplication filesystem written using FUSE. I thought I'd give it a go and it seems pretty good so I thought I'd run a quick tutorial on how to use it. For this demo, I used Ubuntu 10.04 LTS x86_64. I chose this over my favourite choice of CentOS as it's software and libraries are far more current which I found would help in the installing of dependencies. This was installed into a VMware VM with a 3GB / ext3 partition and a 10GB ext3 partition mounted onto /data

We could stop there but it's worth investigating the power of Deduplication and seeing how lessfs stacks up. Now lessfs creates a stats folder in your mount under .lessfs/lessfs_stats it shows how much space has been used to store files and how much has been saved using it's deduplication, lets have a look at the file

root@dedup:~# cat /mnt/less/.lessfs/lessfs_stats
  INODE             SIZE  COMPRESSED_SIZE  FILENAME
      7                0                0  lessfs_stats

Lets check how much data is being used by lessfs at the moment

root@dedup:~#  du -hs /data
64M      /data

This space is just lessfs's database and other storage mechanisms. We'll write some data and see how it goes. Lets create a 200MB file full of zeros, this should be easily dedup'd as it's all identical.

root@dedup:~# dd if=/dev/zero of=/mnt/less/zero bs=1M count=200
root@dedup:~# ls -alh /mnt/less/zero
-rw-r--r-- 1 root root 200M 2010-07-16 19:40 /mnt/less/zero

You can see that the system sees it using 200MB, what does less fs think we've used

root@dedup:~# head /mnt/less/.lessfs/lessfs_stats
  INODE             SIZE  COMPRESSED_SIZE  FILENAME
      7                0                0  lessfs_stats
      8        209715200             1628  zero

That's pretty good, 200MB compressed into 1K! But lessfs could be lying, as we know the actual data is stored in /data, how much has that grown.

root@dedup:~# du -hs /data
64M      /data

Absolutely nothing, it's looking good. It's worth storing some real-world data...

root@dedup:~# scp -r 192.168.1.1:/Aerosmith /mnt/less/

This will allow some dedup, but lets see what happens if we store the same data again into a different folder

root@dedup:~# scp -r 192.168.1.1:/Aerosmith /mnt/less/Aerosmith_copy
root@dedup:~# du -hs /mnt/less/Aerosmith/
249M    /mnt/less/Aerosmith/
root@dedup:~# du -hs /mnt/less/Aerosmith_copy/
249M    /mnt/less/aerosmith_copy/

That's 500 MB we've copied to that filesystem, yet we're only using 308MB in total in the /data folder

root@dedup:~# du -hs /data/
308M    /data/

If we check the lessfs_stats file, we actually seem to see an increase in size vs compressed size. I'm not sure if this is a calculation issue or one related to block size etc. However, you wouldn't expect much with compressed files anyway.

     45          3727130          3727159  08 - Cryin'.mp3
     46          2942524          2942547  05 - Don't Stop.mp3
     47          3301450          3301476  01 - Eat The Rich.mp3

The interesting bit is when we check the section related to the duplcate set of files...

    131          3727130                0  08 - Cryin'.mp3
    132          2942524                0  05 - Don't Stop.mp3
    133          3301450                0  01 - Eat The Rich.mp3

it is deduplicating it perfectly. It has worked out that the files are copies and the copies using little to no space on disk.

There are plenty of options available with lessfs, tweaking the config file allows for encryption, dynamic defrag and transaction logging (to enable recovery after a crash) and the level of compression by using different compression systems (BZIP, GZIP, Deflate etc).

As with anything like this, I wouldn't use it on data I couldn't afford to lose until I was very sure it was stable. However, it shows plenty of promise for what lies ahead in the world of data storage.


If you found this useful, please feel free to donate via bitcoin to 1NT2ErDzLDBPB8CDLk6j1qUdT6FmxkMmNz

IT Consultancy Services

I'm now available for IT consultancy and software development services - Cloudee LTD.



Happy user of Digital Ocean (Affiliate link)


Version:master-619e08f203


Validate HTML 5