Backups and archives

Do you keep backups of your data?

If not, go and make one now, and then come back and read the rest of this article.

If you do, do you really have backups (plural)? Are those backups stored on reliable media (eg: a Raid system)? Are they stored in different geographical locations*? If you use tape or optical disks for your backups, do you have any real idea how long the data on them will remain readable?

Finally, when was the last time you checked that you could successfully restore a file, a directory, a file system or an entire machine from backup?

Anyway, this article is more about archives, by which I mean a series of backups kept on a regular basis and either never deleted, or at least only thinned-out after some period of time.

Archives need not take up the amount of space you might think.

The simple approach to creating a daily archive would be to take a backup, take another backup the day afterwards, then take another backup the day after that, and so on. This requires N times the amount of storage of your original data, for N days' worth of archives. That would, for most people, very quickly become very expensive in terms of storage media.

However, there is a better way (assuming you're using Linux to store the backups on, and you're using Ext3, Ext4 or some other file system which supports hard links).

Consider the following bash script:

archive.sh
#!/bin/bash

arcdir=/home/user/Archive
yearmonth=`date -d yesterday +%Y/%m`
day=`date -d yesterday +%d`
previous=`date -d "-2 days" +%Y/%m/%d`

rsync -a --exclude=lost+found --delete FileServer:/home/user $arcdir

mkdir -p $arcdir/$yearmonth

cp -al $arcdir/user $arcdir/$yearmonth

mv $arcdir/$yearmonth/user $arcdir/$yearmonth/$day

[ -e "$arcdir/$previous" ] && diff -qr --no-dereference $arcdir/$previous $arcdir/$yearmonth/$day

This script (which is run on a backup server separate from the main machine with your files on, referred to as FileServer in the script), performs the following tasks:

  1. Sets up a few variables such as the path to the archive and some dates in useful formats
  2. Performs an rsync with delete from the current files stored on the backup server to the main archive directory
    • This creates an identical copy of the files as they currently are
  3. Creates a subdirectory under the main archive path named YYYY/MM for the current year and month
  4. Copies using hard links (the l option to cp) from the local copy of all the files to the YYYY/MM directory, creating a directory YYYY/MM/user
  5. Renames this new directory to YYYY/MM/DD using yesterday's date (it's assumed that this script will run as a cron job in the early hours of the morning, and therefore what is being created is a copy of the files as they were at the end of the previous day)
  6. Performs a recursive diff on the files between yesterday and the day before
    • This step can be omitted if you don't want to receive a daily email telling you which files are new, which have been deleted, and which have been changed

The important part here is the hard links. This means that two filenames in different directories (for example 2022/09/23/personal/CV.txt and 2022/09/24/personal.CV.txt) can point to a single copy of the actual file stored on disk (assuming it hasn't changed between the daily rsyncs).

Linux supports up to 65534 hard links to a single file (under Ext4), so if you do daily backups of a file which never changes, you will encounter a problem with "too many hard links" after 179 years. I don't regard that as a problem.

The filenames themselves do take a small amount of space, but nothing like the amount of space taken up by the files. Here's an example:

$ du -sh /home/user/Archive/user
33G     /home/user/Archive/user/
$ du -sh /home/user/Archive
83G     /home/user/Archive/

So, 33 Gbytes of the 83 Gbytes under Archive is the current copy of the data, and because we started off one day one by doing a cp to that day's date, we can assume that another 33 Gbytes is used there. This means that the "archive" aspect of all this (individually dated copies of the files as they were on each day) is using 83 - 33 - 33 = 17 Gbytes.

When I mention that this Archive directory contains 645 daily copies of the data (ie: 17 Gbytes / 645 = 27 Mbytes per day on average) I think you can see that this is a pretty space-efficient way of storing what is effectively a copy of your data exactly as it was on each day going back in time.

If the space taken up really does become too much, you can safely delete any days' copies you wish (perhaps 6 out of 7 to turn it into a weekly copy for stuff older than 6 months, for example) without losing anything other than data which only ever existed on the days you delete.

I thoroughly recommend using Raid for reliable file systems on unreliable disks, backups to protect against data loss, and archives to enable a roll-back (or individual file recovery) from any date in the past.

Anyone who falls victim to ransomware data-encrypting viruses would be able to re-install their O/S, and then recover their data from any day in the past, before the virus infected their system.

External backups

You might be surprised at how some people do their backups.

For example:

  • Don't copy your data into removable media such as tape or re-writeable optical disk, and then next time you want to take a backup, copy it to the same tape or disk again. If the copy goes horribly wrong, you don't have a backup and you've destroyed the previous one.
  • If you copy your data to removable media such as tape or optical disk, don't leave the backup media in the same place as the computer containing the data. If someone steals one, they'll probably take both. If the computer catches fire, it'll destroy the backup too.
  • Make sure you can (and know how to) restore data from a backup when necessary. Think about whether your backup strategy allows you to restore a single file, requires you to restore an entire file system, or does something in between.

Go up
Return to main index.