Rsync… What R you syncing about?

rsync is simultaneously one of the one of the most well known, but most underutilized command line utilities I’ve come across. Most developers/admins maybe have one or two uses for it, but frequently fall back on the trusty cp and scp commands. My guess is that it’s name (rsync stands for Remote Sync) leads people to assume it’s primarily for remote file sync, and while it certainly excels at this task, it’s fantastic for syncing local files as well.

Why sync when you can copy?

Copying a file from one location to another sounds simple enough for our old friend cp, but rsync has some very human friendly advantages:

It can be canceled mid copy, and then be resumed later, even when copying a single file.
When copying directories, it only copies changed files, and only copies diffs between existing files (see below for more on the rsync algorithm).
It has a ton of options available to handle compression, preserve permissions, and even show a progress bar (surprisingly useful for large transfers).

Because of these features, I find myself almost always using it in place of cp or scp.

The Algorithm

Admittedly, I am not a computer scientist, so I’m not going to attempt a full explanation of rsync’s algorithm, and instead will focus on the basic concept behind it. I recommend the official docs if you want a more in depth explaination.

First the destination file (if it exists) is split into non-overlapping fixed-sized blocks of size S bytes
Then, for each block, two checksums are calculated: a 32-bit rolling checksum, and a stronger 128-bit MD4 checksum.
The source file, is then analyzed in the same way, and ultimately compared against the checksums from the destination file(see below for more info on this comparison, and the rolling checksum).
The destination file is then (re)constructed from a sequence of instructions, where each instruction is either a reference to an existing block in the destination file, or literal data.
If the file does not already exist, there will be no blocks to compare against the source, and all data from the source file is treated as new literal data.

The key here is that literal data is only sent for the blocks that do not match between the source and destination files.

The Rolling Checksum

The most important part of this process is the 32-bit rolling checksum. It allows rsync to figure out what parts of a file need to be copied, in one quick pass, which is particularly valuable when moving files between two hosts (less network overhead).

What is a checksum? A checksum, is the result of applying a cryptographic hash to a block of data. Provided that the data given to the checksum is the same, the output should never change. Essentially, it’s a quick and easy way for a computer to confirm if two blocks of data match. As an example, we can get the MD5 checksum of a file with the md5 command line tool.

Say we have file1.txt which has the contents:

some data

The MD5 checksum of that file can be returned by running:

➜ ~ md5 file1.txt 
MD5 (file1.txt) = 5febbef14389ebcfc3e501fa1091adcb

If we also have file2.txt and it has the same checksum:

➜ ~ md5 file2.txt 
MD5 (file2.txt) = 5febbef14389ebcfc3e501fa1091adcb

We can confirm that both file1.txt, and file2.txt have the same contents:

some data

While we won’t dive into the math behind these checksums, it’s important to remember, that the more bits we use (the MD5 command above is creating a 128 bit checksum):

The smaller your chances of getting the identical checksum output for two non identical inputs(more accurate matching).
The more computationally expensive the checksum operation is for your computer.

It’s on a roll

rsync uses slightly different checksums than the MD5 checksum in the example above, but the idea of using them to check for changes in data, is is the same.

To figure out what data needs to be synced, rsync first splits the destination file into blocks of equal size, for example 500 bytes, and gets two checksums for each block. The first checksum is a stronger, but slower 128 bit MD4 checksum, and the second is a weaker, but faster 32 bit checksum. Next rsync gets the 32 bit checksum of first block of the source file, and compares it to each of the 32 bit checksums from the destination file. If there is no match during this comparison, a checksum is calculated for a new block, starting 1 byte into the source file, and the comparison process starts again.

So for our example with a block size of 500 bytes

32 bit checksum is calculated for the data from byte 0 to byte 499 of the source file.
This is compared against the 32 bit checksums of each block in our destination file.
If there is a match:
- The stronger 128 bit MD4 checksum is calculated for the matching block in the source file
- This is compared against the 128 bit MD4 checksum for the corresponding block in the destination file. If this is a match, the block is then added to the set of instructions used for creating the final destination file.
If there isn’t a match, the data is added to the final instruction set as new data, and the process is repeated for a new block of data from byte 1 to byte 500, then 2 to 501, 3 to 502 etc.

Because the 32 bit checksum is less expensive to compute, rsync is able to “roll” quickly through the file with the process above, and only do the stronger 128 MD4 checksum when theres a potential match already detected.

Who cares though?

At the end of the day, rsync’s algorithm is there to ensure that in one pass, we’re able to get a quick and accurate way of transferring only the parts of a source file or directory that are different from the destination file or directory. With this ability, we don’t need to worry about interruptions in the transfer, as rsync won’t duplicate work it’s already done, ultimately increasing the efficiency and reliability of our file transfers.

Practical tricks

-a for all the things

The -a (--archive) option for rsync is one that I use almost every time. It wraps the rlptgoD options all into one flag. Specifically -a will:

-r recursively sync directories. (See next section for how to use trailing / characters on directories to change the behavior of this option).
-l preserve symlinks
-p preserve the unix permissions
-t preserve the last modified timestamp
-g preserve the unix group ownership
-o preserve the unix user ownership
-D preserve device files /dev/device1 and other special unix file types.

As you can see, the -a option pretty much makes sure that the destination will 100% match the source.

To / or not to /

When working with directories (-r or -a), the trailing / on a source path determines what exactly gets synced.

For example, say we have a directory src with three files in it:

/path/to/src/
├── file1.txt
├── file2.txt
└── file3.txt

If we omit the trailing / when syncing src to a new directory dest

rsync -a /path/to/src /path/to/dest

rsync will sync the directory src into the directory dest, resulting in:

/path/to/dest/
└── src
    ├── file1.txt
    ├── file2.txt
    └── file3.txt

Instead if we include the trailing /

rsync -a /path/to/src/ /path/to/dest

rsync will sync the contents of src into dest, resulting in:

/path/to/dest/
├── file1.txt
├── file2.txt
└── file3.txt

Examples

Now the fun part! Below are some of the most my most common uses of rsync. To see all options, and their usage, use rsync --help.

Sync single file to a new directory

rsync /path/to/src/file.txt /path/to/dest/that/does/not/exist/

rsync will create the destination directory if it doesn’t exist.

Sync single file to a new directory on a remote machine

rsync /path/to/src/file.txt USERNAME@HOSTNAME:/path/to/dest/

rsync uses ssh by default, so you will need will need ssh access to the remote machine for this to work.

Sync large file

rsync \
  --partial \
  --progress \
  --stats \
  /path/to/src/file.txt /path/to/dest/

The --partial flag tells rsync to create a partial version of the file even if the sync does not complete. In the (likely) event you need to stop and restart the transfer, this will allow you to leverage rsync’s algorithm to only sync data that is not already present in the destination file.
--progress prints a progress meter and human readable stdout. It also automatically enables rsync’s verbose output.
--stats prints stats about the sync, giving you some insight into how the rsync algorithm is working

Simple backup of home directory

rsync -a --progress --stats ~/ /mnt/backup/home/

This will create/update a mirror image of your home dir on backup disk mounted at /mnt/backup. Because of the trailing / after the ~ , the contents of /mnt/backup/home/ will mirror the contents of your home directory ~ .

Exclude and Include filters

rsync -a \
  --progress \
  --stats \
  --exclude "not-important-stuff/*" \
  /path/to/src/ /path/to/dest/

Similarly you can exclude all but one sub directory by adding an --include flag for the directory itself and another for it’s nested contents

rsync -a \
  --progress \
  --stats \
  --include "important-stuff" \
  --include "important-stuff/**" \
  --exclude "*" /path/to/src/ /path/to/dest/

Or we can exclude all files with the extension .ignore

rsync -a \
  --progress \
  --stats \
  --exclude "*.ignore"
  /path/to/src/ /path/to/dest/

Dry run

rsync -a --progress --dry-run /path/to/src/ /path/to/dest/

The --dry-run option provides output similar to a sync, without out actually syncing. This can be very useful for testing the --exclude and --include filters.

So what R you Syncing about?

I hope this helps you sync more and copy less!

About

Blog

Contact

rsync