Rsync… What R you syncing about?
rsync
is simultaneously one of the one of the most well known, but most underutilized command line utilities I’ve come across. Most developers/admins maybe have one or two uses for it, but frequently fall back on the trusty cp
and scp
commands. My guess is that it’s name (rsync
stands for Remote Sync) leads people to assume it’s primarily for remote file sync, and while it certainly excels at this task, it’s fantastic for syncing local files as well.
Why sync when you can copy?
Copying a file from one location to another sounds simple enough for our old friend cp
, but rsync has some very human friendly advantages:
- It can be canceled mid copy, and then be resumed later, even when copying a single file.
- When copying directories, it only copies changed files, and only copies diffs between existing files (see below for more on the rsync algorithm).
- It has a ton of options available to handle compression, preserve permissions, and even show a progress bar (surprisingly useful for large transfers).
Because of these features, I find myself almost always using it in place of cp
or scp
.
The Algorithm
Admittedly, I am not a computer scientist, so I’m not going to attempt a full explanation of rsync
’s algorithm, and instead will focus on the basic concept behind it. I recommend the official docs if you want a more in depth explaination.
- First the destination file (if it exists) is split into non-overlapping fixed-sized blocks of size S bytes
- Then, for each block, two checksums are calculated: a 32-bit rolling checksum, and a stronger 128-bit MD4 checksum.
- The source file, is then analyzed in the same way, and ultimately compared against the checksums from the destination file(see below for more info on this comparison, and the rolling checksum).
- The destination file is then (re)constructed from a sequence of instructions, where each instruction is either a reference to an existing block in the destination file, or literal data.
- If the file does not already exist, there will be no blocks to compare against the source, and all data from the source file is treated as new literal data.
The key here is that literal data is only sent for the blocks that do not match between the source and destination files.
The Rolling Checksum
The most important part of this process is the 32-bit rolling checksum. It allows rsync
to figure out what parts of a file need to be copied, in one quick pass, which is particularly valuable when moving files between two hosts (less network overhead).
What is a checksum?
A checksum, is the result of applying a cryptographic hash to a block of data. Provided that the data given to the checksum is the same, the output should never change. Essentially, it’s a quick and easy way for a computer to confirm if two blocks of data match. As an example, we can get the MD5 checksum of a file with the md5
command line tool.
Say we have file1.txt
which has the contents:
some data
The MD5 checksum of that file can be returned by running:
➜ ~ md5 file1.txt
MD5 (file1.txt) = 5febbef14389ebcfc3e501fa1091adcb
If we also have file2.txt
and it has the same checksum:
➜ ~ md5 file2.txt
MD5 (file2.txt) = 5febbef14389ebcfc3e501fa1091adcb
We can confirm that both file1.txt
, and file2.txt
have the same contents:
some data
While we won’t dive into the math behind these checksums, it’s important to remember, that the more bits we use (the MD5 command above is creating a 128 bit checksum):
- The smaller your chances of getting the identical checksum output for two non identical inputs(more accurate matching).
- The more computationally expensive the checksum operation is for your computer.
It’s on a roll
rsync
uses slightly different checksums than the MD5 checksum in the example above, but the idea of using them to check for changes in data, is is the same.
To figure out what data needs to be synced, rsync
first splits the destination file into blocks of equal size, for example 500 bytes, and gets two checksums for each block. The first checksum is a stronger, but slower 128 bit MD4 checksum, and the second is a weaker, but faster 32 bit checksum.
Next rsync
gets the 32 bit checksum of first block of the source file, and compares it to each of the 32 bit checksums from the destination file. If there is no match during this comparison, a checksum is calculated for a new block, starting 1 byte into the source file, and the comparison process starts again.
So for our example with a block size of 500 bytes
- 32 bit checksum is calculated for the data from byte 0 to byte 499 of the source file.
- This is compared against the 32 bit checksums of each block in our destination file.
- If there is a match:
- The stronger 128 bit MD4 checksum is calculated for the matching block in the source file
- This is compared against the 128 bit MD4 checksum for the corresponding block in the destination file. If this is a match, the block is then added to the set of instructions used for creating the final destination file.
- If there isn’t a match, the data is added to the final instruction set as new data, and the process is repeated for a new block of data from byte 1 to byte 500, then 2 to 501, 3 to 502 etc.
Because the 32 bit checksum is less expensive to compute, rsync
is able to “roll” quickly through the file with the process above, and only do the stronger 128 MD4 checksum when theres a potential match already detected.
Who cares though?
At the end of the day, rsync
’s algorithm is there to ensure that in one pass, we’re able to get a quick and accurate way of transferring only the parts of a source file or directory that are different from the destination file or directory. With this ability, we don’t need to worry about interruptions in the transfer, as rsync
won’t duplicate work it’s already done, ultimately increasing the efficiency and reliability of our file transfers.
Practical tricks
-a
for all the things
The -a
(--archive
) option for rsync
is one that I use almost every time. It wraps the rlptgoD
options all into one flag. Specifically -a
will:
-r
recursively sync directories. (See next section for how to use trailing/
characters on directories to change the behavior of this option).-l
preserve symlinks-p
preserve the unix permissions-t
preserve the last modified timestamp-g
preserve the unix group ownership-o
preserve the unix user ownership-D
preserve device files/dev/device1
and other special unix file types.
As you can see, the -a
option pretty much makes sure that the destination will 100% match the source.
To /
or not to /
When working with directories (-r
or -a
), the trailing /
on a source path determines what exactly gets synced.
For example, say we have a directory src
with three files in it:
/path/to/src/
├── file1.txt
├── file2.txt
└── file3.txt
If we omit the trailing /
when syncing src
to a new directory dest
rsync -a /path/to/src /path/to/dest
rsync
will sync the directory src
into the directory dest
, resulting in:
/path/to/dest/
└── src
├── file1.txt
├── file2.txt
└── file3.txt
Instead if we include the trailing /
rsync -a /path/to/src/ /path/to/dest
rsync
will sync the contents of src
into dest
, resulting in:
/path/to/dest/
├── file1.txt
├── file2.txt
└── file3.txt
Examples
Now the fun part! Below are some of the most my most common uses of rsync.
To see all options, and their usage, use rsync --help
.
Sync single file to a new directory
rsync /path/to/src/file.txt /path/to/dest/that/does/not/exist/
rsync
will create the destination directory if it doesn’t exist.
Sync single file to a new directory on a remote machine
rsync /path/to/src/file.txt USERNAME@HOSTNAME:/path/to/dest/
rsync
uses ssh
by default, so you will need will need ssh
access to the remote machine for this to work.
Sync large file
rsync \
--partial \
--progress \
--stats \
/path/to/src/file.txt /path/to/dest/
- The
--partial
flag tellsrsync
to create a partial version of the file even if the sync does not complete. In the (likely) event you need to stop and restart the transfer, this will allow you to leveragersync
’s algorithm to only sync data that is not already present in the destination file. --progress
prints a progress meter and human readable stdout. It also automatically enablesrsync
’s verbose output.--stats
prints stats about the sync, giving you some insight into how thersync
algorithm is working
Simple backup of home directory
rsync -a --progress --stats ~/ /mnt/backup/home/
This will create/update a mirror image of your home dir on backup disk mounted at /mnt/backup
. Because of the trailing /
after the ~
, the contents of /mnt/backup/home/
will mirror the contents of your home directory ~
.
Exclude and Include filters
rsync -a \
--progress \
--stats \
--exclude "not-important-stuff/*" \
/path/to/src/ /path/to/dest/
Similarly you can exclude all but one sub directory by adding an --include
flag for the directory itself and another for it’s nested contents
rsync -a \
--progress \
--stats \
--include "important-stuff" \
--include "important-stuff/**" \
--exclude "*" /path/to/src/ /path/to/dest/
Or we can exclude all files with the extension .ignore
rsync -a \
--progress \
--stats \
--exclude "*.ignore"
/path/to/src/ /path/to/dest/
Dry run
rsync -a --progress --dry-run /path/to/src/ /path/to/dest/
The --dry-run
option provides output similar to a sync, without out actually syncing. This can be very useful for testing the --exclude
and --include
filters.
So what R you Syncing about?
I hope this helps you sync more and copy less!