From: Rasmus Borup Hansen
Subject: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Date: Mon, 11 Aug 2014 15:55:20 +0200
Hi! I recently had to copy a lot of files and even though I've 20 years experience with various Unix variants I was still surprised by the behaviour of cp and I think my observations should be shared with the community.
The setup: An old Dell server (2 cores, 2 GB initially, 10 GB later, running Ubuntu Trusty) with a new Dell storage enclosure (MD 1200) containing 12 4 TB disks configured with RAID 6 for a total of 40 TB capacity allowing two drives
to fail simultaneously. The server is used for our off-site backup, and the only thing it does is writing stuff to the disks. We use rsnapshot for that so most of the files have a high link count (30+).
One morning I was notified that a disk had failed. No big deal, this happens now and then. I called Dell and next day I had a replacement disk. While rebuilding, the replacement disk failed, and in the meantime another disk had
also failed. Now Dell's support wisely suggested that I did not just replace the failed disks as the array may have been punctured. Apparently, and as I understand it, disks are only reported as failed when they have sufficiently
many bad blocks, and if you're unlucky you can lose data if 3 corresponding blocks on different disks become bad within a short time, so that the RAID controller does not have a chance to detect the failures, recalculate the data
from the parity, and store it somewhere else. So even though only two drives flashed red, data might have been lost.
Having almost used up the capacity we decided to order another storage enclosure, copy the files from the old one to the new one, and then get the old one into a trustworthy state and use it to extend the total capacity. Normally
I'd have copied/moved the files at block-level (eg. using dd or pvmove), but suspecting bad blocks, I went for a file-level copy because then I'd know which files contained the bad blocks. I browsed the net for other peoples' experience
with copying many files and quickly decided that cp would do the job nicely. Knowing that preserving the hardlinks would require bookkeeping of which files have already been copied I also ordered 8 GB more RAM for the server and
configured more swap space.
When the new hardware had arrived I started the copying, and at first it proceeded nicely at around 300-400 MB/s as measured with iotop. After a while the speed decreased considerably, because most of the time was spent creating
hardlinks, and it takes time to ensure that the filesystem is always in a consistent state. We use XFS, and we were probably suffering for not disabling write barriers which can be done when the RAID controller has a write cache
with a trustworthy battery backup. As expected, the memory usage of the cp command increased steadily and was soon in the gigabytes.
After some days of copying the first real surprise came: I noticed that the copying had stopped, and cp did not make any system calls at all according to strace. Reading the source code revealed that cp keeps track of which files
have been copied in a hash table that now and then has to be resized to avoid too many collisions. When the RAM has been used up, this becomes a slow operation.
Trusting that resizing the hash table would eventually finish, the cp command was allowed to continue, and after a while it started copying again. It stopped again and resized the hash table a couple of times, each taking more and more
time. Finally, after 10 days of copying and hash table resizing, the new file system used as many blocks and inodes as the old one according to df, but to my surprise the cp command didn't exit. Looking at the source again, I found that
cp disassembles its hash table data structures nicely after copying (the forget_all call). Since the virtual size of the cp process was now more than 17 GB and the server only had 10 GB of RAM, it did a lot of swapping.
I had started cp with the "-v" option and piped its output (both stdout and stderr) to a tee command to capture the output in a (big!) logfile. This meant that somewhere the output from cp was buffered because my logfile ended in the
middle of a line. Wanting the buffers to be flushed so that I had a complete logfile, I gave cp more than a day to finish disassembling its hash table, before giving up and killing the process.
As I write this, I'm running an "ls -laR" on both file systems to be sure that everything is copied. But unless the last missing part of the output from cp contained more error messages, it appears that only a single file had i/o
errors (luckily we had another copy of it).
I know this is not going to happen right away, but it would be nice if cp somehow used a data structure where the bookkeeping could be done while waiting for i/o instead of piling up the bookkeeping. And unless old systems without
working memory management must be supported, I don't see any harm in simply removing the call to the forget_all function towards the end of cp.c.
To summarise the lessons I learned:
If you trust that your hardware and your filesystem are ok, use block level copying if you're copying an entire filesystem. It'll be faster, unless you have lots of free space on it. In any case it will require less memory.
If you copy many files and want to preserve hardlinks, make sure you have enough memory if you copy at file level.
Disassembling data structures nicely can take much more time than just tearing them down brutally when the process exits.
The number of hard drives flashing red is not the same as the number of hard drives with bad blocks. With RAID 6 you don't need three drives flashing red to loose data, if you're unlucky. Fewer can do. The same will be true for RAID 5,
where you can loose data with only one or no drive flashing red, if you're really unlucky.
I hope this can help or at least be interesting for someone.