btar — Bad tar Clone

This writeup will walk through how to create a clone (ish) of the classic UNIX tar command, with one additional feature.

Motivation

Consider the following quote from the Wikipedia page for tar:

The tar format was designed without a centralized index or table of contents for files and their properties for streaming to tape backup devices. The archive must be read sequentially to list or extract files. For large tar archives, this causes a performance penalty, making tar archives unsuitable for situations that often require random access to individual files.

I read this and thought, "wow that seems like an oversight", and then, with the hubris of a recent college graduate, I thought "I can do better".

Tar Overview

tar converts a set of files into a single file. This new file, known as an "archive" (hence the name tape archive, tape being the magnetic storage medium) can later be converted back into the original set of files.

One of the great things about files is that they are just bytes, so to amalgamate a set of files, all you really have to do is concatenate their contents. Add in some metadata about which file starts where, and this is pretty much it.

Design Overview

Random access was the one feature I wanted to add, so I had this in mind when beginning the design process. To get random access to a given file in the archive, I would have to know where all the files are without reading the entire archive.

We can model this information with a method that is extremely similar to the concept of Extents in file systems. Essentially, each file is identified in the archive by a tuple of (offset, length), and all of these tuples are stored at the head of the archive.

All you need now is a way to associate a file name with a single tuple, and the archive is complete.

File Names

Most of the tar implementations I have read about have all had fixed length requirements for file names. The benefit to having a maximum file length is that it simplifies the layout of metadata. The downside is that you now have an upper bound on the name of a file, which is normally much smaller than the real upper bound provided by the OS. In addition, fixed length file name fields lead to wasted space when the majority of your files have short names.

btar stores file names separately from extent metadata. Much like how a compiler will separate code and data, the archive separates all fixed length and variable length data. In both cases, it is not necessary, but makes things easier down the line.

Archive Format

The btar archive is structured as follows:

  1. Archive Header: The archive header contains a magic number that identifies the file as a btar archive (the sequence of bytes "\x9a\x00\xA1\xFC""btar"). In addition, it stores when the archive was created, how many files are in it, and the total length of all metadata (that is, all data that is not actual file data).

  2. File Metadata Blocks: After the archive header comes the list of (offset, length) tuples for each file. The number tuples is known from the archive header.

  3. File Names: Next comes all of the file names. They are stored as consecutive NULL terminated strings, and can be of any length.

  4. File Data: Finally, the raw data of every file is concatenated together. There is no separation between the bytes of two files (unlike tar) as the offsets between them are known.

Here is a comment from the top of btar.h giving a visual description of the format.

/*
 *        BTAR ARCHIVE FORMAT
 * +-----------------------------------+
 * | +-------------------------------+ |
 * | | Magic Number (8 bytes)        | |
 * | | Creation Date (8 bytes)       | |   One master block
 * | | Number of Files (8 bytes)     | |
 * | | Total Header Length (8 bytes) | |
 * | +-------------------------------+ |
 * |                                   |
 * | +------------------------------+  |
 * | | +--------------------------+ |  |
 * | | | File Length (8 bytes)    | |  |   Many file metadata blocks
 * | | | Archive Offset (8 bytes) | |  |   (one per file in archive)
 * | | +--------------------------+ |  |
 * | |            ...               |  |
 * | |            ...               |  |
 * | |            ...               |  |
 * | +------------------------------+  |
 * |                                   |
 * | +---------------+                 |
 * | | "filename1"\0 |                 |   Null terminated file names
 * | | "filename2"\0 |                 |   (one per file in archive)
 * | |    ...        |                 |
 * | | "filenameN"\0 |                 |
 * | +---------------+                 |
 * |                                   |
 * |           (end of header)         |
 * |                                   |
 * | +-------------------------------+ |
 * | |                               | |  Bytes of files in archive
 * | |         Raw File Data         | |  (one extent per file)
 * | |                               | |  (indexed by metadata blocks)
 * | +-------------------------------+ |
 * +-----------------------------------+
 *
*/

Here is a look at a small archive (see example at bottom) in binary format.

  1. Magic Bytes
  2. Creation Timestamp (Tue Apr 27 00:18:16 2021)
  3. Number of files in archive (4)
  4. Size of archive metadata (120 bytes)
  5. Extent Offset (1 per file)
  6. Extent Length (1 per file)
  7. Null terminated file names (4 of them)
  8. File Data (4 adjacent blocks)

$ xxd archive.btar
00000000: 9a00 a1fc 6274 6172 b8ba 8760 0000 0000 ....btar...`....
00000010: 0400 0000 0000 0000 7800 0000 0000 0000 ........x.......
00000020: 0000 0000 0000 0000 0400 0000 0000 0000 ................
00000030: 0400 0000 0000 0000 0400 0000 0000 0000 ................
00000040: 0800 0000 0000 0000 0400 0000 0000 0000 ................
00000050: 0c00 0000 0000 0000 1800 0000 0000 0000 ................
00000060: 612e 7478 7400 622e 7478 7400 632e 7478 a.txt.b.txt.c.tx
00000070: 7400 642e 7478 7400 6161 610a 6262 620a t.d.txt.aaa.bbb.
00000080: 6363 630a 6464 6464 640a 6464 6464 640a ccc.ddddd.ddddd.
00000090: 6464 6464 640a 6464 6464 640a           ddddd.ddddd.

Implementation

Here is the process for extracting a file from the archive:

  1. Open the file and read the first sizeof(struct archive_header) bytes.
  2. Skip ahead num_files * sizeof(struct file_metadata) bytes to skip of the extent tuples to get to the file names.
  3. Read all of the file names into memory (we know the total length of all the file names from the archive header).
  4. Search for the given file name to extract. Because all the file names are are NULL terminated, it is easy to figure out the index of a given file name.
  5. With the index of the desired file, look into the extent list and find the offset and length of the file.
  6. Seek to the proper offset, read the correct number of bytes, and write them to a file.

To create an archive from a set of files:

  1. Open all the files first to make sure they exist and are readable.
  2. Compute how large the archive metadata will be based off of the length of all file names and how many files there are.
  3. Seek past all the metadata and transfer all of the files contiguously into the archive.
  4. Seek to the beginning of the file and write the metadata.

2038 Problem

All length related fields are stored with 64 bit quantities to avoid size related issues. If only 32 bit quantities were used, the maximum file size that could be included in an archive would be 4GB, which is not an unreasonable file size any more.

Testing

To test, I wrote a script that generates 30 files full of random bytes, ranging from 0b to 200KB in size. I archive them all together, unarchive them, and compare the files from the archive to the originals.

Usage

Here is the API for btar

 $ ./btar
btar -- usage

o To create an archive from a set of files:
  > btar pack archive [file1 ... fileN]

o To extract all (or certain) files from an archive:
  > btar unpack archive [file1 ... fileN]

o To view the contents of an archive:
  > btar list archive

Example

Here is a walkthrough of using the program.

$ cat {a,b,c,d}.txt
aaa
bbb
ccc
ddddd
ddddd
ddddd
ddddd
$ ./btar pack archive.btar {a,b,c,d}.txt
info: creating archive `archive.btar` from `4` files
info: opening file `a.txt`
info: opening file `b.txt`
info: opening file `c.txt`
info: opening file `d.txt`
info: btar header will occupy first 120 bytes of archive
info: - archive header: 32 bytes
info: - file metadata: 64 bytes (16 bytes * 4 files)
info: - filenames: 24 bytes
info: transferring file `a.txt` to `archive.btar`
info: transferring file `b.txt` to `archive.btar`
info: transferring file `c.txt` to `archive.btar`
info: transferring file `d.txt` to `archive.btar`
$ rm {a,b,c,d}.txt
$ ./btar unpack archive.btar a.txt b.txt c.txt d.txt
info: reading archive `archive.btar`
`archive.btar` is a valid btar archive, created at Tue Apr 27 00:18:16 2021
Archive contains 4 files and has a header length of 120 bytes
archive `archive.btar` contains the following files
 o `a.txt` (bytes=4, offset=0)
 o `b.txt` (bytes=4, offset=4)
 o `c.txt` (bytes=4, offset=8)
 o `d.txt` (bytes=24, offset=12)
info: extracting file `a.txt` at offset `120` with length `4`
info: extracting file `b.txt` at offset `124` with length `4`
info: extracting file `c.txt` at offset `128` with length `4`
info: extracting file `d.txt` at offset `132` with length `24`
info: ... done
$ cat {a,b,c,d}.txt
aaa
bbb
ccc
ddddd
ddddd
ddddd
ddddd

Concessions

It is hard to determine where to draw the line in toy projects like this between "this is educational, the big concepts matter" and "this needs to be used in production".

For my previous project, bthreads, I put off lots of corner cases to make everything clean and tidy for educational purposes. I chose to do the same here. For example, all of the files in an archive have to be in the same directory, and file permissions are not stored. These are "easy" features to add, but the educational value of adding them is small compared to the work they would take. You can conceptually "get" how these things would be done and don't necessarily need to do them yourself if you want to follow along.

One concession I did not make was skimping out on the 64 bit quantities. Sticking to 32 bit lengths would have made the file format smaller, but would have had a very tangible impact on the max archive size.

I pointed this out in an earlier section because I think this is a crucial lesson that computer programmers have had to learn the hard way many times (Y2K, 2038 Problem, FAT file system limitations, IPv4 address limitations, early Minecraft world size restriction, ...). We have got to stop doing this to ourselves! A whole class of overflow related bugs disappears when using the full word size of modern processors (ie: 2^32 dollars is a reasonable amount of money while 2^64 dollars is not), and I think we should be leveraging this.

Partly, I think that the ergonomics of the C language are to blame. For example, the definitions in <inttypes.h> like uint64_t are quite the mouthful and discourage their own use. It is so much easier to just write for (int i = 0 ... than it is to "do the right thing" and be careful with your types.

Other languages have fancy keywords like u64 that I think would be nice (I guess you could typedef these).

Conclusion

The simplicity of btar comes mostly from the fact that the file abstraction is so straightforward. Files are just bytes, so many files together are just many bytes together. I am still confused as to why the original tar did not provide random access, maybe someone a little older can clue me in!