A lightweight, safe archive reading library for previewing and extracting archive contents without external dependencies (beyond compression libraries).
CupidArchive provides a clean 3-layer architecture for reading archive files:
- IO Layer - Safe stream abstraction with byte limits to prevent zip bombs
- Filter Layer - Decompression wrappers (gzip, bzip2, deflate, xz)
- Format Layer - Archive format parsers (TAR, ZIP)
The library is designed with safety as a primary concern:
- ArcLimits are enforced across parsing, decompression, and extraction
- Every stream has a hard byte limit to mitigate zip bombs
- Extraction is openat()-anchored and rejects path traversal (Zip-Slip)
- Decompression filters treat truncated input as an error (no silent partial output)
- Formats: TAR (ustar + pax + GNU long name extensions), ZIP (central directory + streaming mode, ZIP64 support), 7z (single-file, LZMA/LZMA2), compressed single files (.gz, .bz2, .xz as virtual archives)
- Compression: gzip (zlib), bzip2 (libbz2), deflate (zlib, for ZIP), xz/lzma (liblzma)
- Entry Types: Regular files, directories, symlinks, hardlinks (TAR only), files and directories (ZIP)
- Operations: Reading, previewing, and extraction
With XZ filter support you can treat .tar.xz as a native archive, and .xz / .txz single files appear as pseudo archives (one entry) through arc_compressed.c, just like .gz and .bz2 files. All compressed single files are presented as virtual archives with a single entry.
- Layered safety – The
ArcStream+ filter + reader + extractor pipeline guarantees hard byte limits, Zip-Slip-safe extraction, and scoped ownership traps at every boundary. - Compression-aware detection –
arc_reader.crewinds compressed streams, reclones gzip/bzip2/xz filters, and reports TAR/ZIP/single-file formats so previews never read past the sniffed header. - Single-file pseudo archives –
.gz,.bz2, and.xzfiles surface as one-entry archives througharc_compressed.c, making previews and extractions consistent with full archives. - Openat-/O_NOFOLLOW-backed extraction –
arc_extract.cbuilds directories withmkdir_p_at(), copies data with 64 KB buffers, and respectsO_NOFOLLOWto avoid symlink races. - Resource limits everywhere –
ArcLimitsguard entry counts, name lengths, extra/comment bytes, decompressed volume, and nesting depth so malformed archives hit a ceiling before wrecking anything.
- Read-only – This library only reads/previews/extracts archives; there is no archive creation or modification API.
- Hardlinks are copied – TAR hardlink entries fall back to regular file copies because inode tracking/relink passes are not implemented.
- Metadata is partial – Extraction preserves permissions and timestamps, but ownership (
uid/gid) is not restored and ZIP symlinks/hardlinks are unsupported. - Encrypted ZIP entries are unsupported – The ZIP parser recognizes the encryption flag but cannot decrypt password-protected entries.
- XZ support depends on liblzma – When
lzma.his unavailable,arc_filter_xz()returnsENOSYSand.xzarchives cannot be read. - 7z support is limited – Only single-file, single-folder 7z archives with LZMA/LZMA2 (or copy) are supported. No encryption, multi-volume, or solid multi-file archives yet.
Most public APIs use safe defaults, but you can pass explicit limits via the _ex APIs:
const ArcLimits *arc_default_limits(void);
ArcReader *arc_open_path_ex(const char *path, const ArcLimits *limits);
ArcReader *arc_open_stream_ex(ArcStream *stream, const ArcLimits *limits);Limits include:
max_entries: max entries parsed from ZIP central directorymax_name: max entry name/path bytesmax_extra: max ZIP extra/comment bytesmax_uncompressed_bytes: cap on decompressed output (zip-bomb mitigation)max_nested_depth: max path depth (components) during extraction
The IO layer provides a unified stream interface using a virtual function table (vtable) pattern. This allows the same interface to be backed by different implementations.
struct ArcStream {
const struct ArcStreamVtable *vtable; // Function pointers
int64_t byte_limit; // Hard limit on total bytes
int64_t bytes_read; // Total bytes read so far
void *user_data; // Implementation-specific data
};-
File Descriptor Stream (
arc_stream_from_fd)- Backed by a file descriptor
- Uses
read()for reading - Supports
lseek()for seeking - Tracks position internally
- Does NOT close the file descriptor (caller owns it)
-
Memory Stream (
arc_stream_from_memory)- Backed by a memory buffer
- Uses
memcpy()for reading - Supports seeking within buffer bounds
- Does NOT free the buffer (caller owns it)
- Default byte limit is buffer size if not specified
-
Substream (
arc_stream_substream)- Bounded view of another stream
- Creates a window into a parent stream
- Used for reading individual archive entry data
- Automatically seeks parent stream to correct position
- Does NOT close parent stream (caller owns it)
Every stream enforces a hard byte limit to prevent zip bombs:
- Limits are checked before each read operation
- When limit is reached, reads return 0 (EOF)
- Limits are enforced at the implementation level (fd_read, mem_read, substream_read)
- Decompression filters also enforce limits on decompressed data
arc_stream_read()- Read up to n bytes (enforces byte limit)arc_stream_seek()- Seek to offset (if supported)arc_stream_tell()- Get current position (if supported)arc_stream_close()- Close and free stream
The filter layer wraps underlying streams to provide decompression. Filters are themselves streams, allowing them to be chained.
- Uses zlib's
inflateInit2()with16 + MAX_WBITSfor gzip format - Maintains a 64KB input buffer
- Streams decompression (doesn't require seeking)
- Does NOT support seeking (returns ESPIPE)
- Tracks decompressed bytes for
tell()operation - Does NOT close underlying stream (caller owns it)
- Truncated input fails: if input ends before
Z_STREAM_END, returns-1and setserrno = EINVAL
- Uses libbz2's
BZ2_bzDecompressInit() - Maintains a 64KB input buffer
- Streams decompression
- Does NOT support seeking (returns ESPIPE)
- Tracks decompressed bytes for
tell()operation - Does NOT close underlying stream (caller owns it)
- Uses liblzma and
lzma_stream_decoder()to stream-decompress .xz archives - Maintains a 64KB input buffer
- Streams decompression (no seeking)
- Does NOT close the underlying stream (
openat()reader owns it) - Truncated input fails:
LZMA_BUF_ERRORwith no progress becomeserrno = EINVAL
- Uses zlib's
inflateInit2()with-MAX_WBITSfor raw deflate (no gzip wrapper) - Used internally by ZIP format for deflate-compressed entries
- Maintains a 64KB input buffer
- Streams decompression
- Does NOT support seeking (returns ESPIPE)
- Tracks decompressed bytes for
tell()operation - Does NOT close underlying stream (caller owns it)
- Truncated input fails: if input ends before
Z_STREAM_END, returns-1and setserrno = EINVAL
The TAR format implementation supports:
Supported TAR Variants:
- ustar format - Standard POSIX TAR with 100-byte filename limit
- pax extended headers - For long paths (>100 chars) and large files (>8GB)
- Old TAR format - Pre-ustar format (detected by absence of magic)
TAR Header Structure:
- Fixed 512-byte blocks
- Octal-encoded numeric fields (mode, size, mtime, etc.)
- Checksum verification (sum of all bytes with checksum field as spaces)
- Support for prefix field (ustar) for paths up to 255 chars
Entry Types Supported:
'0'or'\0'- Regular file'5'- Directory'2'- Symlink'1'- Hardlink'x'- pax extended header (per-file)'g'- pax global header'L'- GNU long filename (applied to next entry)'K'- GNU long linkname (applied to next entry)
PAX Extended Headers:
PAX (Portable Archive Interchange) extended headers provide POSIX-compliant support for:
PAX per-file records (typeflag = 'x'):
path- Overrides filename (supports arbitrary length)linkpath- Overrides symlink target (supports arbitrary length)size- Overrides file size (supports large files)uid,gid,mtime,mode- Override metadata (common in the wild)- Applied to the next real entry (skipped in entry iteration)
PAX global records (typeflag = 'g'):
- Sets defaults for all following entries (until overridden)
- Same fields as per-file records
PAX record parsing:
- Records are decimal-length lines:
LEN key=value\n - Reads exactly the payload length from TAR header
- Safely skips padding to 512-byte boundaries
GNU Long Name Extensions:
GNU tar extensions for long names:
Long filename (typeflag = 'L'):
- Contains filename too long for ustar header
- Applied to the next real entry
Long linkname (typeflag = 'K'):
- Contains symlink target too long for ustar header
- Applied to the next real entry
TAR Reader State:
typedef struct TarReader {
ArcStream *stream; // Underlying stream
ArcEntry current_entry; // Current entry data
bool entry_valid; // Whether entry data is available
int64_t entry_data_offset; // Stream offset of entry data
int64_t entry_data_remaining; // Bytes remaining in entry
bool eof; // End of archive reached
} TarReader;Key Implementation Details:
- Entry data is NOT read automatically - must call
arc_open_data()orarc_skip_data() - Entry remains valid until next
arc_next()call or explicitarc_skip_data() - Data is padded to 512-byte block boundaries
- Zero blocks indicate end of archive
The ZIP format implementation supports:
ZIP Features:
- Central Directory parsing - Fast listing using central directory (standard ZIP files)
- Streaming mode - Falls back to local header parsing when central directory is missing
- ZIP64 support - Files >4GB, archives >4GB, >65535 entries via ZIP64 EOCD + locator + extra fields
- Data descriptor support - Handles ZIPs created with streaming (bit 3 set in general purpose flags)
- Compression methods: Store (0) and Deflate (8)
- Directory detection - Detected by filename ending with
/ - Encryption detection - Flags encrypted entries (extraction not supported)
ZIP64 Features:
- Automatically detects ZIP64 archives when EOCD fields contain 0xFFFFFFFF
- Reads ZIP64 End of Central Directory Locator (signature 0x07064b50)
- Parses ZIP64 End of Central Directory Record (signature 0x06064b50)
- Handles ZIP64 Extended Information Extra Field (0x0001) in both central and local headers
- Supports 64-bit file sizes, compressed sizes, and local header offsets
- Works in both central directory and streaming modes
ZIP Reader State:
Internals are not part of the public API (opaque ArcReader), but conceptually the ZIP reader tracks:
- Current entry metadata + offsets
- Whether it’s using central-directory mode vs streaming mode
- Underlying stream (and optional owned underlying stream when wrapped by a filter)
Data Descriptor Support:
- Handles ZIPs created with streaming where sizes aren't known at header time
- Detects data descriptors via general purpose bit flag 3 (0x0008)
- For uncompressed entries: searches for data descriptor signature (0x08074b50) after compressed data
- For compressed entries: decompresses until EOF, then reads data descriptor
- Supports both signed (with signature) and unsigned data descriptor formats
- Falls back gracefully when data descriptors can't be found
Key Implementation Details:
- Entry data is NOT read automatically - must call
arc_open_data()orarc_skip_data() - Entry remains valid until next
arc_next()call or explicitarc_skip_data() - Central directory mode: reads all entries from central directory first
- Streaming mode: reads entries sequentially from local file headers
- Supports both compressed (deflate) and uncompressed (store) entries
The unified reader API provides format-agnostic access to archives.
Format Detection:
- Detects whole-file compression (gzip/bzip2) and sniffs the decompressed header
- Checks for ZIP first (PK signatures)
- Otherwise checks TAR via ustar magic or valid TAR checksum (and rejects all-zero blocks)
- Returns
{format, compression_type}so callers can recreate a fresh filter for the real reader
Compression Detection:
- Gzip: Magic bytes
0x1f 0x8b - Bzip2: Magic bytes
'B' 'Z' 'h' - XZ: Magic bytes
0xFD 0x37 0x7A 0x58(compressed streams handled via liblzma filter)
Format Types:
ARC_FORMAT_TAR(0) - TAR formatARC_FORMAT_ZIP(1) - ZIP formatARC_FORMAT_7Z(3) - 7z format (limited)
Reader Lifecycle:
arc_open_path()/arc_open_stream()(or*_exvariants) - Opens archivearc_next()- Iterate through entriesarc_open_data()orarc_skip_data()- Handle entry dataarc_close()- Clean up
Ownership note (filtered streams): Filters do not close their underlying stream for composability. Readers track this via:
base.stream: what the format reads (may be a filter)base.owned_stream: underlying stream to also close (e.g. the file stream under a gzip filter)
Entry Management:
arc_next()allocatespathandlink_target(caller must free)arc_entry_free()frees allocated fields- Entry structure is copied to caller, but strings are allocated
The extraction layer provides full archive extraction capabilities.
arc_extract_to_path()
- Extracts all entries from an archive
- Creates subdirectories as needed using
mkdirat()via an openat()-anchored traversal - Preserves permissions and timestamps (optional)
- Returns error count (0 = success, >0 = some errors)
arc_extract_entry()
- Extracts a single entry
- Must be called immediately after
arc_next()while entry data is available - Creates parent directories automatically
- Handles files, directories, symlinks (TAR only), and hardlinks (TAR only)
Directory Creation:
- Uses
mkdir_p_at()(openat()+mkdirat()) to create parent directories recursively - Default mode: 0755
- Handles existing directories gracefully (EEXIST)
- Does not follow symlinks while traversing (
O_NOFOLLOW)
File Extraction:
- Uses 64KB buffer for copying
- Creates files with
openat(..., O_WRONLY | O_CREAT | O_TRUNC | O_NOFOLLOW, ...) - Preserves permissions if requested
- Sets timestamps using
futimens()(fd-based) if requested
Symlink Extraction (TAR only):
- Removes existing file/symlink first (
unlinkat()) - Creates symlink with
symlinkat() - Does NOT preserve permissions (symlinks don't have separate permissions)
- ZIP format does not support symlinks
Hardlink Extraction (TAR only):
- Currently extracts as regular file (hardlink creation requires inode tracking)
- Future enhancement: track inode mappings and create links in second pass
- ZIP format does not support hardlinks
Attribute Preservation:
- Permissions:
fchmod()withmode & 0777(only user/group/other bits) - Timestamps:
futimens()with mtime from entry - Ownership: Not currently preserved (would require
chown()and root privileges)
Security:
- Rejects absolute paths and any
..path components (Zip-Slip prevention) - All extraction operations are anchored to a destination directory fd (
openat()family) - Uses
O_NOFOLLOWduring traversal and file creation to prevent symlink races
#include "cupidarchive/arc_reader.h"
ArcReader *reader = arc_open_path("archive.tar.gz");
if (!reader) {
// Handle error
return;
}
ArcEntry entry;
while (arc_next(reader, &entry) == 0) {
printf("Entry: %s (size: %lu)\n", entry.path, entry.size);
// Optionally read entry data
ArcStream *data = arc_open_data(reader);
if (data) {
char buffer[4096];
ssize_t n = arc_stream_read(data, buffer, sizeof(buffer));
// ... process data ...
arc_stream_close(data);
}
arc_entry_free(&entry);
}
arc_close(reader);#include "cupidarchive/arc_reader.h"
// Extract entire archive
ArcReader *reader = arc_open_path("archive.tar.gz");
if (!reader) {
// Handle error
return;
}
int result = arc_extract_to_path(reader, "/tmp/extracted", true, true);
if (result < 0) {
// Handle error
}
arc_close(reader);
// Or extract entries one by one
ArcReader *reader2 = arc_open_path("archive.tar");
ArcEntry entry;
while (arc_next(reader2, &entry) == 0) {
int result = arc_extract_entry(reader2, &entry, "/tmp/extracted", true, true);
if (result < 0) {
// Handle error for this entry
}
arc_entry_free(&entry);
}
arc_close(reader2);typedef struct ArcEntry {
char *path; // Normalized path (allocated, caller must free)
uint64_t size; // File size in bytes
uint32_t mode; // File mode/permissions
uint64_t mtime; // Modification time (Unix timestamp)
uint8_t type; // Entry type (ARC_ENTRY_*)
char *link_target; // Symlink target (if applicable, allocated, caller must free)
uint32_t uid; // User ID
uint32_t gid; // Group ID
} ArcEntry;ARC_ENTRY_FILE(0) - Regular fileARC_ENTRY_DIR(1) - DirectoryARC_ENTRY_SYMLINK(2) - Symbolic linkARC_ENTRY_HARDLINK(3) - Hard linkARC_ENTRY_OTHER(4) - Other (device files, etc.)
cd cupidarchive
makeThis builds libcupidarchive.a (static library) in the root directory.
- Library:
libcupidarchive.a(static archive) - Object files:
obj/*.o(compiled source files) - Source files:
src/*.candsrc/*.h
Link against the library:
gcc -o myapp myapp.c -Lcupidarchive -lcupidarchive -lz -lbz2The library expects:
-Icupidarchive- For#include "cupidarchive.h"-Icupidarchive/src- For internal headers (automatically included)
- zlib - For gzip decompression (
-lz) - libbz2 - For bzip2 decompression (
-lbz2) - Standard C library - POSIX.1-2008 features
- Hard byte limits: Every stream has a
byte_limitthat cannot be exceeded - Limit enforcement: Limits checked before each read operation
- Automatic limits: File streams get 10x file size limit (for compressed archives)
- Substream limits: Automatically set to entry size
- Filter limits: Decompression filters enforce limits on decompressed data
- All array accesses are bounds-checked
- TAR header parsing validates field sizes
- Path normalization prevents buffer overflows
- Substream operations validate offset and length
- Comprehensive error codes via
errno - NULL pointer checks throughout
- Graceful degradation (e.g., hardlinks fall back to file copy)
- Resource cleanup on errors
- All allocated memory is properly freed
- Entry strings are allocated and must be freed by caller
- Streams clean up their internal data on close
- No memory leaks in normal operation
- Full archive extraction:
arc_extract_to_path()extracts all entries - Single entry extraction:
arc_extract_entry()extracts one entry at a time - Directory creation: Automatically creates parent directories as needed (
mkdir_p_at()viaopenat()/mkdirat()) - Permission preservation: Optional preservation of file permissions and ownership
- Timestamp preservation: Optional preservation of modification times
- Symlink support: Creates symlinks correctly (TAR format only)
- Hardlink handling: Attempts to create hardlinks, falls back to copying (TAR format only, future: proper inode tracking)
The library includes a comprehensive test suite. To run tests:
cd cupidarchive
make testThis will:
- Build the library (if not already built)
- Compile all test executables
- Run all tests and report results
Run tests with AddressSanitizer for memory error detection:
cd cupidarchive/tests
make test-asanRun tests with Valgrind for detailed memory analysis:
cd cupidarchive/tests
make test-valgrindSee tests/README.md for more information about the test suite.
The stream abstraction uses a vtable pattern for polymorphism:
struct ArcStreamVtable {
ssize_t (*read)(ArcStream *stream, void *buf, size_t n);
int (*seek)(ArcStream *stream, int64_t off, int whence);
int64_t (*tell)(ArcStream *stream);
void (*close)(ArcStream *stream);
};Each stream type (fd, memory, substream, filter) implements its own vtable.
TAR format requires 512-byte block alignment:
- Entry headers are 512 bytes
- Entry data is padded to 512-byte boundaries
- End of archive is indicated by two consecutive zero blocks
Paths are normalized to:
- Remove leading
./ - Remove duplicate slashes
// - Preserve absolute paths
- Handle ustar prefix + name combination
- Read first 4 bytes to detect compression
- Check compression magic bytes (gzip, bzip2, xz)
- If compressed (gzip/bzip2), wrap with filter and read again
- Check for ZIP format first (magic bytes
'P' 'K'with ZIP signatures) - If not ZIP, read first 512 bytes to check for TAR format
- Check for ustar magic or old TAR indicators
- Reset stream position and return format code
- Entry data is NOT automatically read
arc_next()only reads the headerarc_open_data()creates a substream for entry dataarc_skip_data()seeks past entry data- Entry remains valid until next
arc_next()orarc_skip_data()
- zstd compression support
- Expand 7z support (solid/multi-file, more coders, encrypted headers)
- RAR format support (read-only)
- Progress callbacks for extraction
- Extraction filters (exclude patterns)
- Proper hardlink handling (inode tracking)
- Ownership preservation (chown support)
- Archive creation (write support)
- ZIP encryption support (password-protected archives)
This library is licensed under the GNU General Public License v3.0 (GPL-3.0).
This means:
- You are free to use, modify, and distribute this library
- If you modify the library, you must release your changes under GPL-3.0
- If you use this library in your project, your project must also be licensed under GPL-3.0 (or a compatible license)
See the LICENSE file in the parent directory for the full license text.