Benchmarks
Spoiler: This bottlenecks any NVMe 😀
Info
All tests were executed in the following environment:
- Library Version
: 0.3.0-preview (17th May 2023)
- zstd version
: 1.5.2 (MSVC)
- lz4 version
: K4os.Compression.LZ4 1.3.5
- CPU
: AMD Ryzen 9 5900X (12C/24T)
- RAM
: 32GB DDR4-3000 (16-17-17-35)
- OS
: Windows 11 22H2 (Build 22621)
- Storage
: Samsung 980 Pro 1TB (NVMe) [PCI-E 3.0 x4]
Some of the compression benchmarks are very dated, as ZStd 1.5.5 has has huge improvements in handling incompressible data since.
Nonetheless, they are still useful for reference.
Common Data Sets
These are the data sets which are used in multiple benchmarks below.
Textures
This dataset primarily consists of mostly DDS BC7 textures, with max dimension of *1024
to *8192
with a total size of 2.11GB
Texture overhauls in games make a majority of mods which large file sizes out there.
Therefore, having a good compression ratio on this data set is important.
Log Files
This dataset consists of 189 Reloaded-II Logs from end users, across various games, with a total size of 12.4MiB
Lightly Compressed Files
This dataset consists of every .one
archive from the 2004 Release of Sonic Heroes, with non-English files removed (168 MiB total).
Sometimes mods have to ship using games' native compression and/or formats; in which case, they are not very highly compressible, as the data is already compressed.
Many older games and some remasters of older games, use custom compression. This compression is usually some variant of basic LZ77. In this case, we test on data compressed using SEGA's PRS Compression Scheme, based on LZ77 with Run Length Encoding [RLE]. This was a common compression scheme in many SEGA games over ~15 or so years.
This data set was thrown in as a bonus, to see what happens!
Block Size (Logs)
Investigates the effect of block size on large files with repeating patterns.
This test was not in-memory, thus throughput is limited by NVMe bottlenecks. Throughput is provided for reference only.
ZStandard Only
Applied level applies to both chunked
and solid
compression levels.
Level | Block Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
-1 | 32767 | 1,150,976 | 1.778 | 239.74MiB/s | 1.944 |
-1 | 65535 | 1,077,248 | 1.665 | 226.90MiB/s | 1.839 |
-1 | 131071 | 970,752 | 1.500 | 215.36MiB/s | 1.746 |
-1 | 262143 | 872,448 | 1.348 | 186.86MiB/s | 1.515 |
-1 | 524287 | 770,048 | 1.190 | 169.42MiB/s | 1.373 |
-1 | 1048575 | 708,608 | 1.095 | 153.09MiB/s | 1.241 |
-1 | 2097151 | 667,648 | 1.032 | 153.09MiB/s | 1.241 |
-1 | 4194303 | 659,456 | 1.019 | 136.63MiB/s | 1.107 |
-1 | 8388607 | 647,168 | 1.000 | 123.36MiB/s | 1.000 |
Level | Block Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
9 | 32767 | 909,312 | 2.220 | 204.94MiB/s | 2.032 |
9 | 65535 | 819,200 | 2.001 | 192.52MiB/s | 1.908 |
9 | 131071 | 733,184 | 1.790 | 186.86MiB/s | 1.852 |
9 | 262143 | 630,784 | 1.541 | 181.52MiB/s | 1.800 |
9 | 524287 | 548,864 | 1.340 | 167.19MiB/s | 1.658 |
9 | 1048575 | 495,616 | 1.210 | 156.87MiB/s | 1.556 |
9 | 2097151 | 442,368 | 1.080 | 149.49MiB/s | 1.482 |
9 | 4194303 | 413,696 | 1.010 | 127.06MiB/s | 1.259 |
9 | 8388607 | 409,600 | 1.000 | 100.84MiB/s | 1.000 |
Level | Block Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
16 | 32767 | 884,736 | 2.180 | 71.38MiB/s | 2.754 |
16 | 65535 | 790,528 | 1.949 | 69.43MiB/s | 2.677 |
16 | 131071 | 712,704 | 1.756 | 64.50MiB/s | 2.486 |
16 | 262143 | 598,016 | 1.474 | 52.72MiB/s | 2.034 |
16 | 524287 | 548,864 | 1.353 | 90.12MiB/s | 3.476 |
16 | 1048575 | 491,520 | 1.212 | 76.09MiB/s | 2.934 |
16 | 2097151 | 442,368 | 1.091 | 64.50MiB/s | 2.486 |
16 | 4194303 | 413,696 | 1.020 | 44.27MiB/s | 1.708 |
16 | 8388607 | 405,504 | 1.000 | 25.93MiB/s | 1.000 |
Level 16 doesn't yield much improvement; lower levels are already good with repeating data.
Level | Block Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
22 | 131071 | 696,320 | 2.072 | 15.55MiB/s | 3.945 |
22 | 524287 | 512,000 | 1.525 | 17.97MiB/s | 4.561 |
22 | 1048575 | 438,272 | 1.305 | 18.23MiB/s | 4.626 |
22 | 8388607 | 335,872 | 1.000 | 3.94MiB/s | 1.000 |
Level 22 excels at large blocks due to larger window size, but that's too slow.
LZ4 Only
Level | Block Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
12 | 32767 | 1,159,168 | 1.490 | 158.83MiB/s | 4.326 |
12 | 65535 | 1,069,056 | 1.373 | 153.09MiB/s | 4.169 |
12 | 131071 | 983,040 | 1.263 | 138.11MiB/s | 3.764 |
12 | 262143 | 913,408 | 1.174 | 125.81MiB/s | 3.428 |
12 | 524287 | 839,680 | 1.079 | 112.45MiB/s | 3.064 |
12 | 1048575 | 806,912 | 1.037 | 116.57MiB/s | 3.176 |
12 | 2097151 | 786,432 | 1.011 | 92.75MiB/s | 2.526 |
12 | 4194303 | 786,432 | 1.011 | 68.31MiB/s | 1.860 |
12 | 8388607 | 778,240 | 1.000 | 36.72MiB/s | 1.000 |
Block Size (Recompressed Files)
Investigates the effect of block size on already lightly compressed data (w/ uncompressed headers).
This test was not in-memory, thus throughput may be subject to NVMe bottlenecks.
ZStd 1.5.4 and above have large improvements for uncompressible data handling performance; but only 1.5.2 was available at time of testing.
ZStandard Only
Level | Block Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
9 | 32767 | 139,653,120 | 1.355 | 163.88MiB/s | 1.068 |
9 | 65535 | 139,575,296 | 1.354 | 163.77MiB/s | 1.068 |
9 | 262143 | 136,527,872 | 1.321 | 162.61MiB/s | 1.060 |
9 | 524287 | 135,581,696 | 1.314 | 160.95MiB/s | 1.049 |
9 | 1048575 | 129,404,928 | 1.253 | 153.41MiB/s | 1.000 |
9 | 2097151 | 122,429,440 | 1.186 | 174.43MiB/s | 1.137 |
9 | 4194303 | 113,074,176 | 1.092 | 193.88MiB/s | 1.264 |
9 | 8388607 | 105,893,888 | 1.000 | 209.17MiB/s | 1.363 |
Level | Block Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
16 | 32767 | 137,244,672 | 1.323 | 103.38MiB/s | 1.064 |
16 | 262143 | 134,094,848 | 1.291 | 102.71MiB/s | 1.058 |
16 | 1048575 | 127,381,504 | 1.227 | 99.98MiB/s | 1.029 |
16 | 4194303 | 111,185,920 | 1.071 | 102.58MiB/s | 1.056 |
16 | 8388607 | 103,788,544 | 1.000 | 97.13MiB/s | 1.000 |
Level | Block Size | Size | Throughput |
---|---|---|---|
-1 | 8388607 | 154,382,336 | 1297.30MiB/s |
It seems ZStd can improve on existing LZ77-only compression schemes, in cases where Huffman coding is available.
This is why only levels > -1
show improvement.
LZ4 Only
Level | Block Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
12 | 32767 | 155,537,408 | 1.021 | 319.52MiB/s | 1.146 |
12 | 262143 | 153,378,816 | 1.007 | 340.32MiB/s | 1.221 |
12 | 1048575 | 152,670,208 | 1.003 | 371.06MiB/s | 1.331 |
12 | 4194303 | 152,317,952 | 1.000 | 343.02MiB/s | 1.231 |
12 | 8388607 | 152,264,704 | 1.000 | 278.74MiB/s | 1.000 |
Chunk Size (Textures)
Investigates the effect of chunk size on large, well compressible files.
ZStandard Only
Note: Due to intricacies of ZStd, Chunk Size 1MiB is left out from results as it produces the same output as 4MiB.
Level | Chunk Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
-1 | 4194304 | 2,211,012,608 | 1.0003 | 1554.50MiB/s | 1.000 |
-1 | 8388608 | 2,210,525,184 | 1.0001 | 1738.17MiB/s | 1.118 |
-1 | 16777216 | 2,210,295,808 | 1.000 | 1900.24MiB/s | 1.222 |
Level | Chunk Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
9 | 4194304 | 1,802,686,464 | 1.009 | 142.77MiB/s | 1.061 |
9 | 8388608 | 1,791,832,064 | 1.003 | 134.58MiB/s | 1.000 |
9 | 16777216 | 1,786,757,120 | 1.000 | 143.72MiB/s | 1.068 |
Level | Chunk Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
16 | 4194304 | 1,784,340,480 | 1.014 | 73.38MiB/s | 1.233 |
16 | 8388608 | 1,767,153,664 | 1.005 | 66.71MiB/s | 1.121 |
16 | 16777216 | 1,759,150,080 | 1.000 | 59.52MiB/s | 1.000 |
Level | Chunk Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
22 | 4194304 | 1,769,832,448 | 1.023 | 50.76MiB/s | 1.565 |
22 | 8388608 | 1,747,853,312 | 1.010 | 42.33MiB/s | 1.304 |
22 | 16777216 | 1,730,514,944 | 1.000 | 32.45MiB/s | 1.000 |
LZ4 Only
LZ4 will have some niche applications; this is one of them. Fast texture loading.
Level | Chunk Size | Size | Ratio (Size) | Throughput | Ratio (Throughput) |
---|---|---|---|---|---|
12 | 4194304 | 2,013,491,200 | 1.001 | 364.04MiB/s | 1.103 |
12 | 8388608 | 2,012,303,360 | 1.000 | 357.71MiB/s | 1.084 |
12 | 16777216 | 2,011,750,400 | 1.000 | 330.12MiB/s | 1.000 |
LZ4 does a huge sacrifice of file size for decompression speeds.
Depending on use case; this might be okay, but for longer term archiving ZStd Level 9+ is preferred.
Thread Scaling: Packing
Files were packed with 16MB Chunk Size, Chunk Size did not seem to have an effect on compression speed.
ZStandard Only
Packing Speed with ZStd Only, Compression Level
Native ZStd library used, speeds are within margin of error on all runtimes.
Level | Thread Count | Throughput | Ratio (Throughput) |
---|---|---|---|
-1 | 1 | 567.11MiB/s | 1.000 |
-1 | 2 | 904.47MiB/s | 1.595 |
-1 | 3 | 1190.45MiB/s | 2.099 |
-1 | 4 | 1308.24MiB/s | 2.306 |
I (Sewer) cannot test more than 4 threads on my system due to I/O bottlenecks, but I expect the scaling to continue linearly.
Level | Thread Count | Throughput | Ratio (Throughput) |
---|---|---|---|
9 | 1 | 60.22MiB/s | 1.00 |
9 | 2 | 92.33MiB/s | 1.53 |
9 | 3 | 117.53MiB/s | 1.95 |
9 | 4 | 134.76MiB/s | 2.24 |
9 | 8 | 179.55MiB/s | 2.98 |
9 | 12 | 172.57MiB/s | 2.86 |
9 | 24 (SMT) | 142.57MiB/s | 2.37 |
Level | Thread Count | Throughput | Ratio (Throughput) |
---|---|---|---|
16 | 1 | 8.79MiB/s | 1.00 |
16 | 2 | 13.49MiB/s | 1.53 |
16 | 3 | 18.70MiB/s | 2.13 |
16 | 4 | 22.45MiB/s | 2.56 |
16 | 8 | 37.25MiB/s | 4.24 |
16 | 12 | 46.25MiB/s | 5.26 |
16 | 24 (SMT) | 59.71MiB/s | 6.79 |
The throughput numbers are very dated.
Use these numbers as a reference of thread scaling performance only.
Testing with NexusMods.Archives.Nx 0.5.0
has 12 thread level 9 compression
running in excess of 250MiB/s, considerably faster than the numbers here.
LZ4 Only
Packing Speed with LZ4 Only, Compression Level 12
Thread Count | Throughput | Ratio (Throughput) |
---|---|---|
1 | 32.00 MiB/s | 1.00 |
2 | 52.94 MiB/s | 1.65 |
3 | 82.22 MiB/s | 2.57 |
4 | 102.72 MiB/s | 3.21 |
8 | 192.76 MiB/s | 6.02 |
12 | 256.30 MiB/s | 8.01 |
24 (SMT) | 346.98 MiB/s | 10.84 |
Scaling for LZ4 is mostly linear with real core count.
Thread Scaling: Extraction
All benchmarks under this section are in-memory; due to exceeding speeds achievable by consumer grade NVMe.
Due to the layout of archive format; the nature of test data (i.e. many small files vs big files) has no effect on performance. (Outside of the standard FileSystem/Storage inefficiencies if writing many small files to disk).
Some reference speeds:
Compression Method | Speed |
---|---|
memcpy | ~16.62 GiB/s |
lz4 1.9.2 (native), 1 thread, [lzbench] | ~4.204 GiB/s |
ZStandard Only
Decompression Speed when Extracting with ZStandard Only
Files were packed with 16MB Chunk Size, and ZStd Level 16
Native ZStd library used, speeds are within margin of error on all runtimes.
.NET 7 / 8 Preview 3
Thread Count | Speed |
---|---|
1 | ~1.01 GiB/s |
2 | ~1.99 GiB/s |
3 | ~2.87 GiB/s |
4 | ~3.70 GiB/s |
6 | ~5.15 GiB/s |
8 | ~6.35 GiB/s |
10 | ~7.38 GiB/s |
12 | ~7.81 GiB/s |
24 | ~7.19 GiB/s ‼️ |
Observed dropoff with hyperthreading, presumably due to cache inefficiency.
Ideally we could detect real core count; but this is hard; it is deliberately abstracted.
LZ4 Only
Decompression Speed when Extracting with LZ4 Only
Files were packed with 16MB Chunk Size, and LZ4 Level 12
.NET 7
Thread Count | Speed |
---|---|
1 | ~2.83 GiB/s |
2 | ~5.56 GiB/s |
3 | ~8.09 GiB/s |
4 | ~10.00 GiB/s |
6+ | ~11.63 GiB/s |
.NET 8 Preview 3
Thread Count | Speed |
---|---|
1 | ~3.82 GiB/s |
2 | ~7.36 GiB/s |
3 | ~10.41 GiB/s |
4+ | ~11.72 GiB/s |
8+ | ~12.10 GiB/s |
.NET 8 Preview 3 NativeAOT
Thread Count | Speed |
---|---|
1 | ~3.26 GiB/s |
2 | ~6.33 GiB/s |
3 | ~9.20 GiB/s |
4 | ~11.09 GiB/s |
6+ | ~11.93 GiB/s |
Presets
The following presets have been created...
Fast/Random Access Preset
This preset is designed for optimising random access, and intended in use when low latency previews are needed such as in the Nexus App.
Solid Algorithm: ZStandard
Chunked Algorithm: ZStandard
Solid Compression Level: -1
Chunked Compression Level: 9
Archival/Upload Preset
This preset is designed for all other use cases. Providing a fair balance for all other use cases.
Solid Algorithm: ZStandard
Chunked Algorithm: ZStandard
Solid Compression Level: 16
Chunked Compression Level: 9
Deduplication of Chunks
Tested on NexusMods.Archives.Nx 0.5.0
The benchmarks use a 1M block and chunk size unless specified.
This measures the overhead of Chunked Deduplication
SOLID deduplication is 'free', I've been unable to observe a regression larger than margin of error (< 0.2%).
"Dedupe All" means that both SOLID and chunked files are deduplicated.
Testing time was also a bit limited, this time around.
Skyrim 202X 9.0 - Architecture
Skyrim's Most Popular Texture Pack
651 textures, total 11.6GiB in size.
Scenario | Throughput | Throughput | Throughput | Size (MB) |
---|---|---|---|---|
Solid Only | 302.97 MiB/s | 307.27 MiB/s | 309.29 MiB/s | 9,264.76 |
Dedupe All | 309.06 MiB/s | 304.67 MiB/s | 310.13 MiB/s | 8,749.81 |
Deduplicating chunked files reduced the size by about 515 MB with minimal impact on packing time.
And with 16MB chunks as an additional point of reference:
Scenario | Throughput | Throughput | Throughput | Size (MiB) |
---|---|---|---|---|
Solid Only | 287.79 MiB/s | 287.61 MiB/s | 293.61 MiB/s | 8,908.04 |
Dedupe All | 282.46 MiB/s | 286.96 MiB/s | 287.43 MiB/s | 8,414.91 |
Larger chunk size slightly improved packing times and reduced file sizes compared to 1M chunks.
Stardew Valley 1.6.8
The full game, Steam version.
Scenario | Throughput | Throughput | Throughput | Size (MB) |
---|---|---|---|---|
Solid Only | 263.20 MiB/s | 259.67 MiB/s | 262.01 MiB/s | 456.49 |
Dedupe All | 261.52 MiB/s | 256.04 MiB/s | 269.83 MiB/s | 455.97 |
Performance difference and space saving was negligible.
Adachi over Everyone Mod
An unreleased meme mod sitting on my hard drive. 16.2GB in size.
5,657 items, contains 57 duplicated files, which are 15.8GiB total. Remaining ~400MB are unique files.
Scenario | Time (ms) | Throughput | Size (MiB) |
---|---|---|---|
Solid Only | 43,572 | 394.93 MiB/s | 5,349.69 |
Dedupe All | 8,005 | 2135.97 MiB/s | 277.67 |
Deduplication significantly reduced file size by about 95% and improved packing speed by over 3x.
Summary
Some additional quicker testing was also done on other sources
That was done when doing benchmarking.
Currently, at time of writing, deduplication of chunks works by hashing the first page (4096) bytes of the first chunk, and then using that to quickly determine if there's a potential duplicate. If there is, the entire file is hashed and compared.
There are a lot of 'nuances' and extra optimizations in there, but this is the general idea.
From extended testing, the general conclusions are.
- Performance is lost when files have same short hash but different content.
- This is common in some textures, as many may have transparent borders.
- This leads to a read of the full file prematurely, when not necessary.
- Huge files with same first 4096 bytes lead to slowdown.
- Realistic max overhead is ~5% of throughput.
- In practise it's however within margin of error (~2%).
- Successful finding of duplicates improves packing speed.
- Because we need to compress less.
Comparison to Common Archiving Solutions
Tests here were ran on .NET 8 Preview 3 NativeAOT, there aren't currently any significant differences here between runtimes.
All tests were ran under optimal thread count/best case scenario.
NX is entirely I/O bottlenecked here (on PCI-E 3.0 drive). Therefore in-memory benchmarks are also provided.
Compression Scripts
Zip (Maximum):
"7z.exe" a -tzip -mtp=0 -mm=Deflate -mmt=on -mx7 -mfb=64 -mpass=3 -bb0 -bse0 -bsp2 -mtc=on -mta=on "output" "input"
Zip (Maximum, Optimized):
"7z.exe" a -tzip -mtp=0 -mm=Deflate -mmt=on -mx7 -mfb=64 -mpass=1 -bb0 -bse0 -bsp2 -mtc=on -mta=on "output" "input"
7z (Normal):
7z.exe" a -t7z -m0=LZMA2 -mmt=on -mx5 -md=16m -mfb=32 -ms=4g -mqs=on -sccUTF-8 -bb0 -bse0 -bsp2 -mtc=on -mta=on "output" "input"
7z (Ultra):
7z.exe" a -t7z -m0=LZMA2 -mmt=on -mx9 -md=64m -mfb=64 -ms=16g -mqs=on -sccUTF-8 -bb0 -bse0 -bsp2 -mtc=on -mta=on "output" "input"
Nx (Archival Preset):
NexusMods.Archives.Nx.Cli.exe pack --source "input" --target "output" --solid-algorithm ZStandard --chunked-algorithm ZStandard --solidlevel 16 --chunkedlevel 9
Nx (Random Access Preset):
NexusMods.Archives.Nx.Cli.exe pack --source "input" --target "output" --solid-algorithm ZStandard --chunked-algorithm ZStandard --solidlevel -1 --chunkedlevel 9
Zip (Optimized) cannot be set via GUI, only via CMD parameter.
Unpacking Scripts
7z/Zip:
7z.exe" x -aos "-ooutput" -bb0 -bse0 -bsp2 -pdefault -sccUTF-8 -snz "input"
Nx:
./NexusMods.Archives.Nx.Cli.exe extract --source input --target "output"
Textures
Packing:
Method | Time Taken | Ratio (Time) | Size | Ratio (Size) |
---|---|---|---|---|
Zip (Maximum) | 17.385s | 2.34 | 1,957,310,894 | 1.24 |
Zip (Optimized) | 7.443s | 1.00 | 1,961,872,315 | 1.25 |
7z (Normal) | 73.572s | 9.88 | 1,616,082,220 | 1.03 |
7z (Ultra) | 120.746s | 16.23 | 1,570,741,070 | 1.00 |
Nx (Random Access, 12T) | 12.909s | 1.73 | 1,787,568,128 | 1.14 |
Nx (Archival, 12T) | 12.955s | 1.74 | 1,786,634,240 | 1.14 |
Unpacking:
Method | Time Taken | Ratio (Time) |
---|---|---|
Zip | 13.857s | 50.76 |
7z (Ultra) | 6.238s | 22.86 |
7z (Normal) | 3.705s | 13.57 |
Nx (12T) | 1.172s | 4.29 |
Nx [In-Memory] (12T) | 0.273s | 1.00 |
Logs
Packing:
Method | Time Taken | Ratio (Time) | Size | Ratio (Size) |
---|---|---|---|---|
Zip (Maximum) | 0.149s | 2.07 | 758,928 | 2.18 |
Zip (Optimized) | 0.115s | 1.60 | 768,374 | 2.21 |
7z (Normal) | 0.297s | 4.13 | 378,780 | 1.09 |
7z (Ultra) | 0.545s | 7.57 | 347,574 | 1.00 |
Nx (Archival, 12T) | 0.194s | 2.69 | 491,520 | 1.41 |
Nx (Random Access, 12T) | 0.072s | 1.00 | 708,608 | 2.04 |
Unpacking:
Method | Time Taken | Ratio (Time) |
---|---|---|
Zip | 0.214s | 167.19 |
7z (Ultra) | 0.225s | 175.78 |
7z (Normal) | 0.267s | 208.59 |
Nx (Random Access) | 0.127s | 99.22 |
Nx (Archival) | 0.127s | 99.22 |
Nx (Random Access) [In-Memory] (1T) | 0.00304s | 2.38 |
Nx (Archival) [In-Memory] (1T) | 0.00300s | 2.34 |
Nx (Random Access) [In-Memory] (4T) | 0.0016s | 1.25 |
Nx (Archival) [In-Memory] (4T) | 0.00128s | 1.00 |
Extraction is I/O bottlenecked; Windows is slow to create small files.
Random Access mode is not faster here due to nature of data set. For larger compressed blocks however, it achieves ~3.5x speed.