Implement brotli's version of lz_extend. #364

LucasSloan · 2024-03-28T02:21:48Z

This provides a 2-3% speed up for compression, with no change to the bytestream. See here for the brotli version.

Before:

Compression             Total  Compressed        | Compression                   |
Method                   Size        Size  Ratio | Iters       Time         Rate |
----------------------------------------------------------------------------------
libdeflate-gzip:1   270062086    85630362  3.154 |    20   26.493 s  194.43 MB/s |
libdeflate-gzip:2   270062086    84037129  3.214 |    20   35.747 s  144.10 MB/s |
libdeflate-gzip:3   270062086    82391861  3.278 |    20   39.707 s  129.73 MB/s |
libdeflate-gzip:4   270062086    81420541  3.317 |    20   43.029 s  119.71 MB/s |
libdeflate-gzip:5   270062086    78832080  3.426 |    20   50.630 s  101.74 MB/s |
libdeflate-gzip:6   270062086    78021372  3.461 |    20   63.719 s   80.84 MB/s |
libdeflate-gzip:7   270062086    77594012  3.480 |    20   87.918 s   58.59 MB/s |
libdeflate-gzip:8   270062086    77190199  3.499 |    20  147.452 s   34.93 MB/s |
libdeflate-gzip:9   270062086    77156775  3.500 |    20  191.025 s   26.97 MB/s |

After:

Compression             Total  Compressed        | Compression                   |
Method                   Size        Size  Ratio | Iters       Time         Rate |
----------------------------------------------------------------------------------
libdeflate-gzip:1   270062086    85630362  3.154 |    20   26.228 s  196.39 MB/s |
libdeflate-gzip:2   270062086    84037129  3.214 |    20   34.950 s  147.38 MB/s |
libdeflate-gzip:3   270062086    82391861  3.278 |    20   39.140 s  131.61 MB/s |
libdeflate-gzip:4   270062086    81420541  3.317 |    20   41.927 s  122.86 MB/s |
libdeflate-gzip:5   270062086    78832080  3.426 |    20   50.023 s  102.97 MB/s |
libdeflate-gzip:6   270062086    78021372  3.461 |    20   61.799 s   83.35 MB/s |
libdeflate-gzip:7   270062086    77594012  3.480 |    20   85.528 s   60.23 MB/s |
libdeflate-gzip:8   270062086    77190199  3.499 |    20  145.867 s   35.31 MB/s |
libdeflate-gzip:9   270062086    77156775  3.500 |    20  189.208 s   27.22 MB/s |

ebiggers · 2024-03-28T05:08:45Z

Thanks for looking into this! This is a part of the code where it feels like there should be something better, but it's been hard to find something that can actually be shown to be better.

When saying that there's a certain percent performance improvement, especially for small differences, it would be helpful to mention which CPU and compiler is used. In a couple quick tests, I don't see the 2-3% improvement on the CPU and compiler combinations I currently happen to do the most testing on most on -- AMD Zen 1 with the latest versions of gcc and clang, and an arm64 processor with clang. It's either a negligible improvement or even a slight regression.

I think you're probably onto something, though; loops should only be unrolled when it provides a clear benefit, and writing the code with incrementing the pointers is probably a better choice than using indexing, especially since many non-x86 CPUs do not support indexed addressing anyway so the compiler would have to convert the code to incrementing the pointers anyway.

Before making a change here, it would be helpful to really understand how the two separate changes (re-rolling the loop, and not using indexed addressing) each affect the generated assembly with gcc and clang, and what result that has on different CPUs. That should allow making a more informed choice, and maybe point to something even better.

LucasSloan · 2024-03-28T05:22:29Z

I’m testing on Epyc Rome (Zen 2) with head clang.

It could also be the dataset I’m testing on - a pile of minified JavaScript and CSS.

What is your preferred dataset?

This provides a 2-3% speed up for compression. Before: ``` Compression Total Compressed | Compression | Method Size Size Ratio | Iters Time Rate | ---------------------------------------------------------------------------------- libdeflate-gzip:1 270062086 85630362 3.154 | 20 26.493 s 194.43 MB/s | libdeflate-gzip:2 270062086 84037129 3.214 | 20 35.747 s 144.10 MB/s | libdeflate-gzip:3 270062086 82391861 3.278 | 20 39.707 s 129.73 MB/s | libdeflate-gzip:4 270062086 81420541 3.317 | 20 43.029 s 119.71 MB/s | libdeflate-gzip:5 270062086 78832080 3.426 | 20 50.630 s 101.74 MB/s | libdeflate-gzip:6 270062086 78021372 3.461 | 20 63.719 s 80.84 MB/s | libdeflate-gzip:7 270062086 77594012 3.480 | 20 87.918 s 58.59 MB/s | libdeflate-gzip:8 270062086 77190199 3.499 | 20 147.452 s 34.93 MB/s | libdeflate-gzip:9 270062086 77156775 3.500 | 20 191.025 s 26.97 MB/s | ``` After: ``` Compression Total Compressed | Compression | Method Size Size Ratio | Iters Time Rate | ---------------------------------------------------------------------------------- libdeflate-gzip:1 270062086 85630362 3.154 | 20 26.228 s 196.39 MB/s | libdeflate-gzip:2 270062086 84037129 3.214 | 20 34.950 s 147.38 MB/s | libdeflate-gzip:3 270062086 82391861 3.278 | 20 39.140 s 131.61 MB/s | libdeflate-gzip:4 270062086 81420541 3.317 | 20 41.927 s 122.86 MB/s | libdeflate-gzip:5 270062086 78832080 3.426 | 20 50.023 s 102.97 MB/s | libdeflate-gzip:6 270062086 78021372 3.461 | 20 61.799 s 83.35 MB/s | libdeflate-gzip:7 270062086 77594012 3.480 | 20 85.528 s 60.23 MB/s | libdeflate-gzip:8 270062086 77190199 3.499 | 20 145.867 s 35.31 MB/s | libdeflate-gzip:9 270062086 77156775 3.500 | 20 189.208 s 27.22 MB/s | ```

ebiggers · 2024-03-28T05:41:40Z

For quick testing I just use an amalgamation of a bunch of different files. For more detailed testing I do look at different types of files specifically, including types that users of libdeflate have mentioned are important to them, for example bioformatics data. I should probably add more Javascript and CSS to my testing.

Testing on a file containing DNA sequencing data, I happen to see a ~2% improvement with clang, which is similar to what you're reporting. I'm guessing that the significant factor here is the distribution of match lengths in the data.

At the same time, on that file I don't see that improvement with gcc, and just building with gcc instead of clang results in a 10% improvement on that file in the first place.

I'm thinking your change is probably a benefit overall, but I'd like to really understand what's going on here as there may be a way to do even better.

LucasSloan · 2024-03-28T05:56:44Z

A quick test suggests just removing the loop unrolling and the goto's is better than this change:

(quality 6)

HEAD: 81.82 MB/s
this PR: 82.55 MB/s
no unrolling: 84.47 MB/s

static forceinline unsigned
lz_extend(const u8 * const strptr, const u8 * const matchptr,
	  const unsigned start_len, const unsigned max_len)
{
	unsigned len = start_len;
	machine_word_t v_word;

	if (UNALIGNED_ACCESS_IS_FAST) {
		while (len + WORDBYTES <= max_len) {
			v_word = load_word_unaligned(&matchptr[len]) ^
				 load_word_unaligned(&strptr[len]);
			if (v_word != 0) {
				if (CPU_IS_LITTLE_ENDIAN())
					len += (bsfw(v_word) >> 3);
				else
					len += (WORDBITS - 1 - bsrw(v_word)) >> 3;
				return len;
			}
			len += WORDBYTES;
		}
	}

	while (len < max_len && matchptr[len] == strptr[len])
		len++;
	return len;
}

LucasSloan · 2024-03-28T06:20:03Z

Just switching to pointer arithmetic also seems better than head, but admittedly the variability of my measurements is pretty high:

(quality 6)

head: 80.82 MB/s
pointer arithmetic: 81.40 MB/s

static forceinline unsigned
lz_extend(const u8 * strptr, const u8 * matchptr,
	  const unsigned start_len, unsigned max_len)
{
	const u8 * const strptr_orig = strptr;
	
	max_len -= start_len;
	strptr += start_len;
	matchptr += start_len;
	machine_word_t v_word;
	unsigned matching_bytes;

	if (UNALIGNED_ACCESS_IS_FAST) {

		if (likely(max_len >= 4 * WORDBYTES)) {

		#define COMPARE_WORD_STEP				\
			v_word = load_word_unaligned(matchptr) ^	\
				 load_word_unaligned(strptr);	\
			matchptr += WORDBYTES; \
			max_len -= WORDBYTES; \
			if (v_word != 0)				\
				goto word_differs;			\
			strptr += WORDBYTES;				\

			COMPARE_WORD_STEP
			COMPARE_WORD_STEP
			COMPARE_WORD_STEP
			COMPARE_WORD_STEP
		#undef COMPARE_WORD_STEP
		}

		while (max_len >= WORDBYTES) {
			v_word = load_word_unaligned(matchptr) ^
				 load_word_unaligned(strptr);
			matchptr += WORDBYTES;
			max_len -= WORDBYTES;
			if (v_word != 0)
				goto word_differs;
			strptr += WORDBYTES;
		}
	}

	while (max_len && *matchptr == *strptr) {
		max_len--;
		matchptr++;
		strptr++;
	}
	return strptr - strptr_orig;

word_differs:
	if (CPU_IS_LITTLE_ENDIAN())
		matching_bytes = (bsfw(v_word) >> 3);
	else
		matching_bytes = (WORDBITS - 1 - bsrw(v_word)) >> 3;
	return strptr - strptr_orig + matching_bytes;
}

LucasSloan force-pushed the brotli-lz-extend branch from bf13c0d to 26d4beb Compare March 28, 2024 02:29

LucasSloan force-pushed the brotli-lz-extend branch from 26d4beb to 32a7739 Compare March 28, 2024 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement brotli's version of lz_extend. #364

Implement brotli's version of lz_extend. #364

LucasSloan commented Mar 28, 2024

ebiggers commented Mar 28, 2024

LucasSloan commented Mar 28, 2024

ebiggers commented Mar 28, 2024

LucasSloan commented Mar 28, 2024

LucasSloan commented Mar 28, 2024

Implement brotli's version of lz_extend. #364

Are you sure you want to change the base?

Implement brotli's version of lz_extend. #364

Conversation

LucasSloan commented Mar 28, 2024

ebiggers commented Mar 28, 2024

LucasSloan commented Mar 28, 2024

ebiggers commented Mar 28, 2024

LucasSloan commented Mar 28, 2024

LucasSloan commented Mar 28, 2024