Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ImageSearch 2 #35

Open
iseahound opened this issue Dec 12, 2023 · 31 comments
Open

Improve ImageSearch 2 #35

iseahound opened this issue Dec 12, 2023 · 31 comments
Labels
enhancement New feature or request

Comments

@iseahound
Copy link
Owner

This is the second attempt at improving ImageSearch.

The following fixes a bug where the last row of images is not searched.

https://godbolt.org/z/oEoMW7zEK

@iseahound
Copy link
Owner Author

Use -O3 for the above script. For some reason it wasn't allocating variables correctly into registers and instead using the stack which is slower.

@iseahound
Copy link
Owner Author

https://godbolt.org/z/WPWjf8c96

@martona
Copy link

martona commented Dec 15, 2023

I don't know if you'd be interested in an external contribution. I'm working on an imagesearch replacement with more features (tolerance and % of pixels matched). I'm getting close to finishing it but I'm not dead set on releasing it as standalone.

I have 512 and 128-bit SIMD working, and also have the pedestrian base code that doesn't use vectors, even though that's not a strict necessity because every 64-bit CPU should have at least SSE2. Pretty happy with what I've got so far, at least with the speed.

I still need to write the 256-bit part but that should be easy with how modular the thing ended up being. The unknown is the CPUID stuff and sorting the computers into v4/v3/v2/v1, which largely describes the vectorization tier but also the gcc -march option used.

Anyway, lmk if there's any interest.

@iseahound
Copy link
Owner Author

Sure thing! As many approaches as possible are the best possible solution. I'm still working on optimizing the basic ImageSearch functionality by avoiding the O(n²) subimage matching loop as much as possible.

@iseahound
Copy link
Owner Author

@iseahound
Copy link
Owner Author

iseahound commented Dec 15, 2023

@martona There is a requirement for a "focused pixel" (an additional pixel in the center of the image or to be determined by some algorithm). See: https://godbolt.org/z/j6bvo1YPd

An explanation: Doing 2 pixel searches in parallel reduces the need to enter the subimage matching loop (for lack of a better term). Also, if the top-left pixel is transparent, it's necessary to have a visible pixel somewhere as a backup.

@iseahound
Copy link
Owner Author

iseahound commented Dec 15, 2023

Notes:
#33 - Completed optimizations using parallel PixelSearch before ImageSearch.
#36 - Adding variation to ImageSearch (wind0204)

@martona
Copy link

martona commented Dec 15, 2023

I'll tidy things up and give you access to a repo later today (i'm in the US, East), you can check it out for yourself.

My take on the C stuff stuff you're doing... don't sweat optimizing it. I mean you can, it's fun, but vectors are so much faster. I don't have recent SSE2 numbers at hand but AVX512 is an order of magnitude faster than what you can achieve with wrangling standard C code. SSE would be slower, probably half the speed of AVX (much of it because it's missing integer cmple/cmpge for some reason). 256-bit SIMD should be the sweet spot.

Just to give you a rough idea, my test haystack is 822x987, needle (75x51) is found at 680x430 or so, and avx512 can do it in 2ms with the benchmark written in AHK and using ImagePutBuffers for both haystack and needle, so there's some overhead. If i'm not forcing a topleft pixel match (which is an option), and it means entering the subimage search for every haystack pixel, it's still just 10ms.

But, to be fair, I ignore he A channel completely. I suppose handling it in a non-fancy way wouldn't slow things down much. I just never had the need for it. It's usually about finding stuff that came from the screen, on the screen.

@martona
Copy link

martona commented Dec 15, 2023

Great work on the ImagePutBuffer project btw, it's immensely useful and a joy to use.

@iseahound
Copy link
Owner Author

iseahound commented Dec 15, 2023

Thanks! I personally have zero use for ImageSearch/PixelSearch myself, so any work on it is purely for learning, academic, or for fun. With that said, one common use case is when the target image is a sprite (meaning there is a character surrounded by transparent pixels). By ignoring those regions, one can search for a non-square collection of pixels. (You can do: ImagePutBuffer({sprite: "cat.png"}) or buf.TransColor()).

SIMD intrinsics are best for when repetitive data can be streamed, such as in a loop. There's some outer logic which controls how often the subimage loop is entered / exited which can be highly optimized by the CPU. For example, if one branch is 90% vs 95%, the 95% approach can be over 2x faster because it can anticipate it twice as much. To me, optimization is like a layer cake. Each level of complexity has different things that should be optimized. For example, the current ImageSearch + variation is unoptimized, because preprocessing can be done to split the target image into high and low variation images instead of encoding a SIMD instruction to do color + variation and color - variation.

My testing showed 3.4-3.8x speedup for 128-bit registers, and maybe a 50% increase over that for 256 and maybe 20% for 512. It's wildly CPU dependent, with better results nearing the theoretical 2x for better hardware.

@wind0204
Copy link
Contributor

Awesome, I can't wait to see which SIMD instructions are being used. :D

@martona
Copy link

martona commented Dec 16, 2023

I see. Well, transparency can be handled trivially in the lo/hi mask precalculation phase where you can just set 0/0xffffffff for the masks in the locations that you want to ignore.

I sent you an invite to the repo with imagesearch. I just extracted it from the project I made it for, so it's pretty barebones re. tests and stuff.

@iseahound
Copy link
Owner Author

iseahound commented Dec 16, 2023

@martona It seems we have the exact same approach: https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch2x.c
which is fantastic.

From my understanding, you're currently searching for one image correct? The approach with popcount/BitScanForward required a lot of extra C code on my end, so I simply allowed it to fall back to the "v1" approach when the mask returned positive.

__m128i vmask = _mm_set1_epi32(0xFFFFFFFF);

should this be -1 in my code?

@iseahound
Copy link
Owner Author

@wind0204

Some additional results:

  • Replacing div with cmov didn't produce a speedup that was significant. In addition, it's the 3rd check after the focused pixel and the top-left pixel, so the additional code for left wasn't worth it.
  • I also tried removing the line if (trans || c1 == *(current)) because it seemed redundant, but there was no difference in speed.

https://godbolt.org/z/M6nofK9hs

@martona
Copy link

martona commented Dec 16, 2023

@iseahound no, -1/0xffffffff doesn't matter.

lzcnt is relatively new, with the GCC v4/v3... march buckets it's only in v4, same as AVX512. but bsf is just an extra subtraction to do the same thing. to do it without is very costly; there's a pretty good C version in my code but it is a noticable hit.

same for popcnt, it ends up being a whole lot of bit twiddling.

i think both are worth it though even if you have to fall back to mucking around with the bits, there's no way it's not faster than falling back to doing it byte-by-byte in C.

@martona
Copy link

martona commented Dec 16, 2023

@wind0204

I was going to invite you to the repo but github wasn't giving me any results when I put in your name in the invite collaborator box. Weird.

So I made the repo public. you can see it at https://github.com/martona/imgutil

@iseahound
Copy link
Owner Author

iseahound commented Dec 16, 2023

I think I'm hitting theoretical maximums with basic ImageSearch.

580.70 fps - imagesearch1https://godbolt.org/z/M6nofK9hs
749.15 fps - pixelsearch1x https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch1x.c

It's a SSE2 vs basic C comparison, so I think this works quite well. Next step is to vectorize.

@iseahound
Copy link
Owner Author

Will probably revisit this at a later date. Having trouble with the assembly—using -Os produces the best binaries, but the most used variables seem to be put into the stack instead of a register. Also not sure why -O3 insists on movdqa instead of movups when the latter instruction is one byte shorter.

@wind0204
Copy link
Contributor

wind0204 commented Dec 18, 2023

@wind0204

Some additional results:

* Replacing div with cmov didn't produce a speedup that was significant. In addition, it's the 3rd check after the focused pixel and the top-left pixel, so the additional code for `left` wasn't worth it.

I guess it didn't provide performance boost because it was not in subimage loop? because I still believe that CMP + CMOV(one constant) combination is better than a DIV.--I believe it still without any profiling/benchmark experience and knowing the truth though.

I agree, third check wouldn't affect the performance much, that was the first check out of two checks in the code I wrote.

* I also tried removing the line `if (trans || c1 == *(current))` because it seemed redundant, but there was no difference in speed.

Does that mean it's better to remove the line for a simpler code?

@wind0204
Copy link
Contributor

wind0204 commented Dec 18, 2023

749.15 fps - pixelsearch1x https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch1x.c

I wonder how much faster it would perform with movdqa instead of movdqu in pixelsearch1x.c since I think it's worth to 16-byte align big haystack images. ;-p --16 bytes... is actually 4 pixels, it would be beneficial in almost all cases if the performance boost is meaningful.

@iseahound
Copy link
Owner Author

iseahound commented Dec 18, 2023

Does that mean it's better to remove the line for a simpler code?

Not sure, the idea is that it happens before the x-domain check. This has a side effect of hiding the div until it's needed. You're right that div is "slower", but only in a high performance scenario. Even when I try to trigger div more, the benchmarks don't show a difference here.

@wind0204
Copy link
Contributor

wind0204 commented Dec 18, 2023

Does that mean it's better to remove the line for a simpler code?

Not sure, the idea is that it happens before the x-domain check. This has a side effect of hiding the div until it's needed. You're right that div is "slower", but only in a high performance scenario. Even when I try to trigger div more, the benchmarks don't show a difference here.

As you can see with the quote below, I was talking about removing the line if (trans || c1 == *(current)), not CMOV instruction! And I think code's simplicity does help not as much as throughput though.

* I also tried removing the line `if (trans || c1 == *(current))` because it seemed redundant, but there was no difference in speed.

Does that mean it's better to remove the line for a simpler code?

@iseahound
Copy link
Owner Author

Yes. that's exactly what I meant.

@wind0204
Copy link
Contributor

wind0204 commented Dec 18, 2023

Yes. that's exactly what I meant.

Aha, you meant that the line lowers the frequency of the DIV operation--so removing the redundant operation doesn't really improves throughput-- I'm sorry for my poor reading skill. and lengthening the thread even longer! It's embarrassing :(

@wind0204
Copy link
Contributor

wind0204 commented Dec 21, 2023

749.15 fps - pixelsearch1x https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch1x.c

I wonder how much faster it would perform with movdqa instead of movdqu in pixelsearch1x.c since I think it's worth to 16-byte align big haystack images. ;-p --16 bytes... is actually 4 pixels, it would be beneficial in almost all cases if the performance boost is meaningful.

According to this article, it's worth a try. I'm sharing this just in case you're interested.

@iseahound
Copy link
Owner Author

That's a good catch! I didn't understand your previous comment before, but the article made it much clearer. I assumed the aligned version would be used - for some reason I thought loadu meant load unsigned.

@wind0204
Copy link
Contributor

I thought the image buffers were not aligned to 16 bytes, are they? According to the article I shared, if the data is already aligned the performance of the unaligned vector moving instructions is not worse on modern processors, ChatGPT 3.5 thinks it's worth to use the aligned mov instructions though.

@iseahound
Copy link
Owner Author

I'm not sure. They're allocated using Buffer, which should generate a nice alignment by default. There's actually working DirectX screen capture inside ImagePut I believe, I just never hooked it up to the callable APIs.

I have no recollection but apparently... I said:

That's a good question. I suppose the answer, "that's just the way it is" doesn't really answer your question satisfactorily, so I went ahead and did some research. Turns out, in the video memory (VRAM) DirectX uses something called High Level Shader Language (HLSL) which necessitates a 16 byte boundary. That means the screen width should be divisible by 4, because 1 pixel (ARGB) has 4 channels, and each color channel is 1 byte. So I made a mistake earlier, when I said it should be divisible by 16, the screen width needs to be divisible by 4.

As for why? It's probably faster to compute in blocks of 128-bits, (16 bytes), because of how the video card is designed. Now keep in mind, when you're doing DirectX screen capture, what's really happening behind the scenes is a memory transfer from the video card (VRAM) to your computer’s memory (RAM). And the really clever part about DirectX11 is that it doesn't copy the memory until you ask for it. That's why DX11 is so fast compared to BitBlt.

TL;DR Alignment on memory = faster.

https://www.autohotkey.com/boards/viewtopic.php?t=92622

@iseahound
Copy link
Owner Author

MFCreateAlignedMemoryBuffer ?

@iseahound
Copy link
Owner Author

Or just overallocate and extra 512-bits (for AVX512) and "start" the memory at that boundary.

@wind0204
Copy link
Contributor

I still believe that CMP + CMOV(one constant) combination is better than a DIV.--I believe it still without any profiling/benchmark experience and knowing the truth though.

I've found a good document on the throughput--or the reciprocal of the throughput- of each instruction: https://www.agner.org/optimize/instruction_tables.pdf
It's interesting to see FDIVs' much higher throughput than integer divisions thereof by the way..

CMOV is as quick as MOV, and CMP is very quick too, sometimes even quicker considering the throughput-- 1/0.25 > 1/0.3--. Now I can say that I do know the fact, not only just believe... 😄

Repository owner deleted a comment Jan 24, 2024
@iseahound iseahound added the enhancement New feature or request label Jan 26, 2024
Repository owner deleted a comment Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants