-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ImageSearch 2 #35
Comments
Use -O3 for the above script. For some reason it wasn't allocating variables correctly into registers and instead using the stack which is slower. |
I don't know if you'd be interested in an external contribution. I'm working on an imagesearch replacement with more features (tolerance and % of pixels matched). I'm getting close to finishing it but I'm not dead set on releasing it as standalone. I have 512 and 128-bit SIMD working, and also have the pedestrian base code that doesn't use vectors, even though that's not a strict necessity because every 64-bit CPU should have at least SSE2. Pretty happy with what I've got so far, at least with the speed. I still need to write the 256-bit part but that should be easy with how modular the thing ended up being. The unknown is the CPUID stuff and sorting the computers into v4/v3/v2/v1, which largely describes the vectorization tier but also the gcc -march option used. Anyway, lmk if there's any interest. |
Sure thing! As many approaches as possible are the best possible solution. I'm still working on optimizing the basic ImageSearch functionality by avoiding the O(n²) subimage matching loop as much as possible. |
Saved work (notes to self): |
@martona There is a requirement for a "focused pixel" (an additional pixel in the center of the image or to be determined by some algorithm). See: https://godbolt.org/z/j6bvo1YPd An explanation: Doing 2 pixel searches in parallel reduces the need to enter the subimage matching loop (for lack of a better term). Also, if the top-left pixel is transparent, it's necessary to have a visible pixel somewhere as a backup. |
I'll tidy things up and give you access to a repo later today (i'm in the US, East), you can check it out for yourself. My take on the C stuff stuff you're doing... don't sweat optimizing it. I mean you can, it's fun, but vectors are so much faster. I don't have recent SSE2 numbers at hand but AVX512 is an order of magnitude faster than what you can achieve with wrangling standard C code. SSE would be slower, probably half the speed of AVX (much of it because it's missing integer cmple/cmpge for some reason). 256-bit SIMD should be the sweet spot. Just to give you a rough idea, my test haystack is 822x987, needle (75x51) is found at 680x430 or so, and avx512 can do it in 2ms with the benchmark written in AHK and using ImagePutBuffers for both haystack and needle, so there's some overhead. If i'm not forcing a topleft pixel match (which is an option), and it means entering the subimage search for every haystack pixel, it's still just 10ms. But, to be fair, I ignore he A channel completely. I suppose handling it in a non-fancy way wouldn't slow things down much. I just never had the need for it. It's usually about finding stuff that came from the screen, on the screen. |
Great work on the ImagePutBuffer project btw, it's immensely useful and a joy to use. |
Thanks! I personally have zero use for ImageSearch/PixelSearch myself, so any work on it is purely for learning, academic, or for fun. With that said, one common use case is when the target image is a sprite (meaning there is a character surrounded by transparent pixels). By ignoring those regions, one can search for a non-square collection of pixels. (You can do: SIMD intrinsics are best for when repetitive data can be streamed, such as in a loop. There's some outer logic which controls how often the subimage loop is entered / exited which can be highly optimized by the CPU. For example, if one branch is 90% vs 95%, the 95% approach can be over 2x faster because it can anticipate it twice as much. To me, optimization is like a layer cake. Each level of complexity has different things that should be optimized. For example, the current ImageSearch + variation is unoptimized, because preprocessing can be done to split the target image into high and low variation images instead of encoding a SIMD instruction to do My testing showed 3.4-3.8x speedup for 128-bit registers, and maybe a 50% increase over that for 256 and maybe 20% for 512. It's wildly CPU dependent, with better results nearing the theoretical 2x for better hardware. |
Awesome, I can't wait to see which SIMD instructions are being used. :D |
I see. Well, transparency can be handled trivially in the lo/hi mask precalculation phase where you can just set 0/0xffffffff for the masks in the locations that you want to ignore. I sent you an invite to the repo with imagesearch. I just extracted it from the project I made it for, so it's pretty barebones re. tests and stuff. |
@martona It seems we have the exact same approach: https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch2x.c From my understanding, you're currently searching for one image correct? The approach with popcount/BitScanForward required a lot of extra C code on my end, so I simply allowed it to fall back to the "v1" approach when the mask returned positive.
should this be |
Some additional results:
|
@iseahound no, -1/0xffffffff doesn't matter. lzcnt is relatively new, with the GCC v4/v3... march buckets it's only in v4, same as AVX512. but bsf is just an extra subtraction to do the same thing. to do it without is very costly; there's a pretty good C version in my code but it is a noticable hit. same for popcnt, it ends up being a whole lot of bit twiddling. i think both are worth it though even if you have to fall back to mucking around with the bits, there's no way it's not faster than falling back to doing it byte-by-byte in C. |
I was going to invite you to the repo but github wasn't giving me any results when I put in your name in the invite collaborator box. Weird. So I made the repo public. you can see it at https://github.com/martona/imgutil |
I think I'm hitting theoretical maximums with basic ImageSearch. 580.70 fps - imagesearch1https://godbolt.org/z/M6nofK9hs It's a SSE2 vs basic C comparison, so I think this works quite well. Next step is to vectorize. |
Will probably revisit this at a later date. Having trouble with the assembly—using |
I guess it didn't provide performance boost because it was not in subimage loop? because I still believe that CMP + CMOV(one constant) combination is better than a DIV.--I believe it still without any profiling/benchmark experience and knowing the truth though. I agree, third check wouldn't affect the performance much, that was the first check out of two checks in the code I wrote.
Does that mean it's better to remove the line for a simpler code? |
I wonder how much faster it would perform with movdqa instead of movdqu in pixelsearch1x.c since I think it's worth to 16-byte align big haystack images. ;-p --16 bytes... is actually 4 pixels, it would be beneficial in almost all cases if the performance boost is meaningful. |
Not sure, the idea is that it happens before the x-domain check. This has a side effect of hiding the div until it's needed. You're right that div is "slower", but only in a high performance scenario. Even when I try to trigger div more, the benchmarks don't show a difference here. |
As you can see with the quote below, I was talking about removing the line
|
Yes. that's exactly what I meant. |
Aha, you meant that the line lowers the frequency of the DIV operation--so removing the redundant operation doesn't really improves throughput-- I'm sorry for my poor reading skill. and lengthening the thread even longer! It's embarrassing :( |
According to this article, it's worth a try. I'm sharing this just in case you're interested. |
That's a good catch! I didn't understand your previous comment before, but the article made it much clearer. I assumed the aligned version would be used - for some reason I thought loadu meant load unsigned. |
I thought the image buffers were not aligned to 16 bytes, are they? According to the article I shared, if the data is already aligned the performance of the unaligned vector moving instructions is not worse on modern processors, ChatGPT 3.5 thinks it's worth to use the aligned mov instructions though. |
I'm not sure. They're allocated using Buffer, which should generate a nice alignment by default. There's actually working DirectX screen capture inside ImagePut I believe, I just never hooked it up to the callable APIs. I have no recollection but apparently... I said:
|
MFCreateAlignedMemoryBuffer ? |
Or just overallocate and extra 512-bits (for AVX512) and "start" the memory at that boundary. |
I've found a good document on the throughput--or the reciprocal of the throughput- of each instruction: https://www.agner.org/optimize/instruction_tables.pdf CMOV is as quick as MOV, and CMP is very quick too, sometimes even quicker considering the throughput-- 1/0.25 > 1/0.3--. Now I can say that I do know the fact, not only just believe... 😄 |
This is the second attempt at improving ImageSearch.
The following fixes a bug where the last row of images is not searched.
https://godbolt.org/z/oEoMW7zEK
The text was updated successfully, but these errors were encountered: