-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Waiting for recursive lock' crash (CCL1.11, CCL1.12 on ARM32) #428
Comments
This sounds like a bug that we've struggled with on 32-bit ARM for a long time. |
[from the mailing list] A test case is here: https://github.com/jetmonk/openmcl-thread-test Running (threadtest2) triggers a modified with-lock macro that grabs a lock, checks that the current process in fact owns it, runs body, and again checks for ownership before releasing. After many iterations, an error is thrown when the macro discovers that some other process stole the lock. I thought it might be a lack of ARM DMB instructions ("ensures that all memory accesses are completed before new memory access is committed.") in atomic routines , which have been discussed as necessary for ARM mutexes elsewhere online, like But I tried stuffing (dmb) everywhere in ARM/arm-misc.lisp and it didn't fix the problem. [From a folllowup post] On my pine64 device, running CCL 1.12.1 on Debian arm64 "testing", |
This is probably not a synchronization issue as we're experiencing the crashes on a single core arm32 system. Last known good version is 11.5. |
@varjagg are you reporting that ccl 1.11.5 works OK, and you see the issue on later versions? |
@xrme yep we've been testing our system on 11.5 for ages but as we tried to move to 1.12.2 it started to melt down. Same for any release between the two. At first we suspected our code but yesterday it fell apart just in swank interaction on a fresh image. But now that I reproduced it again am not even sure it's the same exact issue. The "Unhandled exception 4 at…" part is identical but it blows up on multiplication now:
Then |
Another backtrace. Now from socket code:
…however, one of other threads appears to have bad frame:
|
I don't remember what ARM-specific changes took place from 1.11.5 to newer releases. Clearly something must have happened, though. I wonder if there's some runtime (gc, exception handling) issue. Signal 4 is SIGILL, by the way. |
SIGILL? What instruction is the CPU executing at that moment? Is it possible to disassemble whatever is at the program counter at the time of the signal? |
SIGILL is frequently used normally for GC and various other traps, so it's not surprising to get it. The lisp kernel debugger can print out the registers (with "r"). We'd need to attach a debugger to find out more. |
This is farily reproducible on our target so I'll get the register values tomorrow. Is there anything specific to instrument armcl for debugger? |
If you end up in the lisp kernel debugger, you should be able to attach gdb. https://trac.clozure.com/ccl/wiki/CclUnderGdb might help a little. I think armcl is already compiled with #454 is a bug on the ARM where calling |
OK, so I poked around a bit. Not sure how helpful is it but there it goes:
|
Copy this code to /tmp/bug.lisp.
After several iterations crashes:
Works fine on amd64 (but I need ARM32). If remove work with files then works fine, but I need a lot of streaming.
Didn't use WITH-OPEN-FILE to check is it argument :ABORT of function CLOSE. It can be NIL.
Tried every of (:private :lock (:external nil)) values to argument :sharing of function OPEN, because sometimes (with/out explicit JOIN-PROCESS) condition arises:
If I specify N different file names -- bug preserves.
If I enclose entire worker into lock -- bug preserves.
The text was updated successfully, but these errors were encountered: