imsnif opened issue #10283:
Hey,
In the context of Zellij, ever since a recent (still unreleased) upgrade of wasmtime, I've been getting intermittent and seemingly random segfaults.
I am afraid I don't have a way to reproduce them, but I have been gathering some information about them and am hoping the maintainers can help pin-point the issue or provide more insight.
Version info:
wasmtime version: 29.0.1
(probably) did not occur in wasmtime version: 21.0.2What happens: Zellij crashes, without a panic, and syslog tells me that it got a segfault:
* [343995.595910] async-std/runti[2487041]: segfault at 705a8ea73970 ip 0000705a8ea73970 sp 0000705a8ea73948 error 15 likely on CPU 3 (core 1, socket 0)
With some debug logging, I managed to narrow the crash down to this block in our app: https://github.com/zellij-org/zellij/blob/main/zellij-server/src/plugins/wasm_bridge.rs#L1589-L1596
I have a core dump, but unfortunately not one with debug symbols so I'm not sure it's a lot of help.
This happens about once or twice a day for me and so is extremely gnarly to pin-point. I'm not 100% sure this is a wasmtime issue, but the crash location would seem to indicate that it is.
Any help in troubleshooting or mitigating this somehow would be appreciated. On my part I'm going to try to get a coredump with some debug symbols so we'll hopefully be less blind.
Thanks!
alexcrichton commented on issue #10283:
segfault at 705a8ea73970 ip 0000705a8ea73970
it's suspicious that the segfaulting addressing and the ip are the same, meaning that this looks like it's trying to execute code that's either not mapped or not executable. What OS are you on? That may be some cache-coherency thing and/or bug in Wasmtime perhaps? (not that anything about that has changed recently, so I don't know what would have changed in the 21->29 update other than perhaps making it more likely one way or another)
imsnif commented on issue #10283:
I'm on arch linux
alexcrichton commented on issue #10283:
Can you run
uname -a
and paste that? In addition to Linux I'd be curious what architecture you're using (e.g. x64, aarch64, etc) and possibly what kernel version too
imsnif commented on issue #10283:
Sure: Linux green 6.12.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 22 Nov 2024 16:04:27 +0000 x86_64 GNU/Linux
imsnif edited a comment on issue #10283:
Sure:
Linux green 6.12.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 22 Nov 2024 16:04:27 +0000 x86_64 GNU/Linux
alexcrichton commented on issue #10283:
Ah ok, x64 probably doesn't have any cache coherency issues or things like that. Can you open up the coredump and get the exact faulting instruction? E.g. if you open it up and run
disas
does it print anything? If nothing can you dodisas $pc-40,$pc+30
to see the window around the faulting instruction?
alexcrichton edited a comment on issue #10283:
Ah ok, x64 probably doesn't have any cache coherency issues or things like that. Can you open up the coredump and get the exact faulting instruction? E.g. if you open it up and run
disas
does it print anything? If nothing can you dodisas $pc-40,$pc+30
to see the window around the faulting instruction?EDIT: also if possible a stack trace via
bt
would be helpful too
imsnif commented on issue #10283:
Sure, here you go (I hope I did it right).
Core was generated by `/home/aram/backup/dev-zellij --server /run/user/1000/zellij/0.42.0/zellij'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007b103cdb1870 in ?? () [Current thread is 1 (LWP 800590)] (gdb) disas No function contains program counter for selected frame. (gdb) disas $pc-40,$pc+30 Dump of assembler code from 0x7b103cdb1848 to 0x7b103cdb188e: 0x00007b103cdb1848: mov $0xcedceff9,%ecx 0x00007b103cdb184d: pop %rcx 0x00007b103cdb184e: add %al,(%rax) 0x00007b103cdb1850: jo 0x7b103cdb186a 0x00007b103cdb1852: fstpt (%rax,%rdx,1) 0x00007b103cdb1855: jnp 0x7b103cdb1857 0x00007b103cdb1857: add %bl,%bl 0x00007b103cdb1859: lea (bad),%edx 0x00007b103cdb185a: rcr $1,%ah 0x00007b103cdb185c: (bad) 0x00007b103cdb185d: pop %rcx 0x00007b103cdb185e: add %al,(%rax) 0x00007b103cdb1860: call 0x7b104d17f37d 0x00007b103cdb1865: jnp 0x7b103cdb1867 0x00007b103cdb1867: add %al,(%rax) 0x00007b103cdb1869: add %al,(%rax) 0x00007b103cdb186b: add %al,(%rax) 0x00007b103cdb186d: add %al,(%rax) 0x00007b103cdb186f: add %al,(%rax) 0x00007b103cdb1871: add %al,(%rax) 0x00007b103cdb1873: add %al,(%rax) 0x00007b103cdb1875: add %al,(%rax) 0x00007b103cdb1877: add %ch,(%rax) 0x00007b103cdb1879: call 0x7b10b7eb0c83 0x00007b103cdb187e: add %al,(%rax) 0x00007b103cdb1880: add %eax,(%rax) 0x00007b103cdb1882: add %al,(%rax) 0x00007b103cdb1884: add %al,(%rax) 0x00007b103cdb1886: add %al,(%rax) 0x00007b103cdb1888: clc 0x00007b103cdb1889: push %rbx 0x00007b103cdb188a: sbb %bl,%dh 0x00007b103cdb188c: (bad) 0x00007b103cdb188d: pop %rcx End of assembler dump. (gdb) bt #0 0x00007b103cdb1870 in ?? () #1 0x000059cedceff9b9 in ?? () #2 0x00007b103cdb1870 in ?? () #3 0x000059cedcd08ddb in ?? () #4 0x00007b103cdb18e8 in ?? () #5 0x0000000000000000 in ?? () (gdb)
I'm also going to start working with an unstripped version so that next time this happens we can hopefully get some more info.
bjorn3 commented on issue #10283:
Looks like the disassembly got mangled due to starting at the wrong offset.
disas/r $pc-40,$pc+30
shows the raw bytes for each instruction too. Alsy try varying the-40
a bit until one instruction has 0x00007b103cdb1870 as start address.
imsnif commented on issue #10283:
It this better?
(gdb) disas $pc-10,$pc+30 Dump of assembler code from 0x7b103cdb1866 to 0x7b103cdb188e: 0x00007b103cdb1866: add %al,(%rax) 0x00007b103cdb1868: add %al,(%rax) 0x00007b103cdb186a: add %al,(%rax) 0x00007b103cdb186c: add %al,(%rax) 0x00007b103cdb186e: add %al,(%rax) => 0x00007b103cdb1870: add %al,(%rax) 0x00007b103cdb1872: add %al,(%rax) 0x00007b103cdb1874: add %al,(%rax) 0x00007b103cdb1876: add %al,(%rax) 0x00007b103cdb1878: sub %ch,%al 0x00007b103cdb187a: add $0x7b0ff4,%eax 0x00007b103cdb187f: add %al,(%rcx) 0x00007b103cdb1881: add %al,(%rax) 0x00007b103cdb1883: add %al,(%rax) 0x00007b103cdb1885: add %al,(%rax) 0x00007b103cdb1887: add %bh,%al 0x00007b103cdb1889: push %rbx 0x00007b103cdb188a: sbb %bl,%dh 0x00007b103cdb188c: (bad) 0x00007b103cdb188d: pop %rcx End of assembler dump. (gdb) bt #0 0x00007b103cdb1870 in ?? () #1 0x000059cedceff9b9 in ?? () #2 0x00007b103cdb1870 in ?? () #3 0x000059cedcd08ddb in ?? () #4 0x00007b103cdb18e8 in ?? () #5 0x0000000000000000 in ?? () (gdb)
cfallin commented on issue #10283:
It appears execution fell into bad data somehow:
add %al, (%rax)
is the disassembly of all zero bytes.Perhaps try starting further back -- did execution slide into this from legitimate code just prior, somehow?
Or perhaps a jump landed here. Are you able to capture the crash under
rr
(the reversible debugger) and backwards-step (rsi
, reverse-step-instruction) to see where we came from?
imsnif commented on issue #10283:
Would it be helpful if I attach the coredump here?
cfallin commented on issue #10283:
Probably not without the binary and other information -- in general it's difficult to jump into the context of another project embedding us and work things out. Capturing the crash in
rr
if you can is probably the most likely way to make progress: that would allow us to see exactly how things became corrupted (if they did) or went off the rails otherwise.
imsnif commented on issue #10283:
I have unfortunately not managed to find a way to reproduce the crash. It happens about 1-2 times a day of heavy work with the application. I guess it will help some more if I provide a backtrace of the coredump with the debug symbols?
cfallin commented on issue #10283:
Landing in the middle of zero data is the sort of thing that requires a bit more info to debug unfortunately -- a backtrace might offer some clues, but the only way to really get enough information to zero in on a fix (if this is a Wasmtime bug) would be, again, an
rr
trace -- we really need to see register values, see how control flow reached a place, see how struct fields got corrupted if they did, etc.
cfallin commented on issue #10283:
A few questions that might help triage as well:
- Does your application that embeds Wasmtime include any unsafe Rust code, or have any libraries written in unsafe languages (e.g. C/C++) linked in, other than libc?
- Do the failures correlate with heavy load? Wasm execution? Level of concurrency? Or do they happen randomly?
- A bit of a stretch, but I recently encountered some perplexing kernel crashes on one of my machines and initially assumed the issue must have been some subtle race condition in the filesystem code that was crashing. It turned out that I had bad RAM. So
- Does this crashing issue occur only on one machine or on multiple?
- If only one machine, has it had any other stability issues? Could you run
memtest86
(if x86/PC hardware) as a baseline check?
imsnif commented on issue #10283:
* Does your application that embeds Wasmtime include any unsafe Rust code, or have any libraries written in unsafe languages (e.g. C/C++) linked in, other than libc?
Yes and yes. Zellij has 500+ dependencies. I did debug it down do these lines though https://github.com/zellij-org/zellij/blob/main/zellij-server/src/plugins/wasm_bridge.rs#L1589-L1596 - it definitely happens here. The only unsafe thing that's happening in this range has to do with wasmtime (unless I'm missing something which is always a possibility). Otherwise it's serialization, acquiring a lock on an
Arc<Mutex>
and pushing into aVecDeque
.* Do the failures correlate with heavy load? Wasm execution? Level of concurrency? Or do they happen randomly?
Anecdotally, it seems to me that this happens when the system itself is under heavy load (eg. compiling rust). This is a 10+ year old laptop, so it struggles often and these tend to be the times in which this happens - but definitely not only.
* Does this crashing issue occur only on one machine or on multiple?
So far only on my machine. I unfortunately do not have access to others. I have no heard this reported, but that's because this only happens (to the best of my knowledge) in unreleased code. I'm currently very wary of releasing said code (supposed to do so in the coming week or two) and finding out because these sort of crashes will definitely cause many users to rage quit the app. I'm hoping we get to the bottom of this one way or the other.
* If only one machine, has it had any other stability issues? Could you run `memtest86` (if x86/PC hardware) as a baseline check?
This was a good idea, but unfortunately all is well. I ran memtest86+ overnight, 10 passes with no errors.
imsnif commented on issue #10283:
And I have a proper backtrace with debug symbols!
(gdb) bt #0 0x00007f134ab5f880 in ?? () #1 0x000057ceab37a049 in wasmtime_setjmp_29_0_1 () #2 0x000057ceaad40b6b in wasmtime::runtime::vm::traphandlers::catch_traps () #3 0x000057ceaadf72d1 in wasmtime::runtime::func::typed::TypedFunc<Params,Results>::call () #4 0x000057ceaad679ba in zellij_server::plugins::wasm_bridge::apply_event_to_plugin () #5 0x000057ceab032cc6 in async_task::raw::RawTask<F,T,S>::run () #6 0x000057ceaa2f9b71 in async_global_executor::threading::thread_main_loop () #7 0x000057ceaa2ff776 in std::sys::backtrace::__rust_begin_short_backtrace () #8 0x000057ceaa2f7a21 in core::ops::function::FnOnce::call_once{{vtable.shim}} () #9 0x000057ceaa71a09b in alloc::boxed::{impl#28}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at alloc/src/boxed.rs:1972 #10 alloc::boxed::{impl#28}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at alloc/src/boxed.rs:1972 #11 std::sys::pal::unix::thread::{impl#2}::new::thread_start () at std/src/sys/pal/unix/thread.rs:105 #12 0x00007f134c1ad39d in ?? () from /usr/lib/libc.so.6 #13 0x00007f134c23249c in ?? () from /usr/lib/libc.so.6
imsnif commented on issue #10283:
I'm very happy to provide whichever information you need. I'm happy to send you the core dump as well if you think that will help (though it was compiled to my machine, so I'm not sure if you'd be able to work with it?)
This does not seem related to any specific wasm module, it happens with the built-in plugins in Zellij. I can point to the code if you'd like.
I'm in a bit of a precarious situation here. This is a very rare crash (as I mentioned, it happens about twice a day for me when working with the app as my daily driver and IDE) - and so as much as I would want to, I don't know how to give you what you're asking for. I simply can't reproduce it.
On the other hand - I also can't release the software in this case. This would be devastating for the application and its stability (assuming it's not somehow a problem local to my case). My only other recourse is to downgrade back to version 21, which is the currently released one and hope for the best. I don't think this would be a good solution.
Could you please help me, maybe with educated guesses? Maybe with a way to mitigate this error, now that we now it's happening in the call function? I would be fine with a
try_call
or some such that lets me totally restart the plugin. Not a great experience, but better than crashing the whole app.
imsnif edited a comment on issue #10283:
I'm very happy to provide whichever information you need. I'm happy to send you the core dump as well if you think that will help (though it was compiled to my machine, so I'm not sure if you'd be able to work with it?)
This does not seem related to any specific wasm module, it happens with the built-in plugins in Zellij. I can point to the code if you'd like.
I'm in a bit of a precarious situation here. This is a very rare crash (as I mentioned, it happens about twice a day for me when working with the app as my daily driver and IDE) - and so as much as I would want to, I don't know how to give you what you're asking for. I simply can't reproduce it.
On the other hand - I also can't release the software in this case. This would be devastating for the application and its stability (assuming it's not somehow a problem local to my case). My only other recourse is to downgrade back to version 21, which is the currently released one and hope for the best. I don't think this would be a good solution.
Could you please help me, maybe with educated guesses? Maybe with a way to mitigate this error, now that we know it's happening in the call function? I would be fine with a
try_call
or some such that lets me totally restart the plugin. Not a great experience, but better than crashing the whole app.
bjorn3 commented on issue #10283:
@imsnif Do you have any plugins installed that are written in Go or another GC'ed language by the way? Or is with just the default Zellij plugins?
imsnif commented on issue #10283:
Just the default Zellij ones.
bjorn3 edited a comment on issue #10283:
@imsnif Do you have any plugins installed that are written in Go or another GC'ed language by the way? Or is this with just the default Zellij plugins?
bjorn3 commented on issue #10283:
That at least eliminates the possibility that this is a bug in the GC support of Wasmtime as all default plugins are written in Rust which doesn't use Wasm GC.
alexcrichton commented on issue #10283:
My guess is that back-and-forth debugging over and asynchronous chat mechanism such as issue comments here is unlikely to turn up much of a solution with most of the low-hanging-fruit possibilities already having been weeded out. I think @cfallin is right in that to make any progress on this you'd probably need to be able to upload an artifact of some kind here, e.g. a core dump or an
rr
trace or similar. That's of course not guaranteed to yield a solution, but otherwise only getting snippets here and there is a bit of a stunted debugging experience from our end.Another thing that might be useful: is there one wasm module which is causing issues? Multiple? One particular export? Multiple? Basically getting a more detailed picture of the crash might yield insights with respect to the shape of what Wasmtime is doing. I realize this is probably difficult to gather from your end due to the non-reproducible nature of the crash but if you're able to install some verbose logging and correlate that with a crash it could perhaps prove valuable.
alexcrichton commented on issue #10283:
At least from my perspective I want to be able to help you @imsnif but the context here is very sparse. Others might have more context but the context at least I have is:
- This is non-reproducible and spurious. Any requests for more information are understood that it may take awhile to collect the crash.
- This crash is a segfault where the ip itself seems to be the address of the faulting instruction and the address causing the fault. This was present in the OP's syslog logs, and I'm assuming that's the same for all other crashes shown her because the syslog hasn't been recaptured.
- The stack trace points at compiled wasm code, which knowingly doesn't have a symbol associated with it.
That's unfortunately not really much to go on. "Segfault in wasm code" could range anywhere from critical security bug in the runtime to some other random thread unmapping memory unknowningly. Debugging is in general a pretty interactive experience insofar as we don't have a runbook which says "run this command and it'll file an issue" but instead debugging issues like this requires a lot of back-and-forth with what's being debugged. This is all of course under the limitations you're describing which is a spurious crash that is not easily reproducible.
It's also worth pointing out that at least from my perspective I'm no zellij expert myself, rather I'm not familiar with the codebase at all. Not being familiar with a codebase can severly hinder debugging because there's so much unknown context of what's going on. Now of course you're in this bucket with respect to Wasmtime as well (I'm assuming you're not intimately familiar with Wasmtime's codebase), and that's something I personally very much want to respect. I don't expect you to be able to provide the perfect crash log/trace that narrows down this issue in a few minutes from our perspective, but at. the same time I'd hope you can be sympathetic to our end as well in terms of "segfault in wasm code" is not much to go on in terms of debugging.
I also want to very much respect the pressures you have in play as well with respect to releasing. From my (probably naive) perspective it seems like you probably want to downgrade to Wasmtime 21 while this is debugged in parallel. Either that or have some sort of time box for investigating this and after that downgrade for a release. I'd love to give you an estimate of how long this will take or some mitigation to apply to make it less likely or workaround or something like that. With "segfault in wasm code" though that option unfortunately isn't a reality.
Some other information that could possibly help:
- I see in Zellij there's support for both Winch and Cranelift. Which is being used?
- I see in Zellij there's support for deserializing compiled modules from disk? Is that being used?
- If so, is it possible that the file is being modified on disk while it's in use? There's some documentation about the
unsafe
here. Also if so, perhaps you could test disabling this to see if it improves things?- I've got no idea if I can inspect a core dump from your machine on my machine but I figure it can't hurt. I'm no core dump expert myself.
- What are you currently doing to reproduce? Is there a sequence of things you do which after 15 minutes reproduces? Or do you just work in the background and once a week it crashes? This is somewhat rhetorical but if you do have a method of reproducing, even if it takes awhile, that's where again
rr
could be invaluable to reproduce the crash.
imsnif commented on issue #10283:
Hey @alexcrichton - first things first - I just want to make sure I do not come across as non-sympathizing or unappreciative of your situation. I very much get where you are coming from and understand the difficulty in debugging this very theoretical issue. You are not by my machine and even if you were - this is not consistently reproducible and very broad. I'm not expecting any magical solutions and am very grateful for the support I'm getting from all of you. Thank you for continuing to work with me on this!
I see in Zellij there's support for deserializing compiled modules from disk? Is that being used?
If so, is it possible that the file is being modified on disk while it's in use? There's some documentation about the unsafe here. Also if so, perhaps you could test disabling this to see if it improves things?I want to concentrate on this, because I think this is a very good guess and could very well be the issue.
The way we load plugins (modules) is to first compile them and then keep the compiled assets in our cache folder. The cache folder is indexed by a sha256 hash of the wasm-file.
When developing, I use the winch compiler while in production we use the cranelift compiler. As far as I understand it, these two would have different compiled assets but the same hash index. Could it be that while developing I override the cache index with an asset compiled by a different compiler and thus cause an already instantiated module to crash? Am I understanding the docs correctly that even already instantiated modules page from their loaded HD path?
cfallin commented on issue #10283:
Yes, we
mmap()
the.cwasm
file directly, so if the data at offsets in the file corresponding to the running code are changed, then the running code in memory will be changed underneath the instruction pointer, wherever it is. This could very well explain the RIP-in-the-middle-of-zeroes seen above.Question: when you write out the files into the cache directory, do you open as writable and truncate, then write, or do you write a new file then rename it over the old file? I suspect the latter ("atomically replacing the file", as described here, often the method used by text editors on save) is what one wants. The effect of this is that the old file will be unlinked from the directory when replaced, but Wasmtime will still hold the file open (via the mmap) and so the old inode with old data will continue to exist and be used.
Separately, it might be reasonable to add the compiler config to your cache key, so both variants can remain cached, but if the replacement is done properly as above then keying them separately should be "just an optimization" rather than load-bearing for correctness.
imsnif commented on issue #10283:
We don't take any special care of files in our cache. Until now I've honestly been treating it as completely disposable, and routinely used "delete the cache" as a way to solve issues for both myself and users.
Alright - looks like we've got a lead. I'm going to first try and run a version that uses a totally separate cache than my dev one to see if I still experience crashes. I'll make sure to report back here.
alexcrichton commented on issue #10283:
Nice! Definitely seems like a promising lead.
One other bit of context I can add is that nothing about this has changed in the 21->29 release history, so that would raise a few questions for me:
- Did you try downgrading to 21 to see if it fixed anything? If you did and it appeared to fix things that's worrisome, but if you hadn't done that yet just wanted to confirm.
- Did you implement support for Winch + Cranelift in the development of Zellij between when Wasmtime was added at 21 and now? That might help explain why it wasn't there originally but is cropping up now.
Or well I suppose that this sort of assumes changing files is the cause. Although if the answers are "no 21 hasn't been tested yet" and "yes winch+cranelift support was added in this window" that's further fuel to the fire of "thanks mmap unix semantics"
imsnif commented on issue #10283:
I have not yet tested version 21 (was planning to try tomorrow). Version 21 though was the first wasmtime version we went to production with after switching from wasmer. So until now I have never had a situation where I was working locally with a production and dev version that shared the same cache for a significant amount of time. It could very well be that this was all a red herring.
imsnif closed issue #10283:
Hey,
In the context of Zellij, ever since a recent (still unreleased) upgrade of wasmtime, I've been getting intermittent and seemingly random segfaults.
I am afraid I don't have a way to reproduce them, but I have been gathering some information about them and am hoping the maintainers can help pin-point the issue or provide more insight.
Version info:
wasmtime version: 29.0.1
(probably) did not occur in wasmtime version: 21.0.2What happens: Zellij crashes, without a panic, and syslog tells me that it got a segfault:
* [343995.595910] async-std/runti[2487041]: segfault at 705a8ea73970 ip 0000705a8ea73970 sp 0000705a8ea73948 error 15 likely on CPU 3 (core 1, socket 0)
With some debug logging, I managed to narrow the crash down to this block in our app: https://github.com/zellij-org/zellij/blob/main/zellij-server/src/plugins/wasm_bridge.rs#L1589-L1596
I have a core dump, but unfortunately not one with debug symbols so I'm not sure it's a lot of help.
This happens about once or twice a day for me and so is extremely gnarly to pin-point. I'm not 100% sure this is a wasmtime issue, but the crash location would seem to indicate that it is.
Any help in troubleshooting or mitigating this somehow would be appreciated. On my part I'm going to try to get a coredump with some debug symbols so we'll hopefully be less blind.
Thanks!
imsnif commented on issue #10283:
Hi all - I'm coming back to report that I've been working with a separate cache folder on my dev instance for 3 days and experienced no crashes. I think we nailed it on the head. Very cool and much appreciated.
TL;DR for those who don't want to read the whole thread: This was an issue very likely stemming from this
unsafe
: https://docs.rs/wasmtime/latest/wasmtime/struct.Module.html#method.deserialize_file. Thedeserialize_file
method requires that "the file referenced must remain unchanged and a valid precompiled module for the entire lifetime of the Module". The Zellij cache was being shared between instances that had a different compiler using the same cache indices for modules.I was not aware of this unsafe, and now that I am I'm going to get around it in one way or another on the Zellij side (likely keep the modules in memory if resource utilization / performance permits).
Thanks everyone for the excellent support!
Last updated: Apr 17 2025 at 01:31 UTC