Skip to content

SEGFAULT and/or SEGABRT with GtkCanvas #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nHackel opened this issue Mar 7, 2025 · 14 comments · Fixed by #89
Closed

SEGFAULT and/or SEGABRT with GtkCanvas #87

nHackel opened this issue Mar 7, 2025 · 14 comments · Fixed by #89
Labels
bug Something isn't working

Comments

@nHackel
Copy link
Contributor

nHackel commented Mar 7, 2025

Hello,

for I think a few years now, we've had sporadic segmenation faults in our Gtk4.jl GUI. With recent versions we've been able to somewhat reliably produce the error and it looked like the CairoSurfaces of the GtkCanvas were somtimes garbage collected.

Usually we didnt't see any Julia-level errors or assertions about that. Today we tried to change our GtkCanvas printing from Cairo.jl to CairoMakie.jl and there we got some more Julia stacktraces and log messages.

I've been able to produce this MWE:

using Gtk4, CairoMakie, Cairo

function Base.setproperty!(surface::CairoSurface, fieldname::Symbol, val)
    if fieldname == :ptr && val == C_NULL
        @show stracktrace()
    end
    setfield!(surface, fieldname, val)
end

config = CairoMakie.ScreenConfig(1.0, 1.0, :good, true, false, nothing)
CairoMakie.activate!()

canvas = GtkCanvas()
w = GtkWindow(canvas,"CairoMakie example")

@guarded draw(canvas) do widget
    # global f, ax, p = lines(1:10) # no stacktrace
    f, ax, p = lines(1:10) # stacktrace

    CairoMakie.autolimits!(ax)
    screen = CairoMakie.Screen(f.scene, config, Gtk4.cairo_surface(canvas))
    CairoMakie.resize!(f.scene, Gtk4.width(widget), Gtk4.height(widget))
    CairoMakie.cairo_draw(screen, f.scene)
end

which prints a stacktrace if the surface access is illegal. So on the hand the "bug" is partially in the user code, because the figure isn't referenced and thus can be garbage collected.

But I'm not sure why that garbage collection triggers the CairoSurface to be also destroyed. I think that part might be an issue in Gtk4.jl. With this MWE I've also not managed to create a segfault, so something "regenerates" the cairo surface?

I saw here that the canvas get be reinitialized when resized, so maybe it happens there

@nHackel
Copy link
Contributor Author

nHackel commented Mar 7, 2025

My solution for our UI package for now will be to have something akin to:

mutable struct CairoMakieCanvas
  canvas::GtkCanvas
  figure::CairoMakie.Figure
end

function draw(fig::Figure, canvas::CairoMakieCanvas)
    canvas.figure = fig
    @guarded draw(canvas.canvas) do widget
	# ... CairoMakie.cairo_draw and so 
    end
end

and update the figure and draw

@jwahlstrand
Copy link
Member

When the canvas is resized, the CairoImageSurface backing store has to be replaced, since there's no way to resize the surface. I wonder if we should be creating the CairoMakie screen using a separate CairoImageSurface that is then copied onto the GtkCanvas backing store in the draw function. I'll bet it would not affect performance much.

Alternatively, could you call cairo_surface_reference when creating the screen, to keep the surface alive while it's being drawn on? You could unreference it after the draw using cairo_surface_destroy.

@nHackel
Copy link
Contributor Author

nHackel commented Mar 8, 2025

I'm not sure if I understand the situation correctly. The replacement due to resize would happen in either case right? And also with the the workaround I mentioned in my comment?

In the MWE the stacktrace seemed to only be printed in the case without the global keyword. If resizing replaces the store in either case, then maybe the GC'ed figure also has another effect.

I played around a bit more with the workaround I posted and so far I wasn't able to crash the GUI again. I'll only be able to a do proper stress test on monday with live image reconstructions and multiple sensor readings in parallel

@nHackel
Copy link
Contributor Author

nHackel commented Mar 9, 2025

Ah okay I see the issue with having to redraw the surface now. What is still unclear to me why we get an issue when we construct the figure locally and don't have an outside reference to it. Maybe that issue is not being reproduced by the MWE correctly.

For the reference issue we could try an approach with a package extension on CairoMakie and using ScopedValues. In the extension we could define a new draw method as follows:

const figure_cache = ScopedValue(CairoMakie.Figure)()

function Gtk4.draw(fig::CairoMakie.Figure, canvas::GtkCanvas)
  @with figure_cache => fig begin
    @guarded draw(canvas) do widget
      f = figure_cache[]
      screen = CairoMakie.Screen(f.scene, config, Gtk4.cairo_surface(canvas))
      CairoMakie.resize!(f.scene, Gtk4.width(widget), Gtk4.height(widget))
      CairoMakie.cairo_draw(screen, f.scene)
    end
  end
end

And then I think each instance of this function call should have its own figure_cache. I haven't tried it out yet (because Julia is broken on my PC atm), but once I do I can make a PR if I see that helping us with our problem

@jwahlstrand
Copy link
Member

For me the stacktraces appear whether or not the global keyword is used... it makes sense that the surface should get GC'ed in either case. I was just wondering if the crash issue was being caused by code writing to the surface after it was GC'ed, in which case adding a reference should be able to fix it.

Do the crashes occur more when you use multiple threads?

@nHackel
Copy link
Contributor Author

nHackel commented Mar 9, 2025

Yes so with the CairoMakie figures we would get a few assertion errors for Makie where the surface was a C_NULL. If we had a lot of drawing updates either because we had many plots being updated "at the same time" or updates with a quick succession we had our segfaults. Those two cases I can test tomorrow more thoroughly with the CairoMakieCanvas workaround. The scoped value one I can test in 1-2 weeks.

In those cases we also always have more than two threads, but I think most updates to the drawing function happen within @idle_add, so via the main loop task. And we also have julia -t x,1, so only one thread in the Gtk4 thread pool.

The issue where I tested my workaround I could reproduce by opening two windows with I think 5 GtkCanvases each. That just SEGFAULT/SEBARBT reliably recently and it happened with just one thread too. All the data and code one would need for that is open-source, so I can give you a reproducer for that, but it's not a small MWE

@nHackel
Copy link
Contributor Author

nHackel commented Mar 9, 2025

Oh and we just added a reference to the figure, not to the surface. Now I'm unsure if that actually fixed it for us or if we just dont encounter our race condition anymore

@nHackel
Copy link
Contributor Author

nHackel commented Mar 10, 2025

Okay so we are adding/updating the draw function outside of an @idle_add. From what I can tell the draw call goes to ccall(gtk_widget_queue_draw), which seems to schedule it during the next frame update. Which should also happen in the main loop right?

I think I'm still missing something, because it looks like when we throw away the surface and resize if, that should happen "atomically" from a GTK loop perspective and the draw calls shouldn't be invoked when a surface is currently empty.

It might be that our GUI is keeping some reference to the wrong CairoContext and tries to draw to it and now that I switched over to making most things in Makie I accidentally fixed it

I did notice that we currently have multiple draw calls per resize in Gtk4. One in resize itself and one in inner resize

@nHackel
Copy link
Contributor Author

nHackel commented Mar 10, 2025

So far we also don't have anymore crashes when doing "online" reconstructions, which we used to immidiately have last week. In that setting we have a thread receiving data and and another thread periodically reconstructing a 3D image which we then display on such grid of canvases:

Image

Now each canvas is updated with its own figure which it caches until the next image is reconstructed. Once I'm back at the system I can also try resizing everything during this process.

I also didn't find any function our the GUI that keeps a reference to a CairoContext that is outside of a draw function call

@nHackel
Copy link
Contributor Author

nHackel commented Mar 20, 2025

@jwahlstrand I was trying to get a better MWE and I can reproduce a similar crash with it:

using Gtk4, CairoMakie, Cairo

config = CairoMakie.ScreenConfig(1.0, 1.0, :good, true, false, nothing)
CairoMakie.activate!()

grid = GtkGrid()
grid[1,1] = GtkCanvas()
grid[1,2] = GtkCanvas()
grid[2,1] = GtkCanvas()
grid[2,2] = GtkCanvas()
grid.row_homogeneous = true
grid.column_homogeneous = true
w = GtkWindow(grid,"CairoMakie example")

function showData(canvas, data::AbstractMatrix)
  @guarded draw(canvas) do widget
      f, ax, p = heatmap(data)
      CairoMakie.autolimits!(ax)
      screen = CairoMakie.Screen(f.scene, config, Gtk4.cairo_surface(canvas))
      CairoMakie.resize!(f.scene, Gtk4.width(widget), Gtk4.height(widget))
      CairoMakie.cairo_draw(screen, f.scene)
  end
end


function showData(canvas, data::AbstractVector)
  @guarded draw(canvas) do widget
      f, ax, p = lines(data)
      CairoMakie.autolimits!(ax)
      screen = CairoMakie.Screen(f.scene, config, Gtk4.cairo_surface(canvas))
      CairoMakie.resize!(f.scene, Gtk4.width(widget), Gtk4.height(widget))
      CairoMakie.cairo_draw(screen, f.scene)
  end
end

function updateData(g, i)
  data = zeros(32, 32)
  val = mod1(div(i[], 10), size(data, 1))
  data[val, :] .= 1
  showData(g[1,1], data)
  showData(g[1,2], data')
  showData(g[2,1], data[1, :])
  showData(g[2,2], data)
  i[] = i[] + 1
end

index = Ref(1)
timer = Timer(timer -> updateData(grid, index), 0, interval = 0.05)

I also tried this variant with keeping a vector of 4 figures which I update but there I get the same error. So it behaves a bit different to our setup, because there the caching seems to have helped.

@tknopp tknopp added the bug Something isn't working label Mar 23, 2025
@tknopp
Copy link
Member

tknopp commented Mar 23, 2025

I can reproduce here on Mac. The stack trace is:

, ┌ Warning: Error in @guarded callback
│   exception =
│    AssertionError: surface.ptr != C_NULL
│    Stacktrace:
│      [1] get_render_type
│        @ ~/.julia/packages/CairoMakie/JchUZ/src/cairo-extension.jl:79 [inlined]
│      [2] device_scaling_factor
│        @ ~/.julia/packages/CairoMakie/JchUZ/src/screen.jl:125 [inlined]
│      [3] CairoMakie.Screen(scene::Scene, config::CairoMakie.ScreenConfig, surface::Cairo.CairoSurfaceBase{UInt32})
│        @ CairoMakie ~/.julia/packages/CairoMakie/JchUZ/src/screen.jl:341

So it seems that the surface returned by Gtk4.cairo_surface is not valid.

@jwahlstrand
Copy link
Member

jwahlstrand commented Mar 23, 2025

I can also reproduce on Linux.

When the lines that create the canvases are replaced with:

canvas11 = grid[1,1] = GtkCanvas()
canvas12 = grid[1,2] = GtkCanvas()
canvas21 = grid[2,1] = GtkCanvas()
canvas22 = grid[2,2] = GtkCanvas()

it works. So if the canvases (the Julia objects) are kept around, the surfaces survive.

This suggests maybe there is a bug in how Gtk4 handles the case when the Julia object goes out of scope but the corresponding GObject isn't destroyed. When that happens, the Julia object is supposed to be saved in Gtk4.GLib.gc_preserve_glib.

@tknopp
Copy link
Member

tknopp commented Mar 23, 2025

Interesting. I would have assumed that the grid should somehow hold the (Julia) reference to all its children.
Do you have enough insight into that code? I never understood the GObject <-> Julia magic.

@jwahlstrand
Copy link
Member

jwahlstrand commented Mar 23, 2025

I understand what the code in gtype.jl does, but I don't understand in many cases why it's written that way. There are many cases where objects are kept around that should be garbage collected, leading to effective memory leaks. All my attempts to fix that have resulted in segfaults. It definitely also has issues when there is more than one thread.

Interestingly, I went back to the original code (where the canvases are just put in the grid) and tried running just the first iteration of the timer, then starting the timer in the REPL, and the code works. When I did this, the canvases all appeared in Gtk4.GLib.gc_preserve_glib as I expected. Until I call GC.gc(), when it crashes. Not sure what this means yet.

jwahlstrand added a commit that referenced this issue Mar 23, 2025
jwahlstrand added a commit that referenced this issue Mar 24, 2025
…n `gc_preserve_glib`

This prevents these objects, which are "subclasses" such as GtkCanvas, from being garbage
collected at all. Previously, the code was supposed to intercept the garbage collection
and do this "weak to strong" conversion if the object survives being unreferenced by
Julia, but #87 seems to indicate a problem.
This is a band-aid fix for that issue.
jwahlstrand added a commit that referenced this issue Apr 18, 2025
…n `gc_preserve_glib` (#89)

This prevents these objects, which are "subclasses" such as GtkCanvas, from being garbage
collected at all. Previously, the code was supposed to intercept the garbage collection
and do this "weak to strong" conversion if the object survives being unreferenced by
Julia, but #87 seems to indicate a problem.
This is a band-aid fix for that issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants