Skip to content

Conversation

@toddsundsted
Copy link
Contributor

@toddsundsted toddsundsted commented Nov 23, 2025

Parser contexts created with xmlNewParserCtxt() and htmlNewParserCtxt() are never freed. This PR adds missing cleanup calls (xmlFreeParserCtxt() and htmlFreeParserCtxt()) in four methods in src/xml.cr. I believe this is a regression introduced in PR #15906.

The following spec demonstrates the problem. I had considered adding this to the suite of tests, but decided against it because it is non-deterministic. Running this against a pre-1.17.x version succeeds. Running this against the version with this PR in place succeeds. Running this against 1.17.x to 1.18.x results in over 1GB of memory allocated but not freed. (More critically, if you adjust the number of iterations, the memory allocated scales linearly with iterations and never plateaus.)

require "spec"
require "xml"

describe "XML memory management" do
  it "does not leak parser contexts" do
    xml = "<root><item>test</item></root>"

    GC.collect
    baseline_rss = get_rss_kb

    1_000_000.times { XML.parse(xml) }

    GC.collect
    final_rss = get_rss_kb
    growth = final_rss - baseline_rss

    (growth < 50_000).should be_true
  end
end

private def get_rss_kb : Int64
  {% if flag?(:darwin) %}
    `ps -o rss= -p #{Process.pid}`.strip.to_i64
  {% else %}
    File.read("/proc/self/status").lines.each do |line|
      return line.split[1].to_i64 if line.starts_with?("VmRSS:")
    end
    0_i64
  {% end %}
end

Copy link
Contributor

@ysbaddaden ysbaddaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed 🤦

Polish: maybe there could be a pair #parse_xml(&) and #parse_html(&) methods. For example:

private def parse_xml(&)
  ctxt = LibXML.xmlNewParserCtxt
  begin
    from_ptr(ctxt) { yield ctxt }
  ensure
    LibXML.xmlFreeParserCtxt(ctxt)
  end
end

parse_xml do |ctxt|
  LibXML.xmlCtxtReadMemory(ctxt, ...)
end

@ysbaddaden ysbaddaden added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:serialization labels Nov 24, 2025
@straight-shoota straight-shoota added the kind:regression Something that used to correctly work but no longer works label Nov 24, 2025
@straight-shoota straight-shoota added this to the 1.19.0 milestone Nov 24, 2025
@toddsundsted
Copy link
Contributor Author

Indeed 🤦

Polish: maybe there could be a pair #parse_xml(&) and #parse_html(&) methods. For example:

private def parse_xml(&)
  ctxt = LibXML.xmlNewParserCtxt
  begin
    from_ptr(ctxt) { yield ctxt }
  ensure
    LibXML.xmlFreeParserCtxt(ctxt)
  end
end

parse_xml do |ctxt|
  LibXML.xmlCtxtReadMemory(ctxt, ...)
end

addressed in 36c769a

@Sija
Copy link
Contributor

Sija commented Nov 25, 2025

@toddsundsted Why was that closed? :(

@straight-shoota
Copy link
Member

Seems like a clerical error. 😆
The same source branch is reused in #16412 which autoclosed this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:bug A bug in the code. Does not apply to documentation, specs, etc. kind:regression Something that used to correctly work but no longer works topic:stdlib:serialization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants