Skip to content

Cannot extract relative reference links in Markdown #1657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wks opened this issue Mar 19, 2025 · 1 comment
Open

Cannot extract relative reference links in Markdown #1657

wks opened this issue Mar 19, 2025 · 1 comment
Labels

Comments

@wks
Copy link

wks commented Mar 19, 2025

Test case:

Inline [link1](target1.md)

Reference [link2][link2]

[link2]: target2.md

Collapsed [link3][]

[link3]: target3.md

Shortcut [link4]

[link4]: target4.md

Shortcut [link5] with full URL

[link5]: file:///path/to/target5.md

Save this as ~/junk/lychee/baz.md and process it with lychee baz.md --dump -vv, and it prints:

file:///home/wks/junk/lychee/target1.md (baz.md)
file:///path/to/target5.md (baz.md)

It successfully extracts the link to target1.md and resolved it as a relative URL starting with file:///....

But link2 to link4 failed to be extracted. Link5 points to a full URL instead of a filename, and it is extracted, too.

I think the problem is in the handling of links in the markdown parser.

// excerpt from lychee-lib/src/extract/markdown.rs

pub(crate) fn extract_markdown(input: &str, include_verbatim: bool) -> Vec<RawUri> {
// ...
                match link_type {
                    LinkType::Inline => {
                        Some(vec![RawUri {
                            text: dest_url.to_string(),
                            element: Some("a".to_string()),
                            attribute: Some("href".to_string()),
                        }])
                    }
                    LinkType::Reference |
                    LinkType::ReferenceUnknown |
                    LinkType::Collapsed|
                    LinkType::CollapsedUnknown |
                    LinkType::Shortcut |
                    LinkType::ShortcutUnknown |
                    LinkType::Autolink |
                    LinkType::Email =>
                     Some(extract_raw_uri_from_plaintext(&dest_url)),

For inline links, it simply treats dest_url as the href. But for all other kinds of links, it will invoke extract_raw_uri_from_plaintext which uses some kind of heuristics to detect URLs. So anything that doesn't look like a URL in [label]: foo_bar_baz.md are ignored.

@mre mre added the bug Something isn't working label Mar 20, 2025
@mre
Copy link
Member

mre commented Mar 20, 2025

Haven't looked too deeply into it, but could be a duplicate of #1574. There is #1624 if you want to give the current development branch a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants