Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken images #5

Closed
peterjc opened this issue Feb 5, 2024 · 10 comments
Closed

Broken images #5

peterjc opened this issue Feb 5, 2024 · 10 comments

Comments

@peterjc
Copy link
Member

peterjc commented Feb 5, 2024

e.g. https://open-bio.org/wiki/Codefest_2016 has a an image https://open-bio.org/w/images/2/22/FamiLAB-Logo3.gif

In the current rendering, https://open-bio.org/w/images/2/22/FamiLAB-Logo3.gif tries to show https://obf.github.io/wiki/_FamiLAB-Logo3.gif (not found).

So (a) where is the image, and (b) why the leading underscore?

@peterjc
Copy link
Member Author

peterjc commented Feb 5, 2024

It appears the image(s) are not in the SQLite conversion of the XML dump. Are they in the XML dump?

@peterjc
Copy link
Member Author

peterjc commented Feb 5, 2024

It appears I didn't dump the images when getting the XML:

$ grep "File:FamiLAB-Logo3.gif" obf_mediawiki_dump.xml -A 16 -B 1
  <page>
    <title>File:FamiLAB-Logo3.gif</title>
    <ns>6</ns>
    <id>415</id>
    <revision>
      <id>4351</id>
      <timestamp>2016-03-17T13:41:46Z</timestamp>
      <contributor>
        <username>Chapmanb</username>
        <id>24</id>
      </contributor>
      <comment>FamiLAB logo</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve" bytes="12">FamiLAB logo</text>
      <sha1>kg4rg7v2ofe0ejbxa4gz1t1qvwk92bj</sha1>
    </revision>
  </page>

@peterjc
Copy link
Member Author

peterjc commented Feb 5, 2024

Looks like at least 80 images:

$ cat *.md | grep "<img" | sort | uniq | wc -l
      80

@peterjc
Copy link
Member Author

peterjc commented Feb 5, 2024

Confirmed, missed the include files option in the initial dump - addressed and documented in 8b7d369

@peterjc
Copy link
Member Author

peterjc commented Feb 7, 2024

It seems images in MediaWiki are not forced to start with a capital letter...

@peterjc
Copy link
Member Author

peterjc commented Feb 7, 2024

I think the handful of images with a leading underscore are a bug in my script:

$ grep 'src="_' *.md
BOSC_2014.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
BOSC_2014_Schedule.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
BOSC_2015.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
BOSC_2015_Schedule.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
BOSC_2016.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
BOSC_2016_Schedule.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
Codefest_2014.md:<img src="_Aws-logo.jpeg" title="AWS logo|link=http://aws.amazon.com/"
Codefest_2014.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
Codefest_2015.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
Codefest_2016.md:<img src="_FamiLAB-Logo3.gif" title="FamiLAB|link=http://familab.org"
Codefest_2016.md:<img src="_Arvados.png" title="Arvados logo|link=http://arvados.org"
Codefest_2017.md:<img src="_Brmlab.png" title="Brmlab|link=http://brmlab.cz"

@peterjc
Copy link
Member Author

peterjc commented Feb 7, 2024

Some of the images are breaking as %E2%80%8E has been added to the URL, e.g. https://obf.github.io/wiki/SB_logo_navy.png%E2%80%8E instead of https://obf.github.io/wiki/SB_logo_navy.png on https://obf.github.io/wiki/Codefest_2017

This is apparently https://en.wikipedia.org/wiki/Left-to-right_mark and the left-to-right mark is in the HTML snippet within the MarkDown:

$ hexdump -C Codefest_2017.md  | grep navy
00003610  6e 61 76 79 2e 70 6e 67  e2 80 8e 22 0a 74 69 74  |navy.png...".tit|

It was in the raw mediawiki dump too:

$ hexdump -C Codefest_2017.mediawiki  | grep navy -A 1
00002ed0  65 3a 53 42 5f 6c 6f 67  6f 5f 6e 61 76 79 2e 70  |e:SB_logo_navy.p|
00002ee0  6e 67 e2 80 8e 7c 32 32  30 70 78 7c 63 65 6e 74  |ng...|220px|cent|

@peterjc
Copy link
Member Author

peterjc commented Feb 7, 2024

The leading underscores stems from a leading space in the MediaWiki version, which apparently gets ignored:

$ grep "File: " *.mediawiki
BOSC_2014.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
BOSC_2014_Schedule.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
BOSC_2015.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
BOSC_2015_Schedule.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
BOSC_2016.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
BOSC_2016_Schedule.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
Codefest_2014.mediawiki:|rowspan="2"| [[File: Aws-logo.jpeg|150px|center|AWS logo|link=http://aws.amazon.com/]]
Codefest_2014.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
Codefest_2015.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
Codefest_2016.mediawiki:| [[File: FamiLAB-Logo3.gif|400px|left|FamiLAB|link=http://familab.org]]
Codefest_2016.mediawiki:| [[File: Arvados.png|150px|center|Arvados logo|link=http://arvados.org]]
Codefest_2017.mediawiki:| [[File: Brmlab.png|400px|left|Brmlab|link=http://brmlab.cz]]

It looks like a corner case pandoc does not currently support.

Edit: logged as jgm/pandoc#9425

@peterjc
Copy link
Member Author

peterjc commented Feb 7, 2024

I believe all the images are working now, closing.

@peterjc peterjc closed this as completed Feb 7, 2024
peterjc added a commit that referenced this issue Feb 9, 2024
peterjc added a commit that referenced this issue Feb 9, 2024
@peterjc
Copy link
Member Author

peterjc commented Feb 9, 2024

Lots were broken by peterjc/mediawiki_to_git_md#35 - wrongly put into title case.

@peterjc peterjc reopened this Feb 9, 2024
@peterjc peterjc closed this as completed in 912acf2 Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant