This script uses the standard exported XML file from WordPress, and creates a folder/file structure that contains all of the blog posts, converted to markdown format. It will also download all of the images.
Instructions for exporting your information from WordPress can be found here.
This is a fork of Swizec's improvements that is targeting migration to Docusaurus blogs. I didn't bother to adapt the metadata generation, so that may not work as intended for your use-case.
It worked to convert our ten year old blog. Your milage may vary.
This program will put the output into ./out
folder, and also all images will go to /img
.
All posts are written to ./out
in the YYYY-MM-DD-title-of-post.md
style.
All images are written to ./out//img
. If you have duplicate image file names, this will be an issue :)
Export your Wordpress to XML then:
$ git clone https://github.com/Swizec/wordpress-to-markdown
# download your wordpress xml
# change filename on convert.js line 27
$ yarn
$ yarn convert
# sip margaritas
This uses xml2js to parse the XML, and then uses a combination of Rehype, Remark, Prettier, and custom plugins to convert your blogs into prettified Markdown. Embeds become links so your site can run modern-style embedding.
wordpress-to-markdown performs a number of conversion for you that make stuff work better with modern static site builders. Read my article for technical details on how this works
I wanted to create full frontmatter without manual edits. That means:
---
title: 'Always put side effects last'
description: ""
published: 2018-01-10
redirect_from:
- /blog/always-put-side-effects-last/swizec/8057
categories: "Startups, Technical"
hero: ./img/wp-content-uploads-2016-10-salesforce-tower-panorama-1024x358.jpg
---
Title from post title, description based on meta data, a publish date, keep old URL for redirects, combine categories and tags into categories, find a good hero/social image.
Data comes from digging around Wordpress exports and figuring out what fits.
Wordpress HTML is pretty good. Plop it in an HTML parser and, like, it won't choke ... but it won't parse correctly either.
We change double newlines to paragraph breaks. Wordpress doesn't wrap paragraphs in <p></p>
tags
I wrote about fixing bad code blocks in my You though computer science has no place in webdev? Here's a fun coding challenge article.
Your challenge is that this isn't valid HTML:
<pre lang="javascript">
class ReportSize extends React.Component {
refCallback = element => {
if (element) {
this.props.getSize(element.getBoundingClientRect());
}
};
render() {
return (
<div ref={this.refCallback} style={{ border: "1px solid red" }}>
{faker.lorem.paragraphs(Math.random() * 10)}
</div>
);
}
}
</pre>
JSX tags get parsed as HTML and break your code block. You want them to include a <code></code>
tag as well. Otherwise Markdown stringifying doesn't work right.
Fixing this is tricky and this script does it for you.
Lots of ways to embed 3rd party content on a wordpress site. You can use plain old links pasted on their own line, shortcodes, and full HTML embeds.
Markdown site generators like to use plain links.
You want to change code like:
<blockquote class="twitter-tweet">
<p lang="en" dir="ltr">
A script that converts Wordpress dumps into clean Markdown may have been the
dumbest project I ever took on. Sooooo many edge cases 😅
<a href="https://t.co/z8dPUMrBGk">pic.twitter.com/z8dPUMrBGk</a>
</p>
— Swizec Teller (@Swizec)
<a
href="https://twitter.com/Swizec/status/1298308910072307713?ref_src=twsrc%5Etfw"
>August 25, 2020</a
>
</blockquote>
<script
async
src="https://platform.twitter.com/widgets.js"
charset="utf-8"
></script>
Into Markdown that's a link:
https://twitter.com/Swizec/status/1298308910072307713
Site generator can take this and turn it into an embed. When it starts as a blockquote, you'll have trouble.
Script does this for you. If you find an unsupported service, PRs welcome :)
Shortcodes are a semi-standard system of snippets. Denoted by []
they give CMS users the ability to go beyond writing text.
These were popular on internet forums of the late 2000's. Wordpress supports them to this day. Don't know about others.
I wanted to get rid of most and preserve any embeds.
You can identify an embed because it's a closed shortcode prefixed with the name of a service followed by a link.
[tweet https://twitter.com/Swizec/status/1298308910072307713]
The gnarly ones are Wordpress's almost-html shortcodes. Big issue on my site were the [caption][/caption]
shortcodes.
We convert all that to standard markdown. ✌️
This one was frustrating. Embed links can include underscores, like when you embed a tweet from @_developit
.
Markdown stringification escapes underscores because it thinks they're emphasis and doesn't understand that some text nodes are link nodes despite not being links.
https://twitter.com/_developit/status/1300154097170083842
That breaks your embed machinery so we fix it. 🤪
The MIT License (MIT)
Copyright (c) 2013 Jason Young
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.