With the module installed, login to the CMS and observe a new Model Admin menu-item labelled "Migration". Select this and observe a dropdown menu and a 'Create' button. Select the 'Static Site Content Source' option in the dropdown and select 'Create', you'll see 'New Connector' in the list of Connectors, click it to open.
Fill in the fields, referring to the field-decriptions for context. The "NAme" field is merely for your references. The "Base URL" field should be of the form e.g. https://www.example.org
. If the site you wish to import is a MOSS (Microsoft Office Sharepoint Server) site with /Pages/some-page.aspx
styled URLs, then select 'MOSS-style URLs' under "URL Processing", then select the 'Save' button.
Go to the "Crawl" tab and select "Crawl Site". Leave it running. It will take some time to complete depending on the number of pages and assets in the site being crawled.
-
If you reopen the Connector admin in a different browser (so that it has a different session cookie), you can see the current status of the crawling process.
-
Ensure you have your browser's developer tools open before you set the crawl off. Occasionally the crawl will die for unknown reasons - usually if your localhost/container has run out of resources - but review your browser's logs which will help in debugging. Note also that there's an import/extraction log available with more information in it:
tail -f /tmp/exodus.log
- If the host you're running the module on is behind a proxy, adapt the following:
app/_config/config.yml
PhpTek\Exodus\Extract\StaticSiteContentExtractor:
curl_opts_proxy:
hostname: my-gateway.co.nz
port: 1234
Once the crawling is complete (a message will show in the CMS UI), you'll see all the URLs laid out underneath the connector. The URL structure (i.e. where the slashes are) is used to build a hierarchy of URLs. (TODO this doesn't work - no hierarchy is displayed in the central CMS pane - but a simple bullet-list is).
Now it's time to configure some schema which determine how each of the crawled URI's are transformed into native Silverstripe objects. For "Page" content, we will configure some CSS selectors (used to query the source website's DOM) for the import.
Go to the "Main" tab and in the "Parent Page To Import Into" field, select a page. Sometimes it's helpful to create an "imported content" parent page in the Pages section of the CMS first. Then in the "Parent Folder To Import Into" field, select a parent folder to import files and images into e.g. "Uploads".
Now select the "Start Importing" button (See Protip #2, above). This may also take some time and doesn't have a robust resume functionality. owever, since the "staticsiteconnector" days, error handling has been improved.
Next, select the "Add Schema" button and fill out the fields as follows for page content: (Files and images need to be configured similarly, but do not have CSS rules that need to be configured of course).
- Priority:
1
- URL Pattern:
.*
- Data Type:
Page
- Mime-Types:
text/html
Now select the "Add Rule" button and immediately select "Save" - this allows you to select from the "Field Name" dropdown menu.
- Specify a field to import into - usually "Title" or "Content"
- Specify a CSS selector e.g.
body#content h1
or#body#content section.main
respectively (Depends on the CSS class used in the source website) - If you have different CSS selectors for different pages, create multiple Import Rules, just note that the first one rule that actually returns content will be used.
Open sample pages in the CMS' site-tree for review. Here you can assess wether or not the Import Rules have worked as expected. If they don't work, debug them, modify your import rules and schema then re-try.
Using simple CSS selectors you can control the parts of source page that are mapped to a particular field within the SiteTree
class.
That's it! There are quite a few steps but it's easier than copy & pasting all those pages!
Schema is the name given to a collection of rules that instruct the module how a crawled website should have its markup formatted and stored as/in Silverstripe objects during an import.
Each rule in a schema is predicated on a CSS selector which defines the exact DOM fragment on a specific page of the crawled site to transform, and the respective DataObject-subclass field within Silverstripe where this transformed content should be stored.
The schema field "URLs Applied to" is where you define regular expressions to match urls from the legacy site to the imported DataTypes in the new site. Each url is matched against the absolute urls crawled from the legacy site, so you'll need to include the protocol and domain in your URL patterns to make them absolute as well, e.g. "https://www.legacysite.com/news/*".
The actual regular expression is located in src/Model/StaticSiteContentSource.php
in the function schemaCanParseURL()
:
if (preg_match("|^$appliesTo|", $url) == 1) {
...
}
- For help with regular expressions, use an online service like Rubular.com or PHPLiveRegex.com, both of which allow "live" regular expression editing.
Priority order of your schema is important, the Applies To
Url patterns are matched against the imported urls in priority order until the first matching schema is found.
This means you need to order your schema with the most specific patterns first (e.g. CustomNewsPage
, NewsPage
, NewsHolder
etc), then gradually filter down the priority order to the default catch-all patterns for Page
, Image
and File
.
The default catch-all patterns are:
+-----------------+-----------+--------------------------------------+
| Url Applies To | Data Type | Mime-types |
+-----------------+-----------+--------------------------------------+
| .* | Page | text/html |
| .* | Image | image/(png|jpe?g|gif) |
| .* | File | application/(vnd.*|msword|pdf|xml) |
+-----------------+-----------+--------------------------------------+
Notes:
The example below is based on your import using a subclass of SiteTree
. You should also configure a separate Import Rule for importing files and images.
This rule takes the content of the crawled-site's <h1> element, imports it into the SiteTree.Title
field which forms your imported page's <title> element.
- Field Name:
Title
- CSS Selector:
h1
- Exclude CSSSelector: [Optional]
- Element attribute: [Optional]
- Convert to plain text: Check this to remove all markup found in the crawled site
- Schema: Select "Page" or your custom SilverSstripe
DataObject
subclass to import content into
This rule takes the content of the crawled-site's <h1> element, imports it into
the SiteTree.MenuTitle
field. This is used in the CMS' SiteTree list.
- Field Name:
MenuTitle
- CSS Selector:
h1
- Exclude CSSSelector: [Optional]
- Element attribute: [Optional]
- Convert to plain text: Check this box to remove all markup found in the crawled site
- Schema: Select "Page" or your custom Silverstripe
DataObject
subclass to import content into
This rule takes the content of the crawled-site's main body content (excluding any <h1> elements) - in this example we pretend it's all wrapped in a div#content element. This will then form the content that is used in the SiteTree.Content
field.
- Field Name:
Content
- CSS Selector:
div#content
- Exclude CSSSelector:
h1
- Element attribute: [Optional]
- Convert to plain text: Leave this unchecked, you'll probably want to keep all the crawled site's markup as it's being imported into an HTMLText fieldtype, eventually editable in the CMS via the WYSIWYG editor
- Schema: Select "Page" or your custom Silverstripe
DataObject
subclass to import content into
This rule will collect the contents of a crawled-page's <meta> (description) element and imports it into the SiteTree.MetaDescription
field. You can obviously adapt this to suit other <meta> elements you wish to import.
- Field Name:
MetaDescription
- CSS Selector:
meta[name=description]
- Exclude CSSSelector:
- Element attribute:
value
- Convert to plain text: Check this box to remove all markup found in the crawled site (v.unlikely!)
- Schema: Select "Page" or your custom Silverstripe
DataObject
subclass to import content into
After the import has completed, the content may* still contain urls and asset-paths that reference static urls from the impprted, legacy site.
*
If you selected Automatically run link-rewrite task
all links will have been automatically converted for you, see below for more information.
There is a built-in BuildTask
subclass which will modify urls in imported content, replacing the src
& href
attributes of links, images and files with Silverstripe CMS shortcodes that reference the imported assets and page IDs.
If you check the box Automatically run link-rewrite task
under the "Main" tab in the CMS' UI, this will happen seamlessly for you. This should probably just be the default...
See:
src/Task/StaticSiteRewriteLinksTask.php
For hints on usage, run the task from the command-lne without any arguments.
If enabled in the external-content
"Import" section, this task can actually be run automatically once the import itself has completed. This is useful for Silverstripe setups where you may not have shell access to the server to run BuildTasks
.
There is a comprehensive CMS report "Imported links rewrite report" which is available after each import. You can use the data to analyse your imports and rewritten links, to help you tweak your crawl and import rules and help to point out exactly what's failing and how you might fix it manually if need be.