Releases: fmacpro/horseman-article-parser
Releases · fmacpro/horseman-article-parser
0.9.0
- Allows passing of rules for returning an articles title & contents. This is useful in a case
where the parser is unable to return the desired title or content e.g.
rules: [
{
host: 'www.bbc.co.uk',
content: () => {
var j = window.$
j('article section, article figure, article header').remove()
return j('article').html()
}
},
{
host: 'www.youtube.com',
title: () => {
return window.ytInitialData.contents.twoColumnWatchNextResults.results.results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text
},
content: () => {
return window.ytInitialData.contents.twoColumnWatchNextResults.results.results.contents[1].videoSecondaryInfoRenderer.description.runs[0].text
}
}
]
0.8.54
0.8.53
0.8.52
- sidebar keyword removed from unlikely candidates regex & handled unexpected redirects ( fixes #47 )
- article body identification rules (regexes) moved to options
- exposed original html of document on response object ( #48 )
- dependency security updates
- amended the default
puppeteer.goto
waitUntil
option to benetworkidle2
rather thandomcontentloaded
0.8.51
0.8.5
- Allow compromise plugins to be passed in
- Update docs
Compromise is the natural language processor that allows horseman-article-parser
to return
topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:
/** add some names
let testPlugin = function(Doc, world) {
world.addWords({
'rishi': 'FirstName',
'sunak': 'LastName',
})
}
const options = {
url: 'https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies',
enabled: ['lighthouse', 'screenshot', 'links', 'sentiment', 'entities', 'spelling', 'keywords'],
nlp: {
plugins: [testPlugin]
}
}
This allows us to match - for example - names which are not in the base compromise word lists.
0.8.4
0.8.3
- Refactor title processing
Title processing can now be turned on and is off by default. It is now also possible to configure the title processing functionality as below
var options = {
title: {
useBestTitlePart: true, // true turns on the title processing
commonSeparatingCharacters: [' | ', ' _ ', ' - ', '«', '»', ' — ', ' — ', ' – '],
minimumTitlePartLength: 10
}
}