article-parser

Extract main article, main image and meta data from URL.

Demo

View screenshots for more info.

Usage

npm install article-parser

Then:

const {
  extract
} = require('article-parser');

const url = 'https://goo.gl/MV8Tkh';

extract(url).then((article) => {
  console.log(article);
}).catch((err) => {
  console.log(err);
});

APIs

Since v4, article-parser will focus only on its main mission: extract main readable content from given webpages, such as blog posts or news entries. Although it is still able to get other kinds of content like YouTube movies, SoundCloud media, etc, they are just additions.

extract(String url | String html)

Extract data from specified url or full HTML page content. Return: a Promise

Here is how we can use article-parser:

import {
  extract
} from 'article-parser';

const getArticle = async (url) => {
  try {
    const article = await extract(url);
    return article;
  } catch (err) {
    console.trace(err);
  }
};

In comparison to v3, the article object structure has been changed too. Now it looks like below:

{
  "url": URI String,
  "title": String,
  "description": String,
  "image": URI String,
  "author": String,
  "content": HTML String,
  "published": Date String,
  "source": String, // original publisher
  "links": Array, // list of alternative links
  "ttr": Number, // time to read in second, 0 = unknown
}

Configuration methods

In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.

setParserOptions(Object parserOptions)
getParserOptions()
setNodeFetchOptions(Object nodeFetchOptions)
getNodeFetchOptions()
setSanitizeHtmlOptions(Object sanitizeHtmlOptions)
getSanitizeHtmlOptions()

Here are default properties/values:

Object `parserOptions`:

{
  wordsPerMinute: 300,
  urlsCompareAlgorithm: 'levenshtein',
}

Read string-comparison docs for more info about urlsCompareAlgorithm.

Object `nodeFetchOptions`:

{
  headers: {
    'user-agent': 'article-parser/4.0.0',
  },
  timeout: 30000,
  redirect: 'follow',
  compress: true,
  agent: false,
}

Read node-fetch docs for more info.

Object `sanitizeHtmlOptions`:

{
  allowedTags: [
    'h1', 'h2', 'h3', 'h4', 'h5',
    'u', 'b', 'i', 'em', 'strong',
    'div', 'span', 'p', 'article', 'blockquote', 'section',
    'pre', 'code',
    'ul', 'ol', 'li', 'dd', 'dl',
    'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood',
    'label',
    'fieldset', 'legend',
    'img', 'picture',
    'br', 'p', 'hr',
    'a',
  ],
  allowedAttributes: {
    a: ['href'],
    img: ['src', 'alt'],
  },
}

Read sanitize-html docs for more info.

Screenshots

Article Parser demo:

Example FasS with Google Cloud Function

Test

git clone https://github.com/ndaidong/article-parser.git
cd article-parser
npm install
npm test

License

The MIT License (MIT)

Name		Name	Last commit message	Last commit date
Latest commit History 307 Commits
src		src
test-data		test-data
.eslintignore		.eslintignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package.json		package.json
reset.js		reset.js
run.js		run.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

article-parser

Demo

Usage

APIs

extract(String url | String html)

Configuration methods

Object `parserOptions`:

Object `nodeFetchOptions`:

Object `sanitizeHtmlOptions`:

Screenshots

Test

License

About

Uh oh!

Releases

Packages

Languages

License

develanet/article-parser

Folders and files

Latest commit

History

Repository files navigation

article-parser

Demo

Usage

APIs

extract(String url | String html)

Configuration methods

Object parserOptions:

Object nodeFetchOptions:

Object sanitizeHtmlOptions:

Screenshots

Test

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Object `parserOptions`:

Object `nodeFetchOptions`:

Object `sanitizeHtmlOptions`:

Packages