Newspaper3kli stands for the "kommand-line" interface over Newspaper3k.
A tiny layer on top of Newspaper3k with support for Unix-like executions and parallelism (using asyncio) to download bulks of articles faster.
In addition to the requirements, make sure you have nltk
's
punkt
package installed (via nlkt.download()
in
interactive Python) for Newspaper3k's article.nlp()
to work
properly.
# assuming your OS has pip3 as default
pip install newspaper3kli==0.1.0
Overview of available parameters
usage: newspaper3kli [-h] [-o OUTPUT] [-u] [--keep-html] [urls [urls ...]]
positional arguments:
urls URL to download content from (single download)
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output path to store the results
-u, --disable-verify-ssl
Flag to disable SSL certificate verification.
--keep-html Flag to save content with HTML..
newspaper3kli https://hello.world/article/2020 \
https://hello.world/article/2019
TXT is the simplest file format for reading with Newspaper3kli.
Assuming the txt file has the following content (line delimited URLs):
https://hello.world/article/2020
https://hello.world/article/2019
cat /path/to/this/file.txt | newspaper3kli
CSV parsing will depend in a tool like awk
or cut
to split the columns.
Content sample
url,tags,date
https://hello.world/article/2020,some|thing,2020-01-01T00:00:00
https://hello.world/article/2019,some|thing,2019-01-01T00:00:00
Processing
# note that $1 corresponds to the URLs column number, change to yours
cat /path/to/this/file.csv | awk -F, '{ print $1 }' | newspaper3kli
For any other character-delimited content, simple change from -F, (comma) to the desired format, e.g.: -F\t for TSV
When no path is specified through --output
parameter, the default path is
the output
directory inside Newspaper3kli's installation directory.
Files are created according to Article's name, and are stored in pairs:
- JSON for metadata;
- HTML for content;
Thanks to dsynkov for the work at newspaper-bulk. The source of inspiration and some code for this project.