A Wrapper over newspaper3k library to provide support to nepali sites
- Works for majority of nepali news site
- Tested sites list is given in sites.json, (50+ sites)
- Currently only title and text field is guarenteed to have data.
- Extraction of images & date is also supported in most sites, be sure check if it supported on the sites you want before relying.
In case you run in some troubles during installation performing the steps below, Visit newpaper3k for detail usage/installation help.
$ git clone https://github.com/pykancha/newspaper3k_wrapper.git
$ pip install -r requirements.txt
$ python setup.py install
$ python download_corpora.py
Once you have run python download_corpora.py
command on your machine,
you can use:
python -m pip install git+https://github.com/pykancha/newspaper3k_wrapper.git#egg=newspaper_wrapper
to simply install it as regular package without cloning repo to your folder.
You can get the download_corpora.py file without cloning the repo through:
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py -o download_corpora.py
In your requirements.txt file add;
-e git+https://github.com/pykancha/newspaper3k_wrapper.git#egg=newspaper_wrapper
Use the command
poetry add git+https://github.com/pykancha/newspaper3k_wrapper.git
Alternatively, edit in your pyproject.toml file
newspaper3k_wrapper = { git = "https://github.com/pykancha/newspaper3k_wrapper.git" }
>> from newspaper_wrapper import Article
>> url = 'https://www.himalkhabar.com/news/113640'
>> article = Article(url, language='hi')
>> article.download()
>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>> article.parse()
>> article.nlp()
>> print(article.title, article.text)
Refer to: Docs - Adding new languages
$ git clone https://github.com/pykancha/newspaper3k_wrapper
$ cd newspaper3k_wrapper
$ python -m pip install -e .
$ python download_corpora.py
Make changes and run the tests
$ python tests/unit_tests.py
$ python tests/unit_tests.py fulltext