topic-crawler

Automated topic discovery crawler

How It Works

Given a starting topic, the crawler downloads the Wikipedia article for that topic.
The HTML is parsed and all internal links are identified with BeautifulSoup4.
Each link to another Wikipedia article is then explored, returning to step 1.
This continues until the max crawl distance is reached to prevent it from infinitely exploring associations.

This process creates a tree structure, where the starting topic is the top node. Related topics (identified through page links) become nodes on the tree. Topics related to those nodes become more nodes.

The height of the tree is limited by a setting called max crawl distance, while the number of branches per node is limited by another setting called crawl limit per node.

End Result

After the crawling process is complete, the topic associations are merged into a data file that can be visualized in any tool capable of creating a network chart. For the examples below, I used Qlik Sense.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
data		data
.gitignore		.gitignore
README.md		README.md
discover_topics.py		discover_topics.py
download_wiki_topic_page.py		download_wiki_topic_page.py
extract_outbound_links.py		extract_outbound_links.py
merge_outlink_files.py		merge_outlink_files.py
purge_all_data_files.py		purge_all_data_files.py
purge_article_files.py		purge_article_files.py
purge_outbound_link_files.py		purge_outbound_link_files.py
requirements.txt		requirements.txt
test_discover_topics_atmosphere.py		test_discover_topics_atmosphere.py
test_discover_topics_information_engineering.py		test_discover_topics_information_engineering.py
test_discover_topics_nuclear_fusion.py		test_discover_topics_nuclear_fusion.py
test_download_wiki_topic_page.py		test_download_wiki_topic_page.py
test_extract_outbound_links.py		test_extract_outbound_links.py
topic_discovery_screenshot_air_quality_node.png		topic_discovery_screenshot_air_quality_node.png
topic_discovery_screenshot_network_chart.png		topic_discovery_screenshot_network_chart.png
topic_discovery_screenshot_network_chart_with_labels_visible.png		topic_discovery_screenshot_network_chart_with_labels_visible.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

topic-crawler

How It Works

End Result

About

Releases

Packages

Languages

viperior/topic-crawler

Folders and files

Latest commit

History

Repository files navigation

topic-crawler

How It Works

End Result

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages