ThingScraper

The Thingiverse Popular 3D Printing Models Web Scrapper

1. Description

ThingScraper was created as part of Israeli Tech Challenge <itc> Data Science fellows program. The purpose of this project is to scrape different information related to 3D printable models from the popular website Thingiverse.

2. Installation

Clone the git repo to local machine:

git clone https://github.com/Shlomigreen/ThingScrape

Project requirements:

The code mainly relays on Selenium webdriver and python.
In order to build database out of scrapped data, PyMySQL is also required.

pip install -r requirements.txt

WebDriver: In order to preform scrapping using selenium, a webdriver for your browser of choice is required. Download one of the following that matches your browser's version:

Chrome
Firefox
For additional browsers please refer to selenium download page

Note that Browser object in our code only supports chrome, firefox, internet explorer and safari.

You will need to provide the webdriver's path inside the personal configuration file: personal_config.py.

3. Usage

Direct usage: import ThingScraper objects into python project
Command line interface: run python main.py from command line with acceptable tags

3.1. Direct usage objects

Direct usage is possible by importing several classes from ThingScraper.py :

a Browser object: handles browser of choice for requesting and obtaining web information.
a Thing, User and Make objects: receives and holds information about a single thing (model) ,user or a make. Uses a Browser object for some of its functionality.

Basic usage

from ThingScraper import Browser, Thing, User, Make

# Define a new browser instance
# Browser(browser_name, browser_webdriver_path)
browser = Browser('chrome', 'chromedriver')

# Define a new thing instance
# by giving 'thing_id' or 'url' arguments
thing = Thing(thing_id='4734271')

# Attach browser to thing instance
thing.set_browser(browser)

# Open up thing page in attached browser and break it down to elements
thing.fetch_all()

# Convert found elements into useful information
thing.parse_all()

# Print out obtained information
thing.print_info()

# Close browser
browser.close()

Expected output:

https://www.thingiverse.com/thing:4734271
	thing_id = 4734271
	model_name = stackable crate
	username = brainchecker
	uploaded = 2021-01-23T00:00:00
	thing_files = 3
	comments = 47
	makes = 17
	remixes = 12
	tags = ['box', 'container', 'crate', 'stackable']
	print_settings = {'printer_brand': None, 'printer_model': None, 'rafts': 'no', 'supports': 'yes', 'resolution': '0.2', 'infill': '5', 'filament_brand': 'esun, bq', 'filament_color': 'orange, grass green', 'filament_material': 'pla'}
	license = Creative Commons - Attribution
	remix = None
	category = Containers

Obtain thing's makes and remixes

# Get a set of make ids for a thing instance
makes_set = thing.get_makes(max_makes=MAX_MAKES_TO_SCAN)

# Get a list of tuples of remixes ids and likes for a thing instance. 
# where the keys are the ids and values are thing object with 'likes' properties
remix_dict = thing.get_remixes(max_remixes=MAX_REMIXES_TO_SCAN)

3.2 Command line interface (CLI)

When running the program through a CLI, 1st positional argument is the type of object we want to scrap, should be: {Thing, User, Make, Remix, API, All}

We can give several arguments of this type, and they will be executed in the order given.

python main.py Thing
python main.py Thing API User

The 'All' option is a shorthand, the following are identical:

python main.py All
python main.py Thing Remix Make API User

To quickly scrape for all datatype and save, we can use:

python main.py All -n items_per_page --google-app-name "PERSONAL-KEY" -J

To open the help menu use:

python main.py -h
python main.py --help

Tags

The following tags are can be added:

-n, --num-items (int)

Used to indicate how many items to mine.

When we provide many search arguments, we should also provide a 'num-items' for each search argument:

python main.py Thing User API Make -n 5 1 4 2

Note that in the example above the 'User' argument doesn't take a 'num-items' argument by default, but due to the order of the type commands that we provided we have to give it any argument to reach the other parameters.

In this example the 'User' argument is ignored.

We can also provide a non-matching number of arguments, for example the following arguments are identical:

python main.py Thing User API -n 5 4 4
python main.py Thing User API -n 5 4
python main.py Thing User API -n 5 4 4 5 6 7 8

When not enough arguments are provided the last argument is used as a substitute.

When too many arguments are provided, the extras are ignored.

If we are using a shorthand command (like 'All') we have to give it a 'num-items' corresponding to each action it represents, for example the following are identical:

python main.py All -n 1 2
python main.py All -n 1 2 2 2
python main.py Thing Remix Make API User -n 1 2
python main.py Thing Remix Make API User -n 1 2 2 2

Because 'All' is a shorthand for 5 commands, it requires 5 parameters, but since the last parameter usually doesn't require an argument we can ignore it, and if the last parameter are identical we can
omit them as well.

-N, --Name (str)

The name of the file. Used when exporting to json.

-B, --Browser (str)

The name of the browser. Used to configure selenium simulation.

-D, --Driver (str)

Driver path - browser. Used to configure selenium simulation.

-J, --save-json (bool)

Save a copy of the data in a json file at the end of the run.

-j, --load-json (bool)

Open save from json file at the start of the run

-v, --volume (int)

Set how much text to output to the command line:

10 = quite
20 = normal
30 = debug
40 = verbose

If the provided level is not in the list, it will be set to the nearest value above.

Normal by default.

--google-app-name (str)

google developer code used to access google APIs, default values is provided in the personal configuration file.

--headleess (bool)

runs the scraper in headless mode (no visible browser)

-d --database (bool)

If indicated, a database will be created over the MySQL server (specified in parameters, or by default in the Database/config.py file)

--not-all-users (bool)

search only for the exact number of users specified in the 'num-items' tag

--mysql-host (str)

set the host name of the mySQL server. default in the Database/config.py file

--mysql-user (str)

set the username of the mySQL server. default in the Database/config.py file

--mysql-password (str)

set the password of the mySQL server. default in the Database/config.py file

4. Configurations

4.1. Personal configurations (personal_config.py)

browser: str representation for browser to use (One of: chrome, firefox, iexplorer, safari). Default: chrome.
driver_path: either a relative or absolute path for the webdriver file location for the provided browser. Default: chromedriver.
def_save_name: the name of the exported file from CLI. Default: save.
wait_timeout_: the time to wait in seconds for web element to be available.
pages_to_scan: the number of pages to scan from the explore url.
max_makes_to_scan: the maximum number of makes to scan per thing.
max_remixes_to_scan: the maximum number of remixes to scan per thing.
implicitly_wait: the number of seconds to wait in some javascript heavy pages (makes and remixes i.e.).
google_ktree_API_key: A token to use Google's APIs: Knowledge Graph Search API.

5. Database

Once a JSON file was created after scraping some things, a MySQL database can be created using the build_database function from Database\build_db.py.

from Database.build_db import build_database

build_database(json_path, db_name=['thingiverse.db'], drop_existing=[True])
# json_path: the path to the JSON file created from CLI
# db_path: the path to save the created database. Default: 'thingiverse.db'
# drop_existing: if true, drop database first if existing. Default: True.

5.1. ERD

5.2 Tables and fields

Users

Holds the information for all scrapped users

Column	Description
user_id	automatically incremented id inside the database
username	the username for each user
followers	the number of followers the user has
following	the number of users the user follows
designs	the number of designs (things) posted by the user
collections	the number of collections created by the user
makes	the number of makes the user has posted for different designs
likes	the number of likes the user has on his profile
skill_level	the self estimated skill level the user set for itself (can be null)

Things

Holds scrapped information of things and remixes

Column	Description
thing_id	automatically incremented id inside the database
thingiverse_id	the thing (or remix) id as provided by thingiverse
user_id	foreign key for a creator user as exist in Users table (can be null if user was not scrapped)
model_name	the model name given by the creator
uploaded	the date the thing was uploaded in ISO8601
files	the number of files posted for the model
comments	the number of comments the model has
makes	the number of makes (prints) the model has
remixes	the number of remixes (modifications) the model has
likes	the number of likes the model has
setting_id	foreign key for print settings found in Print_settings table (can be null if no print settings were provided)
license	usage license as provided by the user
remix_id	if the thing is a remix, this is a thing_id of another scrapped thing (can be null of remix source was not scrapped)
thingiverse_remix	the source thing for this remix in thingiverse id (can be null if has no original)
category	the category to which the thing was posted to by the user

Makes

Holds the information for all scraped makes

Column	Description
make_id	automatically incremented id inside the database
thingiverse_id	the make id as provided by thingiverse
thing_id	foreign key for thing id found in Things. the thing it was made from
user_id	foreign key for user id found in Users. the creator who posted the make
uploaded	the date and time the make was uploaded in ISO8601 format
comments	number of comments for the make
likes	number of likes for the make
views	number of views the make has
category	the category to which the thing was posted to by the user
setting_id	foreign key for print settings found in Print_settings table (can be null if nu print settings were provided)

Print settings

Information of print 'settings' creator's used to print a model (either posted as thing, remix or make).

Column	Description
setting_id	automatically incremented id inside the database
printer_brand	the brand of the printer used to print the model
printer_model	the specific model of the printer used to print the model
rafts	indicates if rafts were used when printing: 0 - no, 1 -yes, -1 = doesn't matter or NULL = wasn't indicated
supports	indicates if supports were used when printing: 0 - no, 1 -yes, -1 = doesn't matter or NULL = wasn't indicated
resolution	the printing resolution used
infill	percentage of infill used for printing
filament_brand	brand of filament used for printing. Note: for makes, this field also hold the color and material used.
filament_color	color of the filament used for printing
filament_material	type of material used for printing

Titles

Column	Description
title_id	automatically incremented id inside the database
title	text of titled used in user profile

Since a user can have multiple titles and titles can be related to multiple uses, a many-to-many table title_user is existing.

6. License & Contributing

Created by Konstantin Krivokon and Shlomi Abuchatzera Green.

Creative common usage 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
Database		Database
.gitignore		.gitignore
APIs.py		APIs.py
README.md		README.md
ThingScraper.py		ThingScraper.py
cli.py		cli.py
general_config.py		general_config.py
main.py		main.py
personal_config.py		personal_config.py
requirements.txt		requirements.txt
test_APIs.py		test_APIs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ThingScraper

1. Description

2. Installation

3. Usage

3.1. Direct usage objects

Basic usage

Obtain thing's makes and remixes

3.2 Command line interface (CLI)

Tags

4. Configurations

4.1. Personal configurations (personal_config.py)

5. Database

5.1. ERD

5.2 Tables and fields

Users

Things

Makes

Print settings

Tags

Titles

6. License & Contributing

About

Contributors 2

Languages

Column	Description
tag_id	automatically incremented id inside the database
tag	text of tag used in thing or remix post

Shlomigreen/ThingScraper

Folders and files

Latest commit

History

Repository files navigation

ThingScraper

1. Description

2. Installation

3. Usage

3.1. Direct usage objects

Basic usage

Obtain thing's makes and remixes

3.2 Command line interface (CLI)

Tags

4. Configurations

4.1. Personal configurations (personal_config.py)

5. Database

5.1. ERD

5.2 Tables and fields

Users

Things

Makes

Print settings

Tags

Titles

6. License & Contributing

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages