-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
238 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
Class: CS41 | ||
Date: March 11, 2020 | ||
Project Partners: Elizabeth Fitzgerald & Christopher Moffitt | ||
Google Drive URL for Presentation: https://drive.google.com/file/d/1G11pi4g7jpeK87NpjaQiVwEEr8vu6Rvx/view | ||
========================================= | ||
Requirements: | ||
------------------------- | ||
Simply download the code as is, and then run the main python script entitled wallscraper.py and the command line argument (wallpapers). | ||
ex: | ||
python wallscraper.py wallpapers | ||
|
||
For periodical running extension (much more complicated). | ||
Save wallpapers.plist file to LaunchAgents folder. | ||
Then enter the following in Terminal in order to start the launchd: | ||
$ launchctl load ~/Library/LaunchAgents/wallpapers.plist | ||
$ launchctl start wallpapers | ||
|
||
========================================= | ||
Technical Details: | ||
------------------------- | ||
This project performs several tasks. It scrapes data, downloads that data, avoids downloading the same image twice, allows for command line utility, and periodically runs itself. Each of these tasks was made up of smaller parts that came together to make it whole. | ||
------------------------- | ||
Task 1: Scraping Data | ||
------------------------- | ||
(A) Better familiarize ourselves with JSON objects and how to work with them | ||
-- | ||
|
||
(B) Write the query code to collect the JSON objects from reddit | ||
-- | ||
|
||
(C) Build a class for Reddit Posts that stores the most relevant JSON information as attributes | ||
-- In order to do this, we had to organize the collected JSON data into a single, neat dictionary in the __init__ function. The dictionary only collects certain attributes (those in the attr list). With these attributes in mind, the function goes through each post characteristic scraped from the JSON data, and stores only the desired attributes in a dictionary. If a post lacks a particular attribute, then the attribute is assigned the value of None. | ||
|
||
------------------------- | ||
Task 2: Downloading Data | ||
------------------------- | ||
(A) Implement the download function in the RedditPost class (moderate) | ||
-- The download function only runs on posts whose urls contain the string ".jpg" or ".png" | ||
-- The download function sorts images into different folders based on their size, and titles them in the format "wallpapers/[image size]/[title].png". | ||
-- When creating this function, we had to deal with the case of the path to a folder not existing. The command "os.makedirs(path)" accounts for this. | ||
-- Once the path is known to exist, we use the requests package to collect the content of the post's url, and then write that to the existing filepath. | ||
|
||
(B) Better familiarize ourselves with magic methods and implement the __str__(self) method | ||
-- This function simply allowed us to print out post data more cleanly by printing the object itself. It was fairly simple, and only required basic string concatenation. | ||
|
||
(C) Test downloading one image | ||
-- We started by simply downloading the first collected image post, to confirm that it went to the right folder. | ||
|
||
(D) Download all images generated by initial query | ||
-- Once we downloaded one image correctly, we simply ran all the RedditPost objects through a for loop in the main function. | ||
|
||
------------------------- | ||
Task 3: Wallpaper Deduplication | ||
------------------------- | ||
(A) Keep track of previously seen images | ||
-- In order to keep track of previously seen images across different instances of the program, we used the pickle package. | ||
-- We created a list of seen wallpapers and saved it to the project folder as a pickle object. We then saved every new post's url content to the list. | ||
|
||
(B) Check new images against those already seen | ||
-- With this object, we can load it any time we're saving a new post, scan through it to make sure there are no matching content, then save the new wallpaper's content to the list before dumping the list back into the pickle object. | ||
|
||
------------------------- | ||
Task 4: Implementing Command Line Utility | ||
------------------------- | ||
(A) Allow the user to specify which subreddit posts to download through command line arguments | ||
-- We did this by importing the sys package | ||
-- We then just had to pass sys.argv[1] to our query function | ||
------------------------- | ||
Task 5: Configuring the script to run automatically | ||
------------------------- | ||
(A) Work with Mac LaunchAgents to have our wallscraper script run every hour | ||
-- Create a new .plist file for wallscraper in LaunchAgents and make the start interval for every hour | ||
-- Use terminal to start the launch |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> | ||
<plist version="1.0"> | ||
<dict> | ||
<key>Label</key> | ||
<string>wallpapers</string> | ||
<key>ProgramArguments</key> | ||
<array> | ||
<string>/usr/local/bin/python</string> | ||
<string>/Users/cmoffitt/cs41-env/Assignments/FinalProject/lab-5/wallscraper.py</string> | ||
<string> wallpapers</string> | ||
</array> | ||
<!-- Run every hour --> | ||
<key>StartInterval</key> | ||
<integer>3600</integer><!-- seconds --> | ||
</dict> | ||
</plist> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
#!/usr/bin/env python3 | ||
""" | ||
Reddit Wallscraper | ||
Course: CS 41 | ||
Name: Chris Moffit and Elizabeth Fitzgerald | ||
SUNet: cmoffitt and elizfitz | ||
Replace this with a description of the program. | ||
""" | ||
# import utils | ||
import requests | ||
import sys | ||
import re | ||
import os | ||
import pickle | ||
|
||
#Uses requests module to query reddit for json file corresponding to subreddit | ||
def query(subreddit): | ||
URL_START = "https://reddit.com/r/" | ||
URL_END = ".json" | ||
url = URL_START + subreddit + URL_END | ||
print(url) | ||
headers = {'User-Agent': "Wallscraper Script by @cmoffitt"} | ||
r = None | ||
|
||
# Make request and catch exceptions | ||
try: | ||
r = requests.get(url, headers=headers) | ||
r.raise_for_status() | ||
except requests.exceptions.HTTPError as errh: | ||
print("Http Error:", errh) | ||
sys.exit(1) | ||
except requests.exceptions.ConnectionError as errc: | ||
print("Error Connecting: No internet connection") | ||
sys.exit(1) | ||
except requests.exceptions.Timeout as errt: | ||
print("Timeout Error:", errt) | ||
sys.exit(1) | ||
except requests.exceptions.RequestException as err: | ||
print("OOps: Something Else", err) | ||
sys.exit(1) | ||
|
||
# Capture json dict object of subreddit if response was successful | ||
print(r) | ||
if r.ok: | ||
json_data = r.json() | ||
else: | ||
print("The server did not return a successful response. Please try again") | ||
sys.exit(1) | ||
|
||
# Check if valid subreddit | ||
if not isValidSubreddit(json_data): | ||
print("Not a valid subreddit. Please try again.") | ||
sys.exit(1) | ||
|
||
return json_data | ||
|
||
#Class defining one reddit post | ||
class RedditPost: | ||
#Initializes one reddit post as a dictionary storing certain attributes from the json post object | ||
def __init__(self, data): | ||
post_data = data | ||
attr = ["subreddit", "is_self", "ups", "post_hint", "title", "downs", "score", "url", "domain", "permalink", "created_utc", "num_comments", "preview", "name", "over_18"] | ||
|
||
dict = {} | ||
for k in attr: | ||
try: | ||
dict[k] = post_data["data"][k] | ||
except: | ||
dict[k] = None | ||
|
||
self.data = dict | ||
|
||
#Downloads the post image to a file on the computer, preventing duplicate image downloading | ||
def download(self): | ||
if ".jpg" in self.data["url"] or ".png" in self.data["url"]: #only download if it actually is an image | ||
#format the correct name and path for the file | ||
name = re.sub(r'\[.*\]', '', self.data["title"]) | ||
name = re.sub(" ", "", name) | ||
name = re.sub(r'[^a-zA-Z0-9]', "", name) | ||
path = "wallpapers/" + str(self.data["preview"]["images"][0]["source"]["width"]) + "x" + str(self.data["preview"]["images"][0]["source"]["height"]) + "/" | ||
filename = name + ".png" | ||
|
||
if not os.path.exists(path): | ||
os.makedirs(path) | ||
|
||
img_data = requests.get(self.data["url"]).content #unique info regarding the particular image to save in order to prevent duplicate image downloading | ||
|
||
#Run this code the first time in order to create the pickle file for seen_wallpapers | ||
#seen_wallpapers.append(img_data) | ||
#f = open("seen_wallpapers.pickle", 'wb') | ||
#pickle.dump(seen_wallpapers, f) | ||
#f.close() | ||
|
||
#upload seen_wallpapers pickle file to compare against img_data and prevent duplicae image downloading | ||
seen_wallpapers = [] | ||
f = open("seen_wallpapers.pickle", 'rb') | ||
seen_wallpapers = pickle.load(f) | ||
f.close() | ||
if img_data not in seen_wallpapers: | ||
seen_wallpapers.append(img_data) | ||
f = open("seen_wallpapers.pickle", 'wb') | ||
pickle.dump(seen_wallpapers, f) | ||
f.close() | ||
#save image in file | ||
with open(os.path.join(path, filename), 'wb') as temp_file: | ||
temp_file.write(img_data) | ||
temp_file.close() | ||
|
||
else: | ||
pass | ||
|
||
def __str__(self): | ||
#"RedditPost({title} ({score}): {url}) | ||
string = "RedditPost({" + self.data["title"] + "} ({" + str(self.data["score"]) + "}): {" + self.data["url"] + "})" | ||
return string | ||
|
||
|
||
|
||
# Checks if valid subreddit by making sure json dict object is properly filled with contents | ||
def isValidSubreddit(json_data): | ||
if json_data['data']['dist'] == 0: | ||
return False | ||
else: | ||
return True | ||
|
||
|
||
def main(subreddit): | ||
q = query(subreddit) | ||
|
||
children = (q['data']['children']) | ||
postCount = 0 # To confirm we have all 25 "posts" | ||
scoreCount = 0 # To check number of posts with a score above 500 | ||
|
||
RedditPosts = [RedditPost(post) for post in children] | ||
|
||
for post in RedditPosts: | ||
new_post = post | ||
postCount += 1 | ||
if new_post.data["score"] > 500: | ||
scoreCount += 1 | ||
post.download() | ||
|
||
print("There were " + str(postCount) + " posts.") | ||
print(str(scoreCount) + " of those posts had a score over 500.") | ||
|
||
if __name__ == '__main__': | ||
main(sys.argv[1]) |