Skip to content

Commit

Permalink
Add cmoffitt_elizfitz
Browse files Browse the repository at this point in the history
  • Loading branch information
cooper-mj committed Mar 24, 2020
1 parent 0e63766 commit af74e22
Show file tree
Hide file tree
Showing 4 changed files with 238 additions and 0 deletions.
Binary file modified .DS_Store
Binary file not shown.
73 changes: 73 additions & 0 deletions cmoffitt_elizfitz/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Class: CS41
Date: March 11, 2020
Project Partners: Elizabeth Fitzgerald & Christopher Moffitt
Google Drive URL for Presentation: https://drive.google.com/file/d/1G11pi4g7jpeK87NpjaQiVwEEr8vu6Rvx/view
=========================================
Requirements:
-------------------------
Simply download the code as is, and then run the main python script entitled wallscraper.py and the command line argument (wallpapers).
ex:
python wallscraper.py wallpapers

For periodical running extension (much more complicated).
Save wallpapers.plist file to LaunchAgents folder.
Then enter the following in Terminal in order to start the launchd:
$ launchctl load ~/Library/LaunchAgents/wallpapers.plist
$ launchctl start wallpapers

=========================================
Technical Details:
-------------------------
This project performs several tasks. It scrapes data, downloads that data, avoids downloading the same image twice, allows for command line utility, and periodically runs itself. Each of these tasks was made up of smaller parts that came together to make it whole.
-------------------------
Task 1: Scraping Data
-------------------------
(A) Better familiarize ourselves with JSON objects and how to work with them
--

(B) Write the query code to collect the JSON objects from reddit
--

(C) Build a class for Reddit Posts that stores the most relevant JSON information as attributes
-- In order to do this, we had to organize the collected JSON data into a single, neat dictionary in the __init__ function. The dictionary only collects certain attributes (those in the attr list). With these attributes in mind, the function goes through each post characteristic scraped from the JSON data, and stores only the desired attributes in a dictionary. If a post lacks a particular attribute, then the attribute is assigned the value of None.

-------------------------
Task 2: Downloading Data
-------------------------
(A) Implement the download function in the RedditPost class (moderate)
-- The download function only runs on posts whose urls contain the string ".jpg" or ".png"
-- The download function sorts images into different folders based on their size, and titles them in the format "wallpapers/[image size]/[title].png".
-- When creating this function, we had to deal with the case of the path to a folder not existing. The command "os.makedirs(path)" accounts for this.
-- Once the path is known to exist, we use the requests package to collect the content of the post's url, and then write that to the existing filepath.

(B) Better familiarize ourselves with magic methods and implement the __str__(self) method
-- This function simply allowed us to print out post data more cleanly by printing the object itself. It was fairly simple, and only required basic string concatenation.

(C) Test downloading one image
-- We started by simply downloading the first collected image post, to confirm that it went to the right folder.

(D) Download all images generated by initial query
-- Once we downloaded one image correctly, we simply ran all the RedditPost objects through a for loop in the main function.

-------------------------
Task 3: Wallpaper Deduplication
-------------------------
(A) Keep track of previously seen images
-- In order to keep track of previously seen images across different instances of the program, we used the pickle package.
-- We created a list of seen wallpapers and saved it to the project folder as a pickle object. We then saved every new post's url content to the list.

(B) Check new images against those already seen
-- With this object, we can load it any time we're saving a new post, scan through it to make sure there are no matching content, then save the new wallpaper's content to the list before dumping the list back into the pickle object.

-------------------------
Task 4: Implementing Command Line Utility
-------------------------
(A) Allow the user to specify which subreddit posts to download through command line arguments
-- We did this by importing the sys package
-- We then just had to pass sys.argv[1] to our query function
-------------------------
Task 5: Configuring the script to run automatically
-------------------------
(A) Work with Mac LaunchAgents to have our wallscraper script run every hour
-- Create a new .plist file for wallscraper in LaunchAgents and make the start interval for every hour
-- Use terminal to start the launch
17 changes: 17 additions & 0 deletions cmoffitt_elizfitz/wallpapers.plist
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>wallpapers</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/python</string>
<string>/Users/cmoffitt/cs41-env/Assignments/FinalProject/lab-5/wallscraper.py</string>
<string> wallpapers</string>
</array>
<!-- Run every hour -->
<key>StartInterval</key>
<integer>3600</integer><!-- seconds -->
</dict>
</plist>
148 changes: 148 additions & 0 deletions cmoffitt_elizfitz/wallscraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
#!/usr/bin/env python3
"""
Reddit Wallscraper
Course: CS 41
Name: Chris Moffit and Elizabeth Fitzgerald
SUNet: cmoffitt and elizfitz
Replace this with a description of the program.
"""
# import utils
import requests
import sys
import re
import os
import pickle

#Uses requests module to query reddit for json file corresponding to subreddit
def query(subreddit):
URL_START = "https://reddit.com/r/"
URL_END = ".json"
url = URL_START + subreddit + URL_END
print(url)
headers = {'User-Agent': "Wallscraper Script by @cmoffitt"}
r = None

# Make request and catch exceptions
try:
r = requests.get(url, headers=headers)
r.raise_for_status()
except requests.exceptions.HTTPError as errh:
print("Http Error:", errh)
sys.exit(1)
except requests.exceptions.ConnectionError as errc:
print("Error Connecting: No internet connection")
sys.exit(1)
except requests.exceptions.Timeout as errt:
print("Timeout Error:", errt)
sys.exit(1)
except requests.exceptions.RequestException as err:
print("OOps: Something Else", err)
sys.exit(1)

# Capture json dict object of subreddit if response was successful
print(r)
if r.ok:
json_data = r.json()
else:
print("The server did not return a successful response. Please try again")
sys.exit(1)

# Check if valid subreddit
if not isValidSubreddit(json_data):
print("Not a valid subreddit. Please try again.")
sys.exit(1)

return json_data

#Class defining one reddit post
class RedditPost:
#Initializes one reddit post as a dictionary storing certain attributes from the json post object
def __init__(self, data):
post_data = data
attr = ["subreddit", "is_self", "ups", "post_hint", "title", "downs", "score", "url", "domain", "permalink", "created_utc", "num_comments", "preview", "name", "over_18"]

dict = {}
for k in attr:
try:
dict[k] = post_data["data"][k]
except:
dict[k] = None

self.data = dict

#Downloads the post image to a file on the computer, preventing duplicate image downloading
def download(self):
if ".jpg" in self.data["url"] or ".png" in self.data["url"]: #only download if it actually is an image
#format the correct name and path for the file
name = re.sub(r'\[.*\]', '', self.data["title"])
name = re.sub(" ", "", name)
name = re.sub(r'[^a-zA-Z0-9]', "", name)
path = "wallpapers/" + str(self.data["preview"]["images"][0]["source"]["width"]) + "x" + str(self.data["preview"]["images"][0]["source"]["height"]) + "/"
filename = name + ".png"

if not os.path.exists(path):
os.makedirs(path)

img_data = requests.get(self.data["url"]).content #unique info regarding the particular image to save in order to prevent duplicate image downloading

#Run this code the first time in order to create the pickle file for seen_wallpapers
#seen_wallpapers.append(img_data)
#f = open("seen_wallpapers.pickle", 'wb')
#pickle.dump(seen_wallpapers, f)
#f.close()

#upload seen_wallpapers pickle file to compare against img_data and prevent duplicae image downloading
seen_wallpapers = []
f = open("seen_wallpapers.pickle", 'rb')
seen_wallpapers = pickle.load(f)
f.close()
if img_data not in seen_wallpapers:
seen_wallpapers.append(img_data)
f = open("seen_wallpapers.pickle", 'wb')
pickle.dump(seen_wallpapers, f)
f.close()
#save image in file
with open(os.path.join(path, filename), 'wb') as temp_file:
temp_file.write(img_data)
temp_file.close()

else:
pass

def __str__(self):
#"RedditPost({title} ({score}): {url})
string = "RedditPost({" + self.data["title"] + "} ({" + str(self.data["score"]) + "}): {" + self.data["url"] + "})"
return string



# Checks if valid subreddit by making sure json dict object is properly filled with contents
def isValidSubreddit(json_data):
if json_data['data']['dist'] == 0:
return False
else:
return True


def main(subreddit):
q = query(subreddit)

children = (q['data']['children'])
postCount = 0 # To confirm we have all 25 "posts"
scoreCount = 0 # To check number of posts with a score above 500

RedditPosts = [RedditPost(post) for post in children]

for post in RedditPosts:
new_post = post
postCount += 1
if new_post.data["score"] > 500:
scoreCount += 1
post.download()

print("There were " + str(postCount) + " posts.")
print(str(scoreCount) + " of those posts had a score over 500.")

if __name__ == '__main__':
main(sys.argv[1])

0 comments on commit af74e22

Please sign in to comment.