I started this project when I was on my first internship, I have learned a lot since then and I have decided to rewrite this project (in Go)! This project will no longer be maintained and the replacement is located here: https://github.com/gordonpn/hot-flag-deals.
This project aims to scrape the content of the Hot Deals forums, keep track of all interesting and relevant deals, as well as archive all other deals. All relevant deals are emailed daily to a mailing list and then archived. There also exists a front-end at https://gordon-pn.com/deals to view the current relevant deals.
Red Flag Deals does aggregate deals on their front page, but the Hot Deals Forums are community driven and sourced by anybody. This is where the purpose of my project comes into play, this project scrapes the Hot Deals Forums several times per day and displays them on a front-end.
With this project, I saved myself the chore of checking the (messy) forum a few times a day while still being aware of the good deals posted by the community.
to github.com/gordonpn/hot-flag-deals
Template made by @tiffzeng
- Maven: Dependency management
- Bootstrap: CSS framework for front-end
- jQuery: front-end
- Javalin: Web framework for Java for the back-end
- Spring Framework: Utilized Thymeleaf for email templates as well as some dependency injection
- jsoup: library to parse HTML documents
- Java 8+
- Apache Maven 3.6+
Clone the master branch into your workspace.
Compile and package using Maven.
mvn clean compile packageEdit the configuration.json to your needs. You must set your gmail and password as environment variables. In my case, my prod machine was running on Linux and my test machines were running on Mac and Windows. Those settings come from the ConfigurationLoader.java.
The main class com.rfdhd.scraper.App is used for scraping the forum.
java -cp *.jar com.rfdhd.scraper.AppThe main class com.rfdhd.scraper.DigestCreator is used for sending the daily digest email. It will take the content of dailyDigest.json as source.
java -cp *.jar com.rfdhd.scraper.DigestCreatorThe main class com.rfdhd.scraper.Start is used to start the back-end to respond to the HTTP requests.
java -cp *.jar com.rfdhd.scraper.StartThe Scraper and the DigestCreator are both automated in Jenkins in order to have the most up to date information on deals.
- Use the Jsoup library to scrape data correctly.
- Save all the scraped data in a map.
- Save the unfiltered map into scrapings.json
- Try to read scrapings.json
- Remove duplicates before saving again
- Utility class to calculate information from a map.
- Filter the raw map using the utility class.
- Save the filtered map into currentLinks.json
- Parse direct link to product
- Create a configuration file in resources with the property of pages to scrape.
- Spring framework beans for configuration
- Refactor currentLinks to dailyDigest
- Refactor pastLinks to archive
- Add mailing list to configuration
- Add email settings to configuration
- When filtering for dailyDigest, read scrapings and get median votes count of all
- When scrapings are filtered, it must check with archive if an item has already been processed
- When email service reads from dailyDigest, move items to archive
- Set up Mail service
- Set up Thymeleaf engine
- Environment variables getter
- Implement Spring framework beans
- Add content body under h2
- Add a good readme.md
- Sort descending by votes before sending email
- Record thread start time
- Parse post date
- Keep the most recent version of the scraped posts
- Fix logic with scrapings (threads not going to dailyDigest if it was previously scraped with a low vote score (because it was they wre already in scrapings))
- To fix these two issues:
- Make use of LinkedHashMap to preserve the order of insertion.
- Try to read the existing files before scraping. And put into those existing maps, thus updating values with identical keys.
- Only filter based on the median of pages scraped, not entire scrapings json.
- Save the interesting threads in dailyDigest disregarding the duplicates found in scrapings.
- To fix these two issues:
- When preparing the email, remove the duplicates by comparing with archive.
- Filter out threads older than 72 hours when preparing email.
- Read from config and template within the jar
- Add command line flags to differentiate testing on prod and prod
- A front-end
- Finish implementing back-end for signing up
- Refactor how the configurations are acquired.
- Write tests
- Improve styling of email template
- Integrate MongoDB or lowdb for a database
