CeurDataAnalytics

This project is part of the "Technology for Big Data Management" course. It aimed to develop a method to analyze the website of CEUR-WS website.

Our objective was to extract data from the website, store it in a database, and make a graph out of it so that there can be queries and analysis made on the data to display commonly searched information in a timely manner.

Prerequisites

Before running the script, ensure you have the following installed:

Python 3.8 or higher
MongoDB (e.g. MongoDB Atlas)
Memgraph (for graph database)
Required Python packages (listed in requirements.txt)

Installation

Clone the repository:

git clone https://github.com/Bia104/CeurDataAnalytics.git
cd CeurDataAnalytics

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```

Configuration

Create a .env file in the web-scraper directory and add the following environment variables:

BASE_URL = "https://ceur-ws.org/"
MONGO_URI = "mongodb://localhost:27017/" # Replace with the actual URI
DB_NAME = "ceur_ws"

Create the MongoDB database with the name ceur_ws (or the name you specified in the .env file).

Change the databases' names and connection info in the Jupyter notebooks to match your setup if necessary.

Running the Script

To run the script, execute the following command:

python main.py

Logs can be analyzed and found in scraping.log. This will populate the MongoDB database with the scraped data.

Subsequently, you can run the two Jupyter notebooks:

etl/clear_dataset.ipynb: Cleans the dataset and prepares it for graph construction.
etl/create_nodes.ipynb: Constructs the graph in Memgraph from the cleaned dataset.

Components

To develop this project, we used and upgraded an already existing web scraper that was made in Python.

The other unique additions are the following:

Parser
Along with the web scraper, a parser was developed to extract structured data from the PDF's content. This parser extracts information such as the title, authors, keywords, references and abstract of each paper.
Graph Construction
A Memgraph graph database is constructed from the MongoDB data to enable complex queries and analysis.

This enables queries such as:

MATCH (a:Author)-[:WROTE]->(p:Paper)-[:IN_VOLUME]->(v:Volume)
RETURN a.name, p.title, v.title;

Other Materials

Additional materials and documentation related to this project can be found in the utils directory.

The Presentation.pdf contains the slides used for the final presentation of the project where you'll find the architecture, data structure, and results.
The query_collection.json file contains a collection of Cypher queries that can be imported and executed on the Memgraph database to retrieve various insights from the data.

Contact

For any questions feel free to reach out:

Federico Di Petta: federico.dipetta@studenti.unicam.it
Bianca Maria Cerino: biancamaria.cerino@studenti.unicam.it

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.idea		.idea
etl		etl
release		release
resources		resources
testing		testing
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CeurDataAnalytics

Prerequisites

Installation

Configuration

Running the Script

Components

Other Materials

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Bia104/CeurDataAnalytics

Folders and files

Latest commit

History

Repository files navigation

CeurDataAnalytics

Prerequisites

Installation

Configuration

Running the Script

Components

Other Materials

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages