CORD-19 Research Paper Analysis

Project Purpose and Learning Objectives

This project aims to demonstrate a complete data science workflow, from data loading and cleaning to analysis, visualization, and building an interactive web application. The primary learning objectives include:

Proficiency in using Python libraries like Pandas, Matplotlib, Seaborn, and Streamlit.
Understanding and implementing data cleaning and preparation techniques.
Performing exploratory data analysis and generating meaningful visualizations.
Building an interactive web application to showcase data insights.
Adhering to professional Git practices for version control.

Dataset Source

The dataset used in this project is metadata.csv from the CORD-19 (COVID-19 Open Research Dataset) Research Challenge available on Kaggle.

Steps Implemented

1. Data Loading & Basic Exploration

Downloaded and loaded metadata.csv into a Pandas DataFrame.
Displayed initial rows, DataFrame shape, and column data types.
Checked for missing values across all columns.
Generated basic descriptive statistics for numerical columns.

2. Data Cleaning & Preparation

Handled missing values by dropping rows with missing abstract, publish_time, title, or journal.
Converted the publish_time column to datetime objects and extracted the year of publication.
Created a new derived feature: abstract_word_count.

3. Data Analysis & Visualization

Counted the number of papers published each year.
Identified the top 10 journals by publication count.
Performed a simple word frequency analysis on paper titles to generate a word cloud.
Created the following visualizations:
- Line plot showing the number of publications over time.
- Bar chart displaying the top 10 publishing journals.
- Word cloud visualizing common terms in paper titles.
- Bar chart showing the distribution of paper counts by source (journal used as fallback if source_x is not available).
All visualizations were generated using Matplotlib and Seaborn, saved as PNG files in a plots/ directory.

4. Streamlit Application

Developed an interactive web application (app.py) using Streamlit.
The app features:
- A clear title, description, and explanation of its purpose.
- Interactive widgets: a slider for filtering by publication year range and a dropdown for selecting specific journals.
- Dynamic display of the generated visualizations based on user selections.
- A sample table showing the head of the filtered dataset.

Instructions for Running the Project

Prerequisites

Python 3.8+
Git
metadata.csv file downloaded from the CORD-19 Kaggle dataset and placed in the project root directory.

Setup and Installation

Clone the repository:

git clone <YOUR_GITHUB_REPO_URL>
cd Frameworks_Assignment

Create and activate a virtual environment:

python3 -m venv venv
# On Linux/macOS:
source venv/bin/activate
# On Windows (Command Prompt):
# venv\Scripts\activate.bat
# On Windows (PowerShell):
# venv\Scripts\Activate.ps1

Install dependencies:
```
pip install -r requirements.txt
```
Place metadata.csv: Ensure the metadata.csv file (downloaded from Kaggle) is placed directly in the Frameworks_Assignment directory.

Running the Analysis Script

To run the analysis.py script and generate the plots (saved in the plots/ directory):

python3 analysis.py

Running the Streamlit Application

To start the interactive Streamlit web application:

streamlit run app.py

Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

Key Findings and Reflections

The CORD-19 dataset is extensive, requiring robust data cleaning for effective analysis.
Publication trends show a significant increase in research output over recent years, especially around the COVID-19 pandemic period.
Certain journals consistently publish a high volume of research, indicating their prominence in the field.
Word clouds provide a quick visual summary of prevalent topics in research paper titles.
Streamlit offers a powerful and straightforward way to transform static analyses into interactive web applications, making insights more accessible to a broader audience.
Managing Python environments with venv is crucial for dependency management and avoiding conflicts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORD-19 Research Paper Analysis

Project Purpose and Learning Objectives

Dataset Source

Steps Implemented

1. Data Loading & Basic Exploration

2. Data Cleaning & Preparation

3. Data Analysis & Visualization

4. Streamlit Application

Instructions for Running the Project

Prerequisites

Setup and Installation

Running the Analysis Script

Running the Streamlit Application

Key Findings and Reflections

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

CORD-19 Research Paper Analysis

Project Purpose and Learning Objectives

Dataset Source

Steps Implemented

1. Data Loading & Basic Exploration

2. Data Cleaning & Preparation

3. Data Analysis & Visualization

4. Streamlit Application

Instructions for Running the Project

Prerequisites

Setup and Installation

Running the Analysis Script

Running the Streamlit Application

Key Findings and Reflections