Skip to content

Latest commit

 

History

History
96 lines (82 loc) · 4.43 KB

File metadata and controls

96 lines (82 loc) · 4.43 KB

CORD-19 Research Paper Analysis

Project Purpose and Learning Objectives

This project aims to demonstrate a complete data science workflow, from data loading and cleaning to analysis, visualization, and building an interactive web application. The primary learning objectives include:

  • Proficiency in using Python libraries like Pandas, Matplotlib, Seaborn, and Streamlit.
  • Understanding and implementing data cleaning and preparation techniques.
  • Performing exploratory data analysis and generating meaningful visualizations.
  • Building an interactive web application to showcase data insights.
  • Adhering to professional Git practices for version control.

Dataset Source

The dataset used in this project is metadata.csv from the CORD-19 (COVID-19 Open Research Dataset) Research Challenge available on Kaggle.

Steps Implemented

1. Data Loading & Basic Exploration

  • Downloaded and loaded metadata.csv into a Pandas DataFrame.
  • Displayed initial rows, DataFrame shape, and column data types.
  • Checked for missing values across all columns.
  • Generated basic descriptive statistics for numerical columns.

2. Data Cleaning & Preparation

  • Handled missing values by dropping rows with missing abstract, publish_time, title, or journal.
  • Converted the publish_time column to datetime objects and extracted the year of publication.
  • Created a new derived feature: abstract_word_count.

3. Data Analysis & Visualization

  • Counted the number of papers published each year.
  • Identified the top 10 journals by publication count.
  • Performed a simple word frequency analysis on paper titles to generate a word cloud.
  • Created the following visualizations:
    • Line plot showing the number of publications over time.
    • Bar chart displaying the top 10 publishing journals.
    • Word cloud visualizing common terms in paper titles.
    • Bar chart showing the distribution of paper counts by source (journal used as fallback if source_x is not available).
  • All visualizations were generated using Matplotlib and Seaborn, saved as PNG files in a plots/ directory.

4. Streamlit Application

  • Developed an interactive web application (app.py) using Streamlit.
  • The app features:
    • A clear title, description, and explanation of its purpose.
    • Interactive widgets: a slider for filtering by publication year range and a dropdown for selecting specific journals.
    • Dynamic display of the generated visualizations based on user selections.
    • A sample table showing the head of the filtered dataset.

Instructions for Running the Project

Prerequisites

  • Python 3.8+
  • Git
  • metadata.csv file downloaded from the CORD-19 Kaggle dataset and placed in the project root directory.

Setup and Installation

  1. Clone the repository:
    git clone <YOUR_GITHUB_REPO_URL>
    cd Frameworks_Assignment
  2. Create and activate a virtual environment:
    python3 -m venv venv
    # On Linux/macOS:
    source venv/bin/activate
    # On Windows (Command Prompt):
    # venv\Scripts\activate.bat
    # On Windows (PowerShell):
    # venv\Scripts\Activate.ps1
  3. Install dependencies:
    pip install -r requirements.txt
  4. Place metadata.csv: Ensure the metadata.csv file (downloaded from Kaggle) is placed directly in the Frameworks_Assignment directory.

Running the Analysis Script

To run the analysis.py script and generate the plots (saved in the plots/ directory):

python3 analysis.py

Running the Streamlit Application

To start the interactive Streamlit web application:

streamlit run app.py

Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

Key Findings and Reflections

  • The CORD-19 dataset is extensive, requiring robust data cleaning for effective analysis.
  • Publication trends show a significant increase in research output over recent years, especially around the COVID-19 pandemic period.
  • Certain journals consistently publish a high volume of research, indicating their prominence in the field.
  • Word clouds provide a quick visual summary of prevalent topics in research paper titles.
  • Streamlit offers a powerful and straightforward way to transform static analyses into interactive web applications, making insights more accessible to a broader audience.
  • Managing Python environments with venv is crucial for dependency management and avoiding conflicts.