Educational Data Analysis Project

Project Overview
EDA Results Summary
ML Results Summary
Policy Recommendations
Collected Datasets
Project Progress
License
Folder Structure
Streamlit App Structure

Project Overview

This project involves analyzing educational data collected from the United Nations Data (UNDATA). The data covers various aspects of education, including gender ratios, public expenditure, access to computers, and teaching staff in different regions and countries. The datasets provided are from the Statistical Yearbook and are crucial for understanding trends and disparities in global education systems.

Supporting Materials

Presentation: Made using Canva
Kanban Board: Used for organization
Daily Task Planner Board: Used to keep on top of time management
Application: Created using Streamlit to present findings

EDA Results Summary

For a detailed view of the complete EDA results, please visit Full EDA Results.

1. Year

Span: 2005–2022, with an average focus year of 2012.96.
Context: Concentration in high-income countries, reflecting trends shaped by events like the 2008 global financial crisis.

2. Region/Country/Area

High-Income Coverage: Strong data for countries like the U.S., Germany, and Japan.
Data Gaps: Sparse data from Albania, Moldova, and Montenegro.
Frequency Leaders: High reporting in China, U.S., India, and Russia.

3. Staff Compensation (% of Public Expenditure)

Average: 50.77%, higher in developed economies.
Outliers: Some developing nations report 0%, indicating reporting gaps or alternative funding.

4. Access to Computers by Education Level

Disparities:
- Primary: 44.6% access.
- Lower Secondary: 33.3%.
- Upper Secondary: 34.1%.
Digital Divide: High-income nations offer near-universal access; gaps persist in developing regions.

5. Capital Expenditure (% of Total Public Education Spending)

Low Average: 6.09%, with a focus on operational costs over infrastructure.
Regional Variance: Higher in developed nations; limited in smaller economies.

6. Other Current Expenditures

Average: 17.08%, reflecting operational cost allocations outside staff compensation.

7. Gross Enrollment Ratios (GER)

Primary and Secondary: Near-universal GER in developed nations; challenges in developing regions.
Gender Trends: Slight male advantage in early levels, with upper secondary education showing parity.

8. Ratio of Girls to Boys

Parity: Near-equal ratios across levels, with slight advantages for girls in upper secondary in developed regions.

9. Teachers by Education Level

Shortages: Noticeable in lower secondary and upper secondary levels, particularly in under-resourced areas.

10. Teacher Qualifications

Regional Disparities: Higher qualification standards in developed countries, with gaps in developing nations.

Key Trends

Digital inequality persists.
Focus remains on operational costs over infrastructure.
Teacher availability and qualifications vary significantly.
Gender parity is largely achieved in enrollment.
Significant disparities exist between developed and developing regions.

ML Results Summary

For a detailed view of the complete ML results, please visit Full ML Results.

KNN Model Performance

Classifier

Best Setup: 70/30 split, MinMaxScaler, 52.94% accuracy.
Challenges: Struggles with class imbalances, low recall and F1-scores.

Regressor

Best Setup: 60/40 split, StandardScaler, MSE of 16.68.
Insights: R-squared of 0.34; improvements needed in predictive accuracy.

Advanced Models

Random Forest vs. Stacked Regressor:
- Both achieve R-squared = 0.47.
- MSE: 13.41 (Random Forest), 13.26 (Stacked Regressor).
- Accuracy within tight tolerance: 0%.

Improvement Recommendations

For Classification:
- Address imbalances using techniques like SMOTE.
- Explore advanced classifiers (e.g., Gradient Boosting).
For Regression:
- Improve feature engineering and try models like LightGBM.
For Ensemble Models:
- Optimize hyperparameters and base model diversity.

Final Suggestions

Focus on balancing datasets for classification models.
Refine regression models with advanced methodologies and robust cross-validation.

Policy Recommendations

Caveat: Mock UN recommendations created by Ceci for this project as per the analysis done.

For a detailed view of the complete Recommendations, please visit Full Recommendations.

1. Enhance Educational Data Reporting and Accessibility

Standardize data collection and reporting systems across countries.
Encourage international organizations to support low-income regions with data reporting.

2. Invest in Teacher Training and Professional Development

Focus on improving teacher training, especially in developing regions.
Promote international partnerships to share best practices and provide financial incentives for teachers.

3. Address the Digital Divide through Technology Investment

Promote universal access to digital tools and infrastructure in secondary education.
Collaborate with the private sector to support digital literacy programs.

4. Promote Gender Parity in Secondary and Upper Secondary Education

Strengthen initiatives to reduce gender disparities in secondary education.
Implement scholarships, mentorship, and gender-sensitive policies to support female students.

5. Address Teacher Shortages in Low-Income and Conflict-Affected Regions

Increase teacher recruitment and professional development in underserved areas.
Provide targeted support through international partnerships.

6. Promote Balanced Investment in Education Infrastructure

Allocate balanced public education expenditure toward both current costs and long-term infrastructure investments.
Urge international financial institutions to prioritize infrastructure in low- and middle-income countries.

7. Support Long-Term Strategic Planning for Education Systems

Advocate for sustainable education policies focusing on teacher quality, gender equality, and infrastructure.
Align national education plans with global frameworks such as the SDGs and Education 2030 Agenda.

8. Facilitate International Collaboration and Knowledge Sharing

Foster global partnerships for knowledge exchange to address common educational challenges.
Support multilateral platforms for sharing research and scaling successful initiatives.

9. Develop Targeted Programs for Vulnerable Regions

Implement region-specific programs for areas affected by economic instability, conflict, or political challenges.
Mobilize funding for mobile schools, digital learning platforms, and community-based education solutions.

10. Enhance International Investment in Education

Urge increased financing for education in low-income and conflict-affected regions.
Support innovative financing mechanisms like education bonds and public-private partnerships.

Collected Datasets

The following datasets have been collected from UNDATA:

Ratio of Girls to Boys in Education
- Dataset ID: SYB67_319_202411
- Description: This dataset provides the ratio of girls to boys in education across different countries and regions.
Public Expenditure on Education and Access to Computers
- Dataset ID: SYB67_245_202411
- Description: This dataset presents the public expenditure on education, along with data on the availability of computers in educational institutions.
Teaching Staff in Education
- Dataset ID: SYB67_323_202411
- Description: This dataset outlines the number of teaching staff in the education sector across various countries and regions.
Education Statistics
- Dataset ID: SYB67_309_202411
- Description: This dataset provides a comprehensive overview of various education-related statistics, such as enrollment rates, graduation rates, and literacy rates.

Additional Datasets

Literacy Rates:

Youth Literacy Rate, Population 15-24 Years (%)
Identified as UNdata_Export_20241213_140703208 in the files.
Youth Literacy Rate, Population 15-24 Years, Gender Parity Index (GPI)
Identified as UNdata_Export_20241213_140708283 in the files.

License

This project is based on publicly available data from UNDATA. Please refer to the UN Data Usage Policy for licensing and attribution information.

Project Progress

Handling Data:

Change column names.
Drop rows with index 0 (which contained all the column names).
Look for null values.
Describe data by Region Code and Value.
Get all unique values.
Fill null values with mode in the "Footnotes" column.
Gather further data.
Change and normalize format.
Filter dataset for the regions needed.
Group only for the years with the largest amount of data for the largest number of countries kept.
Repeat step 10 for the regions.
Rename the files.
Load files into a Jupyter notebook, run a correlation matrix for each variable, making sure all NaN values are filled with 0.
Run several figures for the various variables by year, mostly bar plots and violin plots (disregarding 0 values using Seaborn and Plotly).

Preliminary EDA:

Started EDA with EDA_countries Jupyter notebook.
Placed all the formatted CSV dataframes.
Merged dataframes.
Created a preliminary correlation matrix.
Proceeded with analysis and description of relevant columns into a summary table.
Output the summary table into a CSV file.
Created violin and box plots.
Made more visualizations using Seaborn and Matplotlib.
Created a line plot for enrollment trends.
Moved on to more complex EDA.

Complex EDA:

(All of the relevant results are available in Full EDA Results)

Created a new EDA notebook: full_analytics.
Loaded only the merged CSV and checked for data types, null values, and duplicates.
Made a summary table and a categorical summary.
Plotted all the enrollment trend lines and one congregated clean plot with all of them together.
Made a trend table and interpreted the results.
Created a bar chart for staff compensation as expenditure data.
Created two more charts with better views of the relevant data.
Made a comparison of expenditure data by country (region).
Plotted a new correlation matrix for the countries being analyzed.
Created a histogram for teacher distribution by education level.
Made line plots for teacher distribution by level and year, and also for qualifications.
Created a table to iterate on the data for the analytics file.
Made an enrollment ratio plot.
Plotted a graph to detect outliers.
Created aggregate tables (only kept one in the file).
Made comparison scatter plots between expenditure and enrollment ratios, using only the aggregate data.
Created a missing data matrix.

MySQL Progress:

Database creation in Python connector attempted but unsuccessful.
Proceeded to create the database in MySQL Workbench directly.
Proceeding with each .csv file.
Creating a table for countries and for years separately, to relate all tables to each other.
Aggregating data from existing tables into a full table with only the chosen list of countries.

Machine Learning Models:

Make a list of models and run regressor and classifier versions whenever possible.
Decide a target variable.
Start running various models test by test.
Implement features into the models that run better.
Run both types of KNN.
Try multiple splits.
Get a confusion matrix.
Choose the splits to continue with and scaling types.
Apply grid search.
Apply SMOTE to KNN classifier.
Use random oversampler.
Use class weights.
Try again with KNN regressor.
Arrive at no conclusion.
Try regression ensemble models.
Attempt random forest regressor.
Attempt stacking regressor.

Streamlit App Development:

Decide on a schematic.
Create a Jupyter notebook for function aggregation and to keep things organized.
Make the folders.
Start creating functions.
Test app.
Create pages.
Test multiple times at every step.
Keep proceeding with EDA.
Decide to make an md file to make it easier to import information into the EDA page.
Make an EDA page.
Import information in phases.
Test between phases.

Presentation Construction:

Look at available models.
Narrow down search.
Choose a color scheme and vector theme.
Pick one model.
Start presentation.
Decide on a title.
Make notes on progress time and steps (this whole list).
Decide what presentation must include.
Plot what to say, when to say it, and where to insert the information.
Decide main steps.
Finish presentation.

Folder Structure

The project is organized into the following folders:

working_notebooks: Notebooks that contain work in progress and are not organized.
- unused_notebooks
usable_notebooks: Organized notebooks that present findings in an organized manner.
- correlations
- plots
- barfigures
- full_analysis
data: Collects the data used for the project separate folders:
- raw:
  - literacy_rates
  - unused_data
    - enrollment_data
    - government_expenditure
- cleaned
- filtered
- merged
  - variables
  - years_vars
  - final
mysql_scripts: Collects MySQL scripts and tables created for the project directly in MySQL Workbench.
app_files: Contains all the files that pertain to the Streamlit app written for the project.
slides: Presentation PDF for ease of view.

Streamlit App Structure

The Streamlit app is structured as follows:

app_files/: Main folder containing app-related files.
- app.py: Main script to run the Streamlit app.
- data_loader.py: Handles data loading and processing.
- pages/: Contains individual pages of the app.
  - opener.py: Page that confirms data has been loaded and presents project details.
  - introduction.py: Provides project introduction and overview.
  - ml_results.py: Displays machine learning results and findings.
- utils_/: Folder for utility functions.
  - display.py: Helper functions for displaying data.

This app allows users to explore different aspects of the education data, verify the results of the machine learning models run for the project, and analyze trends in education. Navigation buttons within the app enable users to seamlessly switch between sections.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.ipynb_checkpoints		.ipynb_checkpoints
app_files		app_files
data		data
functions		functions
mysql_scripts		mysql_scripts
slides		slides
usable_notebooks		usable_notebooks
working_notebooks		working_notebooks
.gitignore		.gitignore
.python-version		.python-version
EDA_results.md		EDA_results.md
ML_results.md		ML_results.md
README.md		README.md
configurations.yaml		configurations.yaml
hello.py		hello.py
pyproject.toml		pyproject.toml
recommendations_page.md		recommendations_page.md
uv.lock		uv.lock

Cfg-data/Education-Data-n-Trends

Folders and files

Latest commit

History

Repository files navigation