- Project Overview
- EDA Results Summary
- ML Results Summary
- Policy Recommendations
- Collected Datasets
- Project Progress
- License
- Folder Structure
- Streamlit App Structure
This project involves analyzing educational data collected from the United Nations Data (UNDATA). The data covers various aspects of education, including gender ratios, public expenditure, access to computers, and teaching staff in different regions and countries. The datasets provided are from the Statistical Yearbook and are crucial for understanding trends and disparities in global education systems.
- Presentation: Made using Canva
- Kanban Board: Used for organization
- Daily Task Planner Board: Used to keep on top of time management
- Application: Created using Streamlit to present findings
For a detailed view of the complete EDA results, please visit Full EDA Results.
- Span: 2005–2022, with an average focus year of 2012.96.
- Context: Concentration in high-income countries, reflecting trends shaped by events like the 2008 global financial crisis.
- High-Income Coverage: Strong data for countries like the U.S., Germany, and Japan.
- Data Gaps: Sparse data from Albania, Moldova, and Montenegro.
- Frequency Leaders: High reporting in China, U.S., India, and Russia.
- Average: 50.77%, higher in developed economies.
- Outliers: Some developing nations report 0%, indicating reporting gaps or alternative funding.
- Disparities:
- Primary: 44.6% access.
- Lower Secondary: 33.3%.
- Upper Secondary: 34.1%.
- Digital Divide: High-income nations offer near-universal access; gaps persist in developing regions.
- Low Average: 6.09%, with a focus on operational costs over infrastructure.
- Regional Variance: Higher in developed nations; limited in smaller economies.
- Average: 17.08%, reflecting operational cost allocations outside staff compensation.
- Primary and Secondary: Near-universal GER in developed nations; challenges in developing regions.
- Gender Trends: Slight male advantage in early levels, with upper secondary education showing parity.
- Parity: Near-equal ratios across levels, with slight advantages for girls in upper secondary in developed regions.
- Shortages: Noticeable in lower secondary and upper secondary levels, particularly in under-resourced areas.
- Regional Disparities: Higher qualification standards in developed countries, with gaps in developing nations.
- Digital inequality persists.
- Focus remains on operational costs over infrastructure.
- Teacher availability and qualifications vary significantly.
- Gender parity is largely achieved in enrollment.
- Significant disparities exist between developed and developing regions.
For a detailed view of the complete ML results, please visit Full ML Results.
- Best Setup: 70/30 split, MinMaxScaler, 52.94% accuracy.
- Challenges: Struggles with class imbalances, low recall and F1-scores.
- Best Setup: 60/40 split, StandardScaler, MSE of 16.68.
- Insights: R-squared of 0.34; improvements needed in predictive accuracy.
- Random Forest vs. Stacked Regressor:
- Both achieve R-squared = 0.47.
- MSE: 13.41 (Random Forest), 13.26 (Stacked Regressor).
- Accuracy within tight tolerance: 0%.
- For Classification:
- Address imbalances using techniques like SMOTE.
- Explore advanced classifiers (e.g., Gradient Boosting).
- For Regression:
- Improve feature engineering and try models like LightGBM.
- For Ensemble Models:
- Optimize hyperparameters and base model diversity.
- Focus on balancing datasets for classification models.
- Refine regression models with advanced methodologies and robust cross-validation.
Caveat: Mock UN recommendations created by Ceci for this project as per the analysis done.
For a detailed view of the complete Recommendations, please visit Full Recommendations.
- Standardize data collection and reporting systems across countries.
- Encourage international organizations to support low-income regions with data reporting.
- Focus on improving teacher training, especially in developing regions.
- Promote international partnerships to share best practices and provide financial incentives for teachers.
- Promote universal access to digital tools and infrastructure in secondary education.
- Collaborate with the private sector to support digital literacy programs.
- Strengthen initiatives to reduce gender disparities in secondary education.
- Implement scholarships, mentorship, and gender-sensitive policies to support female students.
- Increase teacher recruitment and professional development in underserved areas.
- Provide targeted support through international partnerships.
- Allocate balanced public education expenditure toward both current costs and long-term infrastructure investments.
- Urge international financial institutions to prioritize infrastructure in low- and middle-income countries.
- Advocate for sustainable education policies focusing on teacher quality, gender equality, and infrastructure.
- Align national education plans with global frameworks such as the SDGs and Education 2030 Agenda.
- Foster global partnerships for knowledge exchange to address common educational challenges.
- Support multilateral platforms for sharing research and scaling successful initiatives.
- Implement region-specific programs for areas affected by economic instability, conflict, or political challenges.
- Mobilize funding for mobile schools, digital learning platforms, and community-based education solutions.
- Urge increased financing for education in low-income and conflict-affected regions.
- Support innovative financing mechanisms like education bonds and public-private partnerships.
The following datasets have been collected from UNDATA:
-
Ratio of Girls to Boys in Education
- Dataset ID: SYB67_319_202411
- Description: This dataset provides the ratio of girls to boys in education across different countries and regions.
-
Public Expenditure on Education and Access to Computers
- Dataset ID: SYB67_245_202411
- Description: This dataset presents the public expenditure on education, along with data on the availability of computers in educational institutions.
-
- Dataset ID: SYB67_323_202411
- Description: This dataset outlines the number of teaching staff in the education sector across various countries and regions.
-
- Dataset ID: SYB67_309_202411
- Description: This dataset provides a comprehensive overview of various education-related statistics, such as enrollment rates, graduation rates, and literacy rates.
- Youth Literacy Rate, Population 15-24 Years (%)
Identified as UNdata_Export_20241213_140703208 in the files. - Youth Literacy Rate, Population 15-24 Years, Gender Parity Index (GPI)
Identified as UNdata_Export_20241213_140708283 in the files.
This project is based on publicly available data from UNDATA. Please refer to the UN Data Usage Policy for licensing and attribution information.
- Change column names.
- Drop rows with index 0 (which contained all the column names).
- Look for null values.
- Describe data by Region Code and Value.
- Get all unique values.
- Fill null values with mode in the "Footnotes" column.
- Gather further data.
- Change and normalize format.
- Filter dataset for the regions needed.
- Group only for the years with the largest amount of data for the largest number of countries kept.
- Repeat step 10 for the regions.
- Rename the files.
- Load files into a Jupyter notebook, run a correlation matrix for each variable, making sure all NaN values are filled with 0.
- Run several figures for the various variables by year, mostly bar plots and violin plots (disregarding 0 values using Seaborn and Plotly).
- Started EDA with EDA_countries Jupyter notebook.
- Placed all the formatted CSV dataframes.
- Merged dataframes.
- Created a preliminary correlation matrix.
- Proceeded with analysis and description of relevant columns into a summary table.
- Output the summary table into a CSV file.
- Created violin and box plots.
- Made more visualizations using Seaborn and Matplotlib.
- Created a line plot for enrollment trends.
- Moved on to more complex EDA.
(All of the relevant results are available in Full EDA Results)
- Created a new EDA notebook: full_analytics.
- Loaded only the merged CSV and checked for data types, null values, and duplicates.
- Made a summary table and a categorical summary.
- Plotted all the enrollment trend lines and one congregated clean plot with all of them together.
- Made a trend table and interpreted the results.
- Created a bar chart for staff compensation as expenditure data.
- Created two more charts with better views of the relevant data.
- Made a comparison of expenditure data by country (region).
- Plotted a new correlation matrix for the countries being analyzed.
- Created a histogram for teacher distribution by education level.
- Made line plots for teacher distribution by level and year, and also for qualifications.
- Created a table to iterate on the data for the analytics file.
- Made an enrollment ratio plot.
- Plotted a graph to detect outliers.
- Created aggregate tables (only kept one in the file).
- Made comparison scatter plots between expenditure and enrollment ratios, using only the aggregate data.
- Created a missing data matrix.
- Database creation in Python connector attempted but unsuccessful.
- Proceeded to create the database in MySQL Workbench directly.
- Proceeding with each .csv file.
- Creating a table for countries and for years separately, to relate all tables to each other.
- Aggregating data from existing tables into a full table with only the chosen list of countries.
- Make a list of models and run regressor and classifier versions whenever possible.
- Decide a target variable.
- Start running various models test by test.
- Implement features into the models that run better.
- Run both types of KNN.
- Try multiple splits.
- Get a confusion matrix.
- Choose the splits to continue with and scaling types.
- Apply grid search.
- Apply SMOTE to KNN classifier.
- Use random oversampler.
- Use class weights.
- Try again with KNN regressor.
- Arrive at no conclusion.
- Try regression ensemble models.
- Attempt random forest regressor.
- Attempt stacking regressor.
- Decide on a schematic.
- Create a Jupyter notebook for function aggregation and to keep things organized.
- Make the folders.
- Start creating functions.
- Test app.
- Create pages.
- Test multiple times at every step.
- Keep proceeding with EDA.
- Decide to make an md file to make it easier to import information into the EDA page.
- Make an EDA page.
- Import information in phases.
- Test between phases.
- Look at available models.
- Narrow down search.
- Choose a color scheme and vector theme.
- Pick one model.
- Start presentation.
- Decide on a title.
- Make notes on progress time and steps (this whole list).
- Decide what presentation must include.
- Plot what to say, when to say it, and where to insert the information.
- Decide main steps.
- Finish presentation.
The project is organized into the following folders:
working_notebooks
: Notebooks that contain work in progress and are not organized.unused_notebooks
usable_notebooks
: Organized notebooks that present findings in an organized manner.correlations
plots
barfigures
full_analysis
data
: Collects the data used for the project separate folders:raw
:literacy_rates
unused_data
enrollment_data
government_expenditure
cleaned
filtered
merged
variables
years_vars
final
mysql_scripts
: Collects MySQL scripts and tables created for the project directly in MySQL Workbench.app_files
: Contains all the files that pertain to the Streamlit app written for the project.slides
: Presentation PDF for ease of view.
The Streamlit app is structured as follows:
app_files/
: Main folder containing app-related files.app.py
: Main script to run the Streamlit app.data_loader.py
: Handles data loading and processing.pages/
: Contains individual pages of the app.opener.py
: Page that confirms data has been loaded and presents project details.introduction.py
: Provides project introduction and overview.ml_results.py
: Displays machine learning results and findings.
utils_/
: Folder for utility functions.display.py
: Helper functions for displaying data.
This app allows users to explore different aspects of the education data, verify the results of the machine learning models run for the project, and analyze trends in education. Navigation buttons within the app enable users to seamlessly switch between sections.