The purpose of this project is to collect and analyze real estate listings in Oman, clean the data, engineer useful features, and build machine learning models to predict property sale prices based on key property characteristics.
- Website 1: https://www.dubizzle.com.om/en/properties/
- Website 2: https://hilalprp.com.om/
-
Web Scraping:
- Used Python with
requests,BeautifulSoup, andpandasto scrape listing data. - Extracted features: property title, city, area, price, number of bedrooms, bathrooms, garage, and listing type.
- Used Python with
-
Data Cleaning:
- Removed text units (e.g., "OMR", "SqM") from price and area columns using regex.
- Converted numeric columns to floats.
- Filled missing values using median (for numeric data) or mode (for categorical data).
- Dropped rows with excessive missing information.
- Created new columns such as
price_per_sqm, city,government,total roomsbased on property size. - Applied label encoding for categorical features (e.g., location,city, listing type and government).
- Normalized numerical features where needed.
- Combined data from two sources into one unified dataset.
Used Scikit-learn to apply and evaluate different regression models:
- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared Score (RΒ²)
| Model | RMSE | RΒ² Score |
|---|---|---|
| Linear Regression | 0.10 | 0.43 |
| Decision Tree | 0.02 | 0.98 |
| Random Forest | 0.01 | 0.99 |
β Random Forest performed the best, achieving the highest RΒ² and lowest RMSE.
β’ Web scraping scripts or notebooks β’ Data cleaning functions β’ Final combined CSV file β’ Feature engineering and modeling code β’ A brief README.md file
- Add more features like amenities, location coordinates, or property age.
- Scrape additional websites to expand the dataset.
- Deploy the model as a prediction tool via Flask or Streamlit.