This project, developed within the scope of the Mapindata Hackathon, aimed to leverage data not just as numbers, but as a tool to understand people, cities, and life itself. Our goal was to determine a "wealth score" for individuals living in or passing through a given region, and to categorize these individuals into multiple "persona" groups that reflect their multifaceted behaviors.
By blending diverse datasets—from restaurant and hotel segments to mobile signal data, demographic information, and amenity areas—we didn't just focus on income levels. Instead, we revealed an individual's "life value" based on various dimensions such as access, consumption habits, mobility, and lifestyle preferences. This allowed us to capture multiple personality traits like "Pet Lover," "Office Explorer," or "Gourmet Follower" through data, ultimately creating a unique and dynamic profile for each person.
Our project aims to be a valuable decision-support system across various fields, from smart city planning to target audience analysis, thanks to the insights we've gained.
Run the Jupyter Notebook files in the project in order to create the necessary CSV files:
99_kisi_id.ipynbrestoran_preprocessing.ipynbrestoran_cluster.ipynbperson1-dateplace.ipynbperson2-favoriler-timecluster.ipynbperson3-rich_score.ipynbperson4-persona.ipynb
Files edited manually in Excel or Jupyter Notebook:
-
Clustered_Veteriner.csvClassification was made in Excel as A B C according to the neighborhood name.
-
Kahve_New.csvIn Excel, the data types of the
latitudeandlongitudecolumns have been changed to float. Unnecessary columns have been removed.
Note: Before running Notebooks, the Polygons folder and Mobility Data paraquet data must be manually added to the data folder.
This notebook creates a new DataFrame by selecting the first 99 unique device_aid values from the large mobility dataset (MobilityDataMay2024.parquet).
The reason why 99 IDs are selected is that the RAM memory fills up when the DataFrame is read.
-
Installing Dask and Pandas:
dask.dataframeandpandasare imported. -
Reading and Filtering Data:
- Parquet file is read with Dask.
- Unique IDs are taken from the
device_aidcolumn. - The first 99 IDs are selected and the motion data is filtered for these IDs.
This notebook reads and cleans the raw restaurant data to prepare it for analysis and clustering.
-
Loading Libraries: Import the
pandasandnumpypackages. -
Reading Data: Load the main dataset (
Hackathon_MainData.xlsx) in Excel format. -
Column Selection: Create a new DataFrame (
df) by selecting the columns required for analysis:- Latitude, Longitude, District, Type, Modern/Traditional/Hotel values
- Average Spending Amount, Restaurant Type
- Map Profile & Population Score, Mapin Segment.
-
**Correcting Incorrect Values
- The capital letters and Turkish character missing spellings in the district column are corrected.
- The spelling error (
TRADITIONAL) in the sales channel names is corrected. - The restaurant type is updated as 'Balik' → 'Balık'.
-
Clearing Coded Values:
- Modern (
D), Traditional (R) ve Hotel (H) değerlerinden harfler çıkarılır.
- Modern (
-
Formatting Scores and Coordinates:
- Latitude, Longitude, Map Profile/Population scores are converted from string to float format.
- Five-digit numbers are divided by 1000 to bring them to a thousandths scale.
-
**Average Spend Conversion
*The average of the range values "10-20 TL" and values such as "+30" are converted into numbers as the lower limit.
-
Filling in Missing Values:
NaNvalues in Modern, Traditional, Hotel columns are filled with 0.
-
Final Check and Record:
- The cleaned DataFrame is printed to the screen and saved to the
MainData_updated.csvfile.
- The cleaned DataFrame is printed to the screen and saved to the
-
Column List:
- The resulting columns are listed in the console.
This notebook applies KMeans clustering analysis on the data cleaned in step one and separates the restaurants into clusters on the map.
-
Loading Libraries and Data: Import
pandas,sklearn.cluster.KMeans,matplotlibandseaborn; ReadMainData_updated.csv. -
Filling in Missing Average Spending:
- Empty
Ortalama Harcama Tutarıvalues are filled with the average value of the column.
- Empty
-
NaN Row Removal:
- Rows with missing
Map Profile ScoreorMap Population Scoreare removed.
-
- One-Hot Encoding:
- İlçe column is converted to binary columns with
pd.get_dummies.
-
Remove Unnecessary Columns:
- The columns
Mapin SegmentandHotel Valueare deleted.
- The columns
-
Restaurant Type Assembly and Cleaning:
- Categories with few samples are merged or deleted..
-
Sales Channel Filter:
Hotelvalues are removed from theTürcolumn
-
Removal of 0 Scored Records:
- Rows with
Map Profile Score == 0are extracted.
- Rows with
-
Feature Selection and Normalization:
Ortalama Harcama TutarıandMap Profile Scoreare selected and scaled withStandardScaler.
-
KMeans:
- Clusters are calculated with
n_clusters=3.
- Clusters are calculated with
-
Visualization:
- Clusters are coloured with latitude-longitude scatter plot.
-
Elbow Method Analysis:
- Inertia is calculated for k values between 1-10.Elbow Method Analysis:
- Inertia is calculated for k values between 1-10.
-
Cluster Labelling and Registration:
- The clusters are labelled as "Zengin Restoran", "Orta Halli Restoran", "Ucuz Restoran" and
Clustered_Restaurants.csvis saved.
- The clusters are labelled as "Zengin Restoran", "Orta Halli Restoran", "Ucuz Restoran" and
This notebook matches the movement data of 99 selected devices with different POI types.
-
Libraries:
dask.dataframe,pandas,numpy,sklearn.neighbors.BallTree. -
Filter by 99 IDs:
- Load movement data using
99_kisi_data.csv.
- Load movement data using
-
Timing and Location Processing:
- Convert
timestampcolumn to datetime, addzaman_bolumu. - The filter
horizontal_accuracy≤ 200 is applied.
- Convert
-
Primary Records:n
- The first record is selected for each
device_aid,grid_id,date,zaman_bolumucombination.
- The first record is selected for each
-
Restaurant/Veterinarian/Coffee Shop Pairing:
- With
Clustered_Restaurants.csv,Clustered_Veteriner.csv,Kahve_New.csvthe closest POI is matched using BallTree/Haversine.
- With
-
POI Type Determination:
- The
matched_*columns are processed and thePlacecolumn is created by selecting the closest category.
- The
-
Result:
- The file
id-date-zaman_bolumu-place.csvis saved.
- The file
For each device, this notebook classifies the categories of venues visited, identifies favourite venues and performs time-set analysis.
-
Data Upload: Read
id-date-zaman_bolumu-place.csv. -
Category Glossary:
- Groups such as Hospital, Park, Market, Restaurant etc. are defined according to keywords.
-
categorize Function:
- Category assignment is made over the venue name.
-
Pivot Tables:
- By time zones (
zaman_pivot) and by space categories (mekan_pivot) the number of visits is calculated.
- By time zones (
-
Time Clustering:
- The optimal k is found with the Elbow method, and
time_clusteris labelled with KMeans.
- The optimal k is found with the Elbow method, and
-
Favourite Places:
- Frequently visited categories according to the visit rate are stored in the
favourilerlist.
- Frequently visited categories according to the visit rate are stored in the
-
CSV:
id-favoriler-time_cluster.csvis saved.
This notebook calculates a "zenginlik skoru" (rich_score) based on the neighbourhood, number of visits and demographics of the devices.
-
Data and Demographics:
id-favoriler-time_cluster.csvandIlce_Demografi.xlsxare uploaded.
-
Night Neighbourhood Analysis:
- "The neighbourhood most frequently visited during the "night" period is selected as
oturdugu_mahalle.
- "The neighbourhood most frequently visited during the "night" period is selected as
-
Visit Attributes:
- Restaurant, luxury villa and site visit numbers are calculated with pivot.
-
Neighbourhood Level Matching:
- Neighbourhood levels (
A,B,C) read from Excel are added to each device.
- Neighbourhood levels (
-
Attribute Scaling:
- With
StandardScalerandMinMaxScalerthe numerical attributes are normalised.
- With
-
Rich Score Calculation:
- Scores are summed with weighted contribution factors (e.g. villa ×0.35, restaurant ×0.25, neighbourhood ×0.25).
-
CSV:
id-mahalle-rich_score.csvis created.
This notebook combines favourites, time sets and wealth score to create the final persona dataset.
- Data Upload: Read
id-favoriler-time_cluster.csvandid-mahalle-rich_score.csv. - Consolidation:
Axes are concatenated withpd.concat`. - Final CSV:
final.csvfile is saved and the process is completed.
