Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fraudulent traffic.
- Build a machine learning model to determine whether a click is fraud or not.
Competition link: TalkingData AdTracking Fraud Detection Challenge
Model: xgboost and lightgbm
Evaluation Metric: area under the ROC curve (AUC-ROC)
Training and verification: Some models use the data of 11.07-11.09, and some models use the data of 11.07-11.08. Randomly select 50 million rows of data for verification.
Model | Public score | Private score | Final rank |
---|---|---|---|
LGBM | 0.98122 | 0.98206 | 223th (Top 6%) bronze medal 🥉 |
The libraries used are:
- numpy
- pandas
- matplotlib,
- seaborn
- sklearn
- lightgbm
- xgboost
Challenges:
- Large Datasets (TalkingData provides training data for 185 million samples 7GB size.)
- Imbalanced Data
Exploratory Data Analysis(EDA):
Solution References: