This project analyzes purchasing behavior patterns to predict pregnancy status using logistic regression. The model achieves 84.5% accuracy in predicting pregnancy based on various shopping behaviors and lifestyle changes.
- Clone the repository:
git clone https://github.com/jfm56/regression_analysis.git
cd regression_analysis- Install required dependencies:
pip install -r requirements.txtRequired dependencies:
- pandas==2.1.0
- numpy==1.24.3
- scikit-learn==1.3.0
- openpyxl==3.1.2
- matplotlib==3.7.1
- seaborn==0.12.2
Development dependencies:
- pytest==7.4.0
- pylint==2.17.5
- pytest-cov==4.1.0
- black==23.7.0
The analysis expects an Excel file (target.xlsx) with the following columns:
- Implied Gender
- Home/Apt/PO Box
- Pregnancy Test
- Birth Control
- Feminine Hygiene
- Folic Acid
- Prenatal Vitamins
- Prenatal Yoga
- Body Pillow
- Ginger Ale
- Sea Bands
- Stopped buying ciggies
- Cigarettes
- Smoking Cessation
- Stopped buying wine
- Wine
- Maternity Clothes
- Pregnant (target variable, 0 or 1)
An example file target.xlsx is included in the repository.
- Place your data file (Excel format) in the project directory
- Run the analysis:
python -m regression_analysis.regressionThe script will generate three visualization files:
correlation_heatmap.png: Shows correlations between featuresscatter_plots.png: Displays relationships between key features and pregnancyfeature_importance.png: Shows the importance of each feature
- Accuracy: 84.50%
- Precision (Pregnant): 92%
- Recall (Pregnant): 74%
- F1-Score (Pregnant): 82%
- Folic Acid (2.94)
- Prenatal Vitamins (2.22)
- Pregnancy Test (1.96)
- Maternity Clothes (1.72)
- Ginger Ale (1.41)
- Birth Control (-2.03)
- Feminine Hygiene (-1.73)
- Wine (-1.29)
- Cigarettes (-1.25)
The scatter plots show the relationships between six key features and pregnancy status:
- X-axis: Feature value
- Y-axis: Pregnancy status (0 = Not Pregnant, 1 = Pregnant)
- Red trend lines indicate the direction and strength of relationships
- Upward trends suggest positive correlation with pregnancy
- Downward trends suggest negative correlation with pregnancy
Key observations:
- Folic Acid and Prenatal Vitamins show strong positive correlations
- Birth Control and Wine show strong negative correlations
- The spread of points indicates the reliability of each relationship
The model identifies purchasing patterns that are most predictive of pregnancy:
- Health supplements (Folic Acid, Prenatal Vitamins) are the strongest positive indicators
- Contraceptives and lifestyle products (Birth Control, Wine, Cigarettes) are strong negative indicators
- Changes in purchasing behavior (stopping wine/cigarettes) are moderately strong indicators
- Install development dependencies:
pip install -r requirements.txt- Run tests:
pytest- Check code quality:
pylint regression_analysis/regression.py tests/*.py- Check test coverage:
pytest --cov=regression_analysis --cov-report=htmlThe test suite includes:
- Model accuracy testing with realistic data
- Feature importance validation
- Error handling tests
- 90% code coverage
docker pull jmullen029/regression_analysis:latest
docker run jmullen029/regression_analysis:latest- Build and run using Docker:
docker-compose up --build- Run tests in Docker:
docker-compose run regression pytestThe Docker image is automatically built and published to Docker Hub on every push to main branch.
- The model uses logistic regression for binary classification (Pregnant/Not Pregnant)
- Features are encoded using Label Encoding for categorical variables
- The dataset is split 80/20 for training and testing
- Results include both positive and negative predictors for comprehensive analysis
- Includes comprehensive test suite with pytest
- Docker support for consistent development environment
- Code quality maintained with pylint
- Test coverage tracked with pytest-cov
- Automated dependency updates with Dependabot
- Weekly checks for Python packages
- Weekly checks for GitHub Actions
- Weekly checks for Docker base images
- Auto-merges patch updates
