Autonomous AI Data Scientist Assistant - Complete EDA, Visualization & ML Pipeline
Spark Insights is an autonomous AI-powered data scientist that performs comprehensive Exploratory Data Analysis (EDA), generates intelligent visualizations, builds machine learning models, and creates professional PowerPoint presentations automatically. Built with SmolagentsAI framework and powered by OpenAI's advanced language models.
- π€ Autonomous Analysis: Complete end-to-end data analysis without human intervention
- π Smart Data Processing: Automatic data loading, cleaning, and preprocessing
- π Intelligent Visualizations: Context-aware chart generation based on data characteristics
- π§ ML Model Building: Automatic model selection, training, and evaluation
- π PowerPoint Generation: Professional presentation creation with embedded visualizations
- οΏ½ Comprehensive EDA: Statistical analysis, correlation discovery, and pattern recognition
- π‘ Actionable Insights: AI-generated recommendations and business insights
- π― Interactive Interface: Simple command-line interface for data analysis
spark-insights/
βββ agent.py # Main AI agent configuration with SmolagentsAI
βββ main.py # Entry point and user interface
βββ config.py # Configuration and system prompts
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (API keys)
βββ tools/ # Modular analysis tools
β βββ __init__.py
β βββ file_handler.py # Data loading and file processing
β βββ data_analysis.py # Statistical analysis and EDA
β βββ visualization.py # Chart generation with matplotlib/seaborn
β βββ ml_model.py # Machine learning model building
β βββ report_generator.py # PowerPoint presentation creation
β βββ conversation_manager.py # Chat and conversation handling
βββ examples/ # Sample datasets
β βββ sales.csv # Sample sales dataset
β βββ amazon.csv # Sample Amazon dataset
β βββ README.txt # Dataset descriptions
βββ plots/ # Generated visualizations (auto-created)
βββ artifacts/ # ML model artifacts and outputs
βββ env/ # Python virtual environment
βββ __pycache__/ # Python cache files
The system uses SmolagentsAI framework with the following tools:
- FileHandlerTool: Intelligent data loading for CSV/Excel files
- DataAnalysisTool: Comprehensive statistical analysis and EDA
- VisualizationTool: Context-aware chart generation
- MLModelTool: Automated machine learning pipeline
- ReportGeneratorTool: Professional PowerPoint creation
- ConversationManagerTool: Interactive chat capabilities
- Python 3.8 or higher
- OpenAI API key
- Git (for cloning)
# Clone the repository
git clone https://github.com/your-repo/spark-insights.git
cd spark-insights
# Create virtual environment
python -m venv env
# On Windows:
env\Scripts\activate
# On macOS/Linux:
source env/bin/activate
# Install dependencies
pip install -r requirements.txt
Create a .env
file in the project root:
# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
# Run the AI Data Scientist
python main.py
The system will prompt you for a data file path and then autonomously:
- Load and explore your dataset
- Perform comprehensive EDA with statistical analysis
- Generate intelligent visualizations based on your data
- Build relevant ML models (classification/regression/clustering)
- Extract actionable insights and patterns
- Create a PowerPoint presentation with all findings
π Welcome to Spark Insights! Your Autonomous AI Data Scientist.
π This system will automatically:
β’ Perform comprehensive Exploratory Data Analysis (EDA)
β’ Generate dataset-specific visualizations
β’ Build relevant ML models
β’ Extract meaningful insights
β’ Create a custom PowerPoint presentation
π Please provide your data file path (CSV/Excel):
> examples/sales.csv
π Starting autonomous data analysis...
β
Analysis Complete!
π― PowerPoint Report Created!
π File: analysis_report.pptx
-
Data Quality Assessment
- Missing value analysis
- Data type detection
- Statistical summaries
- Outlier identification
-
Exploratory Data Analysis
- Correlation analysis
- Distribution analysis
- Categorical variable exploration
- Time series analysis (if applicable)
-
Intelligent Visualizations
- Histograms and distribution plots
- Correlation heatmaps
- Box plots and violin plots
- Scatter plots and trend analysis
- Bar charts for categorical data
-
Machine Learning Pipeline
- Automatic problem type detection (classification/regression)
- Feature engineering and selection
- Model training and evaluation
- Feature importance analysis
- Performance metrics and confusion matrices
-
Professional Reporting
- Executive summary
- Dataset overview
- Visualization slides with explanations
- ML model results
- Key insights and recommendations
- Actionable conclusions
Your analysis will generate:
- Multiple PNG visualizations in
plots/
directory - analysis_report.pptx - Professional PowerPoint presentation
- Model artifacts in
artifacts/
directory - Console output with step-by-step progress
- AI Framework: SmolagentsAI (Hugging Face)
- LLM: OpenAI GPT models
- Data Processing: Pandas, NumPy
- Visualization: Matplotlib, Seaborn, Plotly
- Machine Learning: Scikit-learn
- Reporting: python-pptx
- Configuration: python-dotenv, PyYAML
For detailed documentation on each component:
- Tools Documentation - Individual tool descriptions
- Configuration Guide - System prompts and settings
- Example Datasets - Sample data descriptions
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Follow Python PEP 8 style guidelines
- Add docstrings to all functions and classes
- Test your changes with sample datasets
- Update documentation for new features
# Core AI and ML
smolagents>=0.1.0
pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
seaborn>=0.12.0
# LLM Integration
openai>=1.0.0
python-dotenv>=1.0.0
# Reporting
python-pptx>=0.6.21
# Additional utilities
requests>=2.31.0
click>=8.1.0
rich>=13.0.0
-
API Key Error
Error: OpenAI API key not found
Solution: Ensure your
.env
file containsOPENAI_API_KEY=your_key_here
-
File Not Found
β File not found: data.csv
Solution: Provide the full path to your data file
-
Memory Issues with Large Files
MemoryError: Unable to allocate array
Solution: Use smaller datasets or increase system memory
This project is licensed under the MIT License - see the LICENSE file for details.
- SmolagentsAI by Hugging Face for the agent framework
- OpenAI for advanced language model capabilities
- Python Data Science Community for excellent libraries
- Built with modern Python best practices for autonomous data analysis