SECmp is a search analysis system designed to collect, normalize, and compare search engine rankings, using Google Search results as the relevance baseline.
The system issues real-world queries to multiple search engines, extracts the top organic results, and evaluates ranking similarity against Google using percent overlap and Spearman rank correlation. SECmp focuses on ranking behavior, relevance divergence, and signal consistency across heterogeneous search platforms.
This project reflects core problems in information retrieval, ranking systems, and signal evaluation, and mirrors workflows used in search quality analysis and automated sourcing systems.
SECmp follows a two-stage pipeline:
-
Query Ingestion
- Load a predefined query set (e.g. 100 real-world queries).
-
Multi-Engine Search
-
Issue queries to supported search engines:
- Google (baseline, pre-collected)
- Bing
- Yahoo!
- DuckDuckGo
- Ask
-
Apply randomized delays to avoid throttling.
-
-
Result Extraction
- Use Selenium for dynamic engines (Bing, Yahoo).
- Use Requests + BeautifulSoup for static or semi-static engines.
- Decode redirect and tracking URLs to obtain canonical destination links.
-
Result Normalization
-
Clean and normalize URLs to remove:
- Scheme differences (http / https)
- Trailing slashes
- Tracking and redirect wrappers
-
-
Structured Storage
- Persist top-10 results per query into engine-specific JSON files.
-
Baseline Alignment
- Load Google Search results as the reference ranking.
-
Overlap Analysis
- Compute the number and percentage of overlapping URLs between each engine and Google.
-
Ranking Correlation
- Compute Spearman rank correlation coefficient for overlapping results.
- Apply re-ranking logic to handle partial overlaps correctly.
-
Reporting
-
Generate a summary CSV containing:
- Overlap count
- Percent overlap
- Spearman correlation
- Per-query results and overall averages
-
SECmp generates structured outputs for both stages:
output/
└── task1/
└── <timestamp>/
├── Bing_Results.json
├── Yahoo!_Results.json
├── DuckDuckGo_Results.json
└── Ask_Results.json
Each JSON file maps:
Query → Top-10 ranked URLs
output/
└── task2/
└── evaluation.csv
The CSV includes:
- Query index
- Number of overlapping results
- Percent overlap with Google
- Spearman rank correlation coefficient
- Aggregate averages
- Language: Python
- Networking:
requests - HTML Parsing:
BeautifulSoup(beautifulsoup4) - Browser Automation:
Selenium(Chrome WebDriver) - Ranking Metrics: Spearman rank correlation
- Data Formats: JSON, CSV
- Python 3.8+
- Google Chrome
- ChromeDriver (compatible with your Chrome version)
Install dependencies:
pip install -r requirements.txt-
Place query file under:
./assets/100QueriesSet*.txt -
Place Google baseline results under:
./assets/results/Google_Result*.json
python secmp_task1.pyThis will:
- Query each supported search engine
- Collect top-10 results per query
- Store results as JSON files
python secmp_task2.pyThis will:
- Compare each engine against Google
- Compute overlap and Spearman correlation
- Generate a summary CSV report
- Google results are treated as the relevance baseline.
- Randomized delays are enforced to reduce blocking risk.
- URL normalization is critical to ensure fair comparison.
- The system evaluates ranking behavior, not content quality.