Skip to content

zhichzhang/SECmp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Search Engine Ranking Comparator

Introduction

SECmp is a search analysis system designed to collect, normalize, and compare search engine rankings, using Google Search results as the relevance baseline.

The system issues real-world queries to multiple search engines, extracts the top organic results, and evaluates ranking similarity against Google using percent overlap and Spearman rank correlation. SECmp focuses on ranking behavior, relevance divergence, and signal consistency across heterogeneous search platforms.

This project reflects core problems in information retrieval, ranking systems, and signal evaluation, and mirrors workflows used in search quality analysis and automated sourcing systems.

Workflow

SECmp follows a two-stage pipeline:

1. Search Result Collection (Task 1)

  1. Query Ingestion

    • Load a predefined query set (e.g. 100 real-world queries).
  2. Multi-Engine Search

    • Issue queries to supported search engines:

      • Google (baseline, pre-collected)
      • Bing
      • Yahoo!
      • DuckDuckGo
      • Ask
    • Apply randomized delays to avoid throttling.

  3. Result Extraction

    • Use Selenium for dynamic engines (Bing, Yahoo).
    • Use Requests + BeautifulSoup for static or semi-static engines.
    • Decode redirect and tracking URLs to obtain canonical destination links.
  4. Result Normalization

    • Clean and normalize URLs to remove:

      • Scheme differences (http / https)
      • Trailing slashes
      • Tracking and redirect wrappers
  5. Structured Storage

    • Persist top-10 results per query into engine-specific JSON files.

2. Ranking Comparison & Evaluation (Task 2)

  1. Baseline Alignment

    • Load Google Search results as the reference ranking.
  2. Overlap Analysis

    • Compute the number and percentage of overlapping URLs between each engine and Google.
  3. Ranking Correlation

    • Compute Spearman rank correlation coefficient for overlapping results.
    • Apply re-ranking logic to handle partial overlaps correctly.
  4. Reporting

    • Generate a summary CSV containing:

      • Overlap count
      • Percent overlap
      • Spearman correlation
      • Per-query results and overall averages

Output

SECmp generates structured outputs for both stages:

Task 1 — Search Results

output/
└── task1/
    └── <timestamp>/
        ├── Bing_Results.json
        ├── Yahoo!_Results.json
        ├── DuckDuckGo_Results.json
        └── Ask_Results.json

Each JSON file maps:

Query → Top-10 ranked URLs

Task 2 — Ranking Evaluation

output/
└── task2/
    └── evaluation.csv

The CSV includes:

  • Query index
  • Number of overlapping results
  • Percent overlap with Google
  • Spearman rank correlation coefficient
  • Aggregate averages

Tech Stack

  • Language: Python
  • Networking: requests
  • HTML Parsing: BeautifulSoup (beautifulsoup4)
  • Browser Automation: Selenium (Chrome WebDriver)
  • Ranking Metrics: Spearman rank correlation
  • Data Formats: JSON, CSV

How to Run

1. Prerequisites

  • Python 3.8+
  • Google Chrome
  • ChromeDriver (compatible with your Chrome version)

Install dependencies:

pip install -r requirements.txt

2. Prepare Query & Baseline Data

  • Place query file under:

    ./assets/100QueriesSet*.txt
    
  • Place Google baseline results under:

    ./assets/results/Google_Result*.json
    

3. Run Search Result Collection (Task 1)

python secmp_task1.py

This will:

  • Query each supported search engine
  • Collect top-10 results per query
  • Store results as JSON files

4. Run Ranking Comparison (Task 2)

python secmp_task2.py

This will:

  • Compare each engine against Google
  • Compute overlap and Spearman correlation
  • Generate a summary CSV report

Notes

  • Google results are treated as the relevance baseline.
  • Randomized delays are enforced to reduce blocking risk.
  • URL normalization is critical to ensure fair comparison.
  • The system evaluates ranking behavior, not content quality.

About

SECmp is a system for collecting and comparing search engine rankings, using Google Search results as a baseline to evaluate relevance overlap and ranking correlation across engines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages