Q2BSTUDIO Auditor

A Python-based investigation tool for analyzing industrial-scale automated content generation systems. Developed to document and analyze the publishing patterns of q2bstudio.com.

Featured Investigation

This case was investigated and published by Numerama (December 2024):
Qui sont ces parasites qui pillent les articles à l'ère de l'IA ? On a remonté la piste d'une immense ferme à contenus

The investigation revealed an industrial-scale AI content farm publishing up to 10,275 articles per day (one every 8.4 seconds), with over 210,000 articles documented.

Background

This tool was created following the discovery that my technical article about OBSIDIAN Neural (an open-source AI music VST plugin) was reproduced, translated, and republished with commercial links injected by Q2BSTUDIO's automated system.

When their blog showed 36,516+ pages of content, I developed this auditor to systematically document their publishing volume and patterns.

Full case study: https://dev.to/innermost_47/when-ai-content-systems-reproduce-content-without-attribution-a-documented-case-study-1h0g

Features

Systematic Blog Scraping: Crawls through all pagination pages of Q2BSTUDIO's blog
Data Extraction: Captures article titles, URLs, publication dates, and page numbers
Spanish Date Parsing: Handles Spanish-language date formats
Statistical Analysis: Generates comprehensive reports on publication patterns
Data Visualization: Creates charts showing daily article production, timelines, and comparative statistics
Wayback Machine Integration: Automatically archives sampled articles for evidence preservation
Checkpoint System: Saves progress periodically to prevent data loss
CSV Export: Exports all data in standard CSV format for further analysis

Requirements

Python 3.10+
pip (Python package manager)

Python Dependencies

requests
beautifulsoup4
matplotlib

All dependencies are listed in requirements.txt and will be installed automatically.

Installation

git clone https://github.com/innermost47/q2bs.git
cd q2bs
python -m venv env
source env/bin/activate  # On macOS/Linux
# or: env\Scripts\activate  # On Windows
pip install -r requirements.txt

Usage

Basic Scraping

python main.py

The script will:

Detect the maximum number of blog pages (36,516+ as of December 2025)
Ask for confirmation before scraping
Scrape all articles systematically
Generate statistical reports
Create data visualizations
Optionally archive a sample to Wayback Machine

Resume from Checkpoint

The auditor includes a checkpoint system that allows you to resume interrupted scraping sessions:

python main.py

When you run the script, you'll be presented with:

A list of existing checkpoint directories
Option to select a checkpoint to resume from
Option to start fresh (new scraping)
Option to visualize-only mode (skip scraping)

The script automatically:

Calculates which page to resume from based on article IDs
Avoids duplicate articles
Continues in the same output directory
Preserves all previously scraped data

Example workflow:

Available checkpoints:
0. Start fresh (new scraping)
1. q2b_audit_20251208_103810 - 242,191 articles - 2025-12-08T10:38:10

Select checkpoint number (0 for fresh start): 1
Visualize only (skip scraping)? (yes/no): no

Loading checkpoint from: q2b_audit_20251208_103810
Loaded 242,191 articles from checkpoint
Min article ID scraped (last article): 91,643
Calculated resume page: 27,373

Resuming from page 27,373
This will scrape 10,184 pages (from 27,373 to 37,556). Continue? (yes/no):

Visualization-Only Mode

If you want to regenerate visualizations without re-scraping:

python main.py

Then:

Select an existing checkpoint
Answer "yes" to "Visualize only (skip scraping)?"

This will:

Load the existing data
Regenerate all reports
Create fresh visualizations
Skip the scraping phase entirely

Use cases:

Update graphs with new styling
Generate reports after manual data cleaning
Create visualizations for different time periods

Output Structure

After running, you'll get a timestamped directory:

q2b_audit_YYYYMMDD_HHMMSS/
├── articles.csv              # All articles with metadata
├── daily_summary.csv         # Articles per day
├── checkpoint.json           # Progress checkpoint
├── report.json              # Statistical analysis
├── archiving_checkpoint.json # Wayback archiving progress
├── articles_archived.csv    # Articles with archive URLs
├── archive_report.json      # Archiving statistics
├── wayback_urls.txt         # List of Wayback URLs
└── graphs/
    ├── 1_daily_articles.png     # Daily production chart
    ├── 2_timeline.png           # Publication timeline
    └── 3_stats_summary.png      # Statistical summary

Key Findings (December 2025)

Based on analysis of the period November 20 - December 7, 2025:

144,966 articles documented (partial dataset - computer crashed during scraping)
8,401 articles per day on average
Peak day: 10,251 articles (December 4, 2025)
Frequency: 1 article every 10.3 seconds

For comparison (industry estimates - approximate):

TechCrunch: ~40 articles/day
The Verge: ~30 articles/day
The New York Times: ~250 articles/day

Files Description

`q2b_studio_auditor.py`

Core scraping engine that:

Fetches blog pages systematically
Parses article metadata
Handles Spanish date formats
Generates statistical reports
Manages checkpoints

`q2b_data_visualizer.py`

Visualization module that creates:

Daily article production bar charts
Publication timeline graphs
Comparative statistics with major publishers
All charts with proper labeling and context

`wayback_archiver.py`

Archiving system that:

Submits URLs to Wayback Machine
Checks for existing archives
Handles retry logic for failed submissions
Exports archive URLs for verification

`main.py`

Main entry point that orchestrates:

User confirmation workflow
Scraping execution
Report generation
Visualization creation
Optional archiving

Configuration

Scraping Parameters

In main.py, you can adjust:

# Scrape every Nth page (1 = all pages, 10 = every 10th page)
auditor.scrape_all_pages(max_page, start_page=1, sample_every=1)

# Archive sample size
archiver.archive_sample(sample_size=500)

Rate Limiting

The script includes built-in delays to avoid overwhelming the target server:

0.5 seconds between page requests
3 seconds between Wayback Machine submissions
5-second retry delay on timeouts

Ethical Considerations

This tool was developed for legitimate investigative purposes:

Documenting publicly available information
Analyzing content patterns
Preserving evidence of plagiarism
Supporting transparency in automated publishing

Please use responsibly:

Respect robots.txt directives
Don't overwhelm servers with requests
Use for research and documentation purposes
Comply with applicable laws and terms of service

Known Issues

Wayback Machine Archiving

The Wayback Machine integration may experience issues:

Timeout errors: The Internet Archive's save service can be slow
Rate limiting: Heavy usage may trigger temporary blocks
Archive verification: Some archived URLs may not be immediately accessible
Incomplete snapshots: Not all pages successfully archive on first attempt

Workaround: The script saves all URLs to wayback_urls.txt so you can verify archives manually or re-submit failed URLs later.

Large Dataset Handling

For very large scrapes (30,000+ pages):

Consider using sample_every parameter to sample pages
Monitor disk space (CSV files can grow large)
Be prepared for multi-hour runtime
The checkpoint system helps recover from crashes

Data Analysis Tips

Sample Visualizations

The tool generates three types of visualizations:

1. Statistical Summary (Last 4 weeks)

2. Publication Timeline

3. Daily Production

Using the CSV Files

import pandas as pd

# Load articles
df = pd.read_csv('q2b_audit_TIMESTAMP/articles.csv')

# Find articles by keyword
keyword_articles = df[df['title'].str.contains('AI', case=False)]

# Articles by date
daily_counts = df['date_parsed'].value_counts().sort_index()

# Most common page numbers (detect patterns)
page_distribution = df['page_num'].value_counts()

Checking for Your Content

# Search for specific phrases
your_phrases = ['your unique phrase', 'another phrase']
matches = df[df['title'].str.contains('|'.join(your_phrases), case=False)]
print(f"Found {len(matches)} potential matches")

Contributing

If you've experienced similar automated plagiarism or want to improve the tool:

Fork the repository
Create a feature branch
Add your improvements
Submit a pull request

Particularly welcome:

Better date parsing for multilingual content
Enhanced duplicate detection
Content similarity analysis
Additional visualization options

Legal

This tool is provided for research, documentation, and investigative journalism purposes. Users are responsible for ensuring their use complies with applicable laws, including:

Copyright law
Computer fraud and abuse laws
Terms of service agreements
Data protection regulations

The author makes no warranties about the tool's functionality or the accuracy of data collected.

Citation

If you use this tool in research or journalism, please cite:

CHARRETIER, A. (2025). Q2BSTUDIO Auditor.
GitHub: https://github.com/innermost47/q2bs
Case study: https://dev.to/innermost_47/when-ai-content-systems-reproduce-content-without-attribution-a-documented-case-study-1h0g

Contact

Author: Anthony CHARRETIER
Website: https://anthony-charretier.fr
GitHub: https://github.com/innermost47

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
graphs		graphs
sample_data		sample_data
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
q2b_data_visualizer.py		q2b_data_visualizer.py
q2b_studio_auditor.py		q2b_studio_auditor.py
requirements.txt		requirements.txt
wayback_archiver.py		wayback_archiver.py

License

innermost47/q2bs

Folders and files

Latest commit

History

Repository files navigation

Q2BSTUDIO Auditor

Featured Investigation

Background

Features

Requirements

Python Dependencies

Installation

Usage

Basic Scraping

Resume from Checkpoint

Visualization-Only Mode

Output Structure

Key Findings (December 2025)

Files Description

q2b_studio_auditor.py

q2b_data_visualizer.py

wayback_archiver.py

main.py

Configuration

Scraping Parameters

Rate Limiting

Ethical Considerations

Known Issues

Wayback Machine Archiving

Large Dataset Handling

Data Analysis Tips

Sample Visualizations

1. Statistical Summary (Last 4 weeks)

2. Publication Timeline

3. Daily Production

Using the CSV Files

Checking for Your Content

Contributing

Legal

Citation

Contact

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`q2b_studio_auditor.py`

`q2b_data_visualizer.py`

`wayback_archiver.py`

`main.py`

Packages