The PHP Sitemap Generator is a script designed to recursively crawl a website, collect URLs, and store them in a database. It can also generate an XML sitemap from the collected URLs, which is essential for SEO optimization.
This script handles various tasks such as filtering out excluded patterns, avoiding duplicate entries, checking page statuses, and supporting dynamic database creation and updates.
- Automatic Database Setup: Creates a database and table if they do not exist.
- Asynchronous Page Loading: Uses cURL multi-handle for efficient web scraping.
- Page Status Validation: Ensures only active pages (status 200) are stored in the database.
- Exclude Patterns: Skips specific pages based on configurable patterns.
- Recursive Crawling: Traverses internal links on the website.
- XML Sitemap Generation: Automatically creates a sitemap.xml file with valid URLs from the database.
- PHP 7.4 or higher
- PDO extension enabled for database interaction
- A valid
package.json
file containing database credentials:
{
"database": {
"host": "localhost",
"port": "3306",
"db": "sitemap_db",
"username": "root",
"password": ""
}
}
- Clone the repository or download the script.
- Place the script files in your project directory.
- Ensure the
package.json
file is configured with the correct database credentials. - Verify PHP is installed and properly configured on your system.
- Open the script and set the
$baseUrl
variable to the URL of your website:$baseUrl = 'https://example.com';
- Run the script via CLI or a web server.
- The script will:
- Create the necessary database and table.
- Crawl the website starting from the base URL.
- Save valid URLs to the database.
- Generate an XML sitemap file (
sitemap.xml
) in the script's directory.
You can modify the $excludePatterns
array to exclude specific URL patterns during the crawl:
$excludePatterns = [
'/cart/',
'/compare/',
'/profile/',
'/download/',
'/search/'
];
The script creates the following database table:
CREATE TABLE sitemap_urls (
id INT AUTO_INCREMENT PRIMARY KEY,
url VARCHAR(255) NOT NULL,
lastmod DATE NOT NULL,
changefreq VARCHAR(20) DEFAULT 'weekly',
priority DECIMAL(2,1) DEFAULT 0.8,
status INT NOT NULL
);
- Database: All crawled URLs are stored with metadata such as
lastmod
,changefreq
,priority
, andstatus
. - XML Sitemap: The
sitemap.xml
file contains all valid URLs ready for submission to search engines.
- If the
package.json
file is missing or incorrectly formatted, the script will terminate with an error. - If a URL is inaccessible or returns a non-200 status code, it will not be saved in the database.
This project is open-source and available under the MIT License.
Contributions are welcome! Feel free to submit issues or pull requests to improve the script.
For any inquiries or support, contact roman@matviy.pp.ua or site https://roman.matviy.pp.ua .