This Bash script is a utility designed to filter a list of URLs based on various criteria such as extensions, query parameters, and specific patterns. The script supports whitelisting, blacklisting, and custom filters to refine the list of URLs according to your needs.
- Filter URLs by whitelisting or blacklisting specific file extensions.
- Include or exclude URLs with query parameters.
- Retain or remove URLs with or without file extensions.
- Preserve blog-like content or trailing slashes in URLs.
- Save the filtered results to a specified output file or display them in the terminal.
- Bash Shell: Ensure you have a Linux/Unix-based system with a Bash shell (default on most Linux distributions and macOS).
- Basic Tools: The script requires
grep
,sed
, andcat
, which are typically included in most Linux distributions. Verify their availability by running:which grep sed cat
- Download the Script:
Save the script file as
filter_urls.sh
. You can create the file manually or download it from your repository.
1/1. Clone the Repository: Clone the repository to your local machine using the following command:
git clone https://github.com/mrmrswd/filter-urls.git
1/2. Navigate to the Directory: Change into the repository directory:
cd filter-urls
-
Make the Script Executable: Run the following command to give the script execution permissions:
chmod +x filter_urls.sh
-
Move the Script to a Directory in Your PATH: To make the script available system-wide, move it to a directory included in your system's
PATH
:sudo mv filter_urls.sh /usr/local/bin/filter_urls
After this step, you can run the script using the command
filter_urls
. -
Verify Installation: Ensure the script is installed and ready for use:
filter_urls --help
If any required tools are missing, you can install them using your system's package manager:
-
For Debian/Ubuntu:
sudo apt update sudo apt install grep sed coreutils
-
For Red Hat/Fedora:
sudo dnf install grep sed coreutils
-
For macOS (using Homebrew):
brew install grep gnu-sed
filter_urls [-i input_file] [-o output_file] [-w whitelist] [-b blacklist] [-f filters]
Option | Description |
---|---|
-i, --input |
Input file containing URLs (required). |
-o, --output |
Output file to save filtered URLs (optional). |
-w, --whitelist |
Whitelist specific extensions (comma-separated, e.g., php,html,asp ). |
-b, --blacklist |
Blacklist specific extensions (comma-separated, e.g., jpg,png,css ). |
-f, --filters |
Apply filters (comma-separated, e.g., hasparams,noparams,hasext,noext,keepcontent,keepslash ). |
hasparams
: Includes only URLs with query parameters (e.g.,?id=123
).noparams
: Includes only URLs without query parameters.hasext
: Includes only URLs with file extensions (e.g.,.html
,.php
).noext
: Includes only URLs without file extensions (e.g.,/page
).keepcontent
: Retains blog-like content (e.g.,/blog
,/posts
,/articles
).keepslash
: Preserves trailing slashes in URLs.
filter_urls -i urls.txt -f hasparams,noext
filter_urls -i urls.txt -o filtered_urls.txt -f hasext
filter_urls -i urls.txt -w php,html
filter_urls -i urls.txt -b jpg,png,css
filter_urls -i urls.txt -o filtered_urls.txt -f hasparams,keepslash -b jpg,png
The script will exit with an error if:
- The input file is not specified.
- The input file does not exist.
- If an output file is specified (
-o
), the filtered URLs are saved to that file. - If no output file is specified, the filtered URLs are displayed in the terminal.
http://example.com/page.php?id=123
http://example.com/page.php?id=456
http://example.com/blog/why-cats-rule
http://example.com/about.html
http://example.com/page/
http://example.com/assets/image.jpg
filter_urls -i urls.txt -b jpg,png -f noext,keepslash
http://example.com/page/
http://example.com/about.html
http://example.com/blog/why-cats-rule
Feel free to customize and extend this script to suit your specific URL filtering needs!