A TypeScript library for scraping YouTube's autocomplete suggestions with intelligent deduplication.
- Scrapes YouTube's autocomplete API to get search suggestions
- Uses pglite for efficient similarity filtering
- Removes near-duplicate suggestions using trigram similarity
- Configurable similarity threshold
- TypeScript support
- Ready to deploy on Apify platform
git clone https://github.com/yourusername/youtube-autocomplete-scraper.git
cd youtube-autocomplete-scraper
pnpm install
There are two ways to use this scraper:
Run the scraper locally by setting the required environment variables and using pnpm start
:
# Set your input
export INPUT='{"query": "how to make"}'
# Run the scraper
pnpm start
The scraper will output results to the console and save them in the apify_storage
directory.
This scraper is designed to run on the Apify platform. To deploy:
- Push this code to your Apify actor
- Set the input JSON in Apify console:
{
"query": "how to make",
"similarityThreshold": 0.7,
"maxResults": 100,
"language": "en",
"region": "US"
}
Under the hood, this scraper does a few key things:
-
API Querying: Makes requests to YouTube's internal autocomplete API endpoint to get raw suggestions
-
Deduplication: Uses pglite (a lightweight Postgres implementation) to filter out near-duplicate results:
- Converts suggestions to trigrams (3-letter sequences)
- Calculates similarity scores between suggestions using trigram matching
- Filters out suggestions that are too similar based on a configurable threshold
- For example, "how to cook pasta" and "how to cook noodles" might be considered unique, while "how to make pancake" and "how to make pancakes" would be filtered as duplicates
-
Result Processing: Cleans and normalizes the suggestions before returning them
The scraper accepts the following input parameters:
interface Input {
query: string // The search query to get suggestions for
similarityThreshold?: number // How similar suggestions need to be to be considered duplicates (0-1)
maxResults?: number // Maximum number of suggestions to return
language?: string // Language code for suggestions
region?: string // Region code for suggestions
}
The scraper outputs an array of unique autocomplete suggestions. Results are saved to the default dataset in Apify storage and can be accessed via the Apify API or console.
Contributions are welcome! Please feel free to submit a Pull Request.
MIT