Automated Data Collection with LLM, Google Custom Search, Video Transcription, Web Scraping, Instagram API, OpenAI Vision, and OpenAI Whisper.
This repo has kind of turned into a playground for me to use OpenAI LLMs.
This Python project automates data collection on user provided topics.
- Query Generation: LLM creates relevant search queries.
- Data Collection: NewsApi, YouTube, and Google Search to collect data on specific topics.
- Data Scraping: Uses Playwright and BeautifulSoup to fetch and extract web content. Uses MoviePy to extract audio data from videos. Uses OpenAI Whisper Model to create audio transcriptions.
- Summarization: LLM summarizes scraped data into concise reports.
- End-to-End Automation: Fully automated from input to summarized output.
- Instagram Integration:
- Read a user's Instagram.
- Process pictures using OpenAI vision.
- Create an object-weighted graph based on object occurrences in the pictures.
- Summarize picture settings and subjects.
- Summarize the user's Instagram in total.
- Chat Functionality:
- Chat with LLM
- Build character from instagram images
- Chat with instagram character
- Chat Assistant Timeline Builder:
- Timeline builder assistant
- Saves, updates, suggests, finds milestones or memories for a persons life.
- Uses GPT functions to modify, add, update, delete, find user milestones from milestone store.
- Allows users to upload images to associate with milestones.
- Provies output for visualizing on a timeline in the front end.
- Python, OpenAI API, Google Custom Search API
- Playwright, BeautifulSoup, MoviePy, NewsAPI
- InstagramAPI, NetworkX
-
Clone the repo:
git clone https://github.com/your-username/your-repo-name.git cd your-repo-name
-
Install dependencies:
pip install -r requirements.txt
-
Set up API keys:
Create a.env
file with:OPENAI_API_KEY=your_openai_api_key GOOGLE_CUSTOM_SEARCH_API_KEY=your_google_custom_search_api_key GOOGLE_CX=your_google_cse_id NEWS_API_KEY=newsapi_key INSTAGRAM_CLIENT_ID=insagram_client_id INSTAGRAM_CLIENT_SECRET=insagram_client_secret INSTAGRAM_REDIRECT_URI=instagram_redirect_uri INSTAGRAM_ACCESS_CODE=insagram_access_code
-
Run the script: Currently only the "news" source is supported.
python src/researcher/main.py --topic "Your topic here" --source "[news, google, youtube, all]"
-
Check Results Folder: Find result json file in results folder
- Input a topic.
- LLM generates search queries.
- Data source retrieves results.
- MoviePY transcribes video or Playwright & BeautifulSoup scrapes web pages.
- LLM summarizes the scraped content.
- Support for more search engines.
- Advanced query generation and filtering.
- Customizable summarization options.
- Add UI for visualizing results
- Add subscription features for users to get updates on a schedule
- Web page scraping in web_page_reader.py
- More abstract data_gatherer.py class
- Chunking in openAI requests for large text especially in video summarization
- Potentially increase performance by decreasing run time?
- Add SerpApi
- Reduce number of LLM calls and increase relevancy of sumary data