websitenearme

chatbots using langchain, pinecone, to answer business questions

Step 1. Set up venv: mkdir websitenearme cd websitenearme source ./webnearme-venv/bin/activate git clone https://github.com/data-science-nerds/websitenearme.git pip install -r requirements.txt

Step 2. Get the data from wordpress site using Tools > Export

Step 3. Convert the raw xml data to text using scrape_website.py

Step 4.

clean the xml data using data_cleansing.py

Read the XML file. This script will first remove the sections surrounded by comments containing any of the keywords in WP_TERMS. It will then filter out any line that contains any of the stop words in STOP_WORDS. Finally, it will save the cleaned and filtered content to a new text file with the prefix clean_.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
website_content		website_content
.gitignore		.gitignore
README.md		README.md
ai-architects.cloud.xml		ai-architects.cloud.xml
data_cleanser.py		data_cleanser.py
main.py		main.py
requirements.txt		requirements.txt
scrape_website.py		scrape_website.py
test_search.py		test_search.py
upsert_pinecone_data_script copy.py		upsert_pinecone_data_script copy.py
upsert_pinecone_data_script.py		upsert_pinecone_data_script.py
websitenearme.WordPress.2023-08-09.xml		websitenearme.WordPress.2023-08-09.xml
websitenearme_copypaste.txt		websitenearme_copypaste.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

websitenearme

clean the xml data using data_cleansing.py

About

Releases

Packages

Languages

data-science-nerds/websitenearme

Folders and files

Latest commit

History

Repository files navigation

websitenearme

clean the xml data using data_cleansing.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages