is an AI-powered web application that allows users to upload PDFs, ask questions related to the content, and receive answers along with the relevant text highlighted in the PDF.
You can quickly find answers to your questions within large PDF documents, without having to read through the entire content.
- Upload a PDF file
- Ask questions related to the PDF content
- Get answers to your questions
- View the relevant text highlighted in the PDF
- Frontend by Elite-AI-August
- Backend by Elite-AI-August
Follow these steps to run the web application on your local machine.
- Python 3.7+
- Node.js 12+
- Yarn or npm
- Clone the repository
git clone https://github.com/Elite-AI-August/PDF-Pilot
- Change the directory
cd PDF-Pilot
- Set up a virtual environment and install the Python dependencies (pip or pip3)/(python or python3)
python -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
- Install the JavaScript dependencies
cd PDF-Pilot
npm install
-
Set up API keys
a. Sign up for an AI21 Studio account to obtain an API key.
b. Sign up for an OpenAI account to obtain an API key.
c. Create a
.env
file in the root directory of the project and add your API keys:
AI21_API_KEY=your_ai21_api_key
OPENAI_API_KEY=your_openai_api_key
- Start the Flask server in the
src
directory of the project (python or python3)
cd src
python3 server.py
- Start the React development server in the
root
directory:
cd ..
npm start
-
Open a browser and navigate to
http://localhost:3000
. You should see the PDF-Pilot web application. -
Upload a PDF, enter a question, and click "Submit" to see the AI-generated answer and the relevant text highlighted in the PDF.
-
navigate to PDF-Pilot/PDF-Pilot_v1/src
-
Edit the
main()
function in the HandoutAssistand.py script to provide the path to your input PDF file and the desired output PDF file. Also, input the question you want the Handout Assistant to answer.
pdf_path = "/path/to/your/input.pdf"
output_pdf = "/path/to/your/output.pdf"
question = "Your question here"
- Run the script
python HandoutAssistant.py
- The answer, relevant text, and page number will be displayed in the console. The highlighted PDF will be saved to the specified output path.
- PDF text extraction and AI21 context segmentation
- Context-aware question answering with OpenAI's GPT-3
- FAISS-based efficient similarity search
- Automatic highlighting of relevant text in the PDF
-
PDF text extraction: HandoutAssistant.py uses the PyMuPDF library (fitz) to extract text from the PDF document. The text is then stored in a data structure with the corresponding page numbers.
-
Text segmentation: The extracted text is segmented into meaningful chunks using AI21 Studio's text segmentation API. These segments are assigned unique IDs and linked to their respective page numbers.
-
Building the FAISS index: HandoutAssistant creates a FAISS index using the segmented text and OpenAI embeddings. This index is used to search for relevant text segments efficiently.(Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors)
-
Question answering: When a user asks a question, HandoutAssistant retrieves the most relevant text segments from the FAISS index. It then generates a prompt for OpenAI's GPT-3.5 engine, which uses the provided information to answer the question.
-
Highlighting and page number identification: Once the answer is generated, Handout Assistant identifies the page number and relevant text segment in the PDF. The PyMuPDF library is then used to highlight the identified text segment in the output PDF file.
Flowchart
Explained by Google Bard:
The program first uses the pdf_to_text() function to extract the text from the PDF file. This text is then segmented into smaller pieces using the segment_text() function. The assign_page_numbers_and_ids_to_segments() function then assigns page numbers and IDs to each segment. The build_faiss_index() function creates a vector store that can be used to search for relevant segments. The get_relevant_segments() function uses the vector store to search for the most relevant segments for a given question. The generate_prompt() function generates a prompt that can be used to ask an AI Q&A bot for an answer to a question. The process_pdf_and_get_answer() function uses the get_relevant_segments() and generate_prompt() functions to answer a question about the content of a PDF file. The main() function provides an example of how to use the process_pdf_and_get_answer() function.
Here is a more detailed explanation of each function:
pdf_to_text(): This function uses the fitz library to extract the text from a PDF file. The text is returned as a string. segment_text(): This function uses the spacy library to segment the text from a PDF file into smaller pieces. The segments are returned as a list of dictionaries. Each dictionary contains the text of the segment, the page number of the segment, and the ID of the segment. assign_page_numbers_and_ids_to_segments(): This function assigns page numbers and IDs to each segment in a list of segments. The page numbers are assigned based on the page numbers in the PDF file. The IDs are assigned sequentially, starting with 1. build_faiss_index(): This function creates a vector store that can be used to search for relevant segments. The vector store is created using the faiss library. The text from each segment is converted to a vector and added to the vector store. get_relevant_segments(): This function uses the vector store to search for the most relevant segments for a given question. The question is first converted to a vector. The vector is then used to search the vector store for the most similar segments. The most similar segments are returned as a list. generate_prompt(): This function generates a prompt that can be used to ask an AI Q&A bot for an answer to a question. The prompt is generated by combining the question with the IDs of the most relevant segments. process_pdf_and_get_answer(): This function uses the get_relevant_segments() and generate_prompt() functions to answer a question about the content of a PDF file. The function first gets the most relevant segments for the question. The function then generates a prompt that includes the question and the IDs of the most relevant segments. The prompt is then used to ask an AI Q&A bot for an answer. The answer from the AI Q&A bot is then returned. main(): This function provides an example of how to use the process_pdf_and_get_answer() function. The function first gets the path to a PDF file. The function then gets the question that you want to answer. The function then uses the process_pdf_and_get_answer() function to get the answer to the question. The answer is then printed to the console.
This guide provides a step-by-step explanation of the HandoutAssistant.py code and how it processes the PDF file to answer user questions.
HandoutAssistant.py uses the PyMuPDF library (fitz) to extract text from the PDF document. The PDFHandler
class has a static method called pdf_to_text
that takes a PDF path as input and extracts the text and page number for each page.
class PDFHandler:
@staticmethod
def pdf_to_text(pdf_path):
doc = fitz.open(pdf_path)
text = ""
page_texts = []
for page in doc:
page_text = page.get_text("text")
text += page_text
page_texts.append({"text": page_text, "page_number": page.number})
return text, page_texts
The extracted text is segmented into meaningful chunks using the AI21 Studio's text context segmentation API. The AI21Segmentation
class has a static method called segment_text
that takes the extracted text and returns a list of segments.
class AI21Segmentation:
@staticmethod
def segment_text(text):
url = "https://api.ai21.com/studio/v1/segmentation"
payload = {
"sourceType": "TEXT",
"source": text
}
headers = {
"accept": "application/json",
"content-type": "application/json",
"Authorization": f"Bearer {os.environ['AI21_API_KEY']}"
}
response = requests.post(url, json=payload, headers=headers)
if response.status_code == 200:
json_response = response.json()
return json_response.get("segments")
else:
print(f"An error occurred: {response.status_code}")
return None
The assign_page_numbers_and_ids_to_segments
method of the HandoutAssistant
class assigns unique IDs and corresponding page numbers to each segment.
def assign_page_numbers_and_ids_to_segments(self, segmented_text, page_texts):
for idx, segment in enumerate(segmented_text):
segment_text = segment["segmentText"]
segment["id"] = idx + 1
max_overlap = 0
max_overlap_page_number = None
for page_text in page_texts:
overlap = len(set(segment_text.split()).intersection(set(page_text["text"].split())))
if overlap > max_overlap:
max_overlap = overlap
max_overlap_page_number = page_text["page_number"]
segment["page_number"] = max_overlap_page_number + 1
Handout Assistant creates a FAISS index using the segmented text and OpenAI embeddings. This index is used to search for relevant text segments efficiently.
def build_faiss_index(self, questions_data):
# Convert questions_data to a list of Documents
documents = [Document(page_content=q_data["segmentText"], metadata={"id": q_data["id"], "page_number": q_data["page_number"]}) for q_data in questions_data]
# Create the FAISS index (vector store) using the langchain.FAISS.from_documents() method
vector_store = langchain.FAISS.from_documents(documents, self.embedder)
return vector_store
The get_relevant_segments
method of the HandoutAssistant
class retrieves the most relevant text segments from the FAISS index based on the user's question.
def get_relevant_segments(self, questions_data, user_question, faiss_index):
retriever = faiss_index.as_retriever()
retriever.search_kwargs = {"k": 5}
docs = retriever.get_relevant_documents(user_question)
relevant_segments = []
for doc in docs:
segment_id = doc.metadata["id"]
segment = next((segment for segment in questions_data if segment["id"] == segment_id), None)
if segment:
relevant_segments.append({
"id": segment["id"],
"segment_text": segment["segmentText"],
"score": doc.metadata.get("score", None),
"page_number": segment["page_number"]
})
# print the score and the element ID
print(f"Element ID: {segment['id']}, Score: {doc.metadata.get('score', None)}")
relevant_segments.sort(key=lambda x: x["score"] if x["score"] is not None else float('-inf'), reverse=True)
return relevant_segments
The generate_prompt
method of the HandoutAssistant
class constructs the OpenAI API prompt using the user's question and relevant text segments.
def generate_prompt(self, question, relevant_segments):
prompt = f"""
You are an AI Q&A bot. You will be given a question and a list of relevant text segments with their IDs. Please provide an accurate and concise answer based on the information provided, or indicate if you cannot answer the question with the given information. Also, please include the ID of the segment that helped you the most in your answer by writing <ID: > followed by the ID number.
Question: {question}
Relevant Segments:"""
for segment in relevant_segments:
prompt += f'\n{segment["id"]}. "{segment["segment_text"]}"'
print(f"Relevant Element ID: {segment['id']}") # Add this line to print the relevant element IDs
return prompt
The get_answer_and_id
method of the OpenAIAPI
class sends the prompt to the OpenAI API and returns the answer along with the ID of the most helpful text segment.
def get_answer_and_id(self, prompt):
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
temperature=0.5,
max_tokens=2000,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
answer_text = response.choices[0].text.strip()
lines = response.choices[0].text.strip().split('\n')
answer = lines[0].strip()
answer = answer.replace("Answer:", "").strip()
try:
segment_id = int(re.search(r'<ID: (\d+)>', answer).group(1))
answer = re.sub(r'<ID: \d+>', '', answer).strip()
except AttributeError:
segment_id = None
return answer, segment_id
Finally, the main function of the script combines all these methods to process the PDF, find the relevant segments, generate a prompt, get the answer, and highlight the relevant segment in the PDF.
def main():
pdf_path = "/path/to/your/handout.pdf"
output_pdf = "/path/to/your/output.pdf"
question = "Your question here"
handout_assistant = HandoutAssistant()
answer, segment_id, segment_text, page_number = handout_assistant.answer_question(pdf_path, question)
print(f"Answer: {answer}")
print(f"Segment ID: {segment_id}")
print(f"Segment Text: {segment_text}")
print(f"Page Number: {page_number}")
if segment_id is not None:
handout_assistant.highlight_relevant_segment(pdf_path, output_pdf, segment_id, segment_text, page_number)
else:
print("No relevant segment found.")
if __name__ == "__main__":
main()
With this main function, you can run the script to process the input PDF, ask a question, and receive an answer based on the relevant segments. The script will also highlight the relevant segment in the PDF, and save it as a new file with the highlighted section.