Support PDF in GDrive #56

yuvalsteuer · 2023-03-24T21:59:51Z

See similar parsers here:

Add new parser:
e.g: docx, txt, html, pptx etc...
https://github.com/GerevAI/gerev/tree/main/app/parsers
add file type:
app/data_source_api/basic_document.py
Google Drive support: app/data_sources/google_drive.py
...

rishi003 · 2023-03-25T13:11:34Z

I will be happy to take this one

yuvalsteuer · 2023-03-25T17:19:18Z

Great! When are you expecting to finish this?

rishi003 · 2023-03-25T18:03:17Z

I am trying to set up the dev environment right now but getting some issues in the following environment:

OS: WSL(Windows 10)
Nvidia: No

the function

STORAGE_PATH = Path('/opt/storage/') if IS_IN_DOCKER else Path(f'/home/{os.getlogin()}/.gerev/storage/')

The following line is giving me the below error:

FileNotFoundError: [Errno 2] No such file or directory

Upon researching I found out that os.getlogin() is the culprit.

If you cannot provide any help with this issue, can you describe a proper environment setup that will be suitable for development?

yuvalsteuer · 2023-03-25T20:38:34Z

Just hardcode any path that is valid in WSL.

rishi003 · 2023-03-26T19:01:12Z

Currently, it is possible to parse the entire content of pdf files as text, but as it's apparent from your parsers, the program needs to compile it in the following form:

Some title: related text
Some other title: related text

Am I right?

There is already a pull request that parses the entire pdf document as text.

If you have any enhancements or suggestions for that, I'll be more than willing to implement them.

Meanwhile, I am also researching how can I parse pdf while keeping the hierarchical information intact.

Roey7 · 2023-03-26T19:36:09Z

Hey!
Just like I commented on that other PR, it should be pdf->html
then we parse html>text

Roey7 · 2023-03-26T20:14:50Z

@rishi003 let's chat on discord! I could guide you a little bit :)

rishi003 · 2023-03-27T04:55:17Z

Sure, shall we discuss it on the discuss thread?

yuvalsteuer added good first issue Good for newcomers feature labels Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support PDF in GDrive #56

Support PDF in GDrive #56

yuvalsteuer commented Mar 24, 2023 •

edited

Loading

rishi003 commented Mar 25, 2023

yuvalsteuer commented Mar 25, 2023

rishi003 commented Mar 25, 2023

yuvalsteuer commented Mar 25, 2023

rishi003 commented Mar 26, 2023

Roey7 commented Mar 26, 2023

Roey7 commented Mar 26, 2023

rishi003 commented Mar 27, 2023

Support PDF in GDrive #56

Support PDF in GDrive #56

Comments

yuvalsteuer commented Mar 24, 2023 • edited Loading

rishi003 commented Mar 25, 2023

yuvalsteuer commented Mar 25, 2023

rishi003 commented Mar 25, 2023

yuvalsteuer commented Mar 25, 2023

rishi003 commented Mar 26, 2023

Roey7 commented Mar 26, 2023

Roey7 commented Mar 26, 2023

rishi003 commented Mar 27, 2023

yuvalsteuer commented Mar 24, 2023 •

edited

Loading