-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* commiting scraper_pdf folder. remove ignore for now * YIkang's scrape pdf * edited router * edited nougat * readme * added requirements * added pix2text * fix api question * try new functing * fix compare question * update pdf_scrape readme * fixed fialed and passed test * updata readme and passed api test * modified api * ignore pdf file while comparing * ignore extra pdf while comparing * remove api main func * move funcs in pdf converter class
- Loading branch information
1 parent
ea2b52e
commit 6a2875e
Showing
8 changed files
with
141 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,14 @@ | ||
# Scrape_pdf | ||
First we will need to convert the pdf into a markdown format. We will use a tool called nougat. | ||
First we will need to convert the pdf into a markdown format. We will use two tools called nougat and pix2text. | ||
- run `pip install nougat-ocr` to install nougat | ||
- Go to `nougat.py` and choose the pdf you want to convert and the name of the folder you want to save your documents at. | ||
``` | ||
pdf_to_md('~/Downloads/MLS.pdf', 'textbook') | ||
- run `pip install pix2tex` to install pix2text | ||
- Go to `Scrape_pdf.py` and choose the pdf you want to convert and the name of the folder you want to save your documents at. | ||
- change the path in Scrape_pdf.py to your file path and run | ||
|
||
``` | ||
- After you get your markdown folder now run `header.py` to segment the contents of the markdown file into headers and contents. | ||
``` | ||
# TODO | ||
parser = MarkdownParser('textbook/MLS.mmd') | ||
``` | ||
- After you have set up the variables you can run `python3 scrape.py` and it will start scraping the website. | ||
- After you have set up the variables you can run `python3 scrape.py` and it will start scraping the website. | ||
nougat will speed up at computer with gpu |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters