pdfplumber extract text from multiple PDF page #740
fpapso
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @fpapso , In your code, you are running only on the first page ( for page in pdf.pages:
page.extract_text()
# your code |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all, I have build this small code to extract information on PDF Bank statement - and it work fine, when I just have one PDF page. But having 100 or 1000 pages in the same Bank statement, I only get the firste page - will any one have some idea for the right code to just continue loading page and extract data for all page in one file.. to one CSV file
import pandas as pd
import numpy as np
import pdfplumber
import csv
pdf_file = '/home/dev/tools/pdftool/test.pdf'
with pdfplumber.open(pdf_file) as pdf:
first_page = pdf.pages[0]
rows = first_page.extract_text().split('\n')
rows[:200]
with pdfplumber.open(pdf_file) as pdf:
first_page = pdf.pages[0]
rows = first_page.extract_words()
for row in rows:
if row['text'] == 'Bogførings-':
x0 = row['x0']
top = row['top']
if row['text'] == '23-12-2021':
bottom = row['bottom']
if row['text'] == 'Trans-':
x1 = row['x1']
box = (x0, top, x1, bottom)
box
with pdfplumber.open(pdf_file) as pdf:
first_page = pdf.pages[0]
page = first_page.crop(bbox=(box)) # (x0, top, x1, bottom)
table = page.extract_table(table_settings={
"vertical_strategy": "text",
"horizontal_strategy": "text",
})
for row in table[:9]:
print(row)
table = [row for row in table if ''.join([str(i) for i in row]) != '']
df = pd.DataFrame(table)
df.head()
-------------Output------------------
Bogførings- | Valørdato | Registrerings- | Posteringsbeløb | Posteringssaldo | Posteringstekst | Modpost | System- | Trans-
dato | | dato | | | | reference | kode | kode
03-08-2020 | 03-08-2020 | 03-08-2020 | - 200,00 | 300,00 | Gebyr | | 85 | 630
04-08-2020 | 04-08-2020 | 04-08-2020 | 20.000,00 | 20.300,00 | Ovf. Martine Visa | 90400768106 | 3 | 255
13-08-2020 | 13-08-2020 | 13-08-2020 | - 811,25 | 19.488,75 | Dankort-transaktion | | 31 | 76
df.to_csv('output.csv')
Beta Was this translation helpful? Give feedback.
All reactions