messy-excel

Pandas Wrapper for splitting read_excel resulting DataFrame by Extra Spaces

Purpose

We noticed that especially in Japan, the format of Excel files are very messy. Multiple tables are displayed in one sheet and thus makes it hard for pandas read_excel function to detect the multiple tables. The read_html function does this perfectly well, as the <table> tags can tell us from and to where the data exists in one table. This wrapper library is an attempt to make the post-load process of read_excel easier by splitting the dataframes into multiple by detecting extra spaces.

Disclaimer

This library is currently made in very simple logic. It is not meant to be used in production level. An extra row or columns that consists of no data, will be detected as an extra space/splitter and will create new dataframes based on it.

Any contributions to this project is welcome!

Get Started

import the MessyExcel class and specify the excel file path as you would in read_excel. Currently the supported parameter is sheet_name.

from messy_excel import MessyExcel

# load data
me = MessyExcel('test_file.xlsx')
# datalist property will contains list of DataFrames
print(me.datalist[0])

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
messy_excel		messy_excel
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
test_file.xlsx		test_file.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

messy-excel

Purpose

Disclaimer

Get Started

About

Releases

Packages

Languages

License

helloryosuke/messy-excel

Folders and files

Latest commit

History

Repository files navigation

messy-excel

Purpose

Disclaimer

Get Started

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages