Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.
/ messy-excel Public archive

Pandas Wrapper for splitting read_excel resulting DataFrame by Extra Spaces

License

Notifications You must be signed in to change notification settings

helloryosuke/messy-excel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

messy-excel

Pandas Wrapper for splitting read_excel resulting DataFrame by Extra Spaces

Purpose

We noticed that especially in Japan, the format of Excel files are very messy. Multiple tables are displayed in one sheet and thus makes it hard for pandas read_excel function to detect the multiple tables. The read_html function does this perfectly well, as the <table> tags can tell us from and to where the data exists in one table. This wrapper library is an attempt to make the post-load process of read_excel easier by splitting the dataframes into multiple by detecting extra spaces.

sample_xl

Disclaimer

This library is currently made in very simple logic. It is not meant to be used in production level. An extra row or columns that consists of no data, will be detected as an extra space/splitter and will create new dataframes based on it.

Any contributions to this project is welcome!

Get Started

import the MessyExcel class and specify the excel file path as you would in read_excel. Currently the supported parameter is sheet_name.

from messy_excel import MessyExcel

# load data
me = MessyExcel('test_file.xlsx')
# datalist property will contains list of DataFrames
print(me.datalist[0])

About

Pandas Wrapper for splitting read_excel resulting DataFrame by Extra Spaces

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages