This is an example of a module used to automate the production of Excel files. NHS England publishes a number of Excel files, and does so with a significant amount of formatting.
The repo is organised around three example projects, in progressing levels of complexity. The advanced
project is very close to the real publication here; the other two examples pare this project down to make it more simple, straightforward, and concise.
Note Being adapted from the code used to produce the Appointments in General Practice publications, the sample templates used in this repo work with publicly available historical data.
NHS England and NHS Digital have merged.
Repository owner: NHS England Data Science
Email: datascience@nhs.net
To contact us raise an issue on Github or via email.
README.md
templates
|-- easy_project
| |-- __init__.py
| |-- easy_project.py
| |-- easy_template.xlsx
|-- medium_project
| |-- __init__.py
| |-- medium_project.py
| |-- medium_template.xlsx
|-- advanced_project
| |-- __init__.py
| |-- advanced_project.py
| |-- advanced_template.xlsx
| |-- table1_cell_tags.py
| |-- table_1.py
data
|-- appointment_data.csv
|-- practices_data.csv
|-- table1_data.csv
outputs
|-- .gitkeep
|-- advanced_output.xlsx
|-- easy_output.xlsx
|-- medium_output.xlsx
main.py
config.py
requirements.txt
utils.py
The RAP Community of Practice resources can help you if you are unsure about any of the steps below, including cloning repositories, and virtual environments.
- Clone this repo to your preferred working destination, and point your terminal to the folder you've cloned.
- Next, create and activate a python virtual environment using your preferred tool.
- Next we need to install the packages which this project requires. In your terminal, enter
pip install -r requirements.txt
and run the command. If you run into any problems, check that your terminal is pointed to the right place, and that your virtual environment is activated.
Your installation should now be complete, and you should be able to run the project.
The basic logic of all three example projects is the same:
main.py
calls amake_excel
function in the relevant project's script.- This function loads in the template
.xlsx
file for the example project. - It then uses various functions from
excel_functions
; one for each sheet of the target template publication.
These functions all follow the same basic logic, they:
- Open the relevant data file, as specified in the
config
file. - Select the relevant columns, join and re-order if necessary.
- The function specifies a sheet name: open this sheet in the template, and find the cell with
<start>
in it. - Write the data to this sheet, starting at this cell.
- Find the cell with
<end>
written in it. - Delete all rows between the last row which we've written data to, and this
<end>
cell.
This last step needs further explanation. The nature of the publication is such that the number of rows printed each publication might vary; different months have different numbers of days, new regions might be added to the scope, etc. Given this, the most practical solution is to allow for an over-abundance of white space in the
template
document, and then delete as appropriate.
This project writes two simple sheets: 2a
and 2b
. The functions for writing these sheets are straightforward: select the relevant data, and write it to the workbook.
-
Open
main.py
-
Comment out the lines which call the other projects:
#medium_project.make_excel_output() #advanced_project.make_excel_output()
by adding
#
to the beginning of each line -
Run
python main.py
in the terminal.
This project is the appropriate starting point when you have a number of separate CSV files, each of which you want to write to a separate sheet.
The easy project is very simple to adapt. The data_for_sheet_easy_a.csv
file is written, almost directly, to sheet Easy A
in the output file.
- Copy the template .xlsx file, and duplicate the example sheets within your new version, naming the sheets and their columns appropriately
- Add your CSV files to the
data
folder, and add functions to theconfig.py
file to load these - Copy and rename the
make_and_write_easy_a
function inutils.py
, and adapt it to your sheet and data source. - Open
easy_project.py
and replace the functions called withinmake_excel_output
with your new functions, and change the template path to your new template.
This adds a little complexity, in that we will now be handling data from multiple sources.
Sheets 3.x require data from the appointments
and practices
CSV files to be joined, according to region - This is handled by the combine_practices_with_appts
function. Once the data from these has been joined, we can simply write it to the appropriate sheet.
-
Open
main.py
-
Comment out the lines which call the other projects:
#easy_project.make_excel_output() #advanced_project.make_excel_output()
by adding
#
to the beginning of each line -
Run
python main.py
in the terminal.
This project is the appropriate starting point when you have CSV files which contain more information than you want o appear on any given individual sheet. The functions used in this project do more manipulation on the data once it's been loaded in. Some of the sheets produced here also require that data be joined from two separate CSV files. As such, the functions for producing each sheet here contain more logic than in the 'easy' project.
As in the easy project, load your data into the data
folder, and create the corresponding functions in config
to load your data in.
Then, in utils.py
, create the functions which curates (and possibly join) the data you need for your sheets.
As you can see in the functions we've got here, we've given the data in the CSV files a breakdown
column: this means that we're able to easily identify the relevant rows from a 'long' dataset without much logic.
Now, place your template .xlsx
file in the project folder, making sure that your target sheets contain the <start>
and <end>
tags, as in the example template.
Once this is done, you can adapt or replace the medium_project.py
file, and call it from main.py
in the usual way.
This involved a significant step up in complexity from the medium project. Many NHS England publications contain a summary sheet which is heavily formatted, and which contains information about data which will appear in the rest of the sheets in the publication at a more granular level.
The other sheets have a data layout which more-or-less echoes a Pandas dataframe, and as such a single dataframe can be written without much manipulation. Table 1 on the other hand has a number of specific formatting and presentation criteria. These include; including a range of months in the sheet, giving most information twice (once as a count, once as a percentage), displaying the information in a form which is the transpose of the convention (here the data fields are the rows, and the entries are the columns), having multiple blank spaces, and so on.
All of these features present us with difficulties for formatting this sheet. Should we create a very wide dataframe, and write its transpose? Or many small ones, and write them one by one?
The solution we have implemented is to assign each and every cell in a given 'column' a unique cell identifier, and to provide a matching CSV file specifically for the purpose of populating Table 1.
The functions in table_1.py
then search through for these specific tags and replace them with the appropriate values.
This allows us to minimise the risk of introducing errors, and use a simple function in a loop to write each cell of the sheet.
-
Open
main.py
-
Comment out the lines which call the other projects:
#easy_project.make_excel_output() #medium_project.make_excel_output()
by adding
#
to the beginning of each line -
Run
python main.py
in the terminal.
This project is more specific than the other two. It is appropriate for adapting to produce NHS England publications which contain a heavily formatted, high-level, summary sheet.
This involved more work than adapting the other two projects.
We will deal only with how to adapt the summary page, the Table 1
sheet, since the other sheets are equivalent to those in the 'medium' project.
- Make a copy of the publication which you are trying to automate. Leave all of the headers and column names intact, but delete all of the numbers/data from the cells.
- Next, fill those empty cells with written tags which correspond meaningfully to the data which will be written there. This is less laborious than it might look; using the column names to populate the tags, and copious amounts of copy-pasting, you can populate your template in < 10 minutes.
The data used to populate Table 1
in the example project is specifically written for this purpose; every cell tag combined with a month
yields a single value. Writing your CSV data in this format makes it much easier to incorporate into this pipeline, and save you from having to write much logic here.
If your summary sheet is formatted in a way similar to the example, then you will be able to implement it without changing too much of the code. You will need to create an equivalent to cell_1_tags.py
, and populate it with the tags from your project.
You might also need to replace month
with whichever index it is that you are iterating over, and adjust the logic accordingly.