In February 2017 we reported on some findings from an analysis of Art UK's digital archive, which catalogues more than 200,000 of the nation's oil paintings.
The story was the result of a lot of work using the programming language R to analyse the large number of records involved, with further cleaning and analysis often taking place in Excel. Below you'll find a number of files that detail how various questions were asked and answered.
The dataset is particularly large, so most of the initial analysis was done using R. See the scripts below in RMarkdown for an explanation.
- Summary by medium (CSV)
- Summary by catalogue (CSV)
- Summary by collection (CSV)
- Summary by title (CSV)
- Summary by title where those containing a monarch's name is replaced by that monarch's name (CSV): this was an experimental dataset and was not used
- Summary by title where works mentioning royal terms are marked 'TRUE' (CSV): this was an experimental dataset and was not used
- List of schools (CSV)
- Bar chart: Most painted monarchs
- Bar chart: Most popular tags
- Pie chart: Gender breakdown of tags
- Bar chart: Artists with the most pieces in the collection
- Bar chart: Largest collections
- Cleaning up data to extract years: as detailed in the report, date entries were inconsistent, with many works of art having no date at all, or a date range. Those could be as specific as 1920-1922 or as general as c18thC. This is how we tried to clean up dates to get as many as possible in a format that could be used to calculate from.
- Summarising by unknown artist
- Keyword counting: tags were in one column, comma separated, which required some coding to extract and count
- Identifying royals: we wanted to identify which members of royalty appeared more than once. The process and code is detailed here.
- This Excel workbook outlines some final cleaning processes in Excel following the work in R detailed above. For example, some artist names which contained 'school' were not unknown artists but included some identification. Using the
LEN
function was one way to identify outliers with long names which might not be unknown after all. In the final article we did not need an exact number, but this helped us to more confidently state the numbers that were indeed unknown. - Generating regex for identifying royals in R: this spreadsheet shows how we used formulae in Excel to generate the regex that was needed in R to extract the names of royalty.
- Further cleaning on artworks mentioning royals: some paintings mentioning royalty were not of royalty: for example, a train named after a king. This spreadsheet shows some of the processes involved in identifying those.
-
- Summary by title - works mentioning monarchs only (CSV): this included extraction of text either side of the monarch's name to show context