-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathwippaper.tex
2028 lines (1846 loc) · 108 KB
/
wippaper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% git commit -am 'This is Karls comment about the commit'
% git push ; This automatically updates taylor13 branch, and BECAUSE a PR is already queued, it updates the PR
% \documentclass[12pt,twocolumn]{article}
% Copernicus stuff
\documentclass[gmd,manuscript]{copernicus}
%\documentclass[gmd,manuscript]{../171128_Copernicus_LaTeX_Package/copernicus}
% page/line labeling and referencing
% from http://seananderson.ca/2013/04/28/cross-referencing-reviewer-replies-in-latex.html
% \newcommand{\pllabel}[1]{\label{p-#1}\linelabel{l-#1}}
% \newcommand{\plref}[1]{see page~\pageref{p-#1}, line~\lineref{l-#1}.}
% answer environment for reviewer responses
% \newenvironment{answer}{\color{blue}}{}
% \usepackage{enumitem}
% \hypersetup{colorlinks=true,urlcolor=blue,citecolor=red}
% \hypersetup{colorlinks=false}
% \newcommand{\degree}{\ensuremath{^\circ}}
% \newcommand{\order}{\ensuremath{\mathcal{O}}}
% \newcommand{\bibref}[1] { \cite{ref:#1}}
% \newcommand{\pipref}[1] {\citep{ref:#1}}
% \newcommand{\ceqref}[1] {\mbox{CodeBlock \ref{code:#1}}}
% \newcommand{\charef}[1] {\mbox{Chapter \ref{cha:#1}}}
% \newcommand{\eqnref}[1] {\mbox{Eq. \ref{eq:#1}}}
% \newcommand{\figref}[1] {\mbox{Figure \ref{fig:#1}}}
% \newcommand{\secref}[1] {\mbox{Section \ref{sec:#1}}}
% \newcommand{\appref}[1] {\mbox{Appendix \ref{sec:#1}}}
% \newcommand{\tabref}[1] {\mbox{Table \ref{tab:#1}}}
% \newcommand{\urlref}[2] {\href{#1}{#2}\footnote{\url{#1}, retrieved \today.}}
% \newcommand{\editorial}[1]{\protect{\color{red}#1}}
\runningtitle{WIP Paper Draft \today}
\runningauthor{Balaji et al.}
\begin{document}
\title{Requirements for a global data infrastructure in support of CMIP6}
\Author[1,2]{Venkatramani}{Balaji}
\Author[3]{Karl E.}{Taylor}
\Author[4]{Martin}{Juckes}
\Author[5,4]{Bryan N.}{Lawrence}
\Author[3]{Paul J.}{Durack}
\Author[6]{Michael}{Lautenschlager}
\Author[7,2]{Chris}{Blanton}
\Author[8]{Luca}{Cinquini}
\Author[9]{S\'ebastien}{Denvil}
\Author[10]{Mark}{Elkington}
\Author[9]{Francesca}{Guglielmo}
\Author[9,4]{Eric}{Guilyardi}
\Author[4]{David}{Hassell}
\Author[11]{Slava}{Kharin}
\Author[6]{Stefan}{Kindermann}
\Author[1,2]{Sergey}{Nikonov}
\Author[7,2]{Aparna}{Radhakrishnan}
\Author[6]{Martina}{Stockhause}
\Author[6]{Tobias}{Weigel}
\Author[3]{Dean}{Williams}
\affil[1]{Princeton University, Cooperative Institute of Climate
Science, Princeton, NJ 08540, USA}
\affil[2]{NOAA/Geophysical Fluid Dynamics Laboratory, Princeton, NJ 08540,
USA}
\affil[3]{PCMDI, Lawrence Livermore National Laboratory, Livermore, CA 94550, USA}
\affil[4]{Science and Technology Facilities Council, Abingdon, UK}
\affil[5]{National Center for Atmospheric Science and University of
Reading, UK}
\affil[6]{Deutsches KlimaRechenZentrum GmbH, Hamburg, Germany}
\affil[7]{Engility Inc., NJ, USA}
\affil[8]{Jet Propulsion Laboratory (JPL), 4800 Oak Grove Drive,
Pasadena, CA 91109, USA}
\affil[9]{Institut Pierre-Simon Laplace, CNRS/UPMC, Paris, France}
\affil[10]{Met Office, FitzRoy Road, Exeter, EX1 3PB, UK}
\affil[11]{Canadian Centre for Climate Modelling and Analysis, Atmospheric Environment Service, University of Victoria, BC, Canada}
% \affil[10]{NCAR}
\correspondence{V. Balaji (\texttt{balaji@princeton.edu})}
\received{}
\pubdiscuss{} %% only important for two-stage journals
\revised{}
\accepted{}
\published{}
%% These dates will be inserted by Copernicus Publications during the typesetting process.
\firstpage{1}
\maketitle
% \pagebreak
\abstract{The World Climate Research Programme (WCRP)'s Working Group
on Climate Modelling (WGCM) Infrastructure Panel (WIP) was formed in
2014 in response to the explosive growth in size and complexity of
Coupled Model Intercomparison Projects (CMIPs) between CMIP3
(2005-06) and CMIP5 (2011-12). This article presents the WIP
recommendations for the global data infrastructure needed to support
CMIP design, future growth and evolution. Developed in close
coordination with those who build and run the existing
infrastructure (the Earth System Grid Federation; ESGF), the
recommendations are based on several principles beginning with the
need to separate requirements, implementation and operations. Other
important principles include the consideration of the diversity of
community needs around data -- a \emph{data ecosystem} -- the
importance of provenance, the need for automation, and the
obligation to measure costs and benefits.
This paper concentrates on requirements, recognising the diversity
of communities involved (modelers, analysts, software developers,
and downstream users). Such requirements include the need for
scientific reproducibility and accountability alongside the need to
record and track data usage. One key element is to generate a
dataset-centric rather than system-centric focus, with an aim to
making the infrastructure less prone to systemic failure.
With these overarching principles and requirements, the WIP has
produced a set of position papers, which are summarized in the
latter pages of this document. They provide specifications for
managing and delivering model output, including strategies for
replication and versioning, licensing, data quality assurance,
citation, long-term archival, and dataset tracking. They also
describe a new and more formal approach for specifying what data,
and associated metadata, should be saved, which enables future data
volumes to be estimated, particularly for well-defined projects such
as CMIP6.
The paper concludes with a future-facing consideration of the global
data infrastructure evolution that follows from the blurring of
boundaries between climate and weather, and the changing nature of
published scientific results in the digital age. }
% \pagebreak
\introduction
\label{sec:intro}
CMIP6 \citep{ref:eyringetal2016a}, the latest Coupled Model
Intercomparison Project (CMIP), can trace its genealogy back to the
Charney Report \citep{ref:charneyetal1979}. This seminal report on the
links between CO$_2$ and climate was an authoritative summary of the
state of the science at the time and produced findings that have stood
the test of time \citep{ref:bonyetal2013}. It is often noted
\citep[see, e.g][]{ref:andrewsetal2012} that the range and uncertainty
bounds on equilibrium climate sensitivity generated in this report
have not fundamentally changed, despite the enormous increase in
resources devoted to analysing the problem in decades since
\citep[see, e.g][]{ref:knuttietal2017}
Beyond its enduring findings on climate sensitivity, the Charney
Report also gave rise to a methodology for the treatment of
uncertainties and gaps in understanding, which has been equally
influential, and is in fact the basis of CMIP itself. The Report can
be seen as one of the first uses of the \emph{multi-model ensemble}.
At the time, there were two models available representing the
equilibrium response of the climate system to a change in CO$_2$
forcing, one from Syukuro Manabe's group at NOAA's Geophysical Fluid
Dynamics Laboratory (NOAA-GFDL) and the other from James Hansen's
group at NASA's Goddard Institute for Space Studies (NASA-GISS). Then
as now, these groups marshalled vast state-of-the-art computing and
data resources to run very challenging simulations of the Earth
system. The report's results were based on an ensemble of three runs
from the Manabe group, \citep[see e.g.][]{ref:manabewetherald1975} and
two from the Hansen group \citep[see e.g..][]{ref:hansenetal1981}.
The Atmospheric Model Intercomparison Project
\citep[AMIP:][]{ref:gates1992} was one of the first systematic
cross-model comparisons open to anyone who wished to participate. By
the time of the Intergovernmental Panel on Climate Change (IPCC)'s
First Assessment Report (FAR) in 1990 \citep{ref:houghtonetal1992},
the process had been formalized. At this stage, there were five models
participating in the exercise, and some of what is now called the
``Diagnosis, Evaluation, and Characterization of Klima'' \citep[DECK,
see][]{ref:eyringetal2016a} experiments\footnote{``Klima'' is German
for ``climate''.} had been standardized (AMIP, a pre-industrial
control, 1\% per year CO$_2$ increase to doubling, etc). The future
``scenarios'' had emerged as well, for a total of five different
experimental protocols. Fast-forwarding to today,
\href{https://rawgit.com/WCRP-CMIP/CMIP6_CVs/master/src/CMIP6_source_id.html}{CMIP6
expects more than 100
models}\footnote{https://rawgit.com/WCRP-CMIP/CMIP6\_CVs/master/src/CMIP6\_source\_id.html,
retrieved \today.} from
\href{https://rawgit.com/WCRP-CMIP/CMIP6_CVs/master/src/CMIP6_institution_id.html}{more
than 40 modelling
centres}\footnote{https://rawgit.com/WCRP-CMIP/CMIP6\_CVs/master/src/CMIP6\_institution\_id.html,
retrieved \today.} \citep[in 27 countries, a stark contrast to the
US monopoly in][]{ref:charneyetal1979} to participate in the DECK and
historical experiments \citep[Table~2 of][]{ref:eyringetal2016a}, and
some subset of these to participate in one or more of the 23 MIPs
endorsed by the CMIP Panel \citep[Table~3 of][, originally 21 with two
new MIPs more recently endorsed]{ref:eyringetal2016a}. The
\href{https://rawgit.com/WCRP-CMIP/CMIP6_CVs/master/src/CMIP6_experiment_id.html}{MIPs
call for 287
experiments}\footnote{https://rawgit.com/WCRP-CMIP/CMIP6\_CVs/master/src/CMIP6\_experiment\_id.html,
retrieved \today.} , a considerable expansion over CMIP5.
Alongside the experiments themselves is the
\href{http://clipc-services.ceda.ac.uk/dreq/index.html}{Data
Request}\footnote{http://clipc-services.ceda.ac.uk/dreq/index.html,
retrieved \today.} which defines, for each CMIP experiment, what
output each model should provide for analysis. The complexity of this
data request has also grown tremendously over the CMIP era. A typical
dataset from the FAR archive
(\href{https://cera-www.dkrz.de/WDCC/ui/cerasearch/entry?acronym=IPCC_DDC_FAR_GFDL_R15TRCT_D}{from
the GFDL R15
model}\footnote{https://cera-www.dkrz.de/WDCC/ui/cerasearch/entry?acronym=IPCC\_DDC\_FAR\_GFDL\_R15TRCT\_D,
retrieved \today.} ) lists climatologies and time series of a few
basic climate variables such as surface air temperature, and the
dataset size is about 200~MB. The CMIP6 Data Request
\cite{ref:juckesetal2015} lists literally thousands of variables, from
8 modelling \emph{realms} (e.g. atmosphere, ocean, land, atmospheric
chemistry, land ice, ocean biogeochemistry and sea ice) from the
hundreds of experiments mentioned above. This growth in complexity is
testament to the modern understanding of many physical, chemical and
biological processes which were simply absent from the Charney
Report-era models.
The simulation output is now a primary scientific resource for
researchers the world over, rivaling the volume of observed weather
and climate data from the global array of sensors and satellites
\citep{ref:overpecketal2011}. Climate science, and observed and
simulated climate data in particular, have now become primary elements
in the ``vast machine'' \citep{ref:edwards2010} serving the global
climate and weather research enterprise.
% It could be worthwhile to quantify (in $USD) the impact, as forecasting
% in particular has yielded considerable social and economic gains
Managing and sharing this huge amount of data is an enterprise in its
own right -- and the solution established for CMIP5 was the global
Earth System Grid Federation
\citep[ESGF,][]{ref:williamsetal2011a,ref:williamsetal2015}. ESGF was
identified by the WCRP Joint Scientific Committee in 2013 as the
recommended infrastructure for data archiving and dissemination for
the Programme. A map of sites participating in the ESGF is shown in
Figure~\ref{fig:esgf} drawn from the
\href{https://portal.enes.org/data/is-enes-data-infrastructure/esgf}{IS-ENES
Data
Portal}\footnote{https://portal.enes.org/data/is-enes-data-infrastructure/esgf,
retrieved \today.} . The sites are diverse and responsive to many
national and institutional missions. With multiple agencies and
institutions, and many uncoordinated and possibly conflicting
requirements, the ESGF itself is a complex and delicate artifact to
manage.
\begin{figure*}
\begin{center}
\includegraphics[width=175mm]{images/esgf-map-2017.png}
\end{center}
\caption{Sites participating in the Earth System Grid Federation in
May 2017. Figure courtesy IS-ENES Data Portal. }
\label{fig:esgf}
\end{figure*}
The sheer size and complexity of this infrastructure emerged as a
matter of great concern at the end of CMIP5, when the growth in data
volume relative to CMIP3 (from 40~TB to 2~PB, a 50-fold increase in 6
years) suggested the community was on an unsustainable path. These
concerns led to the 2014 recommendation of the WGCM to form an
\emph{infrastructure panel} (based upon
\href{https://drive.google.com/file/d/0B7Pi4aN9R3k3OHpIWC16Z0JBX3c/view?usp=sharing
}{a
proposal}\footnote{https://drive.google.com/file/d/0B7Pi4aN9R3k3OHpIWC16Z0JBX3c/view?usp=sharing
, retrieved \today.} at the 2013 annual meeting). The WGCM
Infrastructure Panel (WIP) was tasked with examining the global
computational and data infrastructure underpinning CMIP, and improving
communication between the teams overseeing the scientific and
experimental design of these globally coordinated experiments, and the
teams providing resources and designing that infrastructure. The
communication was intended to be two-way: providing input both to the
provisioning of infrastructure appropriate to the experimental design,
and informing the scientific design of the technical (and financial)
limits of that infrastructure.
This paper provides a summary of the findings by the WIP in the first
three years of activity since its formation in 2014, and the
consequent recommendations -- in the context of existing
organisational and funding constraints. In the text below, we refer to
\emph{findings}, \emph{requirements}, and \emph{recommendations}.
Findings refer to observations about the state of affairs:
technologies, resource constraints and the like, based upon our
analysis. Requirements are design goals that have been shared with
those building the infrastructure, such as the ESGF software and
security stack. Recommendations are our guidance to the community:
experiment designers, modelling centres, and the users of climate
data.
The intended audience for the paper is primarily the CMIP6 scientific
community. In particular, we aim to show how the scientific design of
CMIP6 as outlined in \cite{ref:eyringetal2016a} translates into
infrastructural requirements. We hope this will be instructive to the
MIP chairs and creators of multi-model experiments highlighting
resource implications of their experimental design, and for data
providers (modelling centres), to explain the sometimes opaque
requirements imposed upon them as a requisite for participation. By
describing how design of this infrastructure is severely constrained
by resources, we hope to provide a useful perspective to those who
find data acquisition and analysis a technical challenge. Finally, we
hope this will be of interest to general readers of the journal from
other geoscience fields, illuminating the particular character of
global data infrastructure for climate data, where the community of
users far outstrip in numbers and diversity, the Earth system
modelling community itself.
In Section~\ref{sec:principles}, the principles and scientific
rationale underlying the requirements for global data infrastructure
are articulated. In Section~\ref{sec:dreq} the CMIP6 Data Request is
covered: standards and conventions, requirements for modelling centres
to process a complex data request, and projections of data volume. In
Section~\ref{sec:licensing}, the recent evolution in how data are
archived is reviewed alongside a licensing strategy consistent with
current practice and scientific principle. In Section~\ref{sec:cite}
issues surrounding data as a citable resource are discussed, including
the technical infrastructure for the creation of citable data, and the
documentation and other standards required to make data a first-class
scientific entity. In Section~\ref{sec:replica} the implications of
data replicas, and in Section~\ref{sec:version} issues surrounding
data versioning, retraction, and errata are addressed.
Section~\ref{sec:summary} provides an outlook for the future of global
data infrastructure, looking beyond CMIP6 towards a unified view of
the ``vast machine'' for weather and climate data and computation.
\section{Principles and Constraints}
\label{sec:principles}
This section lays out some of the the principles and constraints which
have resulted from the evolution of infrastructure requirements since
the first CMIP experiment -- beginning with a historical context.
\subsection{Historical Context}
\label{sec:history}
In the pioneering days of CMIP, the community of participants was
small and well-knit, and all the issues involved in generating
datasets for common analysis from different modelling groups was
settled by mutual agreement (Ron Stouffer, personal communication).
Analysis was performed by the same community that performed the
simulations. The Program for Climate Model Diagnosis and
Intercomparison (PCMDI), established at Lawrence Livermore National
Laboratory (USA) in 1989, had championed the idea of more systematic
analysis of models, and in close cooperation with the climate
modelling centres, PCMDI assumed responsibility for much of the
day-to-day coordination of CMIP. Until CMIP3, the hosting of datasets
from different modelling groups could be managed at a single archival
site; PCMDI alone hosted the entire 40~TB archive.
From its earliest phases, CMIP grew in importance, and its results
have provided a major pillar that supports the periodic
Intergovernmental Panel on Climate Change (IPCC) assessment
activities. However, the explosive growth in the scope of CMIP,
especially between CMIP3 and CMIP5, represented a tipping point in the
supporting infrastructure. Not only was it clear that no one site
could manage all the data, the necessary infrastructure software and
operational principles could no longer be delivered and managed by
PCMDI alone.
For CMIP5, PCMDI sought help from a number of partners under the
auspices of the Global Organisation of Earth System Science Portals
(GO-ESSP). Many of the GO-ESSP partners who became the foundation
members and developers of the Earth System Grid Federation retargeted
existing research funding to help develop ESGF. The primary heritage
derived from the original U.S. Earth System Grid project funded by the
U.S. Department of Energy, but increasingly major contributions came
from new international partners. This meant that many aspects of the
ESGF system began from work which was designed in the context of
different requirements, collaborations and objectives. At the
beginning, none of the partners had funds for operational support for
the fledgling international federation, and even after the end of
CMIP5 proper (circa 2014), the ongoing ESGF has been sustained
primarily by small amounts of funding at a handful of the primary ESGF
sites. Most ESGF sites have had little or no formal operational
support. Many of the known limitations of the CMIP5 ESGF -- both in
terms of functionality and performance -- were a direct consequence of
this heritage.
With the advent of CMIP6 (in addition to some sister projects such as
obs4MIPs, input4MIPs and CREATE-IP), it was clear that a fundamental
reassessment would be needed to address the evolving scientific and
operational requirements. That clarity led to the establishment of the
WIP, but it has yet to lead to any formal joint funding arrangement --
the ESGF and the data nodes within it remain funded (if at all, many
data nodes are marginal activities supported on best efforts) by
national agencies with disparate timescales and objectives. Several
critical software elements also are being developed on volunteer
efforts and shoestring budgets. This finding has been noted in the US
National Academies Report on ``A National Strategy for Advancing
Climate Modeling'' \citep{ref:nasem2012}, which warned of the
consequences of inadequate infrastructure funding.
\subsection{Infrastructural Principles}
\label{sec:infra-principles}
\begin{enumerate}
\item With greater complexity and a globally distributed data
resource, it has become clear that in the design of globally
coordinated scientific experiments, the global computational and
data infrastructure needs to be formally examined as an integrated
element.
The membership of the WIP, drawn as it is from experts in various
aspects of the infrastructure, is a direct consequence of this
requirement for integration. Representatives of modelling centres,
infrastructure developers, and stakeholders in the scientific design
of CMIP and its output comprise the panel membership. One of the
WIP's first acts was to consider three phases in the process of
infrastructure development: \emph{requirements},
\emph{implementation}, and \emph{operations}, all informed by the
builders of workflows at the modelling centres.
\begin{itemize}
\item The WIP, in consort with the WCRP's CMIP Panel, takes
responsibility to articulate \emph{requirements} for the
infrastructure.
\item The \emph{implementation} is in the hands of the
infrastructure developers, principally ESGF for the federated
archive \citep{ref:williamsetal2015}, but also related projects
like Earth System Documentation
\citep[\href{https://www.earthsystemcog.org/projects/es-doc-models/
}{ES-DOC}\footnote{https://www.earthsystemcog.org/projects/es-doc-models/
, retrieved \today.} ,][]{ref:guilyardietal2013}.
\item In 2016 at the WIP's request, the CMIP6 Data Node
\emph{Operations} Team (CDNOT) was formed. It is charged with
ensuring that all the infrastructure elements needed by CMIP6 are
properly deployed and actually working as intended at the sites
hosting CMIP6 data. It is also responsible for the operational
aspects of the federation itself, including specifying what
versions of the toolchain are run at every site at any given time,
and organising coordinated version and security upgrades across
the federation.
\end{itemize}
Although there is now a clear separation of concerns into
requirements, implementation, and operations, close links are
maintained by cross-membership between the key bodies, including the
WIP itself, the CMIP Panel, the ESGF Executive Committee, and the
CDNOT.
\item\label{broad} With the basic fact of anthropogenic climate change
now well established \citep[see, e.g.,][]{ref:stockeretal2013} the
scientific communities with an interest in CMIP is expanding. For
example, a substantial body of work has begun to emerge to examine
climate impacts. In addition to the specialists in Earth system
science -- who also design and run the experiments and produce the
model output -- those relying on CMIP output now include those
developing and providing climate services, as well as
\emph{consumers} from allied fields studying the impacts of climate
change on health, agriculture, natural resources, human migration,
and similar issues \citep{ref:mossetal2010}. This confronts us with
a \emph{scientific scalability} issue (the data during its lifetime
will be consumed by a community much larger, both in sheer numbers,
and also in breadth of interest and perspective than the Earth
system modelling community itself), which needs to be addressed.
Accordingly, we note the requirement that infrastructure should
ensure maximum transparency and usability for user (consumer)
communities at some distance from the modelling (producer)
communities.
\item\label{repro} While CMIP and the IPCC are formally independent,
the CMIP archive is increasingly a reference in formulating climate
policy. Hence the \emph{scientific reproducibility}
\citep{ref:collinstabak2014} and the underlying \emph{durability}
and \emph{provenance} of data have now become matters of central
importance: being able to trace back, long after dataset creation,
from model output to the configuration of models and the procedures
and choices made along the way. This led the IPCC to require data
distribution centres (DDCs) that attempt to guarantee the archival
and dissemination of this data in perpetuity, and consequently to a
requirement in the CMIP context of achieving reproducibility. Given
the use of multi-model ensembles for both consensus estimates and
uncertainty bounds on climate projections, it is important to
document -- as precisely as possible, given the independent
genealogy and structure of many models -- the details and
differences among model configurations and analysis methods, to
deliver both the requisite provenance and the routes to
reproduction.
\item\label{analysis} With the expectation that CMIP DECK experiment
results should be routinely contributed to CMIP, opportunities now
exist for engaging in a more systematic and routine evaluation of
Earth System Models (ESMs). This has led to community efforts to
develop standard metrics of model ``quality''
\citep{ref:eyringetal2016,ref:gleckleretal2016}. Typical multi-model
analysis has hitherto taken the multi-model average, assigning equal
weight to each model, as the most likely estimate of climate
response. This ``model democracy'' \citep{ref:knutti2010} has been
called into question and there is now a considerable literature
exploring the potential of weighting models by quality
\citep{ref:knuttietal2017}. The development of standard metrics
would aid this kind of research.
To that end, there is now a requirement to enable through the ESGF a
framework for accommodating quasi-operational evaluation tools that
could routinely execute a series of standardized evaluation tasks.
This would provide data consumers with an increasingly (over time)
systematic characterization of models. It may be some time before a
fully operational system of this kind can be implemented, but
planning must start now.
In addition, there is an increased interest in climate analytics as
a service \citep{ref:balajietal2011,ref:schnaseetal2017}. This
follows the principle of placing analysis close to the data. Some
centres plan to add resources that combine archival and analysis
capabilities, e.g., NCAR's
\href{https://www2.cisl.ucar.edu/resources/cmip-analysis-platform
}{CMIP Analysis
Platform}\footnote{https://www2.cisl.ucar.edu/resources/cmip-analysis-platform
, retrieved \today.} , or the UK's JASMIN
\citep{ref:lawrenceetal2013}.. There are also new efforts to bring
climate data storage and analysis to the cloud era
\citep[e.g][]{ref:duffyetal2015}. Platforms such as
\href{http://pangeo-data.org/}{Pangeo}\footnote{http://pangeo-data.org/,
retrieved \today.} show promise in this realm, and widespread
experimentation and adoption is encouraged.
\item As the experimental design of CMIP has grown in complexity,
costs both in time and money have become a matter of great concern,
particularly for those designing, carrying out, and storing
simulations. In order to justify commitment of resources to CMIP,
mechanisms to identify costs and benefits in developing new models,
performing CMIP simulations, and disseminating the model output need
to be developed.
To quantify the scientific impact of CMIP, measures are needed to
\emph{track} the use of model output and its value to consumers. In
addition to usage quantification, credit and tracing data usage in
literature via citation of data is important. Current practice is at
best citing large data collections provided by a CMIP participant,
or all of CMIP. Accordingly, we note the need for a mechanism to
identify and \emph{cite} data provided by each modelling centre.
Alongside the intellectual contribution to model development, which
can be recognized by citation, there is a material cost to centres
in computing and data processing, which is both burdensome and
poorly understood by those requesting, designing and using the
results from CMIP experiments, who might not be in the business of
model development. The criteria for endorsement introduced in CMIP6
\citep[see Table~1 in][]{ref:eyringetal2016a} begins to grapple with
this issue, but the costs still need to be measured and recorded. To
begin documenting these costs for CMIP6, the ``Computational
Performance'' MIP project (CPMIP) \citep{ref:balajietal2017} has
been established, which will measure, among other things, throughput
(simulated years per day) and cost (core-hours and joules per
simulated year) as a function of model resolution and complexity.
New tools for estimating data volumes have also been developed, see
Section~\ref{sec:data-request} below.
\item\label{cmplx} Experimental specifications have become ever more
complex, making it difficult to verify that experiment
configurations conform to those specifications. Several modelling
centres have encountered this problem in preparing for CMIP6,
noting, for example, the challenging intricacies in dealing with
input forcing data \citep[see][]{ref:duracketal2018}, output
variable lists \citep{ref:juckesetal2015}, and crossover
requirements between the endorsed MIPs and the DECK
\citep{ref:eyringetal2016a} . Moreover, these protocols inevitably
evolve over time, as errors are discovered or enhancements proposed,
and centres needed to be adaptable in their workflows accordingly.
We note therefore a requirement to encode the protocols to be
directly ingested by workflows, in other words,
\emph{machine-readable experiment design}. The intent is to avoid,
as far as possible, errors in conformance to design requirements
introduced by the need for humans to transcribe and implement the
protocols, for instance, deciding what variables to save from what
experiments. This is accomplished by encoding most of the
specifications in standard, structured and machine readable text
formats (XML and JSON) which can be directly read by the scripts
running the model and post-processing, as explained further below in
Section~\ref{sec:dreq}. The requirement spans all of the
\emph{controlled vocabularies}
(\href{https://github.com/WCRP-CMIP/CMIP6_CVs}{CMIP6\_CVs}\footnote{https://github.com/WCRP-CMIP/CMIP6\_CVs,
retrieved \today.} : for instance the names assigned to models,
experiments, and output variables) used in the CMIP protocols as
well as the CMIP6 Data Request \citep{ref:juckesetal2015}, which
must be stored in version-controlled, machine-readable formats.
Precisely documenting the \emph{conformance} of experiments to the
protocols \citep{ref:lawrenceetal2012} is an additional requirement.
\item\label{snap} The transition from a unitary archive at PCMDI in
CMIP3 to a globally federated archive in CMIP5 led to many changes
in the way users interact with the archive, which impacts management
of information about users and complicates communications with them.
In particular, a growing number of data users no longer registered
or interacted directly with the ESGF. Rather they relied on
secondary repositories, often copies of some portion of the ESGF
archive created by others at a particular time (see for instance the
\href{http://www.ipcc-data.org/docs/factsheets/TGICA_Fact_Sheet_CMIP5_data_provided_at_the_IPCC_DDC_Ver_1_2016.pdf
}{IPCC CMIP5 Data
Factsheet}\footnote{http://www.ipcc-data.org/docs/factsheets/TGICA\_Fact\_Sheet\_CMIP5\_data\_provided\_at\_the\_IPCC\_DDC\_Ver\_1\_2016.pdf
, retrieved \today.} for a discussion of the snapshots and their
coverage). This meant that reliance on the ESGF's inventory of
registered users for any aspect of the infrastructure -- such as
tracking usage, compliance with licensing requirements, or informing
users about errata or retractions -- could at best ensure partial
coverage of the user base.
This key finding implies a more distributed design for several
features outlined below, which devolve many of these features to the
datasets themselves rather than the archives. One may think of this
as a \emph{dataset-centric rather than system-centric} design (in
software terms, a \emph{pull} rather than \emph{push} design):
information is made available upon request at the user/dataset
level, relieving the ESGF implementation of an impossible burden.
\end{enumerate}
Based upon the above considerations, the WIP produced a set of
position papers (see Appendix~\ref{sec:wip}) encapsulating
specifications and recommendations for CMIP6 and beyond. These papers,
summarised below, are available from the
\href{https://www.earthsystemcog.org/projects/wip/}{WIP
website}\footnote{https://www.earthsystemcog.org/projects/wip/,
retrieved \today.} . As the WIP continues to develop additional
recommendations, they too will be made available. As requirements
evolve, a modified document will be released with a new version
number.
\section{A structured approach to data production}
\label{sec:dreq}
The CMIP6 data framework has evolved considerably from CMIP5, and
follows the principles of scientific reproducibility (Item~\ref{repro}
in Section~\ref{sec:principles}) and the recognition that the
complexity of the experimental design (Item~\ref{cmplx}) required far
greater degrees of automation within the production workflow
generating simulation results. As a starting point, all elements in
the experiment specifications must be recorded in structured text
formats (XML and JSON, for example), and any changes must be tracked
through careful version control. \emph{Machine-readable} specification
of all aspects of the model output configuration is a design goal, as
noted earlier.
The data request spans several elements discussed in sub-sections
below.
\subsection{CMIP6 Data Request}
\label{sec:data-request}
The \href{http://clipc-services.ceda.ac.uk/dreq/index.html}{CMIP6 Data
Request}\footnote{http://clipc-services.ceda.ac.uk/dreq/index.html,
retrieved \today.} specifies which variables should be archived for
each experiment. It is one of the most complex elements of the CMIP6
infrastructure due to the complexity of the new design outlined in
\cite{ref:eyringetal2016a}. The experimental design now involves 3
tiers of experiments, where an individual modelling group may choose
which ones to perform; and variables grouped by scientific goals and
priorities, where again centres may choose which sets to publish,
based on interests and resource constraints. There are also
cross-experiment data requests, where for instance the design may
require a variable in one experiment to be compared against the same
variable from a different experiment. The modelling groups will then
need to take this into account before beginning their simulations. The
CMIP6 Data Request is a codification of the entire experimental design
into a structured set of machine-readable documents, which can in
principle be directly ingested in data workflows.
The \href{http://clipc-services.ceda.ac.uk/dreq/index.html}{CMIP6 Data
Request}\footnote{http://clipc-services.ceda.ac.uk/dreq/index.html,
retrieved \today.} \citep{ref:juckesetal2015} combines definitions
of variables and their output format with specifications of the
objectives they support and the experiments that they are required
for. The entire request is encoded in an XML database with rigorous
type constraints. Important elements of the request, such as units,
cell methods (expressing the subgrid processing implicit in the
variable definition), sampling frequencies, and time ``slices''
(subsets of the entire simulation period as defined in the
experimental design) for required output, are defined using controlled
vocabularies that ensure consistency of interpretation. The request is
designed to enable flexibility, allowing modelling centres to make
informed decisions about the variables they should submit to the CMIP6
archive from each experiment.
% The data request spans several elements.
% \begin{enumerate}
% \item specification of the parameter to be calculated in terms of a CF
% standard name and units,
% \item an output frequency,
% \item a structural specification which includes specification of
% dimensions and of subgrid processing.
% \end{enumerate}
In order to facilitate the cross linking between the 2100 variables
from the 287 experiments, the request database allows MIPs to
aggregate variables and experiments into groups. This allows MIPs to
designate variable groups by priority and provides for queries that
return the list of variables needed from any given experiment at a
specified time slice and frequency.
% The link between variables and
% experiments is then made through the following chain:
% \begin{itemize}
% \item A \emph{variable group}, aggregating variables with priorities
% specific to the MIP defining the group;
% \item A \emph{request link} associating a variable group with an
% objective and a set of request items;
% \item \emph{Request} items associating a particular time slice with a
% request link and a set of experiments.
% \end{itemize}
This formulation takes into account the complexities that arise when a
particular MIP requests that variables needed for their own
experiments should also be saved from a DECK experiment or from an
experiment proposed by a different MIP.
The data request supports a broad range of users who are provided with
a range of different access points. These include the entire
codification in the form of a structured (XML) document, web pages, or
spreadsheets, as well as a python API and command-line tools, to
satisfy a wide variety of usage patterns for accessing the data
request information.
% \begin{enumerate}
% \item The XML database provides the reference document;
% \item Web pages provide a direct representation of the database
% content;
% \item Excel workbooks provide selected overviews for specific MIPs and
% experiments;
% \item A python library provides an interface to the database with some
% built-in support functions;
% \item A command line tool based on the python library allows quick
% access to simple queries.
% \end{enumerate}
The data request's machine-readable database has been an extraordinary
resource for the modelling centres. They can, for example, directly
integrate the request specifications with their workflows to ensure
that the correct set of variables are saved for each experiment they
plan to run. In addition, it has given them a new-found ability to
estimate the data volume associated with meeting a MIP's requirements,
a feature exploited below in Section~\ref{sec:dvol}.
\subsection{Model inputs}
\label{sec:data-inputs}
Datasets used by the model for configuration of model inputs
\citep[\texttt{Input Datasets for Model Intercomparison Projects)
input4MIPs}, see][]{ref:duracketal2018} as well as observations for
comparison with models \citep[\texttt{Observations for Model
Intercomparison Projects) obs4MIPs},
see][]{ref:teixeiraetal2014,ref:ferraroetal2015} are both now
organised in the same way, and share many of the naming and metadata
conventions as the CMIP model output itself. The coherence of
standards across model inputs, outputs, and observational datasets is
a development that will enable the community to build a rich toolset
across all of these datasets. The datasets follow the versioning
methodologies described in Section~\ref{sec:version}.
\subsection{Data Reference Syntax}
\label{sec:data-drs}
The organisation of the model output follows the
\href{https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit?usp=sharing
}{Data Reference Syntax
(DRS)}\footnote{https://docs.google.com/document/d/1h0r8RZr\_f3-8egBMMh7aqLwy3snpD6\_MrDz1q8n5XUk/edit?usp=sharing
, retrieved \today.} first used in CMIP5, and now in a somewhat
modified form in CMIP6. The DRS depends on pre-defined
\emph{controlled vocabularies}
\href{https://github.com/WCRP-CMIP/CMIP6_CVs}{CMIP6\_CVs}\footnote{https://github.com/WCRP-CMIP/CMIP6\_CVs,
retrieved \today.} for various terms including: the names of
institutions, models, experiments, time frequencies, etc. The CVs are
now recorded as a version-controlled set of structured text documents,
and satisfies the requirement that there is a
\href{https://github.com/WCRP-CMIP/CMIP6_CVs }{single authoritative
source for any CV}\footnote{https://github.com/WCRP-CMIP/CMIP6\_CVs
, retrieved \today.} , on which all elements in the toolchain will
rely. The DRS elements that rely on these controlled vocabularies
appear as netCDF attributes and are used in constructing file names,
directory names, and unique identifiers of datasets that are essential
throughout the CMIP6 infrastructure. These aspects are covered in
detail in the
\href{https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_global_attributes_filenames_CVs_v6.2.6.pdf
}{CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and
CVs}\footnote{https://www.earthsystemcog.org/site\_media/projects/wip/CMIP6\_global\_attributes\_filenames\_CVs\_v6.2.6.pdf
, retrieved \today.} position paper. A new element in the DRS
indicates whether data have been stored on a native grid or have been
regridded (see discussion below in Section~\ref{sec:dvol} on the
potentially critical role of regridded output). This element of the
DRS will allow us to track the usage of the \emph{regridded subset} of
data and assess the relative popularity of native-grid vs.
standard-grid output.
\subsection{CMIP6 data volumes}
\label{sec:dvol}
As noted, extrapolations based on CMIP3 and CMIP5 lead to some
alarming trends in data volume \citep[see
e.g.,][]{ref:overpecketal2011}. As seen in their Figure~2, model
output such as those from the various CMIP phases (1 through 6) are
beginning to rival observational data volume. As noted in the
introduction, a particular problem for our community is the diverse
and very large user base for the data, many of whom are not climate
specialists, but downstream users of climate data studying the impacts
of climate change. This stands in contrast to other fields with
comparably large data holdings: data from the Large Hadron Collider
\citep[e.g.,][]{ref:aadetal2008}, for example, is primarily consumed
by high energy physicists and not of direct interest to anyone else.
A rigorous approach is needed to estimate future data volumes, rather
than relying on simple extrapolation. Contributions to the increase in
data volume include the systematic increase in model resolution and
complexity of the experimental protocol and data request. We consider
these separately:
\begin{description}
\item[Resolution] The median horizontal resolution of a CMIP model
tends to grow with time, and is expected to be more typically 100~km
in CMIP6, compared to 200~km in CMIP5. Typically the temporal
resolution of the model (though not the data) is doubled as well,
for reasons of numerical stability. Thus, for an $N$-fold increase
in horizontal resolution, we require an $N^3$ increase in
computational capacity. The vertical resolution grows in a more
controlled fashion, at least as far as the data is concerned, as
often the requested output is reported on a standard set of
atmospheric levels that has not changed much over the years.
Similarly the temporal resolution of the data request does not
increase at the same rate as the model timestep: monthly averages
remain monthly averages. Thus, the $N^3$ increase in computational
capacity will result in an $N^2$ increase in data volume,
\emph{ceteris paribus}. Thus, data volume $V$ and computational
capacity $C$ are related as $V \sim C^\frac23$, purely from the
point of view of resolution. Consequently, if centres then
experience an 8-fold increase in $C$ between CMIPs, we can expect a
doubling of model resolution and an approximate quadrupling of the
data volume (see discussion in the
\href{https://docs.google.com/document/d/1kZw3KXvhRAJdBrXHhXo4f6PDl_NzrFre1UfWGHISPz4/edit?ts=5995cbff
}{CMIP6 Output Grid Guidance
document}\footnote{https://docs.google.com/document/d/1kZw3KXvhRAJdBrXHhXo4f6PDl\_NzrFre1UfWGHISPz4/edit?ts=5995cbff
, retrieved \today.} ).
A similar approximate doubling of model resolution occurred between
CMIP3 and CMIP5, but data volume increased 50-fold. What caused that
extraordinary increase?
\item[Complexity] The answer lies in the complexity of CMIP: the
complexity of the data request and of the experimental protocol. The
first component, the data request complexity, is related to that of
the science: the number of processes being studied, and the physical
variables required for the study, along with the large number of
satellite MIPs (23) that now comprise the CMIP6 project. In CPMIP
\citep{ref:balajietal2017}, we have attempted a rigorous definition
of this complexity, measured by the number of physical variables
simulated by the model. This, we argue, grows not smoothly like
resolution, but in very distinct generational step transitions, such
as the one from atmosphere-ocean models to Earth system models,
which, as shown in \cite{ref:balajietal2017}, involved a substantial
jump in complexity with regard to the number of physical, chemical,
and biological species being modelled. Many models of the CMIP5 era
added atmospheric chemistry and aerosol-cloud feedbacks, sometimes
with $\mathcal{O}(100)$ species. CMIP5 also marked the first time in
CMIP that ESMs were used to simulate changes in the carbon cycle.
% the following increase in complexity doesn't help explain the 50-fold increase
% which is what this paragraph is supposed to address
% the number of experiments (or number of years simulated) are
% primarily controlled by $C$, which you say is limited to 8-fold increase.
% need to restructure the argument.
The second component of complexity is the experimental protocol, and
the number of experiments themselves when comparing successive
phases of CMIP. The number of experiments (and years simulated) grew
from 12 in CMIP3 to about 50 in CMIP5, greatly inflating the data
produced. With the new structure of CMIP6, with a DECK and 23
endorsed MIPs, the number of experiments has grown tremendously
(from about 50 to 287). We propose as a measure of experimental
complexity, the \emph{total number of simulated years (SYs)} called
for by the experimental protocol. Note that modelling centres must
make tradeoffs between experimental complexity and resolution in
deciding their level of participation in CMIP6, as discussed in
\cite{ref:balajietal2017}.
\end{description}
Two further steps have been proposed toward ensuring sustainable
growth in data volumes.
% Given the earlier arguments, it seems $C$ will limit growth of volume by itself
% Why are additional steps necessary?
The first of these is the consideration of standard horizontal
resolutions for saving data, as is already done for vertical and
temporal resolution in the data request. Cross-model analyses already
cast all data to a common grid in order to evaluate it as an ensemble,
typically at fairly low resolution. The studies of Knutti and
colleagues (e.g., \cite{ref:knuttietal2017}), for example, are
typically performed on relatively coarse grids. Accordingly for most
purposes atmospheric data on the ERA-40 grid
($2^\circ\times 2.5^\circ$) would suffice, with obvious exceptions for
experiments like those called for by HighResMIP
\citep{ref:haarsmaetal2016}. A similar conclusion applies for ocean
data (the World Ocean Atlas $1^\circ\times 1^\circ$ grid), with
extended discussion of the benefits and losses due to regridding
\citep[see][]{ref:griffiesetal2014,ref:griffiesetal2016}.
This has not been mandated for CMIP6 for a number of reasons. Firstly,
regridding is burdensome on many grounds: It requires considerable
expertise to choose appropriate algorithms for particular variables,
for instance, we may need ones that guarantee exact conservation for
scalars or preservation of streamlines for vector fields may be a
requirement; and it can be expensive in terms of computation and
storage. Secondly, regridding is irreversible (thus amounting to
``lossy'' data reduction) and non-commutative with certain basic
arithmetic operations such as multiplication (i.e., the product of
regridded variables does not in general equal the regridded output of
the product computed on the native grid). This can be problematic for
budget studies. However, the same issues would apply for
time-averaging and other operations long used in the field: much
analysis of CMIP output is performed on monthly-averaged data, which
is ``lossy'' compression along the time axis relative to the model's
time resolution.
These issues have contributed to a lack of consensus in moving
forward, and the recommendations on regridding remain in flux. The
\href{https://docs.google.com/document/d/1kZw3KXvhRAJdBrXHhXo4f6PDl_NzrFre1UfWGHISPz4/edit?ts=5995cbff
}{CMIP6 Output Grid Guidance
document}\footnote{https://docs.google.com/document/d/1kZw3KXvhRAJdBrXHhXo4f6PDl\_NzrFre1UfWGHISPz4/edit?ts=5995cbff
, retrieved \today.} outlines a number of possible recommendations,
including the provision of ``weights'' to a target grid. Many of the
considerations around regridding, particularly for ocean data in
CMIP6, are discussed at length in \cite{ref:griffiesetal2016}.
There is a similar lack of consensus around whether or not to adopt a
common \emph{calendar} for particular experiments. In cases such as a
long-running control simulation where all years are equivalent and of
no historical significance, it is customary in this community to use
simplified calendars -- such as a Julian, a ``noleap'' (365-day) or
``equal-month'' (360-day) calendar -- rather than the Gregorian.
However, comparison across datasets using different calendars can be a
frustrating burden on the end-user. There is no consensus at this
point, however, to impose a particular calendar.
As outlined below in Section~\ref{sec:replica}, both ESGF data nodes
and the creators of secondary repositories are given considerable
leeway in choosing data subsets for replication, based on their own
interests. The tracking mechanisms outlined in Section~\ref{sec:pid}
below will allow us to ascertain, after the fact, how widely used the
native grid data may be \emph{vis-\`a-vis} the regridded subset, and
allow us to recalibrate the replicas, as usage data becomes available.
We note also that the providers of at least one of the standard
metrics packages \citep[ESMValTool,][]{ref:eyringetal2016a} have
expressed a preference of standard grid data for their analysis, as
regridding from disparate grids increases the complexity of their
already overburdened infrastructure.
A second method of data reduction for the purposes of storage and
transmission is the issue of data compression. The netCDF4 software,
which is used in writing CMIP6 data, includes an option for lossless
compression or deflation \citep{ref:zivlempel1977} that relies on the
same technique used in standard tools such as \texttt{gzip}. In
practice, the reduction in data volume will depend upon the
``entropy'' or randomness in the data, with smoother data or fields
with many missing data points (e.g. land or ocean) being compressed
more.
Dealing with compressed data entails computational costs, not only
during its creation, but also every time the data are re-inflated.
There is also a subtle interplay with precision: for instance
temperatures usually seen in climate models appear to deflate better
when expressed in Kelvin, rather than Celsius, but that is due to the
fact that the leading order bits are always the same, and thus the
data is actually less precise. Deflation is also enhanced by
reorganising (``shuffling'') the data internally into chunks that have
spatial and temporal coherence.
Some argue for the use of more aggressive \emph{lossy} compression
methods \citep{ref:bakeretal2016}, but for CMIP6 it can be argued that
the resulting loss of precision and the consequences for scientific
results require considerably more evaluation by the community before
such methods can be accepted. However, as noted above, some lossy
methods of data reduction (e.g., time-averaging) have long been common
practice.
To help inform the discussion about compression, we undertook a
systematic study of typical model output files under lossless
compression, the results of which are
\href{https://public.tableau.com/profile/balticbirch\#!/vizhome/NC4/NetCDF4Deflation}{publicly
available}\footnote{https://public.tableau.com/profile/balticbirch\#!/vizhome/NC4/NetCDF4Deflation,
retrieved \today.} . The study indicates that standard \texttt{zlib}
compression in the netCDF4 library with the settings of
\texttt{deflate=2} (relatively modest, and computationally
inexpensive), and \texttt{shuffle} (which ensures better
spatiotemporal homogeneity) ensures the best compromise between
increased computational cost and reduced data volume. For an ESM, we
expect a total savings of about 50\%, with ocean, ice, and land realms
benefiting most (owing to large areas of the globe that are masked)
and atmospheric data benefiting least. This 50\% estimate has been
verified with sample output from one model whose compression rates
should be quite typical.
The \href{https://earthsystemcog.org/projects/wip/CMIP6DataRequest
}{DREQ}\footnote{https://earthsystemcog.org/projects/wip/CMIP6DataRequest
, retrieved \today.} alluded to above in Section~\ref{sec:dreq}
allows us to estimate expected data volumes. The software generates an
estimate given the model's resolution along with the experiments that
will be performed and the data one intends to save (using DREQ's
\emph{priority} attribute).
% With this information
% We are actually capturing this information in the registered content
% for the model source_id entries - see http://rawgit.com/WCRP-CMIP/CMIP6_CVs/master/src/CMIP6_source_id.html
% The json entry contains resolutions for each active model realm
% https://github.com/WCRP-CMIP/CMIP6_CVs/blob/master/CMIP6_source_id.json
% "unprecedented" is incorrect.
% In CMIP5 we had a sophisticated capability of estimating data volume
% We polled the groups to determine which experiments they planned
% to run and how large their ensembles would be.
% We also asked what resolution they would report output.
% From this we estimated in Nov. 2010 a total data volume of 2.5 petabytes
% (2.1 petabytes if only high-priority variables were reported), not too