-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-Non-Life-Insurance-Pricing.Rmd
1122 lines (797 loc) · 73.5 KB
/
01-Non-Life-Insurance-Pricing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
bookdown::pdf_document2:
template: templates/brief_template.tex
bookdown::word_document2: default
bookdown::html_document2: default
documentclass: book
bibliography: references.bib
editor_options:
chunk_output_type: console
---
```{r echo=FALSE}
library(knitr)
```
<!-- Needed for leaving space to the quote, * is for no indentation after title -->
<!-- \titlespacing*{\chapter}{0pt}{80px}{35pt} -->
# Non-Life Insurance Pricing {#chap:nlip}
<!-- \chaptermark{Non-Life Insurance Pricing} -->
In this chapter we are going to provide an overview on how non-life insurance works from an actuarial point of view with a specific focus on the retail pricing process. For more details on the mathematics and statistics behind the concepts we introduce in this chapter, we refer to [@wuthrich-non-life-insurance-math-stats], [@wuthrich-data-analytics] and [@gigante2010tariffazione].
## Non-Life Insurance {#chap:non-life-ins}
The Italian Civil Code [@italian-civil-code] provides the following definition of insurance contract:
```{definition, ins-contr, name = "Insurance Contract, Art. 1882, Italian Civil Code"}
The insurance is the contract by which an insurer, in exchange of the payment of a certain premium, obliged himself, within the agreed limits:
\setlist{nolistsep}
\begin{enumerate}[noitemsep]
\item to pay an indemnity to the insured equivalent to the damage caused by an accident;
\item or to pay an income or a capital if a life-related event occurs.
\end{enumerate}
```
This definition identifies two parties: the _Insurer_ and the _Policyholder_. The policyholder pays to the Insurer a certain _Premium_ at the beginning of the insurance coverage and the insurer will pay a benefit if a certain event (_Claim_) occurs. This event could happen zero, one or more than one times, so it is possible to have more than one claim.
Usually, in non-life insurance, the benefit is the payment of a sum. This sum could be predetermined (e.g. in motor theft insurance, where the benefit is usually the value of the insured vehicle) or defined by the entity of the claim (e.g. in \ac{mtpl} insurance, it depends on the damage the policyholder has caused to a third party). Regarding the "agreed limits", another peculiarity of non-life insurances is that the coverage period is defined as a fixed amount of time, usually corresponding to 1 year.
Starting from this legal definition, we can formalize a non-life insurance contract as follows.
Let:
* $\left]t_1, t_2\right]$, with $t_1<t_2$, be the coverage period;
* $P>0$ be the premium paid by the policyholder to the insurer;
* $N\in\mathbb{N}$ be the number of claims occurred during the coverage period (_claims count_);
* $\tau_1, \tau_2, \dots, \tau_N$, with $t_1<\tau_1< \tau_2 < \dots < \tau_N<t_2$, be the timing of each claim;
* $Z_1, Z_2, \dots, Z_N > 0$ be the amount of each claim (_claims severities_ or _claims sizes_).
The total cost of claims for the insurance is
$$
S =
\begin{cases}
0 & \text{if } N=0 \\
\sum_{i=1}^{N}{Z_i} & \text{if } N>0
\end{cases}
$$.
For simplicity, in the following we are going to just use the notation $S = \sum_{i=1}^{N}{Z_i}$ with the meaning of $0$ if $N=0$.
Figure \@ref(fig:ins-cashflow) shows the cash flows corresponding to the insurance contract. From this representation we can interpret the entering into an insurance contract by the policyholder as a way to exchange the negative cash flows $-Z_1, -Z_2, \dots, -Z_N$ with one single negative cash flow $-P$. On the other hand, the insurer undertakes the negative cash flows $-Z_1, -Z_2, \dots, -Z_N$ in exchange for a positive cash flow $+P$.
```{tikz, ins-cashflow, fig.cap = "Insurance Contract cash flows.", fig.ext = 'pdf', fig.pos = "hbtp", cache = TRUE, echo = FALSE}
\newcommand{\ImageWidth}{11cm}
\usetikzlibrary{decorations.pathreplacing, positioning, arrows.meta}
\begin{tikzpicture}
% draw horizontal line
\draw[thick, -Triangle] (0, 0) -- (\ImageWidth, 0) node[font = \scriptsize, below left = 3pt and -8pt]{$t$};
\draw[very thick] (1cm, 0) -- (9cm, 0);
% draw vertical lines and times
\draw (1cm, -3pt) -- (1cm, 3pt) node[anchor = south] {$t_{1}$};
\draw (9cm, -3pt) -- (9cm, 3pt) node[anchor = south] {$t_{2}$};
\draw (2.5cm, -3pt) -- (2.5cm, 3pt) node[anchor = south] {$\tau_{1}$};
\draw (3.5cm, -3pt) -- (3.5cm, 3pt) node[anchor = south] {$\tau_{2}$};
\path (5.15cm, -3pt) -- (5.15cm, 3pt) node[anchor = south] {$\dots$};
\draw (6.8cm, -3pt) -- (6.8cm, 3pt) node[anchor = south] {$\tau_{N-1}$};
\draw (8.0cm, -3pt) -- (8.0cm, 3pt) node[anchor = south] {$\tau_{N}$};
% draw Policyholder cash flows
\node at (-1cm, -14pt) {Policyholder};
\node at (1cm, -14pt) {$-P$};
% draw Insurer cash flows
\node at (-1cm, -28pt) {Insurer};
\node at (1cm, -28pt) {$P$};
\node at (2.5cm, -28pt) {$-Z_1$};
\node at (3.5cm, -28pt) {$-Z_2$};
\node at (6.8cm, -28pt) {$-Z_{N-1}$};
\node at (8.0cm, -28pt) {$-Z_N$};
% \foreach \x in {0, 1, ..., 10}
% \draw (\x cm, -3pt) -- (\x cm, 3pt)
% node[anchor = south] {$t_{\x}$}
% ;
;
\end{tikzpicture}
```
The major difference between these cash flows is that $P$ is a certain amount, while $Z_1, Z_2, \dots, Z_N$, at the time $t_1$, are uncertain in the amount, in the count ($N$) and in the timing ($\tau_1, \tau_2, \dots, \tau_N$). So, the policyholder, paying a premium $P$, is giving his risk to the insurer.
This representation points out the inversion of the production cycle typical of the insurance activity. From the insurer point of view, the revenue emerges at the beginning of the economic activity, in $t_1$, while the costs will emerge later. In most other economic activities, the costs emerge before the selling of the product, so the agent can choose the selling price taking into account how much that product costed him. In insurance activity, the insurer, when is selling his product (the insurance coverage), doesn't know the amount of claims he is going to pay for that product. Thus, it is crucial to properly predict the future costs in order to determine an adequate premium.
From a statistical point of view, we can translate this uncertainty saying that $N$ and $Z_1, Z_2, \dots, Z_N$ are random variables. Therefore, we can say that $\left\{N, Z_1, Z_2, \dots \right\}$ is a stochastic process. Usually, in non-file insurance pricing, the variables $\tau_1, \tau_2, \dots, \tau_N$ are not taken into account because the coverage span is short and from a financial point of view the timing of the claims occurrences has negligible effect.
Previously we said that $Z_1, Z_2, \dots, Z_N$ are all positive. This assumption corresponds to the fact that we are excluding the null claims, i.e. the claims that have been opened, but result in no payment due by the insurer. For the values of $Z_i$ with $N<i$ we can use the rule that $\{N<i\} \Rightarrow \{Z_i = 0\}$, so $Z_{N+1}=0, \, Z_{N+2}=0, \, \dots$. Therefore, we can say that:
$$
\{N<i \} \Longleftrightarrow \{Z_i = 0\}
$$
## Non-Life Insurance Pricing {#chap:nlip-details}
In insurances, the premium that the the insurer offers to the policyholder in exchange for the insurance coverage is not the same for every policyholder. The insurer evaluates the risk related to that policy and determines a "proper" premium taking into account risk related factors and commercial related factors. The process of _pricing_ corresponds in defining the set of rules for determining this "proper" premium $P_i$ for a specific policyholder $i$, given the known information on him. In the next sections we are going to better explain what "proper" means.
### Compound Distribution hypotheses
The first step for evaluating the stochastic process $\left\{N, Z_1, Z_2, \dots \right\}$ is to introduce some probabilistic hypotheses. The usual hypotheses assumed are the following:
```{definition, comp-dist, name = "Compound distribution"}
Let's assume that:
1. for each $n>0$, the variables $Z_1|N=n,\ Z_2|N=n,\ \dots,\ Z_n|N=n$ are stochastically independent and identically distributed;
2. the probability distribution of $Z_i|N=n, \ i\le n$ does not depend on $n$.
Under these hypotheses we say that:
$$
S = \sum_{i=1}^{N}{Z_i}
$$
has a compound distribution.
```
The variable $Z_i|N=n$ used in this definition can be interpreted as the _claim severity for the $i$^th^ claim under the hypothesis that $n$ claims occurred_. The two hypotheses provided in definition \@ref(def:comp-dist) imply that the distribution of $Z_i|N=n, \ i\le n$ does not depend on $i$ nor on $n$. For this reason, in the following, we are going to use the notation $Z$ to represent a random variable with the distribution of $Z_i|N=n, \ i\le n$ and $F_Z(\cdot)$ for its cumulative distribution function (i.e. $F_Z(z) = P(Z\le z)$).
Let's consider the variable $Z_i|N\ge i$. We can interpret it as the _claim severity for the $i$^th^ claim under the hypothesis that the $i$^th^ claim occurred_. From the hypotheses provided in definition \@ref(def:comp-dist) we can obtain that also $Z_i|N\ge i$ has the same distribution of $Z_i|N=n, \ i\le n$, that is:
\begin{equation}
\label{eq:z}
P\left(Z_i \le z \middle| N\ge i \right) = F_Z(z)
\end{equation}
This result says that $Z$ can be considered as the _claim severity for a claim under the hypothesis that that claim occurred_. The demonstration of the equation \@ref(eq:z) is reported in the appendix in section \@ref(chap:appendix-notes-on-compound-distribution).
### Distribution of the Total Cost of Claims {#chap:tcc-dist}
Under the hypotheses of definition \@ref(def:comp-dist), it is possible to obtain the full distribution of $S$ given the distribution of $N$ and $Z$. In this chapter we are going to provide only the formula of the expected value $E(S)$, but, with the same approach one can obtain all the moments.
The expected value of the total cost of claims $E(S)$ can be obtained from the expected value of the claims count $E(N)$ and the expected value of the claim severity $E(Z)$ as:
\begin{equation}
\label{eq:s}
E(S) = E(N)E(Z)
\end{equation}
This result tells us that, under the hypotheses of the compound distribution, it is possible to easily obtain $E(S)$ from $E(N)$ and $E(Z)$. That means that we can model separately $E(N)$ and $E(Z)$ and, from them, obtain $E(S)$. That result is particularly useful in personalization (section \@ref(chap:personalization)), because, for each individual $i$, given the information we have on him, we can estimate his expected claim size $E(N_i)$ and his expected claim severity $E(Z_i)$ and obtain his expected total cost of claims as $E(S_i) = E(N_i) E(Z_i)$. The demonstration of the equation \@ref(eq:s) is reported in the appendix in section \@ref(chap:appendix-notes-on-compound-distribution).
<!-- $x_i=(x_{i1}, x_{i2}, \dots, x_{ip})$ -->
### Risk Premium and Technical Price {#chap:risk-prem-tech-price}
The expected cost of claims $E(S)$ is important because it gives us a first interpretation of what "proper" premium means.
```{definition, risk-premium, name = "Risk Premium"}
Said $S$ the total cost of claims of a policyholder, his _Risk Premium_ is given by:
$$
P^{(risk)} = E(S)
$$
```
The _Risk Premium_ is the premium that on average covers the total cost of claims. As mentioned above, as the coverage spans are usually short, we are not taking into account the timing of the claims so we don't discount the fact that the claims occur later than the premium payment.
It is clear that this premium, that only covers the cost of claims, is not "proper" in the practice.
First of all, the insurer has to cover also the expenses related to the policy (commission on sales and expenses related to the claim settlement) and the general expenses of the company. Adding the expenses, we obtain the _Technical Price_.
```{definition, technical-price, name = "Technical Price"}
Said $S$ the total cost of claims of a policyholder and $E$ the expenses related to his policy, his _Technical Price_ is given by:
$$
P^{(tech)} \ = \ E(S) + E \ = \ P^{(risk)} + E
$$
```
Secondly, even if the policyholder paid a premium that on average covers claims and expenses, undertaking that risk with nothing in return would not make sense for the insurer. So, to the technical price, some further loadings must be added, such as for example the loading for the cost of capital, the risk margin and the profit margin.
The amount of the technical price with these loadings can be further modified based on business logic, as we are going to discuss in section \@ref(beyond-technical-pricing).
## Modeling and Personalization {#chap:personalization}
<!--
Often the insurance premium for a specific policyholder $i$ is represented as the product of a reference premium $P$ and a relative coefficient $\alpha_i$ as follows:
$$
P_i = \alpha_i P
$$
The coefficient $\alpha_i$ and the reference premium $P$ can be estimated separately. The process of defining the function for obtaining $\alpha_i$ is called personalization.
-->
In this section we are going to better explain how pricing based on policyholder information works.
### Pricing variables {#chap:pricing-variables}
Usually for every policyholder we have a certain amount of information on him that is considered relevant for his risk evaluation. This information must be reliable and observable at the moment of the underwriting of the policy.
In motor insurances, this information could be:
* Information on the insured vehicle: make, model, engine power, vehicle mass, age of the vehicle;
* General information of the policyholder: age, sex, address (region, city, postcode), ownership of a private box where he parks the car;
* Insurance specific information of the policyholder: number of claims caused in the previous years, how long he has been covered, bonus-malus class;
* Policy options: amount of the maximum coverage, presence and amount of a deductible, presence of other insurance guarantees, how many drivers will drive the vehicle;
* Customer information on the policyholder: how many years he has been a customer of the insurer, how many other policies he owns.
* Telematic data: how many kilometers per year the policyholder traveled in the previous years, which kind of roads the policyholder traveled on, the speed maintained during the trips, how many times the policyholder exceeded the speed limit, how many sharp accelerations and decelerations per kilometer the policyholder performed.
These pieces of information are usually called _pricing variables_.
We must observe that some of these variables are available for every potential customer (such as his age and address), while others are only available for policyholder that are already customers (such as telematic data that is available only if the policyholder agreed on installing on their car the device that collects this data).
Moreover, even considering the variables that are available for every customer, it is important to be aware of how reliable they are. Some of them come from official documents (as customer age and address or bonus-malus class), but others could be declared by the customer and his statements are not easily verifiable by the insurer (as the ownership of a private box or how many drivers will drive the vehicle).
The topic of variables reliability fits in the wider framework of fraud detection. Insurance companies put a lot of effort in preventing frauds. This is done with active actions, such as documents checks and inspections, and with predictive fraud detection models. The two most common categories of frauds are underwriting frauds (such as false declaration on insurance related data) and settlement frauds (such as faking an accident). The customer information on the policyholder is usually important to predict both underwriting frauds and settlement frauds. Usually customers that have a longer relationship with the company and own many policies are less likely to commit frauds.
Regarding the topic of variables reliability, the Italian Insurance Associations ([ANIA](https://www.ania.it/)) in the last years made some big steps forward by collecting in its databases a lot of information about policyholders and vehicles and making it available to insurance companies. For example, by logging in these databases it is possible, at the moment of the quote request, to retrieve useful insurance specific information such as the number of claims caused by the customer in the previous years or how long he has been covered and useful information on his vehicle such as when it has been registered or how many changes of ownership did it experienced.
One of the roles of the actuary is to understand how reliable the information on the policyholder is and to decide how to use that information.
### Pricing variables encoding {#chap:pricing-variables-encoding}
Formally the pricing variables can be encoded as a vector of real numbers. $\boldsymbol{x}_i=(x_{i1}, x_{i2}, \dots, x_{ip})\in\mathcal{X}\subseteq\mathbb{R}^p$. In the modeling framework they can be also called explanatory variables, covariates, predictors or features.
The pricing variables can be of two types:
1. _Quantitative variables_: variables, like policyholder age or vehicle mass, that can be easily represented as a number;
2. _Qualitative variables_: variables, like policyholder sex or vehicle make, that represent a category and are usually represented with strings.
The quantitative variables, possibly transformed, are already suitable to be used.
To facilitate the use of the qualitative variables, they are usually encoded as sets of binary variables.
If a variable $x$ has only 2 possible modalities, it can be easily encoded in a binary variable $x'$ that assigns $0$ to one modality and $1$ to the other. For example, if $x = \text{sex}$, it can be encoded this way:
$$
x' = \begin{cases}
1 & \text{if } \text{sex } = \text{ `Male'} \\
0 & \text{if } \text{sex } = \text{ `Female'}
\end{cases}
$$
In general, if a variable $x$ has $K$ modalities, it can be encoded in $K-1$ binary variables $x'_1, x'_2, \dots, x'_{K-1}$. For example, if $x = \text{make}$ and it can have 4 possible modalities ('Fiat', 'Alfa-Romeo', 'Lancia', 'Ferrari') it can be encoded this way:
\begin{align*}
x'_1 & = \begin{cases}
1 & \text{if } \text{make } = \text{ `Fiat'} \\
0 & \text{otherwise} \\
\end{cases}
\\
x'_2 & = \begin{cases}
1 & \text{if } \text{make } = \text{ `Alfa-Romeo'} \\
0 & \text{otherwise} \\
\end{cases}
\\
x'_3 & = \begin{cases}
1 & \text{if } \text{make } = \text{ `Lancia'} \\
0 & \text{otherwise} \\
\end{cases}
\\
\end{align*}
The variables $x'_1$, $x'_2$, $x'_3$ are called dummy variables. We can observe that all the information about the make is embedded in just these 3 variables, so a fourth dummy variable that indicate the modality 'Ferrari' is not needed. Indeed:
$$
\text{make } = \text{`Ferrari'} \ \Longleftrightarrow \ x'_1=x'_2=x'_3=0
$$
<!-- In table \@ref(tab:dummy-variables) the dummy variable encoding is illustrated. -->
<!--
```{r, dummy-variables-table, echo = FALSE, cache = TRUE}
table <- tibble(
Make = c("Fiat", "Alfa-Romeo", "Lancia", "Ferrari"),
`$x'_1$` = c(1, 0, 0, 0),
`$x'_2$` = c(0, 1, 0, 0),
`$x'_3$` = c(0, 0, 1, 0)
)
table %>%
kable(
# format = "latex",
booktabs = T,
align = "lccc",
vline = "",
# toprule = "", midrule = "\\hline",
# linesep = "", bottomrule = "",
toprule = "", midrule = "\\midrule\\addlinespace",
linesep = "", bottomrule = "",
caption = "Dummy variables encoding.",
label = "dummy-variables",
escape = FALSE
) %>%
kable_styling(
position = "center",
latex_options = "hold_position",
full_width = FALSE
) %>%
row_spec(0, bold = T)
```
-->
For some models it is suggested to use also the dummy variable that indicates the $K$^th^ modality. This encoding is called one-hot encoding and it is mainly used in Neural Networks. For the models considered in this paper the $K-1$ dummy variables encoding is preferred, so we will always consider it.
In the following, when we use the notation $\boldsymbol{x}_i=(x_{i1}, x_{i2}, \dots, x_{ip})$, we always consider that the qualitative variables have been already encoded as dummy variables, so $(x_{i1}, x_{i2}, \dots, x_{ip})\in \mathcal{X} \subseteq \mathbb{R}^p$
### Pricing Rule and Modeling
The pricing variables are used as input of a _Pricing Rule_.
```{definition, pricing-rule, name = "Pricing Rule"}
A _Pricing Rule_ is a function $f(\cdot)$ that from an instance of a set of pricing variables $\boldsymbol{x}_i\in\mathcal{X}$ returns a price:
$$
\begin{array}{rccl}
f: & \mathcal{X} & \longrightarrow & R_+ \\
& \boldsymbol{x}_i & \longmapsto & P_i \\
\end{array}
$$
```
The process of pricing consists in defining a Pricing Rule based on observed data from the past and assumptions on the future.
The first step for defining a Pricing Rule is to model the total cost of claims $S$ and obtain a pricing rule for the risk premium $P^{(risk)}$.
```{definition, modeling, name = "Modeling"}
Modeling a _response variable_ $Y$ means estimating a function
$$r:\mathcal{X}\rightarrow \mathcal{C}$$
that, given a set of explanatory variables $\boldsymbol{x}_i=(x_{i1}, x_{i2}, \dots, x_{ip})\in \mathcal{X} \subseteq \mathbb{R}^p$, returns the expected value of the response variable $E(Y)$ and possibly other moments of $Y$ or even the full distribution of $Y$.
```
In definition \@ref(def:modeling) we used a generic $\mathcal{C}$ as codomain of the function $r(\cdot)$ to not specify whether the model describes just $E(Y)$ (and so $\mathcal{C}=\mathbb{R}$) or something more, such as the couple $\left( E(Y), Var(Y) \right)$ or the full distribution of $Y$.
As we observed in section \@ref(chap:tcc-dist), under the compound distribution hypotheses, we don't have to model directly the total cost of claims $S$, but we can separately model $N$ and $Z$.
### Response variables and distributions {#chap:response-variables-and-distributions}
Usually in statistical modeling, the response variables are seen as random variables with a distribution belonging to a specified family.
#### Distribution for the Claims Count $N$ {#chap:dist-n}
The claim count $N$ is a discrete variable with values in $\{0, 1, 2, 3,\dots\}$. Even if in practice the number of claims can't be arbitrarily high, $N$ is usually modeled with distributions that give a positive probability to all natural numbers. One of the most common distribution used for $N$ is the Poisson distribution.
```{definition, def-poisson, name = "Poisson Distribution"}
A random variable $N$ with support $\{0,1,2,3,\dots \}$ has a Poisson distribution, if its probability function is:
$$
p_N(n) = P\left( N = n \right) = e^{-\lambda}\frac{\lambda^n}{n!}, \quad \lambda>0
$$
We will indicate it with the notation $N \sim Poisson(\lambda)$.
```
```{r, plot-poisson, echo = FALSE, fig.cap = "Poisson distribution for some values of $\\lambda$.", fig.align = "center", out.width = "90%", fig.width = 6, fig.height = 4, cache = TRUE}
tibble(x = 0:22) %>%
crossing(
tibble(
lambda = c(.8, 1, 2.5, 10)
)
) %>%
mutate(
y = dpois(x = x, lambda = lambda),
lambda_label = str_c("lambda == ", lambda) %>%
fct_inorder()
) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_segment(
aes(xend = x),
yend = 0
) +
facet_wrap(
~lambda_label,
labeller = label_parsed
) +
coord_cartesian(
xlim = c(0, 20)
) +
labs(
x = "n", y = "p(n)"
)
```
The Poisson distribution is a parametric distribution that only depends on the parameter $\lambda$. In figure \@ref(fig:plot-poisson), for different levels of $\lambda$ the distribution is represented. These plots show how, for larger values of $\lambda$, the distribution is shifted to larger values and it is wider.
Indeed, the first two moments are:
\begin{align*}
E(N) & = \lambda \\
Var(N) & = \lambda
\end{align*}
Thus, increasing $\lambda$, both $E(N)$ and $Var(N)$ increase.
Looking to the distribution shape, we can see that:
* if $\lambda<1$, the mode is in $n=0$;
* if $\lambda=1$, $p(0)=p(1)=\frac{1}{e}$;
* if $\lambda>1$, the mode is in a value greater than $0$ and, as $\lambda$ increases, the distribution assumes a bell shape similar to the Normal distribution shape. The convergence to the Normal distribution can be obtained with the _Central Limit Theorem_.
In non-life insurance we usually are in the case with $\lambda<1$. E.g. the average claims frequency for motor third party liability insurances in Italy, in 2017 has been 5.68% [@ania-claim-freq].
<!-- ^[[ANIA yearly statistical report for motor third party liability](https://www.ania.it/ricerca-avanzata/-/asset_publisher/XIyLeujL9irt/content/id/113283)]. -->
The property $Var(N) = E(N)$ is an important constraint when the distribution is used in practice. It is possible that the observed data shows a different pattern. Often the observed data shows a situation where $Var(N) > E(N)$. This phenomenon is called _overdispersion_.
To address this issue it is possible to use more flexible distributions, such as Negative-Binomial distribution, or to adopt less assumptions on the response variable distribution. A common technique is the Quasi-Poisson model, that we will describe in chapter \@ref(chap:models).
#### Exposure {#chap:exposure}
In section \@ref(chap:non-life-ins) we said that non-life insurances usually have a fixed coverage period that usually spans for one year. Often we work with portfolios of insurances with different coverage periods. For example, this could be due to the presence of insurances born with shorter coverage periods or to the presence of insurances that has been closed earlier. Moreover, in companies data, often insurance data are collected for accounting years. This means that, if an insurance coverage $c$ spans in two consecutive accounting years $a$ and $a+1$, it is collected as two records: the couple $(c, a)$ and the couple $(c, a+1)$. This situation is quite common, as usually coverages start during the year and not all at the first of the year.
The coverage span for an insurance coverage is called _exposure_ and it is usually measured in years-at-risk. For instance, if an insurance coverage spans for 3 months, it corresponds to a quarter of year, so the exposure, measured in years-at-risk, is $v=\frac{1}{4}$. The term year-at-risk comes from the fact that the policyholder exposure is a risk for the insurer, so the exposure is the period in which the insurer is exposed to the risk of paying claims.
It is natural to assume that, if a policyholder has a longer exposure, it is expected for him to experience more claims. Considering that we have to work with policies with different exposures, in order to take this aspect into account, the usual assumption taken is the following: said $M$ the number of claims the policyholder will experience during his period of exposure $v$ (measured in years) and $N$ the number of claims the policyholder would experience during one year, we assume $E(M) = v E(N)$.
This assumption can be further extended if we assume that the claims come from a _Poisson process_.
```{definition, def-process-count, name = "Counting Process"}
A stocastic process $\{N(t), t\ge0\}$ is called \textit{counting process} if:
\begin{enumerate}
\item The determination of N(t) are natural numbers \\
$N(t) \in \{ 0, 1, 2, ... \} \ t\ge 0$
\item The process is not decreasing \\
$s < t \Rightarrow N(s) \le N(t)$
\end{enumerate}
```
In a counting process $\{N(t), t\ge0\}$:
* $N(t)$ can be interpreted as the number of events or arrivals that occur in the period $[0, t]$;
* $N(t) - N(s), \ s\le t$ can be interpreted as the number of events or arrivals that occur in the period $]s, t]$. $N(t) - N(s)$ is also called _increment_ of the process.
The counting process can be used to model the number of claims that occur to a specific policy.
```{definition, def-process-poisson, name = "Poisson Process"}
A counting process $\{N(t), t\ge0\}$ is a \textit{Poisson process} with intensity $\lambda$ if:
\begin{enumerate}
\item The increments of the process are stocastically independent \\
$\forall n\ge0, \forall s_1 < t_1 \le \dots \le s_n < t_n$ \\
$\Rightarrow \ N(t_1)-N(s_1), \dots, N(t_n)-N(s_n)$ are stocastically independent;
\item The probability of arrival in an interval is proportional to the size of the interval \\
$\forall t\ge 0, \forall \Delta t >0 \ \Rightarrow \ P\left( N(t + \Delta t) - N(t) = 1 \right) = \lambda \Delta t + \omicron (\Delta t)$ \\
where $\lim_{\Delta t \to 0}{\frac{\omicron(\Delta t)}{\Delta t}} = 0$
\item Multiple arrivals are excluded \\
$\forall t\ge 0, \forall \Delta t >0 \ \Rightarrow \ P\left( N(t + \Delta t) - N(t) \ge 2 \right) = \omicron (\Delta t)$
\item Arrivals at time $0$ are almost impossible \\
$P\left( N(0) = 0 \right) = 1 $
\end{enumerate}
```
Under these hypotheses we obtain the following result:
```{theorem, th-process-poisson, name = "Poisson Process"}
If $\{N(t), t\ge 0 \}$ is a Poisson process with intensity $\lambda$, then:
$$\forall t\ge 0, \forall \Delta t >0, \ \Rightarrow \ N(t + \Delta t) - N(t) \sim Poisson(\lambda \Delta t)$$
```
This result tells us that the distribution of the number of events in any interval $]t, t+\Delta t]$ only depends on the size of the interval $\Delta t$. Moreover, for the Poisson property we saw in section \@ref(chap:dist-n), we get:
$$E(N(t + \Delta t) - N(t)) = \lambda \Delta t$$
So, the expected number of arrivals is proportional to the size of the interval $\Delta t$. The intensity of the process $\lambda$ can be also interpreted as the expected number of claims in a unit period.
If we assume that the claims that occur to a policy come from a Poisson process with intensity $\lambda$, if we observe that policy for the period $]t, t+v]$, the claims count in that exposure period $M$ are distributed as:
$$ M\sim Poisson(v \lambda) $$
In particular, if the observed period spans 1 year, we get:
$$ M = N \sim Poisson(\lambda) $$
<!--
Poisson distribution
- Definition
- Properties
- Other distributions
- Quasi-Poisson
Exposure (years-at-risk)
Poisson process
-->
#### Distribution for the Claim Severity $Z$
The claim severity $Z$ is a continuous variable with values in $[0, +\infty[$. As for the claims count $N$, even if in practice it can't be arbitrarily high, it is usually modeled with distributions that give a positive density to all the numbers in $]0, +\infty[$. As the null claims are excluded, it is natural to assume $P\left( Z=0 \right) = 0$. One of the most common distribution used for $Z$ is the Gamma distribution.
```{definition, def-gamma, name = "Gamma Distribution"}
A random variable $Z$ with support $[0, +\infty[$ has a Gamma distribution, if its probability density function is:
$$
f_Z(z) = \frac{\rho^\alpha}{\Gamma(\alpha)}z^{\alpha-1}e^{-\rho z}, \quad \alpha > 0, \ \rho > 0
$$
where $\Gamma(\alpha) = \int_{0}^{+\infty}{z^{\alpha - 1} e^{-z} \mathrm{d} z}$.
We will indicate it with the notation $Z \sim Gamma(\alpha, \rho)$.
```
```{r, plot-gamma, echo = FALSE, fig.cap = "Gamma distribution for some values of $\\alpha$ and $\\rho$.", fig.align = "center", out.width = "90%", fig.width = 6, fig.height = 4, cache = TRUE}
tibble(x = seq(from = 0, to = 25, by = .01)) %>%
crossing(
tibble(
alpha = c(.8, 1, 2, 16),
rho = c(.2, .25, .5, 2)
)
) %>%
mutate(
y = dgamma(x = x, shape = alpha, rate = rho),
label = str_c("list(alpha == ", alpha, ", ", "rho == ", rho, ")") %>%
fct_inorder()
) %>%
ggplot(aes(x = x, y = y)) +
geom_line() +
facet_wrap(
~label,
labeller = label_parsed
) +
coord_cartesian(
ylim = c(0, .3),
xlim = c(0, 20)
) +
labs(
x = "z", y = "f(z)"
)
```
The Gamma distribution is a parametric distribution that depends on two parameters:
* $\alpha > 0$, called shape parameter
* $\rho > 0$, called scale parameter
The first two moments of the Gamma distribution are:
\begin{align*}
E(Z) & = \frac{\alpha}{\rho} \\
Var(Z) & = \frac{\alpha}{\rho^2}
\end{align*}
In figure \@ref(fig:plot-gamma), for different levels of $\alpha$ and $\gamma$ the distribution is represented. These plots show how, as the values of $\alpha$ and $\gamma$ change, the shape changes. We can see that:
* if $\alpha < 1$, $f_z(\cdot)$ is not defined in $0$ and it has a vertical asymptote in $z = 0$. In $]0, +\infty]$ it is monotonically decreasing.
* If $\alpha = 1$, $f_z(\cdot)$ starts from $f(0) = \rho$ and then decreases monotonically. In this case, the density function becomes $f_z(z) = \rho e^{-\rho z}$ and the distribution is also called exponential distribution.
* If $\alpha > 0$, $f_z(\cdot)$ starts from $f(0) = 0$, increases until the mode and then decreases.
In figure \@ref(fig:plot-gamma) the first three distributions represented have the same expected value $E(Z)=\frac{\alpha}{\rho} = 4$, but different shapes. The third and the fourth have the same variance $Var(Z) = \frac{\alpha}{\rho^2} = 8$, but different expected values. As the shape parameter $\alpha$ increases, the distribution assumes a bell shape similar to the Normal distribution one. The convergence to the Normal distribution can be obtained with the _Central Limit Theorem_.
Another parametrization often used for Gamma distribution is obtained by using the mean $\mu$ as a parameter:
$$
\mu = \frac{\alpha}{\rho}
$$
With this parametrization, the density function becomes:
$$
f_Z(z) = \frac{\left(\frac{\alpha}{\mu}\right)^\alpha}{\Gamma(\alpha)}z^{\alpha-1}e^{-\frac{\alpha}{\mu} z}, \quad \alpha > 0, \ \rho > 0
$$
The advantage of using the parameters $(\alpha, \mu)$ is that the link between $E(Z)$ and $Var(Z)$ becomes clearer:
\begin{align*}
E(Z) & = \mu \\
Var(Z) & = \frac{1}{\alpha}\mu^2
\end{align*}
Computing the coefficient of variation we then obtain:
$$CV(Z) = \frac{\sqrt{Var(Z)}}{E(Z)} = \frac{1}{\sqrt{\alpha}}$$
This result means that the coefficient of variation is constant (given the shape parameter $\alpha$). As we saw for the Poisson distribution, it is possible that observed data shows a different pattern. In chapter \@ref(chap:models), for the Gamma distribution, we will use the parametrization based on $(\alpha, \mu)$ instead of the one based on $(\alpha, \rho)$.
Another characteristic of the Gamma distribution that could be problematic in modeling claims severity is that it has a light tail. This means that, as $z$ goes to $+\infty$, $f_Z(z)$ approaches $0$ quite fast. This could lead to a poor fitting for _large claims_. Other distributions with heavier tails are for example the _log-Normal_ and the _Pareto_.
#### Large Claims {#chap:large-claims}
Modeling large claims in quite difficult in practice because usually there is not a lot of observed data on them, so it is hard to understand if they are related to some risk factors (identifiable by the pricing variables) or they happen just by chance due to a distribution with heavy tails.
First of all, to model large claims, we must define what a large claim is. What is usually done in practice is just choosing a threshold $\bar{z}$ and considering large all the claims with a size that exceeds that threshold. The value $\bar{z}$ must be chosen sufficiently big to consider large the claims above $\bar{z}$, but not so big that there are not enough observed claims that exceeds $\bar{z}$. One common choice for Motor Third Party Liability in European markets could be $\bar{z} = 100\, 000 \text{\euro}$.
```{definition, def-large-claim, name = "Large and Attritional Claims"}
Given a predetermined threshold $\bar{z}$, we say that:
\begin{itemize}
\item a claim $Z$ is a \textit{large claim} if $Z > \bar{z}$
\item a claim $Z$ is an \textit{attritional claim} if $Z \le \bar{z}$
\end{itemize}
For each claim $Z$ we call:
\begin{itemize}
\item \textit{Capped Claim Size} \\
$Z' = \min(Z, \bar{z})$;
\item \textit{Excess Over the Threshold} \\
$Z'' = \max(Z - \bar{z}, 0)$.
\end{itemize}
```
In figure \@ref(fig:large-claim) the _Capped Claim Size_ and the _Excess Over the Threshold_ are shown. It is easy to show that $Z$ can be decomposed as:
$$Z = Z' + Z''$$
```{tikz, large-claim, fig.cap = "Large claims.", fig.ext = 'pdf', cache = TRUE, echo = FALSE}
\newcommand{\ImageWidth}{11cm}
\usetikzlibrary{decorations.pathreplacing, positioning, arrows.meta}
\begin{tikzpicture}
% draw horizontal lines
\draw[thick, -Triangle] (-0.5cm, 0) -- (\ImageWidth, 0);
\draw[thick, -Triangle] (0, -0.25cm) -- (0, 2/3*7cm);
\draw (-3pt, 2/3*4cm) node[anchor = east] {$\bar{z}$} -- (\ImageWidth, 2/3*4cm);
% draw vertical lines
\draw [thick] (1.5cm, 0cm) -- (1.5cm, 2/3*3cm) node[anchor = south] {$Z_{1}$};
\draw [thick] (1.5cm - 3pt, 2/3*3cm) -- (1.5cm + 3pt, 2/3*3cm);
\draw [thick] (3cm, 0cm) -- (3cm, 2/3*5cm) node[anchor = south] {$Z_{2}$};
\draw [thick] (3cm - 3pt, 2/3*5cm) -- (3cm + 3pt, 2/3*5cm);
\draw [thick] (4.5cm, 0cm) -- (4.5cm, 2/3*2cm) node[anchor = south] {$Z_{3}$};
\draw [thick] (4.5cm - 3pt, 2/3*2cm) -- (4.5cm + 3pt, 2/3*2cm);
\draw [thick] (6cm, 0cm) -- (6cm, 2/3*6cm) node[anchor = south] {$Z_{4}$};
\draw [thick] (6cm - 3pt, 2/3*6cm) -- (6cm + 3pt, 2/3*6cm);
\draw [thick] (7.5cm, 0cm) -- (7.5cm, 2/3*4.5cm) node[anchor = south] {$Z_{5}$};
\draw [thick] (7.5cm - 3pt, 2/3*4.5cm) -- (7.5cm + 3pt, 2/3*4.5cm);
% draw curly brackets
\draw [decorate, decoration = {brace, amplitude = 5pt}, xshift = -4pt, yshift = 0pt]
(4.5cm, 0cm) -- (4.5cm, 2/3*2cm) node [black, midway, xshift = -10pt]
{\footnotesize $Z'_3$};
\draw [decorate, decoration = {brace, amplitude = 5pt}, xshift = -4pt, yshift = 0pt]
(6cm, 0cm) -- (6cm, 2/3*4cm) node [black, midway, xshift = -10pt]
{\footnotesize $Z'_4$};
\draw [decorate, decoration = {brace, amplitude = 5pt}, xshift = -4pt, yshift = 0pt]
(6cm, 2/3*4cm) -- (6cm, 2/3*6cm) node [black, midway, xshift = -10pt]
{\footnotesize $Z''_4$};
% draw dots
\node at (9.5cm, 2/3*1.5) {\huge $\dots$};
;
\end{tikzpicture}
```
Given the total number of claims $N$, it can be decomposed as:
$$N = N^{(a)} + N^{(l)}$$
where
* $N^{(a)}$ is the attritional claims count, i.e. the number of claims with size $Z \le \bar{z}$;
* $N^{(l)}$ is the large claims count, i.e. the number of claims with size $Z > \bar{z}$;
Let's indicate with $Z_{(i)}$ the $i$^th^ in order from the smallest to the bigger. Sorting the claims we can separate the attritional claims from the large claims as follows:
$$
\underbrace{Z_{(1)}, Z_{(2)}, \dots, Z_{(N^{(a)})}}_{\text{Attritional Claims}},
\underbrace{Z_{(N^{(a)} + 1)}, Z_{(N^{(a)} + 2)}, \dots Z_{(N^{(a)} + N^{(l)})}}_{\text{Large Claims}}
$$
In order to model the large claims it is possible to use the following three decompositions of the total cost of claims $S$:
\begin{align}
\nonumber
S & = \underbrace{Z_{(1)} + Z_{(2)} + \dots + Z_{(N^{(a)})}}_{\text{Attritional Claims}} +
\underbrace{Z_{(N^{(a)} + 1)} + Z_{(N^{(a)} + 2)} + \dots Z_{(N^{(a)} + N^{(l)})}}_{\text{Large Claims}} \\
\label{large-claim-decomposition-1}
& = \underbrace{\sum_{i=1}^{N^{(a)}}{Z_{(i)}}}_{=S^{(a)}} +
\underbrace{\sum_{i = N^{(a)} + 1}^{N^{(a)} + N^{(l)}}{Z_{(i)}}}_{=S^{(l)}}
\ = \ S^{(a)} + S^{(l)} \\[12pt]
\label{large-claim-decomposition-2}
S & = \sum_{i=1}^{N}{Z_i}
\ = \ \sum_{i=1}^{N}{\left(
%\{Z_i|Z_i>\bar{z}\} I_{Z_i>\bar{z}} +
%\{Z_i|Z_i\le\bar{z}\} I_{Z_i\le\bar{z}}
Z_i I_{Z_i>\bar{z}} +
Z_i I_{Z_i\le\bar{z}}
\right)} \\[12pt]
\label{large-claim-decomposition-3}
S & = \sum_{i=1}^{N}{Z_i}
\ = \ \sum_{i=1}^{N}{\left(Z'_i + Z''_i\right)}
\ = \ \sum_{i=1}^{N}{\left(Z'_i + Z''_i I_{Z_i > \bar{z}}\right)}
\end{align}
These three decompositions of $S$ are useful because they provide three decompositions of $E(S)$:
\begin{align}
\nonumber
E(S) & = E(S^{(a)}) + E(S^{(l)}) \\
\label{large-claim-decomposition-expected-1}
& = E(N^{(a)}) E(Z|Z\le\bar{z}) + E(N^{(l)}) E(Z|Z>\bar{z}) \\[12pt]
\nonumber
E(S) & = E(N) E(Z) \\
\nonumber
& = E(N) \left[P(Z\le\bar{z}) E(Z|Z\le\bar{z}) + P(Z>\bar{z}) E(Z|Z > \bar{z}) \right] \\
\label{large-claim-decomposition-expected-2}
& = E(N) \left[\left( 1 - P(Z>\bar{z}) \right) E(Z|Z\le\bar{z}) + P(Z>\bar{z}) E(Z|Z > \bar{z})\right] \\[12pt]
\nonumber
E(S) & = E(N) E(Z) \\
\label{large-claim-decomposition-expected-3}
& = E(N) \left[E(Z') + P(Z>\bar{z}) E(Z'')\right]
\end{align}
<!-- ```{=latex} -->
<!-- \begin{equation} -->
<!-- \label{large-claim-decomposition-expected-1} -->
<!-- \begin{split} -->
<!-- E(S) & = E(S^{(a)}) + E(S^{(l)}) \\ -->
<!-- & = E(N^{(a)}) E(Z|Z\le\bar{z}) + E(N^{(l)}) E(Z|Z>\bar{z}) -->
<!-- \end{split} -->
<!-- \end{equation} -->
<!-- ``` -->
<!-- ```{=latex} -->
<!-- \begin{equation} -->
<!-- \label{large-claim-decomposition-expected-2} -->
<!-- \begin{split} -->
<!-- E(S) & = E(N) E(Z) \\ -->
<!-- & = E(N) \left[P(Z\le\bar{z}) E(Z|Z\le\bar{z}) + P(Z>\bar{z}) E(Z|Z > \bar{z}) \right] \\ -->
<!-- & = E(N) \left[\left( 1 - P(Z>\bar{z}) \right) E(Z|Z\le\bar{z}) + P(Z>\bar{z}) E(Z|Z > \bar{z})\right] -->
<!-- \end{split} -->
<!-- \end{equation} -->
<!-- ``` -->
<!-- ```{=latex} -->
<!-- \begin{equation} -->
<!-- \label{large-claim-decomposition-expected-3} -->
<!-- \begin{split} -->
<!-- E(S) & = E(N) E(Z) \\ -->
<!-- & = E(N) \left[E(Z') + P(Z>\bar{z}) E(Z'')\right] -->
<!-- \end{split} -->
<!-- \end{equation} -->
<!-- ``` -->
\@ref(large-claim-decomposition-expected-1), \@ref(large-claim-decomposition-expected-2) and \@ref(large-claim-decomposition-expected-3) provide three approaches to model attritional and large claims:
1. Looking to \@ref(large-claim-decomposition-expected-1) we can model separately attritional claims and large claims. Modeling $N^{(a)}$ and $Z|Z\le\bar{z}$ we estimate the total cost of claims for the attritional part $S^{(a)}$; modeling $N^{(l)}$ and $Z|Z>\bar{z}$ we estimate the total cost of claims for the large part $S^{(l)}$.
2. Looking to \@ref(large-claim-decomposition-expected-2) we can model together the claim count $N$, and then we can model the cost of the attritional claims $Z|Z\le\bar{z}$, the cost of the large claims $Z|Z>\bar{z}$ and the probability to exceed the threshold $P(Z>\bar{z})$.
3. Looking to \@ref(large-claim-decomposition-expected-3) we can model together the claim count $N$, and then we can model the capped claims size $Z'$, the excess over the threshold $Z''$ and the probability to exceed the threshold $P(Z>\bar{z})$.
If the large claims component weighs a lot on the total cost of claims, these approaches could lead to quite different estimates of $E(S)$. In particular, if in the observed data the number of large claims is small, it will be hard to model both $N^{(l)}$ and $P(Z>\bar{z})$, so for these components the modeling process could lead to a flat model (i.e. a model without any explanatory variable) or an almost flat one (i.e. a model with just few explanatory variables and with mild effects). However, with the first approach, a flat model for $N^{(l)}$ leads to distribute the observed total cost of large claims proportionally to all the policies, while with the second and the third, a flat model for $P(Z>\bar{z})$ leads to distribute the observed total cost of large claims proportionally to the expected number of claims $E(N)$. So, with the first approach, a flat model brings to more solidarity between policies, while, with the second approach, a flat model could bring to an exacerbation of the differences identified by modeling $N$.
For the second approach we must also introduce a distribution suitable for modeling $P(Z>\bar{z})$.
#### Binomial distribution
The _binomial distribution_ is used to model the counting on events that occurs (successes) in a fixed amount of trials $n$. For example we can use it to model the number of large claims within a fixed number of $n$ claims.
```{definition, def-binomial, name = "Binomial Distribution"}
A random variable $Y$ with support $\{0,1,2, \dots, n \}$ has a Binomial distribution, if its probability function is:
$$
p_Y(y) = P\left( Y = y \right) = \binom{n}{y} p^y (1-p)^{n-y}, \quad p \in [0, 1]
$$
We will indicate it with the notation $Y \sim Binom(n, p)$.
```
```{r, plot-binomial, echo = FALSE, fig.cap = "Binomial distribution for some values of $n$ and $p$.", fig.align = "center", out.width = "90%", fig.width = 6, fig.height = 4, cache = TRUE}
tibble(x = seq(from = 0, to = 10, by = 1)) %>%
crossing(
tibble(
n = c(1, 4, 10, 10),
p = c(.45, .2, .2, .5)
)
) %>%
filter(x <= n) %>%
mutate(
y = dbinom(x = x, size = n, prob = p),
# label = str_c("alpha = ", alpha, ", rho = ", rho) %>%
# fct_inorder()
label = str_c("list(italic(n) == ", n, ", italic(p) == ", p, ")") %>%
fct_inorder()
) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_segment(aes(xend = x),
yend = 0) +
facet_wrap(
~label,
labeller = label_parsed
) +
coord_cartesian(
# ylim = c(0, .3),
xlim = c(0, 10)
) +
labs(
x = "y", y = "p(y)"
) +
scale_x_continuous(breaks = seq(from = 0, to = 10, by = 2))
```
The binomial distribution is a parametric distribution that depends on the parameters $n$ and $p$. $n$ represents the number of trials, while $p$ represents the probability for a trial to succeed. The assumption is that the $n$ trials are identical, so they have all the same probability $p$ to succeed.
The first two moments of the binomial distribution are:
\begin{align*}
E(N) & = np \\
Var(N) & = np(1-p)
\end{align*}
In figure \@ref(fig:plot-binomial) the distribution is represented for different levels of $n$ and $p$. From the plots we can see that:
* if $n = 1$, the binomial distribution assumes only the values $1$ (with probability $p$) and $0$ (with probability $1-p$). In this case it is also called _Bernoullian distribution_ and it can be used to model the indicator of an event $I_E$.
* If $n>1$, the binomial distribution assumes a shape centered on its expected value $E(Y)=np$ and fading for values of $y$ that moves away from $E(Y)$. As $n$ increases, the distribution assumes a bell shape similar to the Normal distribution shape. The convergence to the Normal distribution can be obtained with the _Central Limit Theorem_.
From the binomial Distribution it is also possible to define the scaled binomial distribution by dividing its value by $n$.
```{definition, def-scaled-binomial, name = "Scaled Binomial Distribution"}
If $Y\sim Binom(n, p)$, and $Y' = \frac{Y}{n}$, we will say that $Y'$ has a \textit{Scaled Binomial Distribution} and we will indicate it with the notation $Y' \sim Binom(n, p)/n$.
The support of $Y'$ is $\{0, \frac{1}{n}, \frac{2}{n}, \dots, 1 \}$ and its probability function is:
$$
p_{Y'}(y') = P\left( Y' = y' \right) = \binom{n}{ny'} p^{ny'} (1-p)^{n-ny'}, \quad p \in [0, 1]
$$
```
In chapter \@ref(chap:models) we will use the Scaled Binomial Distribution.
In non-life insurance pricing, the binomial distribution can be used to model the probability for a claim to have specific characteristics. For example we can use it to model the probability that a certain claim is a large one, $P(Z>\bar{z})$, in order to model separately attritional claims severity $\{Z|Z\le\bar{z}\}$ and large claims severity $\{Z|Z>\bar{z}\}$, as we have seen in section \@ref(chap:large-claims).
Another example is the decomposition between claims with only material damages and claims with also bodily injuries. Modeling separately these two components is useful because they usually have a different distribution for the claim size.
As for large claims we can decompose $S$ in the following two ways:
\begin{align}
\nonumber
E(S) & = E(S^{(\text{things})}) + E(S^{(\text{inj})}) \\
\label{inj-claim-decomposition-expected-1}
& = E(N^{(\text{things})}) E(Z|\bar{J}) + E(N^{(\text{inj})}) E(Z|J) \\[12pt]
\nonumber
E(S) & = E(N) E(Z) \\
\nonumber
& = E(N) \left[P(\bar{J}) E(Z|\bar{J}) + P(J) E(Z|J) \right] \\
\label{inj-claim-decomposition-expected-2}
& = E(N) \left[\left( 1 - P(J) \right) E(Z|\bar{J}) + P(J) E(Z|J)\right]
\end{align}
where:
* $N^{(\text{things})}$ is the number of claims with only material damages;
* $N^{(\text{inj})}$ is the number of claims with injuries;
* $J$ is the event that represents that a specific claim presents injuries; such as $Z$ is a representative for $Z_1, Z_2, \dots, Z_N$, $J$ is a representative for $J_1, J_2, \dots, J_N$.
Combining this decomposition with what we have seen in large claims decomposition, we can further develop our decomposition taking into account both the presence or absence of injuries and the occurrence or not of a large claim. One example could be:
\begin{align*}
E(S) & = E(N) \left[\left( 1 - P(J) \right) E(Z|\bar{J}) + P(J) E(Z|J) \right] \\[4pt]
& = E(N) \left\{ \right. \\
& \qquad \left( 1 - P(J) \right) E(Z|\bar{J}) \\
& \qquad + P(J) \left[ P(Z \le \bar{z} | J) E\left( Z \mid Z\le \bar{z} \land J \right) \right. \\
& \qquad \qquad + \left. P(Z < \bar{z} | J) E\left( Z \mid Z < \bar{z} \land J \right) \right] \\
& \qquad \left. \right\}
\end{align*}
This way, only the claims with injuries are decomposed between attritional and large. That makes sense because claims that don't produce injuries usually have small severities.
### Model fitting and available data {#chap:model-fitting-and-data-available}
Once we have chosen how to decompose $S$, we have to model the response variables needed for that decomposition ($N$, $Z$, $I_J$, ...) with the explanatory variables. Thus we have to estimate a function $r:\mathcal{X}\rightarrow \mathcal{C}$ as defined in definition \@ref(def:modeling).
In order to estimate $r(\cdot)$ we have also to take some assumptions on the distribution of the response variable and on the shape of $r(\cdot)$. We calls _model_ a set of assumptions on the response variable and on the shape of $r(\cdot)$. We will discuss some of the most widespread models for claims count and claims severity in chapter \@ref(chap:models).
Defined the model, we have to estimate it using observed data. In general, to model a response variable $Y_i$ with the explanatory variables $\boldsymbol{x}_i=(x_{i1}, x_{i2}, \dots, x_{ip})\in \mathcal{X} \subseteq \mathbb{R}^p$, the observed data is in the form:
$$
\mathcal{D} = \left\{(\boldsymbol{x}_1, w_1, y_1), \ (\boldsymbol{x}_2, w_2, y_2), \ \dots, \ (\boldsymbol{x}_i, w_i, y_i), \ \dots, \ (\boldsymbol{x}_n, w_n, y_n)\right\}
$$
where:
* $n$ is the number of observations in the dataset;
* $\boldsymbol{x}_i\in \mathcal{X} \subseteq \mathbb{R}^p$ is the set of explanatory variables for the observation $i$;
* $w_i$ is the weight for the observation $i$;
* $y_i\in \mathcal{Y}\ \subseteq \mathbb{R}$ is the realization of the response variable $Y_i$ for the observation $i$.
What an observation is, depends on the variable we are modeling. For instance:
* If we are modeling the yearly claim count $N_i$, each observation could be a policy (or a couple (policy, accounting year)), the weights could be the exposures $v_i$ and the realizations of response variables could be the number of observed claims for that policy (or couple (policy, accounting year)).
* If we are modeling the claim severity $Z_j$, each observation could be a claim $j$, the weights could all be $1$ and the realizations of the response variable could be the observed cost for the claim $j$. It is also possible to model the claim severity taking into account the total cost of claims for the policy $S_i = \sum_{j=1}^{N_i}{Z_j}$. In this case, each observation would be a policy $i$, the weights would be the number of claims for each policy $n_i$ and the realizations of response variables would be the total observed cost for the claims of the policy $i$.
* If we are modeling the occurrence of injuries in a claim $I_{Jj}$, each observation could be a claim $j$, the weights could be all $1$ and the realizations of response variables could be an indicator that assume the value $1$ if the claim $j$ caused injuries and $0$ otherwise. As for the claim severity, we can also aggregate data for policy, so each observation would be a policy $i$, the weights would be the number of claims $n_i$ for the policy $i$ and the realizations of response variables would be the number of claims that caused injuries among the claims of the policy $i$.
In each of these cases, $y_i$ is seen as a realization of the random variable $Y_i$. With an inferential process we obtain estimations on $Y_i$ distribution based on observations of their realizations $y_i$.
#### Settlement process and IBNR claims
One of the challenges in non-life insurance pricing is that obtaining the observed data is not so straightforward. In many insurance coverages, such as \ac{mtpl}, the settlement process could last many years, so, if we want to develop models using data from recent years, not all the information is available. To better understand this aspect we have to discuss how the settlement process works.
In figure \@ref(fig:settlement-process) the settlement process for a claim is represented. At time $t_1$ the insured event (e.g. an accident) occurs. From this moment a liability for the insurer emerges, even if the insurer has not been notified yet. This liability is called _Outstanding Loss Liability_. In $t_2$ the claim is reported and the insurance is notified about the occurrence of the event. From this moment the settlement process starts. This process consists in evaluating the event and understanding the responsibilities of the parts and the entity of the damage. During this process, controversies between the parts can emerge and, in particular if injuries occurred, the damage evaluation can take a lot of time. When the situation is clear and everything is defined, the claim is settled and the liabilities are paid. In $t_3$ we have the settlement and in $t_4$ the claim is closed. It is possible that $t_4=t_3$, but it is not always the case. If the settlement process takes a long time and the insurer already knows he will have to pay something, he can make some partial payments during the period $[t_2, t_3]$. These intermediate payments are paid at times $\tau_1, \tau_2, \dots, \tau_n \in [t_2, t_3]$. It is also possible that a claim is opened and then gets closed without any payment. After the closing ($t_4$) it is also possible that a claim gets reopened and that more payments emerge.
```{tikz, settlement-process, fig.cap = "Claim timeline.", fig.ext = 'pdf', cache = TRUE, echo = FALSE}
\newcommand{\ImageWidth}{11cm}
\usetikzlibrary{decorations.pathreplacing, positioning, arrows.meta}
\begin{tikzpicture}
% draw horizontal line
\draw[thick, -Triangle] (0, 0) -- (\ImageWidth, 0) node[font = \scriptsize, below left = 3pt and -8pt]{$t$};
\draw[very thick] (0.5cm, 0) -- (9.5cm, 0);
% draw vertical lines and times
\draw (0.5cm, -3pt) -- (0.5cm, 3pt) node[anchor = south] {$t_{1}$};
\draw (2.5cm, -3pt) -- (2.5cm, 3pt) node[anchor = south] {$t_{2}$};
\draw (4.0cm, -3pt) -- (4.0cm, 3pt) node[anchor = south] {$\tau_{1}$};
\draw (4.5cm, -3pt) -- (4.5cm, 3pt) node[anchor = south] {$\tau_{2}$};
\path (5.25cm, -3pt) -- (5.255cm, 3pt) node[anchor = south] {$\dots$};
\draw (6cm, -3pt) -- (6cm, 3pt) node[anchor = south] {$\tau_{n}$};
\draw (7.5cm, -3pt) -- (7.5cm, 3pt) node[anchor = south] {$t_{3}$};
\draw (9.5cm, -3pt) -- (9.5cm, 3pt) node[anchor = south] {$t_{4}$};
% draw events names
\node[align=center] at (0.5cm, -16pt) {\small occurrence};
\node[align=center] at (2.5cm, -16pt) {\small reporting};
\node[align=center] at (5cm, -16pt) {\small intermediate \\ payments};
\node[align=center] at (7.5cm, -16pt) {\small settlement};
\node[align=center] at (9.5cm, -16pt) {\small closing};
% draw curly bracket
\draw [decorate, decoration = {brace, mirror, amplitude = 10pt}, xshift = 0pt, yshift = -10pt]
(2.5cm, -20pt) -- (7.5cm, -20pt) node [black, midway, xshift = 0pt, yshift = -15pt]
{\small settlement process};
\end{tikzpicture}
```
From the moment the claim is reported ($t_2$), the insurer estimates how much he is going to pay for that claim and he allocates that sum in a reserve, called _case reserve_. As new information emerges and some payments are settled, the case reserve is updated. The aim for this reserve is to have a best estimate for the future payments for the claims already emerged. As the claim gets settled, the sum between the paid and the reserved converges to the final cost of the claim.
From this description emerges that:
* In the period $]t_1, t_2[$ the insurer has an outstanding loss liability for an event that has not been reported yet. In this case we talks about an \ac{ibnyr}.
* In the period $[t_2, t_3[$ the insurer has an outstanding loss liability for an event that has been reported, but has not been totally settled yet, so this liability is just an estimate. In this case, if the case reserve is not large enough to cover all the future payments that will emerge for that claim, we talks about an \ac{ibner}.
#### Model fitting with available data
The \ac{ibnyr} and \ac{ibner} issue is particularly challenging when we have to perform a risk evaluation at a specific time $t$. In general $t_1, t_2,\dots$ are not known a priori, so we don't know if in the future more claims for accidents occurred in the past will be reported and we don't know if the ones that are already reported will experience a revaluation. That means that, in general, when we model $N$ and $Z$ at a specific time $t$, we can't observe the total number of claims occurred for each policy $n_i$ and the payments for each claim $z_j$. What we can use is:
* $n_i^{(t)} = n_i^{(\text{reported in } t)}$
where:
* $n_i^{(\text{reported in } t)}$ is the number of reported claims in $t$ for the policy $i$;
* $z_j^{(t)} = z_j^{(\text{paid in }t)} + z_j^{(\text{reserved in } t)}$
where:
* $z_j^{(\text{paid in }t)}$ is the amount already paid in $t$ for the claim $j$;
* $z_j^{(\text{reserved in } t)}$ is the amount reserved in $t$ for the claim $j$.
When we use this data for modeling the total cost of claims we must be particularly aware on what we are using. In general:
$$
n_i^{(t)} \ne n_i, \qquad z_j^{(t)} \ne z_j
$$
The common case is that $n_i^{(t)} < n_i$ and $z_j^{(t)} < z_j$. If we used $n_i^{(t)}$ and $z_j^{(t)}$ without any correction, we would underestimate both $E(N)$ and $E(Z)$, obtaining a biased estimate for $E(S)$.
To tackle these problems what is usually done is fitting the models for $S_i$ with $n_i^{(t)}$ and $z_j^{(t)}$ and then apply a flat corrective coefficient $\alpha$ to $\widehat{E(S_i)}$ based on an aggregated estimate of $E(S)$ that takes into account the long settlement process.
An estimate for the expected total cost of claims for a generic policy in the portfolio $E(S)$ can be obtained with techniques based on runoff triangles, such as the _Chain Ladder_. These techniques are based on projecting the cost of claims already emerged to the final total cost of claims. We are not going to discuss these techniques in this thesis. For more details on them we refer to [@wuthrich-non-life-insurance-math-stats]. For our dissertation, we just have to know that these techniques provide us with an estimate for $E(S)$. Let's call it $\widehat{E(S)}^{CL}$. This estimate does not depend on explanatory variables; it is a sort of average total cost of claims for the policies in the portfolio.
Meanwhile, with the available data $n_i^{(t)}$ and $z_j^{(t)}$, the fitting for all the models needed in the decomposition of $S$ is performed and, for each policy $i\in\{1, 2, \dots, n\}$, $E(S_i)$ is obtained. Let's call it $\widehat{E(S_i)}'$. As we used the data available in $t$ that comes from claims not totally settled, $\widehat{E(S_i)}'$ is a biased estimate for $E(S_i)$.
We can then balance the estimates $\widehat{E(S_i)}'$ with $\widehat{E(S)}^{CL}$ by computing:
$$
\alpha = \frac{n}{\sum_{i=1}^{n}{\widehat{E(S_i)}'}} \widehat{E(S)}^{CL}
$$
and applying to the estimates as follows:
$$
\widehat{E(S_i)} = \alpha \ \widehat{E(S_i)}'