James’s Page | Actuary / CExamTopics

Construction of Empirical Models (20-25%)

1. Estimate failure time and loss distributions using:

a) Kaplan-Meier estimator, including approximations for large data sets

2 concrete situations:

Insurance: observing loss amount per policy. Left truncation when loss < deductible; right censoring when loss > policy limit

Mortality table: observing age of death for each person. Left truncation happens at age the person is 1st observed; right censoring happens at age if the person is still alive at last observation.

Symbols: For each observation {$i$},

{$d_i$} = truncation point for that observation (0 if no truncation);

{$x_i$} = the observed value, if it wasn't censored;

{$u_i$} = censored value for that observation.

Then group and relabel the {$x_i$}'s into {$y_j$} each occurring {$s_j$} times. Divide up the data according to the {$y_j$}'s by defining the risk set to be

{$r_j = \left(\text{#} x_i \text{ and } u_i \geq y_j \right) - \left(\text{#} d_i \geq y_j \right)$}

which is the same as

{$r_j = \left(\text{#} d_i < y_j \right) - \left(\text{#} x_i \text{ and } u_i < y_j \right) $}

So in our situations, the risk set counts

the difference between a) # of policies with observed loss amount >= {$y_j$} and b) # of policies with deductible >= {$y_j$};

the difference between a) # of people entering the study before the age of {$y_j$} and b) # of people died before the age of {$y_j$} (i.e., the # of people being observed alive at a certain age {$y_j$})

Recursively, given {$c$},

{$r_j = r_{j-1} + \left(\text{#} d_i \in [y_{j-1}, y_j) \right) - \left(\text{#} x_i =y_{j-1}\right) - \left(\text{#} u_i \in [y_{j-1}, y_j) \right) $}

And so we define the Kaplan-Meier limit estimator as follows

{$S(t)=0$}, for {$t\in[0, y_1)$},

{$S(t)=\frac{r_1-s_1}{r_1}$} (probability of surviving past {$y_1$}), for {$t\in[y_1, y_2)$},

{$S(t)=S(y_1)\frac{r_2-s_2}{r_2}$} (probability of surviving past {$y_2$}), for {$t\in[y_2, y_3)$},...

{$S(t)=\prod_{i=1}^k\frac{r_i-s_i}{r_i}$} (probability of surviving past {$y_k$}), for {$t\in[y_k,\infty)$}.

Or we can define the last line to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.

b) Nelson-Åalen estimator

This estimates the cumulative hazard rate function as follows:

{$\hat{H}(t)=0$}, for {$t\in[0, y_1)$},

{$\hat{H}(t)=\frac{s_1}{r_1}$}, for {$t\in[y_1, y_2)$},

{$\hat{H}(t)=\hat{H}(y_1) + \frac{s_2}{r_2}$}, for {$t\in[y_2, y_3)$},...

{$\hat{H}(t)=\sum_{i=1}^k\frac{s_i}{r_i}$}, for {$t\in[y_k,\infty)$}.

Then {$\hat{S}(t) = e^{-\hat{H}(t)}$}.

For the end piece {$t\in[y_k,\infty)$}, we can define {$\hat{S}(t)$} to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.

c) Kernel density estimators

First we need to discuss empirical distributions:

Suppose data is not grouped and there is no censoring or truncation. Out of a total of {$n$} observations , let {$y_i$} be distinct data points each occurring {$s_i$} times. Then {$S(t)=1-F(t),$} where

{$F(t)=0$}, for {$t\in[-\infty, y_1)$},

{$F(t)=\frac{s_1}{n}$}, for {$t\in[y_1, y_2)$},

{$F(t)=F(y_1) + \frac{s_2}{n}$}, for {$t\in[y_2, y_3)$},...

{$F(t)=\sum_{i=1}^k\frac{s_i}{n} = 1$}, for {$t\in[y_k,\infty)$}.

Note each {$y_i$} is assigned probability {${p(y_i)}= s_i/n$}.

A kernel density estimator is defined as

{$\hat{F}(t) = \sum_{i=1}^k p(y_i) K_{y_i}(t)$},

where {$K_{y_i}(t)$} are kernel functions that can defined by a chosen pdf with a certain bandwith {$b$}:

A uniform kernel is a rectangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/2b$};

A triangular kernel is a triangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/b$}.

Book also mentions a gamma kernel where the parameters are given by {$\alpha$} and {$y_i/\alpha$}. For the 2 kernels with bandwidths, the larger the bandwidth is, the smoother the resulting kernel density estimator. For gamma kernels, the smaller {$\alpha$} is, the smoother the resulting kernel density estimator.

For grouped data, i.e., if the data is kept track of by a range of observed values instead of single, precise values, then we need to use ogives and histograms. Instead of {$y_i$}, let {$c_0, \ldots, c_k$} be the boundary values of intervals of observations, and {$s_1, \ldots, c_k$} be the number of observations in {$(c_0, c_1), \ldots, (c_{k-1}, c_k)$}. Then{$S(t)=1-F(t),$} where {$F(t)$} is the ogive,

{$F(c_0)=0$},

{$F(c_1)=\frac{s_1}{n}$},

{$F(c_2)=\frac{s_1+s_2}{n}$},...

{$F(c_n)=\sum_{i=1}^k\frac{s_i}{n} = 1$},

and the values of {$F$} between the {$c_i$}'s are connected linearly. Take the derivative to get the histogram {$f(t)$}:

{$f(t)=0$}, for {$t\in[-\infty, c_0)$},

{$f(t)=\frac{s_1}{n(c_1-c_0)}$}, for {$t\in[c_0, c_1)$},

{$f(t)=\frac{s_2}{n(c_2-c_1)}$}, for {$t\in[c_1, c_2)$},...

{$f(t)=\frac{s_k}{n(c_k-c_{k-1})}$}, for {$t\in[c_{k-1},c_k)$},

{$f(t)=0$} (or undefined), for {$t\in[c_k,\infty)$}.

2. Estimate the variance of estimators and confidence intervals for failure time and loss distributions.

3. Apply the following concepts in estimating failure time and loss distribution:

a) Unbiasedness

b) Consistency

c) Mean squared error