Deprecated: Function create_function() is deprecated in /home/jtung/jamestung.com/pmwiki-2.2.71/pmwiki.php on line 456 Deprecated: Function create_function() is deprecated in /home/jtung/jamestung.com/pmwiki-2.2.71/pmwiki.php on line 456 |
Actuary /
CExamTopicsConstruction of Empirical Models (20-25%)1. Estimate failure time and loss distributions using: a) Kaplan-Meier estimator, including approximations for large data sets
2 concrete situations:
Insurance: observing loss amount per policy. Left truncation when loss < deductible; right censoring when loss > policy limit
Mortality table: observing age of death for each person. Left truncation happens at age the person is 1st observed; right censoring happens at age if the person is still alive at last observation.
Symbols: For each observation {$i$},
{$d_i$} = truncation point for that observation (0 if no truncation);
{$x_i$} = the observed value, if it wasn't censored;
{$u_i$} = censored value for that observation.
Then group and relabel the {$x_i$}'s into {$y_j$} each occurring {$s_j$} times. Divide up the data according to the {$y_j$}'s by defining the risk set to be
{$r_j = \left(\text{#} x_i \text{ and } u_i \geq y_j \right) - \left(\text{#} d_i \geq y_j \right)$}
which is the same as
{$r_j = \left(\text{#} d_i < y_j \right) - \left(\text{#} x_i \text{ and } u_i < y_j \right) $}
So in our situations, the risk set counts
the difference between a) # of policies with observed loss amount >= {$y_j$} and b) # of policies with deductible >= {$y_j$};
the difference between a) # of people entering the study before the age of {$y_j$} and b) # of people died before the age of {$y_j$} (i.e., the # of people being observed alive at a certain age {$y_j$})
Recursively, given {$c$},
{$r_j = r_{j-1} + \left(\text{#} d_i \in [y_{j-1}, y_j) \right) - \left(\text{#} x_i =y_{j-1}\right) - \left(\text{#} u_i \in [y_{j-1}, y_j) \right) $}
And so we define the Kaplan-Meier limit estimator as follows
{$S(t)=0$}, for {$t\in[0, y_1)$},
{$S(t)=\frac{r_1-s_1}{r_1}$} (probability of surviving past {$y_1$}), for {$t\in[y_1, y_2)$},
{$S(t)=S(y_1)\frac{r_2-s_2}{r_2}$} (probability of surviving past {$y_2$}), for {$t\in[y_2, y_3)$},...
{$S(t)=\prod_{i=1}^k\frac{r_i-s_i}{r_i}$} (probability of surviving past {$y_k$}), for {$t\in[y_k,\infty)$}.
Or we can define the last line to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.
b) Nelson-Åalen estimator
This estimates the cumulative hazard rate function as follows:
{$\hat{H}(t)=0$}, for {$t\in[0, y_1)$},
{$\hat{H}(t)=\frac{s_1}{r_1}$}, for {$t\in[y_1, y_2)$},
{$\hat{H}(t)=\hat{H}(y_1) + \frac{s_2}{r_2}$}, for {$t\in[y_2, y_3)$},...
{$\hat{H}(t)=\sum_{i=1}^k\frac{s_i}{r_i}$}, for {$t\in[y_k,\infty)$}.
Then {$\hat{S}(t) = e^{-\hat{H}(t)}$}.
For the end piece {$t\in[y_k,\infty)$}, we can define {$\hat{S}(t)$} to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.
c) Kernel density estimators
First we need to discuss empirical distributions:
Suppose data is not grouped and there is no censoring or truncation. Out of a total of {$n$} observations , let {$y_i$} be distinct data points each occurring {$s_i$} times. Then {$S(t)=1-F(t),$} where
{$F(t)=0$}, for {$t\in[-\infty, y_1)$},
{$F(t)=\frac{s_1}{n}$}, for {$t\in[y_1, y_2)$},
{$F(t)=F(y_1) + \frac{s_2}{n}$}, for {$t\in[y_2, y_3)$},...
{$F(t)=\sum_{i=1}^k\frac{s_i}{n} = 1$}, for {$t\in[y_k,\infty)$}.
Note each {$y_i$} is assigned probability {${p(y_i)}= s_i/n$}.
A kernel density estimator is defined as
{$\hat{F}(t) = \sum_{i=1}^k p(y_i) K_{y_i}(t)$},
where {$K_{y_i}(t)$} are kernel functions that can defined by a chosen pdf with a certain bandwith {$b$}:
A uniform kernel is a rectangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/2b$};
A triangular kernel is a triangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/b$}.
Book also mentions a gamma kernel where the parameters are given by {$\alpha$} and {$y_i/\alpha$}. For the 2 kernels with bandwidths, the larger the bandwidth is, the smoother the resulting kernel density estimator. For gamma kernels, the smaller {$\alpha$} is, the smoother the resulting kernel density estimator.
For grouped data, i.e., if the data is kept track of by a range of observed values instead of single, precise values, then we need to use ogives and histograms. Instead of {$y_i$}, let {$c_0, \ldots, c_k$} be the boundary values of intervals of observations, and {$s_1, \ldots, c_k$} be the number of observations in {$(c_0, c_1), \ldots, (c_{k-1}, c_k)$}. Then{$S(t)=1-F(t),$} where {$F(t)$} is the ogive,
{$F(c_0)=0$},
{$F(c_1)=\frac{s_1}{n}$},
{$F(c_2)=\frac{s_1+s_2}{n}$},...
{$F(c_n)=\sum_{i=1}^k\frac{s_i}{n} = 1$},
and the values of {$F$} between the {$c_i$}'s are connected linearly. Take the derivative to get the histogram {$f(t)$}:
{$f(t)=0$}, for {$t\in[-\infty, c_0)$},
{$f(t)=\frac{s_1}{n(c_1-c_0)}$}, for {$t\in[c_0, c_1)$},
{$f(t)=\frac{s_2}{n(c_2-c_1)}$}, for {$t\in[c_1, c_2)$},...
{$f(t)=\frac{s_k}{n(c_k-c_{k-1})}$}, for {$t\in[c_{k-1},c_k)$},
{$f(t)=0$} (or undefined), for {$t\in[c_k,\infty)$}.
2. Estimate the variance of estimators and confidence intervals for failure time and loss distributions. 3. Apply the following concepts in estimating failure time and loss distribution:
|