Sample analysis of risk factors¶
This is an example of quantitative modeling of risk factors using Jupyter notebooks with seaborn for visualization and PyMC3 for Bayesian statistical modeling. Click on the "Open in Colab" badge to get your own copy of the notebook that you can edit and experiment with without needing to install anything locally!
Written by Will Maier.
import scipy.stats as stats
import pymc3 as mc
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
seed = 14211
np.random.seed(seed)
Model¶
We will develop a hierarchical model composed of many factors, defining each factor iteratively. In some cases, the factors will depend on other factors, so we'll work our way up from the most granular factors to risk itself. This approach follows the FAIR ontology (Open Group standard O-RT).
We will express the model using PyMC3, which provides a concise language to describe probabilistic models. Once defined, PyMC3 can also draw samples from the modelled distribution through Monte Carlo simulation.
In the end, we will be able to analyze collections of these samples as traces from which we can estimate characteristics of the distribution including its mean and quantiles.
To begin, we define model as a context manager. As we develop the model, we will do so in this context
model = mc.Model()
Elicitation¶
Risk models relate various factors, where each factor is itself modeled by a probability distribution. These distributions are defined using parameters that are hard to relate to the underlying concrete phenomena. For example, an expert rarely thinks about a random process like the magnitude of a loss event in terms of its mean or variance. Instead, it is often easier to estimate specific plausible values from the distribution. These can be fit to a distribution and the distribution can then be used to estimate the parameters directly.
This process of estimating distribution parameters is called elicitation. Here, we simply fit a known distribution to whatever plausible data an expert might provide. SHELF offers a more robust method for facilitated elicitation among a group of experts.
The elicit
function returns a variable following the beta distribution with its mu (mean) and sigma (standard deviation) parameters inferred from an array of estimated values in data
. These values may be drawn from actual data or -- where such data is not available -- estimated by experts. Providing more values should usually result in an elicited distribution that more closely matches the experts' intent.
This is most useful for continuous variables with upper and lower bounds, where the bounds are the first and last items in data
. The beta distribution has these characteristics and so is used here. elicit
could be modified to support other continuous distributions as well.
The return value follows the beta distribution deterministically scaled to the minimum and maximum values from data
. The scipy.stats.rv_continuous variable used for the fit is stored as the var
attribute of the return value. This may be used to retrieve characteristics of data
, including its mean or interval.
def elicit(name, data):
y = np.sort(data)
width = y.max()-y.min()
par = stats.beta.fit(y[1:-1], floc=y.min(), fscale=y.max())
var = stats.beta(*par)
scaled_mu = var.mean()/width
scaled_sd = var.std()/width
scaled = mc.Beta(f"{name}_scaled__", mu=scaled_mu, sd=scaled_sd)
dist = mc.Deterministic(name, y.min() + (scaled * width))
dist.var = var
return dist
For convenience, we also define a function to quickly summarize a single variable. The resulting plot includes a histogram, KDE, and indicators for the 50th, 90th, 95th, and 99th percentile values.
def sumplot(data):
sns.distplot(data)
colors = iter(sns.color_palette())
m = np.mean(data)
plt.axvline(m, linestyle=":", color=next(colors), label=f"mean = {m:0.2f}")
ps = [50, 90, 95, 99]
for p in ps:
x = np.percentile(data, p)
plt.axvline(x, linestyle=":", color=next(colors), label=f"{p:0.2f} = {x:0,.2f}")
plt.legend()
Primary loss¶
with model:
pl = elicit("primary_loss", [0, 100, 1005, 1100, 10000])
sumplot(pl.var.rvs(1000))
Secondary loss¶
with model:
sl = elicit("secondary_loss", [0, 10, 105, 50000, 10000])
sumplot(sl.var.rvs(1000))
Risk¶
with model:
risk = mc.Deterministic("risk", pl + sl)
Simulation¶
with model:
trace = mc.sample(seed=seed)
Analysis¶
df = mc.trace_to_dataframe(trace)
sumplot(df.risk)