From e4c2eae0ab52db1517d85b775f8106e9e49e5dab Mon Sep 17 00:00:00 2001
From: Elizaveta Semenova <elizaveta.p.semenova@gmail.com>
Date: Sat, 6 Apr 2024 20:28:19 +0200
Subject: [PATCH] update lgcp

---
 21_LGCP.ipynb | 137 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 137 insertions(+)

diff --git a/21_LGCP.ipynb b/21_LGCP.ipynb
index ac861fe..a0bc51a 100644
--- a/21_LGCP.ipynb
+++ b/21_LGCP.ipynb
@@ -79,6 +79,143 @@
         "\n",
         "the second product is taken over the cells with non-zero counts. As the volume of each cell becomes smaller $\\vert c_i \\vert \\to 0$, the counts $y_i$ within cells become 0 or 1 and we obtain the same likelihood as $L_{\\text{po}}.$ Thus, this fitting approach represents an alternative to the one through direct likelihood: instead of using only the points of the observed events and having to evaluate the integral over the observation region, we create artificial counts on a fine grid and use them to discover the intensity. \n"
       ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Discretization of the Gaussian Process\n",
+        "\n",
+        "When the context allows for covariates, the intensity takes the form\n",
+        "\n",
+        "\n",
+        "$$\n",
+        "\\lambda(s) = \\exp(\\textbf{X}^T(s) \\boldsymbol{\\beta} + f(s)),\n",
+        "$$\n",
+        "\n",
+        "where $f$ stands for the Gaussian Process. This case of the Cox process is important for practical use due to its ability to be linked to meaningful explanatory factors. Log-likelihood in the case of, for instance, the Poisson count distribution is composed as\n",
+        "\n",
+        "$$\n",
+        "\\log L = - {\\int_D \\exp\\{ \\textbf{X}^T(s) \\boldsymbol{\\beta} + f(s)\\} ds} + \\sum\\limits_{i=1}^{n} \\{ \\textbf{X}^T(s_i) \\boldsymbol{\\beta} + f(s_i)\\}.\n",
+        "$$\n",
+        "\n",
+        "Here integration is intractable because of the random term under the integral sign. Furthermore, the integral creates high computational burden as it is performed over the entire domain $D.$ An integration quadrature formula can be used, for instance,\n",
+        "\n",
+        "$$\n",
+        "\\int_D \\exp\\{ \\textbf{X}^T(s) \\boldsymbol{\\beta} + f(s)\\} ds  \\approx \\frac{|D|}{K} \\sum_{k=1}^{K} \\exp\\{ \\textbf{X}^T(g_k) \\boldsymbol{\\beta} + f(g_k)\\}.\n",
+        "$$\n",
+        "\n",
+        "Treating this quadrature leads to the same computation issues, as when using the computational grid (Diggle (2013)) and fitting cell counts.\n",
+        "\n",
+        "For practical applications one would like to keep the highest resolution, available for the spatial covariates while performing the inference. As a consequence, computational burden increases due to calculations on large matrices, required to describe the stochastic effect. Traditional mechanism to fit latent Gaussian models and find numerically posterior estimates of all the random terms comprising the latent field is the simulation-based and computationally intense Markov Chain Monte Carlo. Below we present a computational strategy allowing to reduce complexity and memory usage, applied to the fitting of LGCP and make it more feasible. \n",
+        "\n",
+        "Finite realizations of the stationary Gaussian process $f$ compose a vector distributed as $ \\text{MVN}(m,\\Sigma)$ with $\\Sigma$ being a spatially-structured covariance matrix with elements \n",
+        "\n",
+        "$$\n",
+        "\\Sigma_{ij} =  \\text{R}^{\\theta}(d_{ij}).\n",
+        "$$\n",
+        "\n",
+        "Here $\\text{R}^{\\theta}$ is an isomorphic spatial covariance function depending on the parameters $\\theta$ and the distance between locations $d_{ij}.$ We opt for the Gaussian as the functional form of $\\text{R}^{\\theta}$ and this is crucial for our further considerations:\n",
+        "\n",
+        "$$\n",
+        "R_{ij} =  \\sigma^2 \\exp ( - 1/(2l^2) \\sum_{d=1}^{D} (x_{i}^d-x_{j}^d)^2 ).\n",
+        "$$\n",
+        "\n",
+        "The sum in the above formula is taken over all dimensions. Parameter $l$ denotes the length scale and can be chosen differently for different dimensions. For Bayesian inference of the vector-parameter $ \\theta = ( \\boldsymbol{\\beta}, \\sigma^2, l)$ and computation of the posterior distribution\n",
+        "$\\pi (\\theta | S ) \\propto L(S| \\theta ) \\cdot \\pi(\\theta)$\n",
+        "the priors $\\pi(\\theta)$ need to be selected. \n",
+        "\n",
+        "Simulation-based inference requires realizations of the multivariate normal distribution $f \\sim \\text{MVN}(m,\\Sigma),$ generated from the density $(2 \\pi)^{-d/2} |\\Sigma|^{-1/2} \\exp( - \\frac{1}{2} (x-m)^T \\Sigma^{-1} (x-m))$ . The covariance matrix $\\Sigma$ spans $N =n_y n_x$ and $N = n_y n_x n_t$ cells for spatial and spatiotemporal models, respectively. Hence, the size of the covariance matrix is $N^2$, and constitutes a computational bottleneck referring both to the run-time and required storage. To draw from the multivariate normal distribution $ \\text{MVN}(0, \\Sigma)$ and avoid matrix inversion, as well as the determinant calculation, an improvement can be achieved through the Cholesky factorization, i.e. a matrix $L,$ such that \n",
+        "\n",
+        "$$\n",
+        "\\Sigma= L L^T.\n",
+        "$$\n",
+        "\n",
+        "In this case, if $z$ is an $N$-vector generated as $\\text{MVN}(0, I_n),$ then \n",
+        "\n",
+        "$$\n",
+        "Lz \\sim \\text{MVN}(0, \\Sigma).\n",
+        "$$\n",
+        "\n",
+        "This algorithm requires storage for the matrix $L$ of size $N \\times N$. To mitigate this issue we exploit the *Kronecker (tensor) algebra*, as well as matrix-normal and array-normal distributions as described below.\n",
+        "\n",
+        "## Kronecker algebra\n",
+        "\n",
+        "If $A$ is an $m \\times n$ matrix, and $B$ is an $p \\times q$ matrix, then the *Kronecker-product* (or *tensor product*)  $A \\otimes B$ is an $m p \\times n q$ matrix with elements\n",
+        "\n",
+        "$$\n",
+        "A \\otimes B = \n",
+        "\\begin{bmatrix} a_{11}B & .... & a_{1n}B \\\\\n",
+        " ... & ... & ...  \\\\ \n",
+        " a_{m1}B & ... & a_{mn}B  \\end{bmatrix}.\n",
+        "$$\n",
+        "\n",
+        "All further explanations will be made for the 2d case.\n",
+        "\n",
+        "The covariance function, evaluated over a Cartesian grid, gives rise to a covariance matrix that\n",
+        "can be written as a Kronecker product of D smaller marginal covariance matrices. The side length of each such square matrix equals the number of cells along the given axis. This result follows from the definition of the Kronecker product and the chosen structure of the correlation matrix, i.e. \n",
+        "\n",
+        "$$\n",
+        "\\Sigma = V \\otimes U.\n",
+        "$$\n",
+        "\n",
+        "Instead of storing and inverting an $N \\times N$ matrix, it suffices to store one $n_y \\times n_y$ and one $n_x \\times n_x$ matrices. Further gains can be made by avoiding $N \\times N$ matrices in all intermediate computations and using the *matrix normal distribution* instead. An $n_y \\times n_x$ matrix $f_{n_y,n_x}$ is said to follow the matrix-normal distribution\n",
+        "\n",
+        "$$\n",
+        "f_{n_y,n_x} \\sim MN_{n_y,n_x}(\\mu_{n_y,n_x}, U, V)\n",
+        "$$\n",
+        "\n",
+        "with the $n_y \\times n_x$ matrix-mean $\\mu_{n_y,n_x}$, $n_y \\times n_x$ between-rows covariance $U$ and $n_x \\times n_x$ between-columns covariance $V$ if and only if \n",
+        "\n",
+        "$$\n",
+        "vec(f_{n_y,n_x}) \\sim \\text{MVN}( vec(\\mu_{n_y,n_x}), V \\otimes U),\n",
+        "$$\n",
+        "\n",
+        "where $vec(X)$ denotes vectorization of the matrix $X,$ i.e. a vector of length $n_y n_x$ obtained by stacking columns of the matrix $X$ one under another. Vectorization operator has an important property which connects it to the Kronecker product. Let $A$ be an $n_y \\times n_y$, $X$ an $n_y \\times n_x$ and $B$ a $n_x \\times n_x$ matrices, respectively. Then it holds\n",
+        "\n",
+        "$$\n",
+        "(B \\otimes A) vec(X) = vec(AXB^T). (*)\n",
+        "$$\n",
+        "\n",
+        "Let $L$ be the Cholesky factor of the matrix $\\Sigma$, which allows the Kronecker decomposition $\\Sigma= V \\otimes U,$ and let  $L_V$ and $L_U$ be the square roots of $V$ and $U$, respectively. Then $V\\otimes U = (L_V L_V^T) \\otimes (L_U L_U^T) = (L_V \\otimes L_U) (L_V \\otimes L_U)^T$ due to the properties of the Kronecker product and it follows, that\n",
+        "\n",
+        "$$\n",
+        "L= L_V \\otimes L_U.\n",
+        "$$\n",
+        "\n",
+        "By $\\text{vec}^{-1}_{n_y, n_x}$ \n",
+        "\n",
+        "we denote the operation of mechanical unstacking of a vector into a matrix with $n_y$ rows and $n_x$ columns in the column-major order. An important implication of the two properties listed above is a method for sampling from the matrix-normal distribution: if \n",
+        "\n",
+        "$$\n",
+        "z_{n_y, n_x} \\sim \\text{MN}_{n_y, n_x}(0, I, I)\n",
+        "$$\n",
+        "\n",
+        "is an $n_y \\times n_x$ matrix of independent draws from the standard normal distribution, then \n",
+        "\n",
+        "$$\n",
+        "L_U z_{n_y, n_x} L_V^T = vec_{n_y,n_x}^{-1} ((L_V \\otimes L_U) vec(z_{n_y,n_x})\n",
+        "$$\n",
+        "\n",
+        "$$\n",
+        "\\sim  vec_{n_y,n_x}^{-1}(\\text{MVN}(0, L L^T)) = MN_{n_y,n_x}(0, U, V).\n",
+        "$$\n",
+        "\n",
+        "Generalizing this fact for 3d, we obtain\n",
+        "$$\n",
+        "  vec_{n_y,n_x,n_t}^{-1}((L_W \\otimes L_V \\otimes L_U) vec(z_{n_y,n_x, n_t})) \\sim  \\text{MN}_{n_y,n_x,n_t}(0, U, V, W).\n",
+        "$$\n",
+        "\n",
+        "\n",
+        "It follows, that instead of sampling from $\\text{MVN}(0, \\Sigma)$ via $f_N=Lz_N$ one can sample from the matrix-normal distribution $f_{n_y,n_x}= \\text{MN}_{n_y,n_x}(0, U, V).$ \n",
+        "\n",
+        "This approach requires computation and storage of the Cholesky factors $U$ and $V$ of dimensions $n_y \\times n_y$ and $n_x \\times n_x,$ respectively, which is a noticeable parsimony compared to sampling through $L$: we need to store $n_y^2+n_x^2$ instead of $n_y^2n_x^2$ numbers and perform $O(n_y^3+n_x^3)$ operations instead of $O(n_y^3n_x^3)$. The last statement follows from the fact, that, in general, computation of the Cholesky factor of an $n \\times n$ matrix requires $O(n^3)$ operations. Hence, the computational complexity and storage requirements grow linearly and not multiplicatively with each added dimension. "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": []
     }
   ],
   "metadata": {