GENetLib's documentation

GENetLib is a Python library for gene–environment interaction analysis via deep learning.

This Page

The FuncGE Model

The FuncGE model is utilized to capture G-E interaction effects between functional genetic variables and scalar environmental variables.

Notation 1

Consider a dataset with \(n\) independent individuals.

Parameter

Notation

Outcome variables

\(\boldsymbol{y} = (y_1, \ldots, y_n)^{\top}\)

Functional genetic variables

\(X_i(t)\) where \(t \in \mathcal{T}\), with \(\mathcal{T} = [0,T]\)

Scalar environmental variables

\(\boldsymbol{Z}_i = \left(z_{i1}, \ldots, z_{iq} \right)^{\top}\)

Genetic effect function

\(\beta_0(t)\)

Interaction effect functions

\(\beta_k(t)\) for \(k = 1, \ldots, q\)

Model

For continuous outcomes, we consider the following model from a functional data analysis perspective:

\[y_i=g\left(\sum_{k=1}^qz_{ik}\gamma_k+\int_0^TX_i(t)\beta_0(t)dt+\sum_{k=1}^qz_{ik}\int_0^TX_i(t)\beta_k(t)dt\right)+\varepsilon_i\]

where \(\boldsymbol{\gamma} = (\gamma_1, \ldots, \gamma_q)^{\top}\) is the \(q\)-dimensional vector of coefficients. The error term \(\epsilon_i\) follows a normal distribution \(N(0, \sigma_e^2)\) and is assumed to be independent of both \(\boldsymbol{Z}_i\) and \(X_i(t)\). \(g\) is the functional form.

Spline regression

Based on spline regression techniques, we approximate \(\beta_k(t)\) for \(k = 0, 1, \ldots, q\) using a set of B-spline basis functions \(\boldsymbol{\psi}(t) = \left( \psi_1(t), \ldots, \psi_{L_\beta}(t) \right)^\top\).

These basis functions are:

  • Knots: \(M_n\) knots that are uniformly distributed.

  • Interval: \(\mathcal{T}\).

  • Degree \(d\).

  • \(L_\beta = M_n + d\).

Thus, \(\beta_k(t)\) can be approximated as:

\[\beta_k(t)=\boldsymbol{\psi}(t)^\top \boldsymbol{\theta}_k, \; k=0,1,\ldots,q\]

where \(\boldsymbol{\theta}_{k} = (\theta_{k1}, \ldots, \theta_{kL_\beta})^{\top}\) are the coefficients for the B-spline basis functions.

Using this approximation, we can rewrite the model above:

\[y_{i} = g\left\{\sum_{k=1}^q z_{ik} \gamma_k + \left[\int_0^T X_i(t) \boldsymbol{\psi}(t)^\top \mathrm{d}t\right] \boldsymbol{\theta_0} + \sum_{k=1}^q z_{ik} \left[\left(\int_0^T X_i(t) \boldsymbol{\psi}(t)^\top \mathrm{d}t\right) \boldsymbol{\theta_k}\right]\right\} + \varepsilon_i\]

Notation 2

Parameter

Notation

Matrix of basis coefficients for the functional genetic variables

\(\boldsymbol{U}=(\boldsymbol{U}_1, \ldots, \boldsymbol{U}_n)^{\top}\) where \(\boldsymbol{U}_i=\left(U_{i1}, \ldots, U_{iL_\beta} \right)^{\top}\) is computed as the integral \(U_{ij} = \int_0^T X_i(t)\psi_j(t)dt\)

G-E interaction variables

\(\boldsymbol{Q}=(\boldsymbol{Q}_1, \ldots, \boldsymbol{Q}_n)^\top\) where \(\boldsymbol{Q}_i = \boldsymbol{Z}_i \otimes \boldsymbol{U}_i\)

Combination of G and G-E variables

\(\tilde{\boldsymbol{H}} = (\boldsymbol{U}, \boldsymbol{Q})\)

Thus, we define \(\tilde{\boldsymbol{W}} = \left(\boldsymbol{U}, \boldsymbol{Q}, \boldsymbol{Z} \right)\) where \(\tilde{\boldsymbol{W}}_i\) represents the variable set for the \(i\)-th individual.

Loss functions

With the definitions of variable set \(\tilde{\boldsymbol{W}}\) in place, the loss function for continuous outcomes can be derived in a form similar to that described in ScalarGE.

Continuous layer

\[\tilde{l}_{\text{continuous}}(\boldsymbol{\theta})=\frac1n\sum_{i=1}^n[y_i-g(\tilde{\boldsymbol{W}}_i\boldsymbol{\theta})]^2,\]

where \(\boldsymbol{\theta}\) represents the neural network weights.

Binary layer

\[\tilde{l}_{\text{binary}}(\boldsymbol{\theta}) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i\log(p_i) + (1 - y_i) \log (1 - p_i)-\tilde{\boldsymbol{W}}_i \boldsymbol{\theta} \right],\]

where \(p_i = (1 + e^{-\tilde{\boldsymbol{W}}_i\boldsymbol{\theta}})^{-1}\).

Cox layer

\[\tilde{l}_{\text{Cox}}(\boldsymbol{\theta})=-\sum_{i:\delta_{i}=1 }\biggl[h_{\boldsymbol{\theta}}(\tilde{\boldsymbol{W}}_i)-\log{\sum_{j\in R(T_{i})}e^{h_{\boldsymbol{\theta}}(\tilde{\boldsymbol{W}}_j)}}\biggr],\]

where \(y_i\) and \(c_i\) represent the survival time and censoring time, \(T_i\) is the minimum value of \(y_i\) and \(c_i\), \(\delta_i\) is an indicator variable that \(I(y_i \leq c_i)\). The at-risk set at time \(T_{i}\) is \(R(T_i) = \{i' : T_{i'} \geq T_i\}\), and \(h_{\boldsymbol{\theta}}\) is the prognostic index.

Penalty approximation

Define \(\tilde{\gamma}_{j}\) as the main effect and \(\tilde{\beta}_{1j}, \ldots, \tilde{\beta}_{qj}\) as the interaction effects. Then, the coefficient vector related to the corresponding G and G-E variables in \(\tilde{\boldsymbol{H}}\) is denoted by \(\tilde{\boldsymbol{b}}_j = (\tilde{\gamma}_{j}, \tilde{\beta}_{1j}, \ldots, \tilde{\beta}_{qj})^{\top} = (\tilde{b}_{j1}, \ldots, \tilde{b}_{j(q+1)})^{\top}\), where \(j = 1, \ldots, p\).

Let \(K\) represent the total number of fully connected layers within the network architecture, with \(\tilde{\omega}_k\) indicating the weight matrix allocated to the \(k\)-th layer. The objective function can then be formulated as:

\[\tilde{L}(\boldsymbol{\theta}) = \tilde{l}_{\cdot}(\boldsymbol{\theta}) + \sum_{j=1}^p\rho(|| \tilde{\boldsymbol{b}}_j||;\lambda_1,s) + \sum_{j=1}^p \sum_{k=1}^q \rho(|\tilde{\beta}_{kj}|; \lambda_2, s) + \lambda \biggl( \sum_{k=1}^K||\tilde{\omega}_k||_F^2 \biggr),\]

where \(\tilde{l}_{\cdot}(\boldsymbol{\theta})\) is the corresponding loss function for various types of outcomes.

Similarly, we utilize the same method as ScalarGE to approximate penalties. We minimize the objective function, iterating the estimation process until the training converges and parameters stabilize folloing the algorithm below.

Algorithm: training of FuncGE

Input:

  • Functional genetic variables \(\boldsymbol{X}(t)\) or discrete sequence \(\check{\boldsymbol{X}}\);

  • Environmental variables \(\boldsymbol{Z}\);

  • Survival output \((T,\delta)\) or continuous outputs \(y\) or binary output \(y\);

  • Learning rates of the sparse layer and the fully connected layers \(\{\alpha_1,\alpha_2\}\);

  • Tuning and regularization parameters of the MCP penalty \(\{\lambda_1, \lambda_2, s\}\);

  • Tuning parameter of the fully connected layers \(\lambda\).

Data pre-processing:

  • For functional genetic input \(\boldsymbol{X}\), format \(\tilde{W} = (\boldsymbol{U}, \boldsymbol{Q}, \boldsymbol{Z})\);

  • For sequence genetic input \(\check{\boldsymbol{X}}\), format \(\tilde{\boldsymbol{W}} = (\tilde{\boldsymbol{U}}, \tilde{\boldsymbol{Q}}, \boldsymbol{Z})\).

Initialize:

  • Sparse layer \(\tilde{\boldsymbol{b}}^{(0)}\), \(k\)-th fully connected layer \(\tilde{\omega}_k^{(0)}\), \(m = 0\).

Repeat:

  • Update the approximated MCP penalties with the current estimate \(\tilde{\boldsymbol{b}}^{(m)}\);

  • Update Loss \(= \tilde{l}(\boldsymbol{\theta}) + \text{approximated MCP penalties} + \lambda \sum_{k=1}^{K} \|\tilde{\omega}_{k}^{(m)}\|_F^2\);

  • Conduct back propagation, and obtain the gradients \(\frac{\partial \text{Loss}}{\partial \tilde{\boldsymbol{b}}_j^{(m)}}\) and \(\frac{\partial \text{Loss}}{\partial \tilde{\omega}_k^{(m)}}\);

  • For \(j = 1\) to \(p\) do

    • Update estimates \(\tilde{\gamma}_j^{(m+1)} = \tilde{\gamma}_j^{(m)} - \alpha_1 \frac{\partial \text{Loss}}{\partial \tilde{\gamma}_j^{(m)}}\);

    • For \(k = 1\) to \(q\) do

      • Update estimates \(\tilde{\beta}_{kj}^{(m+1)} = \tilde{\beta}_{kj}^{(m)} - \alpha_1 \frac{\partial \text{Loss}}{\partial \tilde{\beta}_{kj}^{(m)}}\);

    • End for;

  • End for;

  • For \(k = 1\) to \(K\) do

    • Update \(\tilde{\omega}_k^{(m+1)} = \tilde{\omega}_k^{(m)} - \alpha_2 \frac{\partial \text{Loss}}{\partial \tilde{\omega}_k^{(m)}}\);

  • End for;

  • Update \(m = m + 1\);

  • Until convergence or :math:m reaches its maximum.

Extra: Sequence Data Processing

We extend our model to accommodate densely sampled discrete data.

For the \(i\)-th individual, suppose we obtained the densely measured observations \(\boldsymbol{\check{X}}_{i} = (X_{i}(t_{i1}), \ldots, X_{i}(t_{m_{i}}) )^{\top}\) at different physical positions \(\{ t_{i1}, \ldots, t_{im_i} \}\). Here, \(\boldsymbol{\check{X}}_i\) is considered a discrete realization of a smooth genetic function \(X_i(t)\), where \(t \in [0, T]\).

Using functional data analysis, we employ least squares-based smoothing techniques to estimate the function \(X_i(t)\). The function \(X_i(t)\) can be approximated as:

\[\hat{X}_i(t) = \check{\boldsymbol{X}}_i^\top \boldsymbol{\Omega}_i (\boldsymbol{\Omega}_i^\top \boldsymbol{\Omega}_i)^{-1} \boldsymbol{\phi}(t),\]

where \(\boldsymbol{\phi}(t) = (\phi_1(t), \ldots, \phi_{L_X}(t))^\top\) is a set of basis functions, such as B-splines, Fourier series, or wavelets. \(\Omega_{i}\) is an matrix where \(\Omega_{i}\) in the \(j\)-th row and \(l\)-th column is the value of \(\phi_{l}(t_{ij})\).

Then we can use this expansion and re-execute the process of the FuncGE model to obtain the final modeling results.

Previous: The ScalarGE Model | Next: Main Functions