The FuncGE Model ========================= .. _funcgemodel-label: The FuncGE model is utilized to capture G-E interaction effects between functional genetic variables and scalar environmental variables. Notation 1 ----------- Consider a dataset with :math:`n` independent individuals. .. list-table:: :widths: 30 70 :header-rows: 1 :align: center * - Parameter - Notation * - Outcome variables - :math:`\boldsymbol{y} = (y_1, \ldots, y_n)^{\top}` * - Functional genetic variables - :math:`X_i(t)` where :math:`t \in \mathcal{T}`, with :math:`\mathcal{T} = [0,T]` * - Scalar environmental variables - :math:`\boldsymbol{Z}_i = \left(z_{i1}, \ldots, z_{iq} \right)^{\top}` * - Genetic effect function - :math:`\beta_0(t)` * - Interaction effect functions - :math:`\beta_k(t)` for :math:`k = 1, \ldots, q` Model --------- For continuous outcomes, we consider the following model from a functional data analysis perspective: .. math:: y_i=g\left(\sum_{k=1}^qz_{ik}\gamma_k+\int_0^TX_i(t)\beta_0(t)dt+\sum_{k=1}^qz_{ik}\int_0^TX_i(t)\beta_k(t)dt\right)+\varepsilon_i where :math:`\boldsymbol{\gamma} = (\gamma_1, \ldots, \gamma_q)^{\top}` is the :math:`q`-dimensional vector of coefficients. The error term :math:`\epsilon_i` follows a normal distribution :math:`N(0, \sigma_e^2)` and is assumed to be independent of both :math:`\boldsymbol{Z}_i` and :math:`X_i(t)`. :math:`g` is the functional form. Spline regression --------------------- Based on spline regression techniques, we approximate :math:`\beta_k(t)` for :math:`k = 0, 1, \ldots, q` using a set of B-spline basis functions :math:`\boldsymbol{\psi}(t) = \left( \psi_1(t), \ldots, \psi_{L_\beta}(t) \right)^\top`. These basis functions are: - Knots: :math:`M_n` knots that are uniformly distributed. - Interval: :math:`\mathcal{T}`. - Degree :math:`d`. - :math:`L_\beta = M_n + d`. Thus, :math:`\beta_k(t)` can be approximated as: .. math:: \beta_k(t)=\boldsymbol{\psi}(t)^\top \boldsymbol{\theta}_k, \; k=0,1,\ldots,q where :math:`\boldsymbol{\theta}_{k} = (\theta_{k1}, \ldots, \theta_{kL_\beta})^{\top}` are the coefficients for the B-spline basis functions. Using this approximation, we can rewrite the model above: .. math:: y_{i} = g\left\{\sum_{k=1}^q z_{ik} \gamma_k + \left[\int_0^T X_i(t) \boldsymbol{\psi}(t)^\top \mathrm{d}t\right] \boldsymbol{\theta_0} + \sum_{k=1}^q z_{ik} \left[\left(\int_0^T X_i(t) \boldsymbol{\psi}(t)^\top \mathrm{d}t\right) \boldsymbol{\theta_k}\right]\right\} + \varepsilon_i Notation 2 ------------- .. list-table:: :widths: 30 70 :header-rows: 1 :align: center * - Parameter - Notation * - Matrix of basis coefficients for the functional genetic variables - :math:`\boldsymbol{U}=(\boldsymbol{U}_1, \ldots, \boldsymbol{U}_n)^{\top}` where :math:`\boldsymbol{U}_i=\left(U_{i1}, \ldots, U_{iL_\beta} \right)^{\top}` is computed as the integral :math:`U_{ij} = \int_0^T X_i(t)\psi_j(t)dt` * - G-E interaction variables - :math:`\boldsymbol{Q}=(\boldsymbol{Q}_1, \ldots, \boldsymbol{Q}_n)^\top` where :math:`\boldsymbol{Q}_i = \boldsymbol{Z}_i \otimes \boldsymbol{U}_i` * - Combination of G and G-E variables - :math:`\tilde{\boldsymbol{H}} = (\boldsymbol{U}, \boldsymbol{Q})` Thus, we define :math:`\tilde{\boldsymbol{W}} = \left(\boldsymbol{U}, \boldsymbol{Q}, \boldsymbol{Z} \right)` where :math:`\tilde{\boldsymbol{W}}_i` represents the variable set for the :math:`i`-th individual. Loss functions ----------------- With the definitions of variable set :math:`\tilde{\boldsymbol{W}}` in place, the loss function for continuous outcomes can be derived in a form similar to that described in :ref:`ScalarGE `. **Continuous layer** .. math:: \tilde{l}_{\text{continuous}}(\boldsymbol{\theta})=\frac1n\sum_{i=1}^n[y_i-g(\tilde{\boldsymbol{W}}_i\boldsymbol{\theta})]^2, where :math:`\boldsymbol{\theta}` represents the neural network weights. **Binary layer** .. math:: \tilde{l}_{\text{binary}}(\boldsymbol{\theta}) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i\log(p_i) + (1 - y_i) \log (1 - p_i)-\tilde{\boldsymbol{W}}_i \boldsymbol{\theta} \right], where :math:`p_i = (1 + e^{-\tilde{\boldsymbol{W}}_i\boldsymbol{\theta}})^{-1}`. **Cox layer** .. math:: \tilde{l}_{\text{Cox}}(\boldsymbol{\theta})=-\sum_{i:\delta_{i}=1 }\biggl[h_{\boldsymbol{\theta}}(\tilde{\boldsymbol{W}}_i)-\log{\sum_{j\in R(T_{i})}e^{h_{\boldsymbol{\theta}}(\tilde{\boldsymbol{W}}_j)}}\biggr], where :math:`y_i` and :math:`c_i` represent the survival time and censoring time, :math:`T_i` is the minimum value of :math:`y_i` and :math:`c_i`, :math:`\delta_i` is an indicator variable that :math:`I(y_i \leq c_i)`. The at-risk set at time :math:`T_{i}` is :math:`R(T_i) = \{i' : T_{i'} \geq T_i\}`, and :math:`h_{\boldsymbol{\theta}}` is the prognostic index. Penalty approximation ------------------------- Define :math:`\tilde{\gamma}_{j}` as the main effect and :math:`\tilde{\beta}_{1j}, \ldots, \tilde{\beta}_{qj}` as the interaction effects. Then, the coefficient vector related to the corresponding G and G-E variables in :math:`\tilde{\boldsymbol{H}}` is denoted by :math:`\tilde{\boldsymbol{b}}_j = (\tilde{\gamma}_{j}, \tilde{\beta}_{1j}, \ldots, \tilde{\beta}_{qj})^{\top} = (\tilde{b}_{j1}, \ldots, \tilde{b}_{j(q+1)})^{\top}`, where :math:`j = 1, \ldots, p`. Let :math:`K` represent the total number of fully connected layers within the network architecture, with :math:`\tilde{\omega}_k` indicating the weight matrix allocated to the :math:`k`-th layer. The objective function can then be formulated as: .. math:: \tilde{L}(\boldsymbol{\theta}) = \tilde{l}_{\cdot}(\boldsymbol{\theta}) + \sum_{j=1}^p\rho(|| \tilde{\boldsymbol{b}}_j||;\lambda_1,s) + \sum_{j=1}^p \sum_{k=1}^q \rho(|\tilde{\beta}_{kj}|; \lambda_2, s) + \lambda \biggl( \sum_{k=1}^K||\tilde{\omega}_k||_F^2 \biggr), where :math:`\tilde{l}_{\cdot}(\boldsymbol{\theta})` is the corresponding loss function for various types of outcomes. Similarly, we utilize the same method as :ref:`ScalarGE ` to approximate penalties. We minimize the objective function, iterating the estimation process until the training converges and parameters stabilize folloing the algorithm below. Algorithm: training of FuncGE ------------------------------- **Input**: - Functional genetic variables :math:`\boldsymbol{X}(t)` or discrete sequence :math:`\check{\boldsymbol{X}}`; - Environmental variables :math:`\boldsymbol{Z}`; - Survival output :math:`(T,\delta)` or continuous outputs :math:`y` or binary output :math:`y`; - Learning rates of the sparse layer and the fully connected layers :math:`\{\alpha_1,\alpha_2\}`; - Tuning and regularization parameters of the MCP penalty :math:`\{\lambda_1, \lambda_2, s\}`; - Tuning parameter of the fully connected layers :math:`\lambda`. **Data pre-processing**: - For functional genetic input :math:`\boldsymbol{X}`, format :math:`\tilde{W} = (\boldsymbol{U}, \boldsymbol{Q}, \boldsymbol{Z})`; - For sequence genetic input :math:`\check{\boldsymbol{X}}`, format :math:`\tilde{\boldsymbol{W}} = (\tilde{\boldsymbol{U}}, \tilde{\boldsymbol{Q}}, \boldsymbol{Z})`. **Initialize**: - Sparse layer :math:`\tilde{\boldsymbol{b}}^{(0)}`, :math:`k`-th fully connected layer :math:`\tilde{\omega}_k^{(0)}`, :math:`m = 0`. **Repeat**: - Update the approximated MCP penalties with the current estimate :math:`\tilde{\boldsymbol{b}}^{(m)}`; - Update Loss :math:`= \tilde{l}(\boldsymbol{\theta}) + \text{approximated MCP penalties} + \lambda \sum_{k=1}^{K} \|\tilde{\omega}_{k}^{(m)}\|_F^2`; - Conduct back propagation, and obtain the gradients :math:`\frac{\partial \text{Loss}}{\partial \tilde{\boldsymbol{b}}_j^{(m)}}` and :math:`\frac{\partial \text{Loss}}{\partial \tilde{\omega}_k^{(m)}}`; - For :math:`j = 1` to :math:`p` do - Update estimates :math:`\tilde{\gamma}_j^{(m+1)} = \tilde{\gamma}_j^{(m)} - \alpha_1 \frac{\partial \text{Loss}}{\partial \tilde{\gamma}_j^{(m)}}`; - For :math:`k = 1` to :math:`q` do - Update estimates :math:`\tilde{\beta}_{kj}^{(m+1)} = \tilde{\beta}_{kj}^{(m)} - \alpha_1 \frac{\partial \text{Loss}}{\partial \tilde{\beta}_{kj}^{(m)}}`; - End for; - End for; - For :math:`k = 1` to :math:`K` do - Update :math:`\tilde{\omega}_k^{(m+1)} = \tilde{\omega}_k^{(m)} - \alpha_2 \frac{\partial \text{Loss}}{\partial \tilde{\omega}_k^{(m)}}`; - End for; - Update :math:`m = m + 1`; - Until convergence or :math:m reaches its maximum. Extra: Sequence Data Processing ------------------------------- We extend our model to accommodate densely sampled discrete data. For the :math:`i`-th individual, suppose we obtained the densely measured observations :math:`\boldsymbol{\check{X}}_{i} = (X_{i}(t_{i1}), \ldots, X_{i}(t_{m_{i}}) )^{\top}` at different physical positions :math:`\{ t_{i1}, \ldots, t_{im_i} \}`. Here, :math:`\boldsymbol{\check{X}}_i` is considered a discrete realization of a smooth genetic function :math:`X_i(t)`, where :math:`t \in [0, T]`. Using functional data analysis, we employ least squares-based smoothing techniques to estimate the function :math:`X_i(t)`. The function :math:`X_i(t)` can be approximated as: .. math:: \hat{X}_i(t) = \check{\boldsymbol{X}}_i^\top \boldsymbol{\Omega}_i (\boldsymbol{\Omega}_i^\top \boldsymbol{\Omega}_i)^{-1} \boldsymbol{\phi}(t), where :math:`\boldsymbol{\phi}(t) = (\phi_1(t), \ldots, \phi_{L_X}(t))^\top` is a set of basis functions, such as B-splines, Fourier series, or wavelets. :math:`\Omega_{i}` is an matrix where :math:`\Omega_{i}` in the :math:`j`-th row and :math:`l`-th column is the value of :math:`\phi_{l}(t_{ij})`. Then we can use this expansion and re-execute the process of the FuncGE model to obtain the final modeling results.