State Space Model

We have \(T, G, N, M \in \mathbb N\) which are respectively the number of time points, number of time series to forecast, number of environment signals, and number of hidden state dimensions. target is a provided set of time series \(Y \in \mathbb{R}^{T\times G}\). environment is a provided set of time series \(X \in \mathbb R^{T \times N}\). hidden state is a set of calculated time series \(H \in \mathbb R^{T \times M}\). We will use \(\tilde{Y}\) and \(\hat{Y}\) to denote predictions and lowercase versions of a time series variable to denote an element of the time series. A general state space model with a lookback of 1 is a set of two functions such that  We will note that you could define \(g\) such that \(H_t\) stores some fixed number of previous \(Y\) and \(E\) values, but since it is of fixed size that number is finite and fixed in the definition. By adjusting what \(Y\) values \(H_t\) gets access to through \(g\) you can define different kinds of lookback. We will introduce two specific classes of state space model and define a combined model based on a relation between them. Note that different models don't neccesarily have the same number of hidden state dimensions even if they have the same target and environment time series. First we have a class of state space models we will call Backwards-Forwards Networks or BFNet which have their \(f,g\) defined as follows with the final property being that \(g(x)\) is a left inverse on \(f(x)\). The second class are called Temporal Networks or TemporalNet have \(f,g\) as follows Note that this definition allows defining a \(g^*(\{X_0,X_1,\dots,X_t\}, H_0) = H_t\) which inductively calculates \(H_t\) by iterating the original \(g\). Now let us denote \(g^*(X, H_0) = H\) so if we treat \(f\) as being able to handle batches we get \(\hat{Y} = f(X, g^*(X, H_0))\) Now let us consider a combined model that shares weights between the two sub-models. Call this an .... It starts witht the defining functions of the models \(f: \mathbb R^M \to \mathbb R^N \to \mathbb R^G\), \(\tilde{f}: \mathbb R^M \to \mathbb R^\tilde{N} \to \mathbb R^G\), \(g: \mathbb R^M \to \mathbb R^N \to \mathbb R^N\), \(\tilde{g}: \mathbb R^G \to \mathbb R^M \to \mathbb R^\tilde{N}\) with the following properties. from the definition of a TemporalNet. from the definition of a BFNet. Then we introduce some properties to tie the two models together. We define a new function \(u: \mathbb R^M \to \mathbb R^N \to \mathbb R^\tilde{N}\) that transforms the hidden state of the TemporalNet to that of the BFNet such that \(f(x) = \tilde{f}(x) \circ u(x)\) With a final abuse of notation to vectorize our functions we get  and Let our hidden state \(\tilde{h} \in \mathbb R^G\) be the baseline for every granularity. \(g\) is simple to implement, but \(u(\cdot) \circ g^*\) requires a little more thought. We'll start by defining an "inner" network. This network will run all the layers to get impacts, but will not have a baseline. Then behavior will differ for BFNet and TemporalNet. BFNet will apply the inverse_impact of each layer in reverse order to the actual sales to get its hidden state. TemporalNet will create a baseline layer and use that. Then it will calculate betagamma decay. It will then calculate the effective change to the baseline due to decay and apply that to its baseline. Once that is done they will both apply the impacts in order to their baseline to get a final prediction. Now to calculate the impact of decay on the baseline we first need to workout impact before the ROI curve is applied. \(P\) is pricing impact, \(B\) is brand mixed effect impact, \(G\) is global mixed effect impact, \(L\) is the baseline from the baseline layer, \(D^\dagger\) is the effect of decay on the baseline, \(M\) calculates the ROI mult, \(S\) is the profit or sales track, \(W\) is the rolling window mult, \(R\) is the impact from the roi curve, and \(G\) is the betagamma impact (when it is non-zero \(D^\dagger\) is zero). We can subtract the identical ROI terms on each side to get Now we can divide by \(P \cdot B \cdot G\) to get Subtract off the L from both sides and we end up with The additive impact of the betagamma decay on the baseline. BFNet TemporalNet    Below is scratch that didn't come to anything solid.. If TemporalNet and BFNet predict the same value then  So