Hierarchical Regularization
The first prior we embed in our system is "similar products behave in similar ways" where similar is defined as sharing elements of a hierarchy. It is important to state that this is incredibly helpful in a low data regime. This is because, if we model the latent structure of our system as a sum over hierarchical components, we effectively get to share data. We may never have spent less than $1M on Product A, but we did on Product B, and so that can inform the learned parameters for Product A, especially if there is any hierarchical overlap!
Let’s say we have \(m\) products, and we want to learn an \(n\) dimensional vector for each product. One way to do this would be to define an \(m × n\) matrix, called \(A\) such that \(A_{i,j}\) represents the \(j^{\text{th}}\) element in the vector for product \(i\):
However, we want to introduce a prior that a hierarchical structure influences the value of these vectors such that products with a similar hierarchy are more likely to be represented in similar ways. Let \(H\) be our hierarchy with depth d such that \(|H[k]|\) denotes the number of unique elements at the k th level of the hierarchy (\(|H[1]| = 1\) and \(|H[d]| = m\) is always true). An example of such a structure where \(d = 5\) and \(m = 9\) can be seen below.
Let us refer to \(i_k \in [1, |H[k]|]\) as the index for hierarchical information for product \(i\) at depth \(k\) of our hierarchy. We then create a collection of vectors \(V\) such that \(V [k]=\langle v_1, v_2, · · · , v_{|H[k]|}\rangle\) where \(v_{i,k}\) corresponds to the value of product \(i\) at the \(k^{\text{th}}\) depth of the hierarchy. Now we can redefine \(A\) as a composition of hierarchical variables by creating \(n\) such collections of vectors \(V\).
This is how we represent many of our learned parameters in our model. Additionally, we regularize each collection of vectors \(V\) to control the variance of our hierarchical representations: