Overview - Hierarchical Modeling - WatchTower Documentation

The first prior we embed in our system is "similar products behave in similar ways" where similar is defined as sharing elements of a hierarchy. It is important to state that this is incredibly helpful in a low data regime. This is because, if we model the latent structure of our system as a sum over hierarchical components, we effectively get to share data. We may never have spent less than $1M on Product A, but we did on Product B, and so that can inform the learned parameters for Product A, especially if there is any hierarchical overlap!

Let’s say we have $m$ products, and we want to learn an $n$ dimensional vector for each product. One way to do this would be to define an $m × n$ matrix, called $A$ such that $A_{i,j}$ represents the $j^{\text{th}}$ element in the vector for product $i$:

\[ A = \begin{bmatrix} A_{1,1} & A_{1,2} & ... & A_{1,n} \\ A_{2,1} & A_{2,2} & ... & A_{2,n} \\ A_{m,1} & A_{1,2} & ... & A_{m,n} \end{bmatrix} \]

However, we want to introduce a prior that a hierarchical structure influences the value of these vectors such that products with a similar hierarchy are more likely to be represented in similar ways. Let $H$ be our hierarchy with depth d such that $|H[k]|$ denotes the number of unique elements at the k th level of the hierarchy ($|H[1]| = 1$ and $|H[d]| = m$ is always true). An example of such a structure where $d = 5$ and $m = 9$ can be seen below.

Let us refer to $i_k \in [1, |H[k]|]$ as the index for hierarchical information for product $i$ at depth $k$ of our hierarchy. We then create a collection of vectors $V$ such that $V [k]=\langle v_1, v_2, · · · , v_{|H[k]|}\rangle$ where $v_{i,k}$ corresponds to the value of product $i$ at the $k^{\text{th}}$ depth of the hierarchy. Now we can redefine $A$ as a composition of hierarchical variables by creating $n$ such collections of vectors $V$.

\[ V = \begin{bmatrix} \underset{j = 1}{\overset{d}{\sum}}{V_1[j]_{1_j}} & \underset{j = 1}{\overset{d}{\sum}}{V_2[j]_{1_j}} & ... & \underset{j = 1}{\overset{d}{\sum}}{V_n[j]_{1_j}} \\ \underset{j = 1}{\overset{d}{\sum}}{V_1[j]_{2_j}} & \underset{j = 1}{\overset{d}{\sum}}{V_2[j]_{2_j}} & ... & \underset{j = 1}{\overset{d}{\sum}}{V_n[j]_{2_j}} \\ ... & ... & ... & ... \\ \underset{j = 1}{\overset{d}{\sum}}{V_1[j]_{m_j}} & \underset{j = 1}{\overset{d}{\sum}}{V_2[j]_{m_j}} & ... & \underset{j = 1}{\overset{d}{\sum}}{V_n[j]_{m_j}} \end{bmatrix} \]

This is how we represent many of our learned parameters in our model. Additionally, we regularize each collection of vectors $V$ to control the variance of our hierarchical representations:

\[ R_{hier} = \frac{\lambda_{hier}}{|V|} \underset{v \in V}{\sum} \lambda_{v} \overset{|v|}{\underset{i=1}{\sum}} v^2_i \]

Hierarchical Regularization