Hierarchical Embedding

Overview

Typical implementation of a Machine Learning approaches for this use case encounters a high risk of overfitting. Hence regularization is a necessity to avoid such model behavior. Hierarchical Regularization is one such implementation in the model. The major advantage is its flexibility to learn parameters based on the input hierarchy levels each having their own trainable hierarchical embeddings. The prior being embedded is that similar granularity values learn from their shared hierarchy levels. This is synonymous to say vehicles within a parent vehicle may learn from their shared hierarchy.

Example: Consider two rows in State-Brand-Vehicle granularity say, 1. Texas (State) - Budweiser (Brand) - Facebook (Vehicle) 2. Texas (State) - Michelob Ultra (Brand) - Instagram(Vehicle).

If the input hierarchy level is just State column, both granularities 1 & 2 share a common hierarchy level ( State - Texas ). Hence the learned embedding of "Texas" State is shared to both these granularities.

Hierarchy Levels

Input hierarchy levels can be of following types: * Categorical columns. Ex: ["State"] or ["Brand"] or ["Parent Vehicle"] * Multi-level hierarchies of categorical columns. Ex: ["State", "Brand"] or ["State","Parent Vhicle"] * Continuous data columns. Ex: ["Population Density"]

An example of input hierarchy is,

hier_cols = [
                ["Vehicle"],                # Categorical
                ["State","Parent Vehicle"], # Multi-level
                ["Density"],                # Continuous
                ["Unemployment"]            # Continuous
            ]

Note: Combination of continuous data columns is not allowed.

Note: Depth of an hierarchy level is defined by the number of possible unique values in it. For multi-level hierarchies, it is number of combinations of unique values. Continuous data columns are assumed to have a depth of one.

Embeddings

Let \(H\) be the set of input hierarchies(all categorical) of length \(d\),

\[H = \{H_1,\ H_2,\dots H_d\}\]

Example:

# Example
Vehicle = ["Facebook","Twitter","Youtube"]
Brand = ["Budweiser", "Corona"]
State = ["California", "Texas"]

# Input Hierarchies
H = {"Vehicle","Brand","State"}

If \(|H_i|\) represent the number of unique values in the hierarchy level \(H_i\) with the corresponding embeddings dictionary \({W_{i}}\) whose keys are unique values of \({H_i}\) and values are zero-mean centered weights. Let \(W\) be the nested dictionary with keys are hierarchies \({H_i}\) and values are \({W_{i}}\),

\[W = \{H_1 : W_1,\ H_2 : W_2,\dots H_d : W_d\}\]

# Example
W = {
    "Vehicle" : {
        "Facebook" : -0.2,
        "Twitter"  : +0.5,
        "Youtube"  : -0.3
    },
    "Brand" : {
        "Budweiser": -0.8,
        "Corona"   : +0.8
    },
    "State" : {
        "California" : -0.2,
        "Texas"      : +0.2
    }
}

Then \(p\) be the model parameter for a particular granularity row \(g\) with corresponding set of hierarchy values \(\{h_1,\ h_2,\dots h_d\}\) is given by,

\[p=\sum_i W[H_i][h_i]\]

where \(h_i\ \epsilon\ H_i\) and \(i\ \epsilon\ (1,d)\).

# Example

# Granularity row
g = {
    "Vehicle": "Twitter",
    "Brand"  : "Budweiser",
    "State"  : "California"
}

# Parameter
p = W["Vehicle"]["Twitter"] + W["Brand"]["Budweiser"] + W["State"]["California"]
p = 0.5-0.8-0.2 = -0.5

If \(H_i\) is a continuous data column, then \({W_{i}}\) is chosen to have a single embedding \(m_i\). The parameter for the granularity \(g\) is given by \(p_g = m_i x_i\) where \(x_i\) is the value of continuous hierarchy column in granularity row \(g\).

In case of a combination of categorical and continuous hierarchies, the parameter \(p\) for the granularity \(g\) is given by,

\[p_g = \sum_{i\ \epsilon\ H_{cont}} m_i x_i + \sum_{i\ \epsilon\ H_{cat}} W[H_i][h_i] \]

where \(\{m_i\}\) are embeddings of continuous hierarchies \(\{H_{cont}\}\) and \(\{x_i\}\) are corresponding values of each continuous hierarchies for granularity \(g\).

If bias is added, the updated parameter is given by,

\[p_g = bias + \sum_{i\ \epsilon\ H_{cont}} m_i x_i + \sum_{i\ \epsilon\ H_{cat}} W[H_i][h_i]\]

# Example with Categorical and Continuous Hierarchies
# Categorical - Vehicle, Brand, State
# Continuous - Density, Unemployment
H = {"Vehicle","Brand","State","Density","Unemployment"}
bias = 0.3

# Embeddings - Categorical
W = {
    "Vehicle" : {
        "Facebook" : -0.2,
        "Twitter"  : +0.5,
        "Youtube"  : -0.3
    },
    "Brand" : {
        "Budweiser": -0.8,
        "Corona"   : +0.8
    },
    "State" : {
        "California" : -0.2,
        "Texas"      : +0.2
    }
}

# Embeddings - Continuous
M = {
    "Density" : 0.001,
    "Unemployment" : -0.2
}

# Granularity row
g = {
    "Vehicle": "Twitter",
    "Brand"  : "Budweiser",
    "State"  : "California",
    "Density": 250,
    "Unemployment" : 4.5
}

# Parameter
p = bias
    + M["Density"] * g["Density"]
    + M["Unemployment"] * g["Unemployment"]
    + W["Vehicle"]["Twitter"]
    + W["Brand"]["Budweiser"]
    + W["State"]["California"]
p = 0.3 + (0.001)(250) + (-0.2)(4.5) + 0.5 + (-0.8) + (-0.2)
p = -0.85

Regularization

Regularization in hierarchical embedding layer is performed using a penalty that the model must accept for the deviations in the trained embeddings.

For each hierarchy \(H_i\) defined above, we define a desired L2 norm as,

\[ DesiredDev_{H_i} = C_1 * \bigg(\frac{1}{\sqrt{|H_i|}}\bigg) * \bigg(\frac{1}{d}\bigg)\]

\[DesiredL2_{H_i} = (|H_i|-1)*DesiredDev_{H_i}^2\]

where \(|H_i|,\ d\) are the cardinalities of \(H_i\) and \(H\) respectively and \(C_1\) is a constant.

The actual L2 norm is calculated as,

\[L2_{H_i} = \sum_{h\ \epsilon\ H_i}\bigg(W[H_i][h_j]\bigg)^2\]

The final penalty added as \(HierLoss\) is given by,

\[CurrentRatio_{H_i} = \frac{L2_{H_i}}{(DesiredL2_{H_i}+\varepsilon)}\]

\[HierLoss = \sum_P \sum_H CurrentRatio_{H_i,p}^2\]

where \(P\) are the parameters and \(\varepsilon\) is a very small value.

If bias term is added to parameter, penalty related to bias is given by,

\[HierBiasLoss = C_2 * \sum_P bias_p^2\]

where \(C_2\) is a constant and \(P\) are parameters.