Modelling Extreme Events
What is Matrix Factorization?
Matrix Facotrizatioin is a ML approach to decompose a metrix into two lower dimensional rectangular matrixes.
Commonly, if you are selling products to users, this is the process of representing every affinity a user has for a product as a user X product matrix which would be huge. Instead, via matrix factorization, you represnt every user as a small vector, and every product as a small vector, and the affinity from a user to a product to be the dot product between that user and that product.
How do we model Extreme Events?
- We learn a custom factorization to represent brand x geo via two 1-D vectors, with a regularized additive term full-matrix to represent custom granular interractions. For Bud Light Crisis, we allow this granularity matrix to be positive or negative representing different direction of the effect, and force the temporal axis to be non-negative representing only the strength, whereas for COVID we represent granularity as non-negative to represent only strength/sensitivity on a signed (negative or non-negative) temporal representation representing the different phases of covid.
- To represent the temporal uniqueness of the event as well as the decayed effect, we learn a number of months of a learned temporal pattern and then continue that pattern forward without unique trends as covid's impact decays towards 0 and the bud light impact just continues forward in perpituity (at least that's how we model it for now). Temporally, during the learned unique fluctuating periods, we had two potential approaches and use only one of them. We could (but don't) allow the model to learn a single temporal behavior and just regularize the deviations and direction changes over time. Instead, however, we learn a bunch of temporal patterns and each granularity learns a softmax (distributional selection) over which trends apply more to that granularity. This could have also been built directly as the matrix factorization representation of the output of compressing the brand and geo axes, but we haven't put in the work to implement it that way and leave that representation as just a magnitude and (for Bud Light) direction learner.
- These temporal representations shouldn't overfit by being allowed to fluctuate too randomly. Overfitting can happen unreasonably by changing in value too fast (so we regularize the 1st derivitve), and by changing direction up and down too fast (so we regularize the 2nd derivitive). This allows the model to learn changes for instance if a drop happens, but not overfit for instance a drop then recover immediately. Statistically, notice a standard L2 regularization on the 1st derivitive would see an instantaneous drop that stayed at the lower value and learn that spread out over a period instead of just in the week it occurred, therefore we instead L1 regularize the 1st derivitive (but still L2 on the second derivitive due to standard problems of a little directional change being fine by a ton of it needing much more data to support)