Matrices and Derivatives: Solving Some Problems

By Behrooz Vedadian on Jan, 29th 2019

NOTE: This article has been translated from Farsi using llama3-70b-8192 and groq

Derivative of Correlation Matrix Estimation and Neural Network Weights

The primary task of taking derivatives is to find the extremum points; however, the method of using derivatives to reach the extremum can be either through iterative optimization methods or solving formal equations. To illustrate each of these methods, we will first estimate the correlation matrix for a Gaussian distribution and then compute the gradient of an MLPMulti-Layer Perceptron with respect to its weights.

Estimating Correlation Matrix

The multivariate Gaussian distribution is formulated as follows:

p (x | μ, Σ) = \det (2 π Σ)^{- \frac{1}{2}} \exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))

Now, if we have samples $x_{1}$ to $x_{n}$ from this distribution, which values of $μ$ and $Σ$ maximize the likelihood of these samples?

Assuming independence among samples, the likelihood of all samples becomes:

\begin{aligned} p (x_{1}, x_{2}, \dots x_{n} | μ, Σ) & = \prod p (x_{i} | μ, Σ) \\ \log (p (x_{1}, x_{2}, \dots x_{n} | μ, Σ)) & = \sum \log (p (x_{i} | μ, Σ)) \end{aligned}

Simplifying the equation, we get:

\log (p (x_{1}, x_{2}, \dots x_{n} | μ, Σ)) = - \frac{n}{2} \log (2^{m} π^{m}) - \frac{n}{2} \log (\det (Σ)) - \frac{1}{2} \sum (x_{i} - μ)^{T} Σ^{- 1} (x_{i} - μ)

To simplify further, we take derivatives with respect to $μ$ and $Σ$ .

For the first derivative:

\frac{\partial y}{\partial μ} = - \frac{1}{2} (Σ^{- 1} + Σ^{- T}) ((\sum x_{i}) - n μ) = 0 \Rightarrow μ = \frac{1}{n} \sum x_{i}

The second derivative:

\frac{\partial y}{\partial Σ} = \frac{\partial y}{\partial Σ^{- 1}} \frac{\partial Σ^{- 1}}{\partial Σ} = - \frac{n}{2} {vec}^{T} (Σ^{T}) - \frac{1}{2} \sum ((x_{i} - μ)^{T} \otimes (x_{i} - μ)^{T}) (Σ^{- T} \otimes Σ^{- 1}) = 0

Simplifying, we get:

{vec}^{T} (Σ^{T}) = \frac{1}{n} \sum ((x_{i} - μ) \otimes (x_{i} - μ))^{T}

Using the relationship $vec (A X B) = (B^{T} \otimes A) vec (X)$ , we obtain:

{vec}^{T} (Σ^{T}) = \frac{1}{n} \sum {vec}^{T} ((x_{i} - μ) (x_{i} - μ)^{T}) \Rightarrow Σ = Σ^{T} = \frac{1}{n} \sum ((x_{i} - μ) (x_{i} - μ)^{T})

Derivative of MLP Weights

The layers of an MLP can be represented as:

y_{n + 1} = ϕ (W_{n} y_{n} + b_{n})

Assuming we have an MLP with multiple layers, when we want to train it, we compare the output of the $m$ th layer with the desired results and calculate the error. Let’s assume we use the MSEMean Squared Error criterion:

j (W_{0}, b_{0}, W_{1}, b_{1}, \dots W_{m - 1}, b_{m - 1}) = \frac{1}{2} ‖ y_{m} - y ‖^{2} = \frac{1}{2} (y_{m} - y)^{T} (y_{m} - y)

Using the machinery we built for derivatives, we calculate $\frac{\partial j}{\partial W_{i}}$ :

\frac{\partial j}{\partial W_{i}} = \frac{\partial j}{\partial y_{m}} (\prod_{m - 1}^{i} \frac{\partial y_{j + 1}}{\partial y_{j}}) \frac{\partial y_{i}}{\partial W_{i}}

Breaking down each component:

\begin{aligned} \frac{\partial j}{\partial y_{m}} & = (y_{m} - y)^{T} \\ \frac{\partial y_{j + 1}}{\partial y_{j}} & = diag (ϕ^{'} (W_{j} y_{j} + b_{j})) W_{j} \\ \frac{\partial y_{i}}{\partial W_{i}} & = (y_{i}^{T} \otimes I) \end{aligned}

Simplifying the expression:

\frac{\partial j}{\partial W_{i}} = (y_{m} - y)^{T} (\prod diag (ϕ^{'} (W_{j} y_{j} + b_{j})) W_{j}) (y_{i}^{T} \otimes I)

Using the Hadamard product (element-wise multiplication), we get:

\frac{\partial j}{\partial W_{i}} = (y_{m} - y)^{T} (\prod diag (ϕ^{'} (W_{j} y_{j} + b_{j})) W_{j}) (y_{i}^{T} \otimes I) = {vec}^{T} (e_{i} y_{i}^{T})

In conclusion, the derivative of the correlation matrix estimation and MLP weights are computable using the derivatives of the logarithmic likelihood function and the error criterion, respectively.