Matrices and Derivatives: Solving Some Problems

By Behrooz Vedadian on Jan, 29th 2019math

NOTE: This article has been translated from Farsi using llama3-70b-8192 and groq

Derivative of Correlation Matrix Estimation and Neural Network Weights

The primary task of taking derivatives is to find the extremum points; however, the method of using derivatives to reach the extremum can be either through iterative optimization methods or solving formal equations. To illustrate each of these methods, we will first estimate the correlation matrix for a Gaussian distribution and then compute the gradient of an MLPMulti-Layer Perceptron with respect to its weights.

Estimating Correlation Matrix

The multivariate Gaussian distribution is formulated as follows:

p(x|μ,Σ)=det(2πΣ)12exp(12(xμ)TΣ1(xμ))

Now, if we have samples x1 to xn from this distribution, which values of μ and Σ maximize the likelihood of these samples?

Assuming independence among samples, the likelihood of all samples becomes:

p(x1,x2,xn|μ,Σ)=p(xi|μ,Σ)log(p(x1,x2,xn|μ,Σ))=log(p(xi|μ,Σ))

Simplifying the equation, we get:

log(p(x1,x2,xn|μ,Σ))=n2log(2mπm)n2log(det(Σ))12(xiμ)TΣ1(xiμ)

To simplify further, we take derivatives with respect to μ and Σ.

For the first derivative:

yμ=12(Σ1+ΣT)((xi)nμ)=0μ=1nxi

The second derivative:

yΣ=yΣ1Σ1Σ=n2vecT(ΣT)12((xiμ)T(xiμ)T)(ΣTΣ1)=0

Simplifying, we get:

vecT(ΣT)=1n((xiμ)(xiμ))T

Using the relationship vec(AXB)=(BTA)vec(X), we obtain:

vecT(ΣT)=1nvecT((xiμ)(xiμ)T)Σ=ΣT=1n((xiμ)(xiμ)T)

Derivative of MLP Weights

The layers of an MLP can be represented as:

yn+1=ϕ(Wnyn+bn)

Assuming we have an MLP with multiple layers, when we want to train it, we compare the output of the mth layer with the desired results and calculate the error. Let’s assume we use the MSEMean Squared Error criterion:

j(W0,b0,W1,b1,Wm1,bm1)=12ymy2=12(ymy)T(ymy)

Using the machinery we built for derivatives, we calculate jWi:

jWi=jym(m1iyj+1yj)yiWi

Breaking down each component:

jym=(ymy)Tyj+1yj=diag(ϕ(Wjyj+bj))WjyiWi=(yiTI)

Simplifying the expression:

jWi=(ymy)T(diag(ϕ(Wjyj+bj))Wj)(yiTI)

Using the Hadamard product (element-wise multiplication), we get:

jWi=(ymy)T(diag(ϕ(Wjyj+bj))Wj)(yiTI)=vecT(eiyiT)

In conclusion, the derivative of the correlation matrix estimation and MLP weights are computable using the derivatives of the logarithmic likelihood function and the error criterion, respectively.