-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathmatrix_derivative.tex
360 lines (290 loc) · 15.7 KB
/
matrix_derivative.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
\chapter{Matrix derivative common cases}
\label{app:matrix_deriv}
What are some conventions for derivatives of matrices and vectors? It
will always work to explicitly write all indices and treat everything
as scalars, but we introduce here some shortcuts that are often faster
to use and helpful for understanding.
There are at least two consistent but different systems for describing
shapes and rules for doing matrix derivatives. In the end, they all
are correct, but it is important to be consistent.
\newcommand{\av}{{\bf a}}
\newcommand{\xv}{{\bf x}}
\newcommand{\yv}{{\bf y}}
\newcommand{\uv}{{\bf u}}
\newcommand{\vv}{{\bf v}}
\newcommand{\fv}{{\bf f}}
\newcommand{\gv}{{\bf g}}
\newcommand{\am}{{\bf A}}
\newcommand{\xm}{{\bf X}}
\newcommand{\ym}{{\bf Y}}
We will use what is often called the `Hessian' or denominator layout,
in which we say that
for
$\xv$ of size $n\times 1$ and $\yv$ of size $m\times 1$,
$\partial \yv/\partial \xv$ is a matrix of size $n\times m$ with the
$(i, j)$ entry $\partial y_j/\partial x_i$.
This denominator layout convention has been adopted by the field of machine
learning to ensure that the shape of the gradient is the same as the shape
of the shape of the respective derivative. This is somewhat controversial at large,
but alas, we shall continue with denominator layout.
The discussion
below closely follows the Wikipedia on matrix derivatives.
\section{The shapes of things}
Here are important special cases of the rule above:
\begin{itemize}
\item Scalar-by-scalar: For $x$ of size $1\times 1$ and $y$ of size $1\times 1$,
$\partial y/\partial x$ is the (scalar) partial derivative of $y$
with respect to $x$.
\item Scalar-by-vector: For $\xv$ of size $n\times 1$ and $y$ of size $1\times 1$,
$\partial y/\partial \xv$ (also written $\nabla_\xv y$, the gradient of $y$ with respect to $\xv$)
is a column vector of size $n\times 1$ with the $i^{\rm th}$ entry $\partial y/\partial
x_i$:
\begin{align*}
\partial y/\partial \xv=
\begin{bmatrix}
\partial y/\partial x_1 \\ \partial y/\partial x_2 \\ \vdots \\ \partial y/\partial x_n
\end{bmatrix}.
\end{align*}
\item Vector-by-scalar: For $x$ of size $1\times 1$ and $\yv$ of size $m\times 1$,
$\partial \yv/\partial x$ is a row vector of size $1 \times m$ with the
$j^{\rm th}$ entry $\partial \yv_j/\partial x$:
\begin{align*}
\partial \yv/\partial x=
\begin{bmatrix}
\partial y_1/\partial x & \partial y_2/\partial x & \cdots & \partial y_m/\partial x
\end{bmatrix}.
\end{align*}
\item Vector-by-vector: For $\xv$ of size $n\times 1$ and $\yv$ of size $m\times 1$,
$\partial \yv/\partial \xv$ is a matrix of size $n\times m$ with the
$(i, j)$ entry $\partial y_j/\partial x_i$:
\begin{align*}
\partial \yv/\partial \xv=
\begin{bmatrix}
\partial y_1/\partial x_1 & \partial y_2/\partial x_1 & \cdots & \partial y_m/\partial x_1 \\
\partial y_1/\partial x_2 & \partial y_2/\partial x_2 & \cdots & \partial y_m/\partial x_2 \\
\vdots & \vdots & \ddots & \vdots \\
\partial y_1/\partial x_n & \partial y_2/\partial x_n & \cdots & \partial y_m/\partial x_n \\
\end{bmatrix}.
\end{align*}
\item Scalar-by-matrix: For $\xm$ of size $n\times m$ and $y$ of size $1\times 1$,
$\partial y/\partial \xm$ (also written $\nabla_\xm y$, the gradient of $y$ with respect to $\xm$)
is a matrix of size $n\times m$ with the $(i, j)$ entry $\partial y/\partial X_{i, j}$:
\begin{align*}
\partial y/\partial \xm=
\begin{bmatrix}
\partial y/\partial X_{1,1} & \cdots & \partial y/\partial X_{1,m} \\
\vdots & \ddots & \vdots \\
\partial y/\partial X_{n,1} & \cdots & \partial y/\partial X_{n,m} \\
\end{bmatrix}.
\end{align*}
\end{itemize}
You may notice that in this list, we have not included matrix-by-matrix, matrix-by-vector, or vector-by-matrix derivatives.
This is because, generally, they cannot be expressed nicely in matrix form and require higher order objects (e.g., tensors)
to represent their derivatives. These cases are beyond the scope of this course.
Additionally, notice that for all cases, you can explicitly compute each element of the derivative object using (scalar) partial derivatives. You may find it useful to work through some of these by hand as you are reviewing matrix derivatives.
\section{Some vector-by-vector identities}
Here are some examples of $\partial \yv / \partial \xv$. In each case,
assume $\xv$ is $n \times 1$, $\yv$ is $m \times 1$, $a$ is a scalar
constant, $\av$ is a vector that does not depend on $\xv$ and $\am$ is a matrix that does
not depend on $\xv$, $u$ and $v$ are scalars that do depend on $\xv$, and
$\uv$ and $\vv$ are vectors that do depend on $\xv$. We also have
vector-valued functions $\fv$ and $\gv$.
\subsection{Some fundamental cases}
First, we will cover a couple of fundamental cases: suppose that $\av$ is an $m \times 1$ vector which is
not a function of $\xv$, an $n\times1$ vector. Then,
\begin{equation}\frac{\partial \av}{\partial \xv} = {\bf 0},
\label{eq:const}
\end{equation}
is an $n \times m$ matrix of 0s. This is similar to the scalar case of differentiating a constant. Next, we can consider the case of differentiating a vector with respect to itself:
\begin{equation}\frac{\partial \xv}{\partial \xv} = {\bf I}\end{equation}
This is the $n \times n$ identity matrix, with 1's along the
diagonal and 0's elsewhere. It makes sense, because $\partial \xv_j /
\xv_i$ is 1 for $i = j$ and 0 otherwise. This identity is also similar to the scalar case.
\subsection{Derivatives involving a constant matrix}
Let the dimensions of $\am$ be $m \times n$. Then the object $\am \xv$ is an $m\times 1$ vector.
We can then compute the derivative of $\am \xv$ with respect to $\xv$ as:
\begin{equation}\frac{\partial \am \xv}{\partial \xv} = \begin{bmatrix}
\partial (\am \xv)_1/\partial x_1 & \partial (\am \xv)_2/\partial x_1 & \cdots & \partial (\am \xv)_m/\partial x_1 \\
\partial (\am \xv)_1/\partial x_2 & \partial (\am \xv)_2/\partial x_2 & \cdots & \partial (\am \xv)_m/\partial x_2 \\
\vdots & \vdots & \ddots & \vdots \\
\partial (\am \xv)_1/\partial x_n & \partial (\am \xv)_2/\partial x_n & \cdots & \partial (\am \xv)_m/\partial x_n \\
\end{bmatrix}
\end{equation}
\noindent Note that any element of the column vector $\am \xv$ can be written as, for $j = 1, \ldots, m$:
\[ (\am \xv)_j = \sum_{k=1}^n A_{j,k} x_k.\]
\noindent Thus, computing the $(i,j)$ entry of $\frac{\partial \am \xv}{\partial \xv}$ requires computing the partial derivative $\partial (\am \xv)_j/\partial x_i:$
\begin{align*}
\partial (\am \xv)_j/\partial x_i = \partial \left( \sum_{k=1}^n A_{j,k} x_k \right)/ \partial x_i = A_{j,i}
\end{align*}
\noindent Therefore, the $(i,j)$ entry of $\frac{\partial \am \xv}{\partial \xv}$ is the $(j,i)$ entry of $\am$:
\begin{equation}
\frac{\partial \am \xv}{\partial \xv} = \am^T \label{eq:Ax}
\end{equation}
\noindent Similarly, for objects $\xv, \am$ of the same shape, on can obtain,
\begin{equation}
\frac{\partial \xv^T \am}{\partial \xv} = \am
\label{eq:xTA}
\end{equation}
\subsection{Linearity of derivatives}
Suppose that $\uv, \vv$ are both vectors of size $m \times 1$. Then,
\begin{equation}
\frac{\partial (\uv + \vv)}{\partial \xv} = \frac{\partial
\uv}{\partial \xv} + \frac{\partial \vv}{\partial \xv}
\label{eq:uplusv}
\end{equation}
\noindent Suppose that $a$ is a scalar constant and $\uv$ is an $m\times 1$ vector that is a function of $\xv$. Then,
\begin{equation}\frac{\partial a \uv}{\partial \xv} = a \frac{\partial \uv}{ \partial \xv}\end{equation}
One can extend the previous identity to vector- and matrix-valued constants. Suppose that $\av$ is a vector with shape $m \times 1$ and $v$ is a scalar which depends on $\xv$. Then,
\begin{equation}\frac{\partial v \av}{\partial \xv} = \frac{\partial
v}{\partial \xv}\av^T\end{equation}
First, checking dimensions, $\partial v/\partial \xv$ is $n \times
1$ and $\av$ is $m \times 1$ so $\av^T$ is $1 \times m$ and our
answer is $n \times m$ as it should be. Now, checking a value,
element $(i,j)$ of the answer is $\partial v \av_j / \partial x_i$ =
$(\partial v / \partial x_i) \av_j$ which corresponds to element
$(i,j)$ of $(\partial v / \partial \xv)\av^T$.
Similarly, suppose that $\am$ is a matrix which does not depend on $\xv$ and $\uv$ is a column vector which does depend on $\xv$. Then,
\begin{equation}\frac{\partial \am \uv}{\partial \xv} = \frac{\partial
\uv}{\partial \xv}\am^T\end{equation}
\subsection{Product rule (vector-valued numerator)}
Suppose that $v$ is a scalar which depends on $\xv$, while $\uv$ is a column vector of shape $m\times 1$ and $\xv$ is a column vector of shape $n \times 1$. Then,
\begin{equation}\frac{\partial v \uv}{\partial \xv} = v\frac{\partial
\uv}{\partial \xv} + \frac{\partial v}{\partial \xv} \uv^T\end{equation}
One can see this relationship by expanding the derivative as follows:
\begin{align*}
\frac{\partial v \uv}{\partial \xv} =
\begin{bmatrix}
\partial (v u_1)/\partial x_1 & \partial (v u_2)/\partial x_1 & \cdots & \partial (v u_m)/\partial x_1 \\
\partial (v u_1)/\partial x_2 & \partial (v u_2)/\partial x_2 & \cdots & \partial (v u_m)/\partial x_2 \\
\vdots & \vdots & \ddots & \vdots \\
\partial (v u_1)/\partial x_n & \partial (v u_2)/\partial x_n & \cdots & \partial (v u_m)/\partial x_n \\
\end{bmatrix}.
\end{align*}
Then, one can use the product rule for scalar-valued functions,
\begin{align*}
\partial (v u_j)/\partial x_i = v (\partial u_j / \partial x_i) + (\partial v / \partial x_i) u_j,
\end{align*}
to obtain the desired result.
\subsection{Chain rule}
Suppose that $\gv$ is a vector-valued function with output vector of shape $m \times 1$, and the argument to $\gv$ is a column vector $\uv$ of shape $d\times 1$ which depends on $\xv$. Then, one can obtain the chain rule as,
\begin{equation}\frac{\partial \gv(\uv)}{\partial \xv} = \frac{\partial
\uv}{\partial \xv}\frac{\partial \gv(\uv)}{\partial
\uv}\end{equation}
Following ``the shapes of things,''
$\partial \uv / \partial \xv$ is $n \times d$ and $\partial \gv(\uv)
/ \partial \uv$ is $d \times m$, where element $(i,j)$ is $\partial
\gv(\uv)_j / \partial \uv_i$.
The same chain rule applies for further compositions of functions:
\begin{equation}\frac{\partial \fv(\gv(\uv))}{\partial \xv} = \frac{\partial
\uv}{\partial \xv}\frac{\partial \gv(\uv)}{\partial \uv}
\frac{\partial \fv(\gv)}{\partial \gv}\end{equation}
\section{Some other identities}
You can get many scalar-by-vector and vector-by-scalar cases as special
cases of the rules above, making one of the relevant vectors just be
1 x 1. Here are some other ones that are handy. For more, see the
Wikipedia article on Matrix derivatives (for consistency, only use
the ones in {\em denominator layout}).
\begin{equation}
\frac{\partial \uv^T \vv}{\partial \xv} = \frac{\partial
\uv}{\partial \xv} \vv + \frac{\partial \vv}{\partial \xv} \uv
\label{eq:uTv}
\end{equation}
\begin{equation}
\frac{\partial \uv^T}{\partial x} = \big(\frac{\partial
\uv}{\partial x}\big)^T
\label{eq:uT}
\end{equation}
\section{Derivation of gradient for linear regression}
\label{app:matrix-gradient}
\newcommand{\xmt}{\tilde{\bf X}}
\newcommand{\ymt}{\tilde{\bf Y}}
Applying identities \ref{eq:xTA}, \ref{eq:uTv}, \ref{eq:uplusv}, \ref{eq:Ax}
\ref{eq:const}
\begin{align*}
\frac{\partial (\xmt \theta - \ymt)^T(\xmt \theta - \ymt)/n}{\partial
\theta} & = \frac{2}{n} \frac{\partial(\xmt \theta - \ymt)}{\partial
\theta}(\xmt \theta - \ymt) \\
& = \frac{2}{n} \big(\frac{\partial\xmt \theta}{\partial
\theta}-\frac{\partial \ymt}{\partial
\theta}\big)(\xmt \theta - \ymt) \\
& = \frac{2}{n} \big(\xmt^T -{\bf 0}\big)(\xmt \theta - \ymt) \\
& = \frac{2}{n} \xmt^T (\xmt \theta - \ymt) \\
\end{align*}
\section{Matrix derivatives using Einstein summation}
\label{app:einstein}
{\em You do not have to read or learn this! But you might find it
interesting or helpful.}
\def\bea{\begin{eqnarray}}
\def\eea{\end{eqnarray}}
\def\Xt{\tilde{X}}
\def\Yt{\tilde{Y}}
Consider the objective function for linear regression, written out as products of matrices:
\bea
J(\theta) = \frac{1}{n} (\Xt\theta - \Yt)^T (\Xt\theta-\Yt)
\,,
\eea
where $\Xt = X^T$ is $n\times d$, $\Yt=Y^T$ is $n\times 1$, and $\theta$ is $d\times 1$. How does one show, with no shortcuts, that
\bea
\nabla_{\theta}J = \frac{2}{n} {\Xt^T} {(\Xt\theta - \Yt)} \;\;?
\eea
One neat way, which is very explicit, is to simply write all the
matrices as variables with row and column indices, e.g., $\Xt_{ab}$ is
the row $a$, column $b$ entry of the matrix $\Xt$. Furthermore, let
us use the convention that in any product, all indices which appear
more than once get summed over; this is a popular convention in
theoretical physics, and lets us suppress all the summation symbols
which would otherwise clutter the following expresssions. For
example, $\Xt_{ab} \theta_b$ would be the implicit summation notation
giving the element at the $a^{\rm th}$ row of the matrix-vector
product $\Xt \theta$.
Using implicit summation notation with explicit indices,
we can rewrite $J(\theta)$ as
\bea
J(\theta) = \frac{1}{n} \left( \Xt_{ab} \theta_b - \Yt_a \right) \left( \Xt_{ac}\theta_c - \Yt_a \right) \,.
\eea
Note that we no longer need the transpose on the first term, because
all that transpose accomplished was to take a dot product between the
vector given by the left term, and the vector given by the right term.
With implicit summation, this is accomplished by the two terms sharing
the repeated index $a$.
Taking the derivative of $J$ with respect to the $d^{\rm th}$ element
of $\theta$ thus gives, using the chain rule for (ordinary scalar)
multiplication:
\bea
\frac{dJ}{d\theta_d} &=& \frac{1}{n} \left[
\Xt_{ab} \delta_{bd} \left( \Xt_{ac}\theta_c-\Yt_a \right)
+ \left(\Xt_{ab}\theta_b - \Yt_a \right) \Xt_{ac} \delta_{cd}
\right]
\\
&=& \frac{1}{n} \left[
\Xt_{ad} \left( \Xt_{ac}\theta_c-\Yt_a \right)
+ \left(\Xt_{ab}\theta_b - \Yt_a \right) \Xt_{ad}
\right]
\\
&=& \frac{2}{n} \Xt_{ad} \left( \Xt_{ab}\theta_b-\Yt_a \right)
\,,
\eea
where the second line follows from the first, with the definition that
$\delta_{bd} = 1$ only when $b=d$ (and similarly for $\delta_{cd}$).
And the third line follows from the second by recognizing that the two
terms in the second line are identical. Now note that in this
implicit summation notation, the $a, b$ element of the matrix product
of $A$ and $B$ is $(AB)_{ac} = A_{ab}B_{bc}$. That is, ordinary
matrix multiplication sums over indices which are adjacent to each
other, because a row of $A$ times a column of $B$ becomes a scalar
number. So the term in the above equation with $\Xt_{ad} \Xt_{ab}$ is
not a matrix product of $\Xt$ with $\Xt$. However, taking the
transpose $\Xt^T$ switches row and column indices, so $\Xt_{ad} = \Xt^T_{da}$.
And $\Xt^T_{da} \Xt_{ab}$ {\em is} a matrix product of $\Xt^T$ with $\Xt$!
Thus, we have that
\bea
\frac{dJ}{d\theta_d} &=& \frac{2}{n} \Xt^T_{da} \left( \Xt_{ab}\theta_b-\Yt_a \right)
\\
&=& \frac{2}{n} \left[ \Xt^T \left( \Xt\theta-\Yt \right) \right]_{d}
\,,
\eea
which is the desired result.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: