CS 224N: Assignment #1

写在前面

Stanford CS 224n Assignment #1 solution. The starter code can be downloaded here. The starter code is in Python 2.7. Since I am more familiar with Python3 and the most difference between Python 2.7 and Python3 here is the print function and xrange, I modified its original code to Python3.

1. Softmax

(a) Prove that softmax is invariant to constant offsets in the input, that is, for any input vector $\bf{x}$ and any constant $c$
$$
\mathbf{softmax}(\mathbf{x}) = \mathbf{softmax}(\mathbf{x} + c)
$$
where $ \bf{x} + c$ means adding the constant $c$ to every dimension of $\bf{x}$. Remember that $$\mathbf{softmax}(\mathbf{x})_i = \dfrac{e^{x_i}}{\sum_j e^{x_j}}$$

这个证明是显然的,我们只需要在softmax的分子分母中同时乘上 $e^c$这一项即可。但这个性质在我们实现softmax的时候非常重要。

考虑这样的情形,你需要对 $\mathbf{x} = [1001, 1002]$ 做softmax运算,但 $e^{1001}$ 毫无疑问是溢出的。但对于softmax而言,$[1001, 1002]$ 和 $[1,2]$ 又有什么区别呢?因此实际实现的时候,我们通常取 $c = - \mathbf{max}\space x_i$ 以保证数值上的稳定。

(b) Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in q1_softmax.py. You may test by executing python q1_softmax.py.

代码如下:

# to be updated

2. Neural Network Basics

(a) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only $\sigma(x)$, but not $x$, is present). Assume that the input $x$ is a scalar for this question. Recall, the sigmoid function is $\sigma(x) = \dfrac{1} {1 + e^{−x}}$ .

这个推导非常简单:$(\sigma(x))’ = \sigma(x)(1 - \sigma(x))$。

(b) ) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e., find the gradients with respect to the softmax input vector $\theta$, when the prediction is made by $\hat{y} = \mathbf{softmax}(\theta)$. Remember the cross entropy function is
$$
\mathbf{CE}(\mathbf{y}, \mathbf{\hat{y}}) = -\sum_i y_i\log \hat{y}_i
$$
where $\mathbf{y}$ is the one-hot label vector, and $\mathbf{\hat{y}}$ is the predicted probability vector for all classes.

我们知道 $\mathbf{y}$ 是一个one-hot label vector,这就是说它只有一项是 $1$ , 其余项均为 $0$ 。设 $\mathbf{y}$ 的第 $i$ 项为 $1$.

对第 $i$ 项: $\dfrac{\partial{\mathbf{CE}}}{\partial{\theta_i}} = \dfrac{\partial{\mathbf{CE}}}{\partial{\hat{y_i}}} \dfrac{\partial{\hat{y_i}}}{\partial{\theta_i}} = -\dfrac{1}{\hat{y_i}}\dfrac{\partial \space\mathbf{softmax}(\theta)_i} {\partial \theta_i}$

而 $\dfrac{\partial \space\mathbf{softmax}(\theta)_i} {\partial \theta_i} = \dfrac{e^{x_i}(\sum_j e^{x_j}) - e^{x_i}e^{x_i}}{(\sum_j e^{x_j})^2}= \mathbf{softmax}(\theta)_i(1 - \mathbf{softmax}(\theta)_i) = \hat{y_i}(1-\hat{y_i})$

故 $\dfrac{\partial{\mathbf{CE}}}{\partial{\theta}_i} = \hat{y_i} - 1$

对第 $j \not = i$ 项: $\dfrac{\partial{\mathbf{CE}}}{\partial{\theta_j}} = \dfrac{\partial{\mathbf{CE}}}{\partial{\hat{y_i}}} \dfrac{\partial{\hat{y_i}}}{\partial{\theta_j}} = -\dfrac{1}{\hat{y_i}}\dfrac{\partial \space\mathbf{softmax}(\theta)_i} {\partial \theta_j}$

而 $\dfrac{\partial \space\mathbf{softmax}(\theta)_i} {\partial \theta_j} = \dfrac{0(\sum_j e^{x_j}) - e^{x_j}e^{x_i}}{(\sum_j e^{x_j})^2} = -\hat{y_i}\hat{y_j}$

故 $\dfrac{\partial{\mathbf{CE}}}{\partial{\theta}_j} = \hat{y_j} = \hat{y_j} - 0$

可以看出两种情况可以综合为 $\hat{y_i} - y_i$ 的形式,则
$$
\dfrac{\partial{\mathbf{CE}}}{\partial{\theta}} = \hat{y} - y
$$

(c) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, find $\dfrac{\partial{J}}{\partial{x}}$ where $J = \mathbf{CE}(y, \hat{y})$ is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y, and cross entropy cost is used. (Feel free to use $\sigma ‘ (x)$ as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit.)

This blog is under a CC BY-NC-SA 3.0 Unported License
Link to this article: http://huangweiran.club/2018/08/30/CS-224N-Assignment-1/