CS 224N: Assignment #1 Weiran Huang Aug 30, 2018

## 写在前面

Stanford CS 224n Assignment #1 solution. The starter code can be downloaded here. The starter code is in Python 2.7. Since I am more familiar with Python3 and the most difference between Python 2.7 and Python3 here is the print function and xrange， I modified its original code to Python3.

## 1. Softmax

(a) Prove that softmax is invariant to constant offsets in the input, that is, for any input vector $\bf{x}$ and any constant $c$
$$\mathbf{softmax}(\mathbf{x}) = \mathbf{softmax}(\mathbf{x} + c)$$
where $\bf{x} + c$ means adding the constant $c$ to every dimension of $\bf{x}$. Remember that $$\mathbf{softmax}(\mathbf{x})_i = \dfrac{e^{x_i}}{\sum_j e^{x_j}}$$

(b) Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in q1_softmax.py. You may test by executing python q1_softmax.py.

# to be updated


## 2. Neural Network Basics

(a) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only $\sigma(x)$, but not $x$, is present). Assume that the input $x$ is a scalar for this question. Recall, the sigmoid function is $\sigma(x) = \dfrac{1} {1 + e^{−x}}$ .

(b) ) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e., find the gradients with respect to the softmax input vector $\theta$, when the prediction is made by $\hat{y} = \mathbf{softmax}(\theta)$. Remember the cross entropy function is
$$\mathbf{CE}(\mathbf{y}, \mathbf{\hat{y}}) = -\sum_i y_i\log \hat{y}_i$$
where $\mathbf{y}$ is the one-hot label vector, and $\mathbf{\hat{y}}$ is the predicted probability vector for all classes.

$$\dfrac{\partial{\mathbf{CE}}}{\partial{\theta}} = \hat{y} - y$$

(c) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, find $\dfrac{\partial{J}}{\partial{x}}$ where $J = \mathbf{CE}(y, \hat{y})$ is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y, and cross entropy cost is used. (Feel free to use $\sigma ‘ (x)$ as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit.)

This blog is under a CC BY-NC-SA 3.0 Unported License
Link to this article: http://huangweiran.club/2018/08/30/CS-224N-Assignment-1/