CS 224N: Assignment #1

## Word2vec Summary

### 背景

1. One-hot没有在表达中体现词的相似度，并且空间浪费。

2. SVD-based的方法是先构造cooccurrence矩阵（这个共同出现可能会被限制在某个window范围内），然后对矩阵使用SVD分解，取前k个奇异值得到矩阵 $U$ 就得到每个词的k维表达。但这种做法弊端明显：

• 新词的出现和语料的改变会造成矩阵尺寸频繁变化。

• 矩阵非常稀疏，并且维度通常很大。

• SVD分解有平方级的开销，而我们刚刚说矩阵的维度很大，因此这个开销会很恼人。

• 需要对不同词语出现频率悬殊做额外处理。

有一些针对的解决办法，但不够好。

### Word2vec：iteration based

word2vec的参数就是词向量矩阵本身，它通过迭代的方式训练。它包括两种算法：连续词袋模型（CBOW）和skip-gram。前者通过上下文预测中心词，后者反之。训练方法也有两种：负采样和层次softmax。

### CBOW

$U$ 为输出矩阵，在这里即中心词的embedding矩阵；$V$ 为输入矩阵，在这里是上下文词语的embedding矩阵。

1. 取窗口内（窗口大小为m）的词语的one-hot vector。
2. 对其在矩阵 $V$ 中做embedding lookup操作（其实就是每个vector乘上这个矩阵），得到2m个词向量。
3. 直接对这2m个向量取平均。（这也是为什么它被称作连续词袋模型，它仍然不考虑这些词语的顺序，像把词向量的每个维度当成了袋子一样收集每个词提供的信息。）
4. 把这个向量与 $U$ 相乘，实际上就是和每个中心词做了内积，然后对得到的向量做softmax，就是我们得到的概率值。

loss function为这个概率和实际结果（其实是一个one-hot vector，可以认为它在正确结果处概率为1）的交叉熵。交叉熵是信息论中的概念，可以用来刻画两个分布的不同。由于真实分布只在一个点处有取值，且是1，其实这个交叉熵就是输出概率值在正确类对应的index上的熵

\begin{aligned} \text { minimize } J &=-\log P\left(w_{c} | w_{c-m}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+m}\right) \\ &=-\log P\left(u_{c} | \hat{v}\right) \\ &=-\log \frac{\exp \left(u_{c}^{T} \hat{v}\right)}{\sum_{j=1}^{|V|} \exp \left(u_{j}^{T} \hat{v}\right)} \\ &=-u_{c}^{T} \hat{v}+\log \sum_{j=1}^{|V|} \exp \left(u_{j}^{T} \hat{v}\right) \end{aligned}

### Skip-gram

$U,V$ 仍然分别代表输出和输入矩阵，只是他们代表的词语位置和刚才相反。

1. 用中心词的one-hot vector获取其embedding。
2. 与output matrix相乘，再做softmax。
3. 用得到的概率向量和真实输出（就是一个只有窗口范围内是1，其余都是0的向量）做交叉熵，作为loss。
\begin{aligned} \text { minimize } J &=-\log P\left(w_{c-m}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+m} | w_{c}\right) \\ &=-\log \prod_{j=0, j \neq m}^{2 m} P\left(w_{c-m+j} | w_{c}\right) \\ &=-\log \prod_{j=0, j \neq m}^{2 m} P\left(u_{c-m+j} | v_{c}\right) \\ &=-\log \prod_{j=0, j \neq m}^{2 m} \frac{\exp \left(u_{c-m+j}^{T} v_{c}\right)}{\sum_{k=1}^{|V|} \exp \left(u_{k}^{T} v_{c}\right)} \\ &=-\sum_{j=0, j \neq m}^{2 m} u_{c-m+j}^{T} v_{c}+2 m \log \sum_{k=1}^{|V|} \exp \left(u_{k}^{T} v_{c}\right) \end{aligned}

### Negative Sampling

$$J=-\sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}-\sum_{(w, c) \in D} \log \left(\frac{1}{1+\exp \left(u_{w}^{T} v_{c}\right)}\right)$$

$$-\log \sigma\left(u_{c-m+j}^{T} \cdot v_{c}\right)-\sum_{k=1}^{K} \log \sigma\left(-\tilde{u}_{k}^{T} \cdot v_{c}\right)$$

CBOW类似。

K个负样本是从某种分布中抽取的。经验上来说这个分布使用$\dfrac{3}{4}$次方的unigram模型是最好的。选取$\dfrac{3}{4}$的原因简单来说就是对于接近1的数，它的值基本维持，但对于很小的数，它的值会变大，能使它更容易被采样到。对于像”is”（0.9）这样的词语，在这样的情况下变化就很小，而像“bombastic”（0.01）则能获得3倍于之前的采样概率。

## 1. Softmax

(a) Prove that softmax is invariant to constant offsets in the input, that is, for any input vector $\bf{x}$ and any constant $c$
$$\mathbf{softmax}(\mathbf{x}) = \mathbf{softmax}(\mathbf{x} + c)$$
where $\bf{x} + c$ means adding the constant $c$ to every dimension of $\bf{x}$. Remember that $$\mathbf{softmax}(\mathbf{x})_i = \dfrac{e^{x_i}}{\sum_j e^{x_j}}$$

(b) Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in q1_softmax.py. You may test by executing python q1_softmax.py.

def softmax(x):
"""Compute the softmax function for each row of the input x.

You should also make sure that your code works for a single
D-dimensional vector (treat the vector as a single row) and
for N x D matrices. This may be useful for testing later. Also,
make sure that the dimensions of the output match the input.

Arguments:
x -- A D dimensional vector or N x D dimensional numpy matrix.

Return:
x -- You are allowed to modify x in-place
"""
orig_shape = x.shape

if len(x.shape) > 1:
# Matrix

# scale down
x_max = np.max(x, axis=1).reshape(-1,1)
x -= x_max

# softmax process
x = np.exp(x)
exp_sum = np.sum(x, axis=1).reshape(-1,1)
x /= exp_sum

else:
# Vector

# scale down
x_max = np.max(x)
x -= x_max

# softmax process
x = np.exp(x)
x /= np.sum(x)

assert x.shape == orig_shape
return x


## 2. Neural Network Basics

(a) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only $\sigma(x)$, but not $x$, is present). Assume that the input $x$ is a scalar for this question. Recall, the sigmoid function is $\sigma(x) = \dfrac{1} {1 + e^{−x}}$ .

(b) ) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e., find the gradients with respect to the softmax input vector $\theta$, when the prediction is made by $\hat{y} = \mathbf{softmax}(\theta)$. Remember the cross entropy function is
$$\mathbf{CE}(\mathbf{y}, \mathbf{\hat{y}}) = -\sum_i y_i\log \hat{y}_i$$
where $\mathbf{y}$ is the one-hot label vector, and $\mathbf{\hat{y}}$ is the predicted probability vector for all classes.

$$\dfrac{\partial{\mathbf{CE}}}{\partial{\theta}} = \hat{y} - y$$

(c) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, find $\dfrac{\partial{J}}{\partial{x}}$ where $J = \mathbf{CE}(y, \hat{y})$ is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y, and cross entropy cost is used. (Feel free to use $\sigma ‘ (x)$ as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit.)

# First implement a gradient checker by filling in the following functions
""" Gradient check for a function f.

Arguments:
f -- a function that takes a single argument and outputs the
x -- the point (numpy array) to check the gradient at
"""

rndstate = random.getstate()
random.setstate(rndstate)
fx, grad = f(x) # Evaluate function value at original point
h = 1e-4        # Do not change this!

# Iterate over all indexes ix in x to check the gradient.
while not it.finished:
ix = it.multi_index

x[ix] += h
fx_forward, _ = f(x)

x[ix] -= 2*h
fx_backward, _ = f(x)

x[ix] += h

numgrad = (fx_forward - fx_backward) / (2 * h)

if reldiff > 1e-5:
print "First gradient error found at index %s" % str(ix)
return

it.iternext() # Step to next dimension



def forward_backward_prop(X, labels, params, dimensions):
"""
Forward and backward propagation for a two-layer sigmoidal network

Compute the forward propagation and for the cross entropy cost,
the backward propagation for the gradients for all parameters.

the assignment sheet: they are w.r.t. weights, not inputs.

Arguments:
X -- M x Dx matrix, where each row is a training example x.
labels -- M x Dy matrix, where each row is a one-hot vector.
params -- Model parameters, these are unpacked for you.
dimensions -- A tuple of input dimension, number of hidden units
and output dimension
"""

# Unpack network parameters
ofs = 0
Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])

W1 = np.reshape(params[ofs:ofs+ Dx * H], (Dx, H))
ofs += Dx * H
b1 = np.reshape(params[ofs:ofs + H], (1, H))
ofs += H
W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
ofs += H * Dy
b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))

# Note: compute cost based on sum not mean.
# forward propagation
h = sigmoid(X.dot(W1) + b1)  # N x H
yhat = softmax(h.dot(W2) + b2)  # N x Dy

cost = -np.sum(np.log(yhat[np.where(labels)]))

# backward propagation



## 3. word2vec

(a) Assume you are given a predicted word vector $v_c$ corresponding to the center word $c$ for
and word prediction is made with the softmax function found in word2vec models
$$\hat{\boldsymbol{y}}_{o}=p(\boldsymbol{o} | \boldsymbol{c})=\frac{\exp \left(\boldsymbol{u}_{o}^{\top} \boldsymbol{v}_{c}\right)}{\sum_{w=1}^{W} \exp \left(\boldsymbol{u}_{w}^{\top} \boldsymbol{v}_{c}\right)}$$
where w denotes the $w$-th word and $u_w$ ($w = 1, \ldots, W$) are the “output” word vectors for all words in
the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word
(the $o$-th element of the one-hot label vector is one), derive the gradients with respect to $v_c$.

$$\frac{\partial J}{\partial \boldsymbol{v}_{c}}=U(\hat{\boldsymbol{y}}-\boldsymbol{y})$$ 或者其经过分配律后的式子： $$\frac{\partial J}{\partial \boldsymbol{v}_{c}}=-\boldsymbol{u}_{i}+\sum_{w=1}^{W} \hat{y}_{w} \boldsymbol{u}_{w}$$

(b) As in the previous part, derive gradients for the “output” word vectors $u_w$’s (including $u_o$).

$$\frac{\partial J}{\partial \boldsymbol{U}}=\boldsymbol{v}_{c}(\hat{\boldsymbol{y}}-\boldsymbol{y})^{\top}$$

(c) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector $v_c$, and the expected output word is $o$. Assume that K negative samples (words) are drawn, and
they are $1,\ldots, K$, respectively for simplicity of notation ($o \notin\{1, \ldots, K\}$ ). Again, for a given word, $o$,
denote its output vector as $u_o$. The negative sampling loss function in this case is
$$J_{n e g-s a m p l e}\left(\boldsymbol{o}, \boldsymbol{v}_{c}, \boldsymbol{U}\right)=-\log \left(\sigma\left(\boldsymbol{u}_{o}^{\top} \boldsymbol{v}_{c}\right)\right)-\sum_{k=1}^{K} \log \left(\sigma\left(-\boldsymbol{u}_{k}^{\top} \boldsymbol{v}_{c}\right)\right)$$

\begin{aligned} \frac{\partial J}{\partial \boldsymbol{v}_{c}} &=\left(\sigma\left(\boldsymbol{u}_{o}^{\top} v_{c}\right)-1\right) \boldsymbol{u}_{o}-\sum_{k=1}^{K}\left(\sigma\left(-\boldsymbol{u}_{k}^{\top} \boldsymbol{v}_{c}\right)-1\right) \boldsymbol{u}_{k} \\ \frac{\partial J}{\partial \boldsymbol{u}_{o}} &=\left(\sigma\left(\boldsymbol{u}_{o}^{\top} \boldsymbol{v}_{c}\right)-1\right) \boldsymbol{v}_{c} \\ \frac{\partial J}{\partial \boldsymbol{u}_{k}} &=-\left(\sigma\left(-\boldsymbol{u}_{k}^{\top} \boldsymbol{v}_{c}\right)-1\right) \boldsymbol{v}_{c}, \quad \text { for all } k=1,2, \ldots, K \end{aligned}

(d) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words $[word_{c-m}, \ldots, word_{c-1}; word_c; word_{c+1}, \ldots word_{c+m}]$, where m is the context size. Denote the “input” and “output” word vectors for $word_k$ as $v_k$ and $u_k$ respectively.

This blog is under a CC BY-NC-SA 3.0 Unported License