CS 224N: Assignment #1

Word2vec Summary

背景

  1. One-hot没有在表达中体现词的相似度,并且空间浪费。

  2. SVD-based的方法是先构造cooccurrence矩阵(这个共同出现可能会被限制在某个window范围内),然后对矩阵使用SVD分解,取前k个奇异值得到矩阵 $U$ 就得到每个词的k维表达。但这种做法弊端明显:

    • 新词的出现和语料的改变会造成矩阵尺寸频繁变化。

    • 矩阵非常稀疏,并且维度通常很大。

    • SVD分解有平方级的开销,而我们刚刚说矩阵的维度很大,因此这个开销会很恼人。

    • 需要对不同词语出现频率悬殊做额外处理。

    有一些针对的解决办法,但不够好。

Word2vec:iteration based

word2vec的参数就是词向量矩阵本身,它通过迭代的方式训练。它包括两种算法:连续词袋模型(CBOW)和skip-gram。前者通过上下文预测中心词,后者反之。训练方法也有两种:负采样和层次softmax。

CBOW

$U$ 为输出矩阵,在这里即中心词的embedding矩阵;$V$ 为输入矩阵,在这里是上下文词语的embedding矩阵。

算法流程如下:

  1. 取窗口内(窗口大小为m)的词语的one-hot vector。
  2. 对其在矩阵 $V$ 中做embedding lookup操作(其实就是每个vector乘上这个矩阵),得到2m个词向量。
  3. 直接对这2m个向量取平均。(这也是为什么它被称作连续词袋模型,它仍然不考虑这些词语的顺序,像把词向量的每个维度当成了袋子一样收集每个词提供的信息。)
  4. 把这个向量与 $U$ 相乘,实际上就是和每个中心词做了内积,然后对得到的向量做softmax,就是我们得到的概率值。

loss function为这个概率和实际结果(其实是一个one-hot vector,可以认为它在正确结果处概率为1)的交叉熵。交叉熵是信息论中的概念,可以用来刻画两个分布的不同。由于真实分布只在一个点处有取值,且是1,其实这个交叉熵就是输出概率值在正确类对应的index上的熵

故我们的优化目标是:

$$ \begin{aligned} \text { minimize } J &=-\log P\left(w_{c} | w_{c-m}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+m}\right) \\ &=-\log P\left(u_{c} | \hat{v}\right) \\ &=-\log \frac{\exp \left(u_{c}^{T} \hat{v}\right)}{\sum_{j=1}^{|V|} \exp \left(u_{j}^{T} \hat{v}\right)} \\ &=-u_{c}^{T} \hat{v}+\log \sum_{j=1}^{|V|} \exp \left(u_{j}^{T} \hat{v}\right) \end{aligned} $$

接下来的工作交给SGD。

Skip-gram

$U,V$ 仍然分别代表输出和输入矩阵,只是他们代表的词语位置和刚才相反。

算法流程和上面基本类似,这里简单描述:

  1. 用中心词的one-hot vector获取其embedding。
  2. 与output matrix相乘,再做softmax。
  3. 用得到的概率向量和真实输出(就是一个只有窗口范围内是1,其余都是0的向量)做交叉熵,作为loss。
$$ \begin{aligned} \text { minimize } J &=-\log P\left(w_{c-m}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+m} | w_{c}\right) \\ &=-\log \prod_{j=0, j \neq m}^{2 m} P\left(w_{c-m+j} | w_{c}\right) \\ &=-\log \prod_{j=0, j \neq m}^{2 m} P\left(u_{c-m+j} | v_{c}\right) \\ &=-\log \prod_{j=0, j \neq m}^{2 m} \frac{\exp \left(u_{c-m+j}^{T} v_{c}\right)}{\sum_{k=1}^{|V|} \exp \left(u_{k}^{T} v_{c}\right)} \\ &=-\sum_{j=0, j \neq m}^{2 m} u_{c-m+j}^{T} v_{c}+2 m \log \sum_{k=1}^{|V|} \exp \left(u_{k}^{T} v_{c}\right) \end{aligned} $$

Negative Sampling

注意到上面的算法中softmax开销非常大。为了缩小这个开销我们可以考虑不直接求,而是取近似。

直觉上,我们softmax的目标是让窗口内的上下文词语输出大,而让其他词语输出小。把窗口内的词语当作正样本,其他词语当作负样本,如果我们进行负采样的话,在目标函数中同时令前者产生的概率大,后者产生的概率小,就近似达到目标效果了。

这里我们把刻画一个样本的概率函数换为sigmoid,采用最大似然的方法设计loss function,能得到如下式子:

$$ J=-\sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}-\sum_{(w, c) \in D} \log \left(\frac{1}{1+\exp \left(u_{w}^{T} v_{c}\right)}\right) $$

具体地,对于skip-gram,在上下文词语 $c-m+j$ 上的目标函数是:

$$ -\log \sigma\left(u_{c-m+j}^{T} \cdot v_{c}\right)-\sum_{k=1}^{K} \log \sigma\left(-\tilde{u}_{k}^{T} \cdot v_{c}\right) $$

CBOW类似。

K个负样本是从某种分布中抽取的。经验上来说这个分布使用$\dfrac{3}{4}$次方的unigram模型是最好的。选取$\dfrac{3}{4}$的原因简单来说就是对于接近1的数,它的值基本维持,但对于很小的数,它的值会变大,能使它更容易被采样到。对于像”is”(0.9)这样的词语,在这样的情况下变化就很小,而像“bombastic”(0.01)则能获得3倍于之前的采样概率。

Hierarchical Softmax

通过构造一棵二叉树的办法来使得每次获得某一个词的概率时不用遍历整个词典。这棵树上叶子节点表示一个词,除了根节点和叶子节点,每一个中间节点都有一个自己的embedding。在给定一个vector的情况下,要求概率需要遍历从根节点到该词所对应的叶子节点的路径,累乘到达每个中间节点的概率。这个概率用sigmoid表示,且每条路径上下一步如果选择左子节点,概率为P(X=1)的情况,右子节点则为P(X=0)的情况,这样实际上也是一种normalization,保证概率之和为1。

这个方法的速度取决于树是如何构造的。在Mikolov的论文中他们使用的树是哈夫曼树,这种树的特点是会给出现频率高的叶子节点更短的路径。

接下来是作业的部分。经过对比,我认为2019以前的第一次作业要比2019的这个作业有趣一些,因此我选择了旧版本的第一个编程作业来完成。

1. Softmax

(a) Prove that softmax is invariant to constant offsets in the input, that is, for any input vector $\bf{x}$ and any constant $c$
$$
\mathbf{softmax}(\mathbf{x}) = \mathbf{softmax}(\mathbf{x} + c)
$$
where $ \bf{x} + c$ means adding the constant $c$ to every dimension of $\bf{x}$. Remember that $$\mathbf{softmax}(\mathbf{x})_i = \dfrac{e^{x_i}}{\sum_j e^{x_j}}$$

这个证明是显然的,我们只需要在softmax的分子分母中同时乘上 $e^c$这一项即可。但这个性质在我们实现softmax的时候非常重要。

考虑这样的情形,你需要对 $\mathbf{x} = [1001, 1002]$ 做softmax运算,但 $e^{1001}$ 毫无疑问是溢出的。但对于softmax而言,$[1001, 1002]$ 和 $[1,2]$ 又有什么区别呢?因此实际实现的时候,我们通常取 $c = - \mathbf{max}\space x_i$ 以保证数值上的稳定。

(b) Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in q1_softmax.py. You may test by executing python q1_softmax.py.

代码如下:

def softmax(x):
    """Compute the softmax function for each row of the input x.

    You should also make sure that your code works for a single
    D-dimensional vector (treat the vector as a single row) and
    for N x D matrices. This may be useful for testing later. Also,
    make sure that the dimensions of the output match the input.



    Arguments:
    x -- A D dimensional vector or N x D dimensional numpy matrix.

    Return:
    x -- You are allowed to modify x in-place
    """
    orig_shape = x.shape

    if len(x.shape) > 1:
        # Matrix

        # scale down
        x_max = np.max(x, axis=1).reshape(-1,1)
        x -= x_max

        # softmax process
        x = np.exp(x)
        exp_sum = np.sum(x, axis=1).reshape(-1,1)
        x /= exp_sum

    else:
        # Vector

        # scale down
        x_max = np.max(x)
        x -= x_max

        # softmax process
        x = np.exp(x)
        x /= np.sum(x)


    assert x.shape == orig_shape
    return x

2. Neural Network Basics

(a) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only $\sigma(x)$, but not $x$, is present). Assume that the input $x$ is a scalar for this question. Recall, the sigmoid function is $\sigma(x) = \dfrac{1} {1 + e^{−x}}$ .

这个推导非常简单:$(\sigma(x))’ = \sigma(x)(1 - \sigma(x))$。

(b) ) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e., find the gradients with respect to the softmax input vector $\theta$, when the prediction is made by $\hat{y} = \mathbf{softmax}(\theta)$. Remember the cross entropy function is
$$
\mathbf{CE}(\mathbf{y}, \mathbf{\hat{y}}) = -\sum_i y_i\log \hat{y}_i
$$
where $\mathbf{y}$ is the one-hot label vector, and $\mathbf{\hat{y}}$ is the predicted probability vector for all classes.

我们知道 $\mathbf{y}$ 是一个one-hot label vector,这就是说它只有一项是 $1$ , 其余项均为 $0$ 。设 $\mathbf{y}$ 的第 $i$ 项为 $1$.

对第 $i$ 项: $\dfrac{\partial{\mathbf{CE}}}{\partial{\theta_i}} = \dfrac{\partial{\mathbf{CE}}}{\partial{\hat{y_i}}} \dfrac{\partial{\hat{y_i}}}{\partial{\theta_i}} = -\dfrac{1}{\hat{y_i}}\dfrac{\partial \space\mathbf{softmax}(\theta)_i} {\partial \theta_i}$

而 $\dfrac{\partial \space\mathbf{softmax}(\theta)_i} {\partial \theta_i} = \dfrac{e^{\theta_i}(\sum_j e^{\theta_j}) - e^{\theta_i}e^{\theta_i}}{(\sum_j e^{\theta_j})^2}= \mathbf{softmax}(\theta)_i(1 - \mathbf{softmax}(\theta)_i) = \hat{y_i}(1-\hat{y_i})$

故 $\dfrac{\partial{\mathbf{CE}}}{\partial{\theta}_i} = \hat{y_i} - 1$

对第 $j \not = i$ 项: $\dfrac{\partial{\mathbf{CE}}}{\partial{\theta_j}} = \dfrac{\partial{\mathbf{CE}}}{\partial{\hat{y_i}}} \dfrac{\partial{\hat{y_i}}}{\partial{\theta_j}} = -\dfrac{1}{\hat{y_i}}\dfrac{\partial \space\mathbf{softmax}(\theta)_i} {\partial \theta_j}$

而 $\dfrac{\partial \space\mathbf{softmax}(\theta)_i} {\partial \theta_j} = \dfrac{0(\sum_j e^{x_j}) - e^{x_j}e^{x_i}}{(\sum_j e^{x_j})^2} = -\hat{y_i}\hat{y_j}$

故 $\dfrac{\partial{\mathbf{CE}}}{\partial{\theta}_j} = \hat{y_j} = \hat{y_j} - 0$

可以看出两种情况可以综合为 $\hat{y_i} - y_i$ 的形式,则
$$
\dfrac{\partial{\mathbf{CE}}}{\partial{\theta}} = \hat{y} - y
$$

(c) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, find $\dfrac{\partial{J}}{\partial{x}}$ where $J = \mathbf{CE}(y, \hat{y})$ is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y, and cross entropy cost is used. (Feel free to use $\sigma ‘ (x)$ as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit.)

根据题意,我们有这样一个两层神经网络: $$ \boldsymbol{h}=\operatorname{sigmoid}\left(\boldsymbol{x} \boldsymbol{W}_{1}+\boldsymbol{b}_{1}\right) \quad \hat{\boldsymbol{y}}=\operatorname{softmax}\left(\boldsymbol{h} \boldsymbol{W}_{2}+\boldsymbol{b}_{2}\right) $$ 记 $z_{1} = \boldsymbol{xW}_{1} + \boldsymbol{b}_{1}$ , $z_{2} = \boldsymbol{hW}_{2} + \boldsymbol{b}_{2}$; $$ \begin{aligned} \boldsymbol{\delta}_{1} &=\frac{\partial C E}{\partial z_{2}}=\hat{\boldsymbol{y}}-\boldsymbol{y} \\ \boldsymbol{\delta}_{2} &=\frac{\partial C E}{\partial \boldsymbol{h}}=\boldsymbol{\delta}_{1} \frac{\partial \boldsymbol{z}_{2}}{\partial \boldsymbol{h}}=\boldsymbol{\delta}_{1} \boldsymbol{W}_{2}^{\top} \\ \boldsymbol{\delta}_{3} &=\frac{\partial C E}{\boldsymbol{z}_{1}}=\boldsymbol{\delta}_{2} \frac{\partial \boldsymbol{h}}{\partial \boldsymbol{z}_{1}}=\boldsymbol{\delta}_{2} \circ \sigma^{\prime}\left(\boldsymbol{z}_{1}\right) \\ \frac{\partial C E}{\partial \boldsymbol{x}} &=\boldsymbol{\delta}_{3} \frac{\partial \boldsymbol{z}_{1}}{\partial \boldsymbol{x}}=\boldsymbol{\delta}_{3} \boldsymbol{W}_{1}^{\top} \end{aligned} $$

这个神经网络有 $ (D_x + 1) \cdot H + (H + 1) \cdot D_y$个参数。

接下来是完成这个网络的代码。先完成gradient_check的部分:

# First implement a gradient checker by filling in the following functions
def gradcheck_naive(f, x):
    """ Gradient check for a function f.

    Arguments:
    f -- a function that takes a single argument and outputs the
         cost and its gradients
    x -- the point (numpy array) to check the gradient at
    """

    rndstate = random.getstate()
    random.setstate(rndstate)
    fx, grad = f(x) # Evaluate function value at original point
    h = 1e-4        # Do not change this!

    # Iterate over all indexes ix in x to check the gradient.
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index

        x[ix] += h
        fx_forward, _ = f(x)

        x[ix] -= 2*h
        fx_backward, _ = f(x)

        x[ix] += h

        numgrad = (fx_forward - fx_backward) / (2 * h)

        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print "Gradient check failed."
            print "First gradient error found at index %s" % str(ix)
            print "Your gradient: %f \t Numerical gradient: %f" % (
                grad[ix], numgrad)
            return

        it.iternext() # Step to next dimension

    print "Gradient check passed!"

然后完成这个二层MLP的的函数:

def forward_backward_prop(X, labels, params, dimensions):
    """
    Forward and backward propagation for a two-layer sigmoidal network

    Compute the forward propagation and for the cross entropy cost,
    the backward propagation for the gradients for all parameters.

    Notice the gradients computed here are different from the gradients in
    the assignment sheet: they are w.r.t. weights, not inputs.

    Arguments:
    X -- M x Dx matrix, where each row is a training example x.
    labels -- M x Dy matrix, where each row is a one-hot vector.
    params -- Model parameters, these are unpacked for you.
    dimensions -- A tuple of input dimension, number of hidden units
                  and output dimension
    """

    # Unpack network parameters
    ofs = 0
    Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])

    W1 = np.reshape(params[ofs:ofs+ Dx * H], (Dx, H))
    ofs += Dx * H
    b1 = np.reshape(params[ofs:ofs + H], (1, H))
    ofs += H
    W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
    ofs += H * Dy
    b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))

    # Note: compute cost based on `sum` not `mean`.
    # forward propagation
    h = sigmoid(X.dot(W1) + b1)  # N x H
    yhat = softmax(h.dot(W2) + b2)  # N x Dy

    cost = -np.sum(np.log(yhat[np.where(labels)]))

    # backward propagation
    gradz2 = (yhat - labels)
    gradW2 = (h.T).dot(gradz2)
    gradb2 = np.sum(gradz2, axis=0, keepdims=True)

    gradh = np.dot(gradz2, W2.T)
    gradz1 = sigmoid_grad(h) * gradh

    gradW1 = (X.T).dot(gradz1)
    gradb1 = np.sum(gradz1, axis=0, keepdims=True)


    # Stack gradients
    grad = np.concatenate((gradW1.flatten(), gradb1.flatten(),
        gradW2.flatten(), gradb2.flatten()))

    return cost, grad

3. word2vec

(a) Assume you are given a predicted word vector $v_c$ corresponding to the center word $c$ for
and word prediction is made with the softmax function found in word2vec models
$$
\hat{\boldsymbol{y}}_{o}=p(\boldsymbol{o} | \boldsymbol{c})=\frac{\exp \left(\boldsymbol{u}_{o}^{\top} \boldsymbol{v}_{c}\right)}{\sum_{w=1}^{W} \exp \left(\boldsymbol{u}_{w}^{\top} \boldsymbol{v}_{c}\right)}
$$
where w denotes the $w$-th word and $u_w$ ($w = 1, \ldots, W $) are the “output” word vectors for all words in
the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word
(the $o$-th element of the one-hot label vector is one), derive the gradients with respect to $v_c$.

令 $y$ 为 $Uv_c$ softmax输出的向量,$\hat{y}$ 为标签向量,形式为one-hot vector。则:

$$ \frac{\partial J}{\partial \boldsymbol{v}_{c}}=U(\hat{\boldsymbol{y}}-\boldsymbol{y}) $$ 或者其经过分配律后的式子: $$ \frac{\partial J}{\partial \boldsymbol{v}_{c}}=-\boldsymbol{u}_{i}+\sum_{w=1}^{W} \hat{y}_{w} \boldsymbol{u}_{w} $$

(b) As in the previous part, derive gradients for the “output” word vectors $u_w$’s (including $u_o$).

与(a)基本完全一致:

$$ \frac{\partial J}{\partial \boldsymbol{U}}=\boldsymbol{v}_{c}(\hat{\boldsymbol{y}}-\boldsymbol{y})^{\top} $$

(c) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector $v_c$, and the expected output word is $o$. Assume that K negative samples (words) are drawn, and
they are $1,\ldots, K$, respectively for simplicity of notation ($o \notin\{1, \ldots, K\}$ ). Again, for a given word, $o$,
denote its output vector as $u_o$. The negative sampling loss function in this case is
$$
J_{n e g-s a m p l e}\left(\boldsymbol{o}, \boldsymbol{v}_{c}, \boldsymbol{U}\right)=-\log \left(\sigma\left(\boldsymbol{u}_{o}^{\top} \boldsymbol{v}_{c}\right)\right)-\sum_{k=1}^{K} \log \left(\sigma\left(-\boldsymbol{u}_{k}^{\top} \boldsymbol{v}_{c}\right)\right)
$$

$$ \begin{aligned} \frac{\partial J}{\partial \boldsymbol{v}_{c}} &=\left(\sigma\left(\boldsymbol{u}_{o}^{\top} v_{c}\right)-1\right) \boldsymbol{u}_{o}-\sum_{k=1}^{K}\left(\sigma\left(-\boldsymbol{u}_{k}^{\top} \boldsymbol{v}_{c}\right)-1\right) \boldsymbol{u}_{k} \\ \frac{\partial J}{\partial \boldsymbol{u}_{o}} &=\left(\sigma\left(\boldsymbol{u}_{o}^{\top} \boldsymbol{v}_{c}\right)-1\right) \boldsymbol{v}_{c} \\ \frac{\partial J}{\partial \boldsymbol{u}_{k}} &=-\left(\sigma\left(-\boldsymbol{u}_{k}^{\top} \boldsymbol{v}_{c}\right)-1\right) \boldsymbol{v}_{c}, \quad \text { for all } k=1,2, \ldots, K \end{aligned} $$

(d) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words $[word_{c-m}, \ldots, word_{c-1}; word_c; word_{c+1}, \ldots word_{c+m}] $, where m is the context size. Denote the “input” and “output” word vectors for $word_k$ as $v_k$ and $u_k$ respectively.

这里不妨使用$F(o, v_c)$ 代表 $J_{softmax-CE}(o, v_c, \ldots)$ 或 $J_{neg-sample}(o,v_c, \ldots)$ 。

对Skip-gram: $$ \begin{array}{l}{\dfrac{\partial J_{\text {skip-gram }}\left(\text { word }_{c-m \ldots c+m}\right)}{\partial U}=\sum_{-m \leq j \leq m, j \neq 0} \dfrac{\partial F\left(\boldsymbol{w}_{c+j}, \boldsymbol{v}_{c}\right)}{\partial \boldsymbol{U}}} \\ {\dfrac{\partial J_{\text {skip-gram }}\left(\text { word }_{c-m \ldots c+m}\right)}{\partial \boldsymbol{v}_{c}}=\sum_{-m \leq j \leq m, j \neq 0} \dfrac{\partial F\left(\boldsymbol{w}_{c+j}, \boldsymbol{v}_{c}\right)}{\partial \boldsymbol{v}_{c}}} \\ {\dfrac{\partial J_{\text {skip-gram }}\left(\text { word }_{c-m \ldots c+m}\right)}{\partial \boldsymbol{v}_{j}}=0, \text { for all } j \neq c}\end{array} $$ 对CBOW: $$ \begin{array}{l}{\dfrac{\partial J_{\mathrm{CBOW}}\left(\text { word }_{c-m \ldots c+m}\right)}{\partial U}=\dfrac{\partial F\left(\boldsymbol{w}_{c}, \hat{\boldsymbol{v}}\right)}{\partial \boldsymbol{U}}=\dfrac{\partial F\left(\boldsymbol{w}_{c}, \hat{\boldsymbol{v}}\right)}{\partial \boldsymbol{U}}, \quad \text { (using the definition of } \hat{\boldsymbol{v}} \text { in the problem) }} \\ {\dfrac{\partial J_{\mathrm{CBOW}}\left(\text { word }_{c-m, \ldots c+m}\right)}{\partial \boldsymbol{v}_{j}}=\dfrac{\partial F\left(\boldsymbol{w}_{c}, \hat{\boldsymbol{v}}\right)}{\partial \hat{\boldsymbol{v}}}, \quad \text { for all } j \in\{c-m, \ldots, c-1, c+1, \ldots, c+m\}} \\ {\dfrac{\partial J_{\mathrm{CBOW}}\left(\text { word }_{c-m, \ldots c+m}\right)}{\partial \boldsymbol{v}_{j}}=\mathbf{0}, \quad \text { for all } j \notin\{c-m, \ldots, c-1, c+1, \ldots, c+m\}}\end{array} $$

This blog is under a CC BY-NC-SA 3.0 Unported License
Link to this article: http://huangweiran.club/2018/08/30/CS-224N-Assignment-1/