【tensorflow】word2vec | 郭飞的笔记

Word Embedding

One-Hot Encoder

在word2vec出现之前，通常将词汇转化为一个单位向量。
例如”北京”对应3987，中国对应”5178”，也就是向量对应位置上数字为1，其余为0.
而对于一篇文章，对应的向量是所有词汇对应向量的和。

缺点

编码随机，从向量看不出两个词汇的关系。
稀疏矩阵训练效率较低

Vector Space Models

可以将字词转化为连续值
主要依赖假设是Distributional Hypothesis，相同语境下出现的字将被映射到向量空间的相近位置。

有两类具体的方法：

计数模型,记录相邻的词出现的频率，然后把计数结果转为小而稠密的矩阵(例如Latent Semantic Analysis)
预测模型，根据相邻的词，推测指定位置的词。(例如Neural Probabilistic Language Models)

OneHot 与 vector转化 假如 Embedding matrix 为$E_{300\times 10k}$，也就是共有10k个单词，使用OneHot编码的话，某个单词可以记为$O_{10k\times 1}$，使用E可以把单词压缩到300维。那么某单词的 vector 编码是$EO$（矩阵乘积）

hierarchical softmax

Negative sampling

有3个点，画在草稿纸上，以后整理

GloVe word vectors

cbow与skip-gram

cbow 是用上下文预测某个词。skip gram 是用某个词预测上下文。

Word2Vec

Neural Probabilistic Language Models通常可以用MLE方法计算，但计算量巨大。
Word2Vec的CBOW中，不需要计算完整概率，而是训练一个二元分类模型，用来区分真实词汇和编造的噪声词汇。
这种编造噪声用来训练的方法叫做 Negative Sampling

实际中，我们用Noise-Contrastive Estimation(NCE) Loss.
TensorFlow里有tf.nn.nce_loss()可以直接实现。

理论假设

分布式假设
- 相同上下文语境的词有相似的含义
缺点
- 同义词问题

美国总统 特朗普 决定在墨西哥边境修建隔离墙
听说美国那个 川普 总统要在墨西哥边上修个墙
同学你的 川普 挺带感啊

代码实现-gemsim

官网 https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

使用

sentences = [['first', 'sentence'], ['second', 'sentence']]
model = gensim.models.Word2Vec(sentences=sentences, min_count=5)
model.wv.word_vec('first')

用法详解

gensim.models.Word2Vec(
    sentences=None,
    size=100,
    alpha=0.025,
    window=5,
    min_count=5,
    workers=3,
    min_alpha=0.0001,
    sg=0,
    hs=0,
    negative=5,
    iter=5,
    sorted_vocab=1,
    compute_loss=False,
)

sentences 列表格式的语料，或者 LineSentence 格式的语料。可以不在这里给出
size=100 隐藏层的单元数，推荐值为几十到几百。这也是参数的个数、向量的维度
window=5：Maximum distance between the current and predicted word within a sentence.
min_count=5出现次数少于这个数的单词会被丢弃
workers=3 :并行线程数量，可以 workers=multiprocessing.cpu_count() 用掉所有线程
sg=0：0代表skip-gram，1代表CBOW
hs=0：0代表使用负抽样，1代表使用 hierarchical softmax
negative=5 ：负抽样的倍数，通常是 5～20
alpha/min_alpha ：初始和最终学习率
iter ：迭代次数
sorted_vocab : 1表示按词频降序，0表示否
compute_loss ： True/False 存储并记录 loss，gensim.models.word2vec.Word2Vec.get_latest_training_loss。但想要取得每一步的loss就比较麻烦。

可以不一开始就指定语料

model = gensim.models.Word2Vec(size=200, window=8, min_count=10, iter=10, workers=multiprocessing.cpu_count())

model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)

从文件中读训练数据

step1：读入数据。要求list每个元素是一句话。不多说。
step2:先分词、过滤停词，分词结果用空格分割，然后存到文件。（详见另一篇博客）

step3:读入文件

line_sentence = gensim.models.word2vec.LineSentence(source='中文语料已空格切分.txt')

step4:word2vec

model = gensim.models.Word2Vec(line_sentence)

step5:检查结果

most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}

结果就不贴了，还行。因为数据量很小，所以可以把 iter 调高。

结果分析

model.wv.index2word # vocabulary
model.wv.most_similar(word, topn=10) # 返回一个词的相近词

所以就有了下面这些操作：

看一些词的相近词

most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word[300:1000]}

上面没有剔除关系分低的词，下面剔一下

# 剔除关系分小于 0.4 的词
tmp={key: [(i,j) for i,j in value if j>0.4] for key, value in most_similars_precalc.items()}
# 空的（没有关系分大于0.4）的词也剔除
{key:value for key, value in tmp.items() if len(value)>0}

另外，还有一些操作

model.evaluate_word_pairs(*)

代码实现-tf

我把网上的代码整理起来，每行都重写了一遍，现在的代码泛用性应该挺高

step1. 读取数据

import tensorflow as tf
import numpy as np

batch_size = 100
embedding_size = 200
vocabulary_size = 10000
generations = 100000
print_loss_every = 2000

num_sampled = int(batch_size / 2)  # Number of negative examples to sample.
window_size = 2  # skip-gram 模型中的 window

# %%
# 读入数据
'http://mattmahoney.net/dc/' + 'text8.zip'
import zipfile


def read_data(filename):
    """Extract the first file enclosed in a zip file as a list of words"""
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0]))  # .split()
    return data


filename = 'text8.zip'
words = read_data(filename)
texts = [words]

step2. 清洗数据

去除标点符号等。这里的数据已清洗过，所以不必要，但如果使用其他数据可能需要这一步。

import re
regex=re.compile('[a-zA-Z]{1,}') # 1次以上字母组合，去除标点符号和数字
texts=[regex.findall(sentence.lower()) for sentence in texts]
texts=[' '.join(words) for words in texts] # 重新整合成句子

step3. 转为标号矩阵

tensorflow.contrib.learn.preprocessing.VocabularyProcessor 可以实现这个功能，但是这个函数将被移除。网上的实现也不够 Pythonic
所以自己写了一个类，把单词向量转为标号向量

import collections


class VocabularyProcessor:
    '''
    author: guofei
    site: www.guofei.site
    输入多个句子组成的列表，输出用序号替代单词，例如
    ```
    corpus = ['this is the first document',
              'this is the second second document',
              'and the third one',
              'is this the first document']
    wp = VocabularyProcessor(max_vocabulary_size=5)
    wp.fit(corpus)
    X = wp.transform(corpus)
    ```
    剔除低频词，并且把所有的低频词标记为 'RARE'，其序号固定为 0
    '''

    def __init__(self, max_vocabulary_size=10000):
        self.max_vocabulary_size = max_vocabulary_size
        self.word_count = None
        self.dictionary = None
        self.reverse_dictionary = None

    def fit(self, sentences):
        words = [word for sentence in sentences for word in sentence.split(' ')]
        word_count_ = collections.Counter(words)  # 词频
        word_count = word_count_.most_common(self.max_vocabulary_size - 1)  # 最多词频

        words_most = [word for word, num in word_count]  # 频率最高的词
        len_words_most = len(words_most)
        dictionary = dict(zip(words_most, range(1, len_words_most)))
        reverse_dictionary = dict(zip(range(1, len_words_most), words_most))

        # 添加 'RARE' ,也就是低频词语
        rare_count = sum([word_count_[word] for word in word_count_ if word not in words_most])
        word_count = [('RARE', rare_count)] + word_count
        dictionary['RARE'] = 0
        reverse_dictionary[0] = 'RARE'

        self.word_count = word_count
        self.dictionary = dictionary
        self.reverse_dictionary = reverse_dictionary

    def transform(self, sentences):
        return [[self.dictionary.get(word, 0) for word in sentence.split(' ')] for sentence in sentences]

    def fit_transform(self, sentences):
        self.fit(sentences)
        return self.transform(sentences)


vp = VocabularyProcessor(max_vocabulary_size=vocabulary_size)
vp.fit(texts)
text_data = vp.transform(texts)
word_dictionary = vp.dictionary
word_dictionary_rev = vp.reverse_dictionary

step4. skip-gram 模型

也是重写了一遍，适配以下情况

有这种数据类型：一行十几万个单词。这样就不可以先按句生成数据，然后从中选择batch_size大小的数据了
不能使用全局变量
不能太过随机选取，否则会让 doc2vec 算法无法充分使用全部数据

def generate_batch_data(sentences, batch_size, window_size):
    batch_data, label_data, batch_size_current = [], [], 0
    while True:
        rand_sentence = sentences[np.random.randint(len(sentences))]
        len_sentence = len(rand_sentence)
        for ind, word in enumerate(rand_sentence):
            x = word
            y_list = rand_sentence[max(ind - window_size, 0):ind] \
                     + rand_sentence[ind + 1:min(ind + window_size + 1, len_sentence)]
            for y in y_list:
                batch_data.append(x)
                label_data.append([y])
                batch_size_current += 1
                if batch_size_current == batch_size:
                    yield batch_data, label_data
                    batch_data, label_data, batch_size_current = [], [], 0



batch_data_generater = generate_batch_data(sentences=text_data, batch_size=batch_size, window_size=2)
# 示例：
# batch_inputs, batch_labels = batch_data_generater.next()
# 将来的预测输入：feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}

step5. 建立并训练模型

# 每5000次打印一次最近的单词
print_valid_every = 5000
valid_words = ['into', 'on', 'five', 'have', 'as']
valid_examples = [word_dictionary[x] for x in valid_words]  # 待评估的单词在vocabulary中的index


# %%


x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)


embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, x_inputs)

# NCE loss parameters
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))


# Get loss from prediction
loss = tf.reduce_mean(
    tf.nn.nce_loss(weights=nce_weights,
                   biases=nce_biases,
                   labels=y_target,
                   inputs=embed,
                   num_sampled=num_sampled,
                   num_classes=vocabulary_size))

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

# Cosine similarity between words
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

# %%
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

# Run the skip gram model.
loss_vec = []
for i in range(generations):
    batch_inputs, batch_labels = batch_data_generater.next()
    feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}
    sess.run(optimizer, feed_dict=feed_dict)

    # Return the loss
    if (i + 1) % print_loss_every == 0:
        loss_val = sess.run(loss, feed_dict=feed_dict)
        loss_vec.append(loss_val)
        print('Loss at step {} : {}'.format(i + 1, loss_val))

    # Validation: Print some words and top 5 related words
    if (i + 1) % print_valid_every == 0:
        sim = sess.run(similarity, feed_dict=feed_dict)
        for j in range(len(valid_words)):
            valid_word = valid_words[j]
            top_k = 5
            nearest = (-sim[j, :]).argsort()[1:top_k + 1]
            log_str = "Nearest to {}:".format(valid_word)+ \
                    ','.join([word_dictionary_rev[nearest_k] for nearest_k in nearest])
            print(log_str)

画图

final_embeddings=sess.run(embeddings)
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))  #in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i,:]
        plt.scatter(x, y)
        plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',
                     ha='right',va='bottom')

    plt.show()

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 200
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [word_dictionary_rev[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)

附：其他数据

用电影评价数据跑一次试试吧，只需要改一下读入数据这一步

import pandas as pd

url_neg = 'http://www.guofei.site/datasets_for_ml/rt-polaritydata/rt-polaritydata/rt-polarity.neg'
url_pos = 'http://www.guofei.site/datasets_for_ml/rt-polaritydata/rt-polaritydata/rt-polarity.pos'
df_neg = pd.read_csv(url_neg, sep='\t', header=None, names=['text'])
df_pos = pd.read_csv(url_pos, sep='\t', header=None, names=['text'])
texts = list(df_pos.text) + list(df_neg.text)
target = [1] * df_pos.shape[0] + [0] * df_neg.shape[0]


valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']

doc2vec

代码
这段代码使用了两个模型，先做doc2vec，然后用逻辑回归预测label

（个人看法），doc2vec 输入 Paragraph id，并进行同样的embedding 操作，把 Paragraph 映射到实向量空间上，实际上是给每个 Paragraph 赋予一个主题。

参考文献

doc2vec 是 Quoc Le和Tomas Mikolov在2014提出的，原始论文
https://blog.csdn.net/mpk_no1/article/details/72458003
https://www.jianshu.com/p/fab82fa53e16
http://scikit-learn.org/stable/modules/feature_extraction.html
https://blog.csdn.net/pipisorry/article/details/41957763
Nick McClure:《TensorFlow机器学习实战指南》机械工业出版社
lan Goodfellow:《深度学习》人民邮电出版社
王琛等：《深度学习原理与TensorFlow实战》电子工业出版社
李嘉璇：《TensorFlow技术解析与实战》人民邮电出版社
黄文坚：《TensorFlow实战》电子工业出版社
郑泽宇等：《TensorFlow实战Google深度学习框架》电子工业出版社
《Matlab神经网络原理与实例精解》陈明，清华大学出版社
《神经网络43个案例》王小川，北京航空航天大学出版社
《人工神经网络原理》马锐，机械工业出版社
白话深度学习与TensorFlow，高扬，机械工业出版社
《TensorFlow实战》中国工信出版集团