【tensorflow】word2vec

Word Embedding

One-Hot Encoder

在word2vec出现之前,通常将词汇转化为一个单位向量。
例如”北京”对应3987,中国对应”5178”,也就是向量对应位置上数字为1,其余为0.
而对于一篇文章,对应的向量是所有词汇对应向量的和。

缺点

  1. 编码随机,从向量看不出两个词汇的关系。
  2. 稀疏矩阵训练效率较低

Vector Space Models

可以将字词转化为连续值
主要依赖假设是Distributional Hypothesis,相同语境下出现的字将被映射到向量空间的相近位置。

有两类具体的方法:

  1. 计数模型,记录相邻的词出现的频率,然后把计数结果转为小而稠密的矩阵(例如Latent Semantic Analysis)
  2. 预测模型,根据相邻的词,推测指定位置的词。(例如Neural Probabilistic Language Models)

OneHot 与 vector转化 假如 Embedding matrix 为$E_{300\times 10k}$,也就是共有10k个单词,使用OneHot编码的话,某个单词可以记为$O_{10k\times 1}$,使用E可以把单词压缩到300维。那么某单词的 vector 编码是$EO$(矩阵乘积)

hierarchical softmax

Negative sampling

有3个点,画在草稿纸上,以后整理

GloVe word vectors

cbow与skip-gram

cbow 是用上下文预测某个词。skip gram 是用某个词预测上下文。

Word2Vec

Neural Probabilistic Language Models通常可以用MLE方法计算,但计算量巨大。
Word2Vec的CBOW中,不需要计算完整概率,而是训练一个二元分类模型,用来区分真实词汇和编造的噪声词汇。
这种编造噪声用来训练的方法叫做 Negative Sampling

实际中,我们用Noise-Contrastive Estimation(NCE) Loss.
TensorFlow里有tf.nn.nce_loss()可以直接实现。

理论假设

  • 分布式假设
    • 相同上下文语境的词有相似的含义
  • 缺点
    • 同义词问题

美国总统 特朗普 决定在墨西哥边境修建隔离墙
听说美国那个 川普 总统要在墨西哥边上修个墙
同学你的 川普 挺带感啊

代码实现-gemsim

代码实现-tf

step1. 读取数据

import tensorflow as tf
import numpy as np

batch_size = 100
embedding_size = 200
vocabulary_size = 10000
generations = 100000
print_loss_every = 2000

num_sampled = int(batch_size / 2)  # Number of negative examples to sample.
window_size = 2  # skip-gram 模型中的 window

# %%
# 读入数据
'http://mattmahoney.net/dc/' + 'text8.zip'
import zipfile


def read_data(filename):
    """Extract the first file enclosed in a zip file as a list of words"""
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0]))  # .split()
    return data


filename = 'text8.zip'
words = read_data(filename)
texts = [words]

step2. 清洗数据

去除标点符号等。这里的数据已清洗过,所以不必要,但如果使用其他数据可能需要这一步。

import re
regex=re.compile('[a-zA-Z]{1,}') # 1次以上字母组合,去除标点符号和数字
texts=[regex.findall(sentence.lower()) for sentence in texts]
texts=[' '.join(words) for words in texts] # 重新整合成句子

step3. 转为标号矩阵

tensorflow.contrib.learn.preprocessing.VocabularyProcessor 可以实现这个功能,但是这个函数将被移除。网上的实现也不够 Pythonic
所以自己写了一个类,把单词向量转为标号向量

import collections


class VocabularyProcessor:
    '''
    author: guofei
    site: www.guofei.site
    输入多个句子组成的列表,输出用序号替代单词,例如
    ```
    corpus = ['this is the first document',
              'this is the second second document',
              'and the third one',
              'is this the first document']
    wp = VocabularyProcessor(max_vocabulary_size=5)
    wp.fit(corpus)
    X = wp.transform(corpus)
    ```
    剔除低频词,并且把所有的低频词标记为 'RARE',其序号固定为 0
    '''

    def __init__(self, max_vocabulary_size=10000):
        self.max_vocabulary_size = max_vocabulary_size
        self.word_count = None
        self.dictionary = None
        self.reverse_dictionary = None

    def fit(self, sentences):
        words = [word for sentence in sentences for word in sentence.split(' ')]
        word_count_ = collections.Counter(words)  # 词频
        word_count = word_count_.most_common(self.max_vocabulary_size - 1)  # 最多词频

        words_most = [word for word, num in word_count]  # 频率最高的词
        len_words_most = len(words_most)
        dictionary = dict(zip(words_most, range(1, len_words_most)))
        reverse_dictionary = dict(zip(range(1, len_words_most), words_most))

        # 添加 'RARE' ,也就是低频词语
        rare_count = sum([word_count_[word] for word in word_count_ if word not in words_most])
        word_count = [('RARE', rare_count)] + word_count
        dictionary['RARE'] = 0
        reverse_dictionary[0] = 'RARE'

        self.word_count = word_count
        self.dictionary = dictionary
        self.reverse_dictionary = reverse_dictionary

    def transform(self, sentences):
        return [[self.dictionary.get(word, 0) for word in sentence.split(' ')] for sentence in sentences]

    def fit_transform(self, sentences):
        self.fit(sentences)
        return self.transform(sentences)


vp = VocabularyProcessor(max_vocabulary_size=vocabulary_size)
vp.fit(texts)
text_data = vp.transform(texts)
word_dictionary = vp.dictionary
word_dictionary_rev = vp.reverse_dictionary

step4. skip-gram 模型

也是重写了一遍,适配以下情况

  1. 有这种数据类型:一行十几万个单词。这样就不可以先按句生成数据,然后从中选择batch_size大小的数据了
  2. 不能使用全局变量
  3. 不能太过随机选取,否则会让 doc2vec 算法无法充分使用全部数据
def generate_batch_data(sentences, batch_size, window_size):
    batch_data, label_data, batch_size_current = [], [], 0
    while True:
        rand_sentence = sentences[np.random.randint(len(sentences))]
        len_sentence = len(rand_sentence)
        for ind, word in enumerate(rand_sentence):
            x = word
            y_list = rand_sentence[max(ind - window_size, 0):ind] \
                     + rand_sentence[ind + 1:min(ind + window_size + 1, len_sentence)]
            for y in y_list:
                batch_data.append(x)
                label_data.append([y])
                batch_size_current += 1
                if batch_size_current == batch_size:
                    yield batch_data, label_data
                    batch_data, label_data, batch_size_current = [], [], 0



batch_data_generater = generate_batch_data(sentences=text_data, batch_size=batch_size, window_size=2)
# 示例:
# batch_inputs, batch_labels = batch_data_generater.next()
# 将来的预测输入:feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}

step5. 建立并训练模型

# 每5000次打印一次最近的单词
print_valid_every = 5000
valid_words = ['into', 'on', 'five', 'have', 'as']
valid_examples = [word_dictionary[x] for x in valid_words]  # 待评估的单词在vocabulary中的index


# %%


x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)


embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, x_inputs)

# NCE loss parameters
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))


# Get loss from prediction
loss = tf.reduce_mean(
    tf.nn.nce_loss(weights=nce_weights,
                   biases=nce_biases,
                   labels=y_target,
                   inputs=embed,
                   num_sampled=num_sampled,
                   num_classes=vocabulary_size))

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

# Cosine similarity between words
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

# %%
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

# Run the skip gram model.
loss_vec = []
for i in range(generations):
    batch_inputs, batch_labels = batch_data_generater.next()
    feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}
    sess.run(optimizer, feed_dict=feed_dict)

    # Return the loss
    if (i + 1) % print_loss_every == 0:
        loss_val = sess.run(loss, feed_dict=feed_dict)
        loss_vec.append(loss_val)
        print('Loss at step {} : {}'.format(i + 1, loss_val))

    # Validation: Print some words and top 5 related words
    if (i + 1) % print_valid_every == 0:
        sim = sess.run(similarity, feed_dict=feed_dict)
        for j in range(len(valid_words)):
            valid_word = valid_words[j]
            top_k = 5
            nearest = (-sim[j, :]).argsort()[1:top_k + 1]
            log_str = "Nearest to {}:".format(valid_word)+ \
                    ','.join([word_dictionary_rev[nearest_k] for nearest_k in nearest])
            print(log_str)

画图

final_embeddings=sess.run(embeddings)
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))  #in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i,:]
        plt.scatter(x, y)
        plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',
                     ha='right',va='bottom')

    plt.show()

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 200
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [word_dictionary_rev[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)

附:其他数据

用电影评价数据跑一次试试吧,只需要改一下读入数据这一步

import pandas as pd

url_neg = 'https://www.guofei.site/datasets/nlp/rt-polaritydata/rt-polarity.neg.txt'
url_pos = 'https://www.guofei.site/datasets/nlp/rt-polaritydata/rt-polarity.pos.txt'
df_neg = pd.read_csv(url_neg, sep='\t', header=None, names=['text'])
df_pos = pd.read_csv(url_pos, sep='\t', header=None, names=['text'])
texts = list(df_pos.text) + list(df_neg.text)
target = [1] * df_pos.shape[0] + [0] * df_neg.shape[0]


valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']

doc2vec

代码
这段代码使用了两个模型,先做doc2vec,然后用逻辑回归预测label

(个人看法),doc2vec 输入 Paragraph id,并进行同样的embedding 操作,把 Paragraph 映射到实向量空间上,实际上是给每个 Paragraph 赋予一个主题。

参考文献

doc2vec 是 Quoc Le和Tomas Mikolov在2014提出的,原始论文
https://blog.csdn.net/mpk_no1/article/details/72458003
https://www.jianshu.com/p/fab82fa53e16
http://scikit-learn.org/stable/modules/feature_extraction.html
https://blog.csdn.net/pipisorry/article/details/41957763
Nick McClure:《TensorFlow机器学习实战指南》 机械工业出版社
lan Goodfellow:《深度学习》 人民邮电出版社
王琛等:《深度学习原理与TensorFlow实战》 电子工业出版社
李嘉璇:《TensorFlow技术解析与实战》 人民邮电出版社
黄文坚:《TensorFlow实战》 电子工业出版社
郑泽宇等:《TensorFlow实战Google深度学习框架》 电子工业出版社
《Matlab神经网络原理与实例精解》陈明,清华大学出版社
《神经网络43个案例》王小川,北京航空航天大学出版社
《人工神经网络原理》马锐,机械工业出版社
白话深度学习与TensorFlow,高扬,机械工业出版社
《TensorFlow实战》中国工信出版集团



您的支持将鼓励我继续创作!