Word Embedding
One-Hot Encoder
在word2vec出现之前,通常将词汇转化为一个单位向量。
例如”北京”对应3987,中国对应”5178”,也就是向量对应位置上数字为1,其余为0.
而对于一篇文章,对应的向量是所有词汇对应向量的和。
缺点
- 编码随机,从向量看不出两个词汇的关系。
- 稀疏矩阵训练效率较低
Vector Space Models
可以将字词转化为连续值
主要依赖假设是Distributional Hypothesis,相同语境下出现的字将被映射到向量空间的相近位置。
有两类具体的方法:
- 计数模型,记录相邻的词出现的频率,然后把计数结果转为小而稠密的矩阵(例如Latent Semantic Analysis)
- 预测模型,根据相邻的词,推测指定位置的词。(例如Neural Probabilistic Language Models)
OneHot 与 vector转化 假如 Embedding matrix 为$E_{300\times 10k}$,也就是共有10k个单词,使用OneHot编码的话,某个单词可以记为$O_{10k\times 1}$,使用E可以把单词压缩到300维。那么某单词的 vector 编码是$EO$(矩阵乘积)
hierarchical softmax
Negative sampling
有3个点,画在草稿纸上,以后整理
GloVe word vectors
cbow与skip-gram
cbow 是用上下文预测某个词。skip gram 是用某个词预测上下文。
Word2Vec
Neural Probabilistic Language Models通常可以用MLE方法计算,但计算量巨大。
Word2Vec的CBOW中,不需要计算完整概率,而是训练一个二元分类模型,用来区分真实词汇和编造的噪声词汇。
这种编造噪声用来训练的方法叫做 Negative Sampling
实际中,我们用Noise-Contrastive Estimation(NCE) Loss.
TensorFlow里有tf.nn.nce_loss()
可以直接实现。
理论假设
- 分布式假设
- 相同上下文语境的词有相似的含义
- 缺点
- 同义词问题
美国总统
特朗普
决定在墨西哥边境修建隔离墙
听说美国那个川普
总统要在墨西哥边上修个墙
同学你的川普
挺带感啊
代码实现-gemsim
官网 https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py
使用
sentences = [['first', 'sentence'], ['second', 'sentence']]
model = gensim.models.Word2Vec(sentences=sentences, min_count=5)
model.wv.word_vec('first')
用法详解
gensim.models.Word2Vec(
sentences=None,
size=100,
alpha=0.025,
window=5,
min_count=5,
workers=3,
min_alpha=0.0001,
sg=0,
hs=0,
negative=5,
iter=5,
sorted_vocab=1,
compute_loss=False,
)
sentences
列表格式的语料,或者LineSentence
格式的语料。可以不在这里给出size=100
隐藏层的单元数,推荐值为几十到几百。这也是参数的个数、向量的维度window=5
:Maximum distance between the current and predicted word within a sentence.min_count=5
出现次数少于这个数的单词会被丢弃workers=3
:并行线程数量,可以workers=multiprocessing.cpu_count()
用掉所有线程sg=0
:0代表skip-gram,1代表CBOWhs=0
:0代表使用负抽样,1代表使用 hierarchical softmaxnegative=5
: 负抽样的倍数,通常是 5~20alpha
/min_alpha
: 初始和最终学习率iter
: 迭代次数sorted_vocab
: 1表示按词频降序,0表示否compute_loss
: True/False 存储并记录 loss,gensim.models.word2vec.Word2Vec.get_latest_training_loss
。 但想要取得每一步的loss就比较麻烦。
可以不一开始就指定语料
model = gensim.models.Word2Vec(size=200, window=8, min_count=10, iter=10, workers=multiprocessing.cpu_count())
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
从文件中读训练数据
- step1:读入数据。要求list每个元素是一句话。不多说。
- step2:先分词、过滤停词,分词结果用空格分割,然后存到文件。(详见 另一篇博客)
- step3:读入文件
line_sentence = gensim.models.word2vec.LineSentence(source='中文语料已空格切分.txt')
- step4:word2vec
model = gensim.models.Word2Vec(line_sentence)
- step5:检查结果
most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}
结果就不贴了,还行。因为数据量很小,所以可以把 iter
调高。
结果分析
model.wv.index2word # vocabulary
model.wv.most_similar(word, topn=10) # 返回一个词的相近词
所以就有了下面这些操作:
- 看一些词的相近词
most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word[300:1000]}
- 上面没有剔除关系分低的词,下面剔一下
# 剔除关系分小于 0.4 的词 tmp={key: [(i,j) for i,j in value if j>0.4] for key, value in most_similars_precalc.items()} # 空的(没有关系分大于0.4)的词也剔除 {key:value for key, value in tmp.items() if len(value)>0}
另外,还有一些操作
model.evaluate_word_pairs(*)
代码实现-tf
我把网上的代码整理起来,每行都重写了一遍,现在的代码泛用性应该挺高
step1. 读取数据
import tensorflow as tf
import numpy as np
batch_size = 100
embedding_size = 200
vocabulary_size = 10000
generations = 100000
print_loss_every = 2000
num_sampled = int(batch_size / 2) # Number of negative examples to sample.
window_size = 2 # skip-gram 模型中的 window
# %%
# 读入数据
'http://mattmahoney.net/dc/' + 'text8.zip'
import zipfile
def read_data(filename):
"""Extract the first file enclosed in a zip file as a list of words"""
with zipfile.ZipFile(filename) as f:
data = tf.compat.as_str(f.read(f.namelist()[0])) # .split()
return data
filename = 'text8.zip'
words = read_data(filename)
texts = [words]
step2. 清洗数据
去除标点符号等。这里的数据已清洗过,所以不必要,但如果使用其他数据可能需要这一步。
import re
regex=re.compile('[a-zA-Z]{1,}') # 1次以上字母组合,去除标点符号和数字
texts=[regex.findall(sentence.lower()) for sentence in texts]
texts=[' '.join(words) for words in texts] # 重新整合成句子
step3. 转为标号矩阵
tensorflow.contrib.learn.preprocessing.VocabularyProcessor 可以实现这个功能,但是这个函数将被移除。网上的实现也不够 Pythonic
所以自己写了一个类,把单词向量转为标号向量
import collections
class VocabularyProcessor:
'''
author: guofei
site: www.guofei.site
输入多个句子组成的列表,输出用序号替代单词,例如
```
corpus = ['this is the first document',
'this is the second second document',
'and the third one',
'is this the first document']
wp = VocabularyProcessor(max_vocabulary_size=5)
wp.fit(corpus)
X = wp.transform(corpus)
```
剔除低频词,并且把所有的低频词标记为 'RARE',其序号固定为 0
'''
def __init__(self, max_vocabulary_size=10000):
self.max_vocabulary_size = max_vocabulary_size
self.word_count = None
self.dictionary = None
self.reverse_dictionary = None
def fit(self, sentences):
words = [word for sentence in sentences for word in sentence.split(' ')]
word_count_ = collections.Counter(words) # 词频
word_count = word_count_.most_common(self.max_vocabulary_size - 1) # 最多词频
words_most = [word for word, num in word_count] # 频率最高的词
len_words_most = len(words_most)
dictionary = dict(zip(words_most, range(1, len_words_most)))
reverse_dictionary = dict(zip(range(1, len_words_most), words_most))
# 添加 'RARE' ,也就是低频词语
rare_count = sum([word_count_[word] for word in word_count_ if word not in words_most])
word_count = [('RARE', rare_count)] + word_count
dictionary['RARE'] = 0
reverse_dictionary[0] = 'RARE'
self.word_count = word_count
self.dictionary = dictionary
self.reverse_dictionary = reverse_dictionary
def transform(self, sentences):
return [[self.dictionary.get(word, 0) for word in sentence.split(' ')] for sentence in sentences]
def fit_transform(self, sentences):
self.fit(sentences)
return self.transform(sentences)
vp = VocabularyProcessor(max_vocabulary_size=vocabulary_size)
vp.fit(texts)
text_data = vp.transform(texts)
word_dictionary = vp.dictionary
word_dictionary_rev = vp.reverse_dictionary
step4. skip-gram 模型
也是重写了一遍,适配以下情况
- 有这种数据类型:一行十几万个单词。这样就不可以先按句生成数据,然后从中选择batch_size大小的数据了
- 不能使用全局变量
- 不能太过随机选取,否则会让 doc2vec 算法无法充分使用全部数据
def generate_batch_data(sentences, batch_size, window_size):
batch_data, label_data, batch_size_current = [], [], 0
while True:
rand_sentence = sentences[np.random.randint(len(sentences))]
len_sentence = len(rand_sentence)
for ind, word in enumerate(rand_sentence):
x = word
y_list = rand_sentence[max(ind - window_size, 0):ind] \
+ rand_sentence[ind + 1:min(ind + window_size + 1, len_sentence)]
for y in y_list:
batch_data.append(x)
label_data.append([y])
batch_size_current += 1
if batch_size_current == batch_size:
yield batch_data, label_data
batch_data, label_data, batch_size_current = [], [], 0
batch_data_generater = generate_batch_data(sentences=text_data, batch_size=batch_size, window_size=2)
# 示例:
# batch_inputs, batch_labels = batch_data_generater.next()
# 将来的预测输入:feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}
step5. 建立并训练模型
# 每5000次打印一次最近的单词
print_valid_every = 5000
valid_words = ['into', 'on', 'five', 'have', 'as']
valid_examples = [word_dictionary[x] for x in valid_words] # 待评估的单词在vocabulary中的index
# %%
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, x_inputs)
# NCE loss parameters
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Get loss from prediction
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=y_target,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)
# Cosine similarity between words
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
# %%
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
# Run the skip gram model.
loss_vec = []
for i in range(generations):
batch_inputs, batch_labels = batch_data_generater.next()
feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}
sess.run(optimizer, feed_dict=feed_dict)
# Return the loss
if (i + 1) % print_loss_every == 0:
loss_val = sess.run(loss, feed_dict=feed_dict)
loss_vec.append(loss_val)
print('Loss at step {} : {}'.format(i + 1, loss_val))
# Validation: Print some words and top 5 related words
if (i + 1) % print_valid_every == 0:
sim = sess.run(similarity, feed_dict=feed_dict)
for j in range(len(valid_words)):
valid_word = valid_words[j]
top_k = 5
nearest = (-sim[j, :]).argsort()[1:top_k + 1]
log_str = "Nearest to {}:".format(valid_word)+ \
','.join([word_dictionary_rev[nearest_k] for nearest_k in nearest])
print(log_str)
画图
final_embeddings=sess.run(embeddings)
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
plt.figure(figsize=(18, 18)) #in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i,:]
plt.scatter(x, y)
plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',
ha='right',va='bottom')
plt.show()
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 200
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [word_dictionary_rev[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)
附:其他数据
用电影评价数据跑一次试试吧,只需要改一下读入数据这一步
import pandas as pd
url_neg = 'http://www.guofei.site/datasets_for_ml/rt-polaritydata/rt-polaritydata/rt-polarity.neg'
url_pos = 'http://www.guofei.site/datasets_for_ml/rt-polaritydata/rt-polaritydata/rt-polarity.pos'
df_neg = pd.read_csv(url_neg, sep='\t', header=None, names=['text'])
df_pos = pd.read_csv(url_pos, sep='\t', header=None, names=['text'])
texts = list(df_pos.text) + list(df_neg.text)
target = [1] * df_pos.shape[0] + [0] * df_neg.shape[0]
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']
doc2vec
代码
这段代码使用了两个模型,先做doc2vec,然后用逻辑回归预测label
(个人看法),doc2vec 输入 Paragraph id,并进行同样的embedding 操作,把 Paragraph 映射到实向量空间上,实际上是给每个 Paragraph 赋予一个主题。
参考文献
doc2vec 是 Quoc Le和Tomas Mikolov在2014提出的,原始论文
https://blog.csdn.net/mpk_no1/article/details/72458003
https://www.jianshu.com/p/fab82fa53e16
http://scikit-learn.org/stable/modules/feature_extraction.html
https://blog.csdn.net/pipisorry/article/details/41957763
Nick McClure:《TensorFlow机器学习实战指南》 机械工业出版社
lan Goodfellow:《深度学习》 人民邮电出版社
王琛等:《深度学习原理与TensorFlow实战》 电子工业出版社
李嘉璇:《TensorFlow技术解析与实战》 人民邮电出版社
黄文坚:《TensorFlow实战》 电子工业出版社
郑泽宇等:《TensorFlow实战Google深度学习框架》 电子工业出版社
《Matlab神经网络原理与实例精解》陈明,清华大学出版社
《神经网络43个案例》王小川,北京航空航天大学出版社
《人工神经网络原理》马锐,机械工业出版社
白话深度学习与TensorFlow,高扬,机械工业出版社
《TensorFlow实战》中国工信出版集团