语料库和数据集
分类:孤立语料库(就是文本集合)、分类语料库(语料+分类信息)、重叠语料库(有分类,但分类有重叠现象)
基本数据类型:离散、连续等,不多说。
语料文件格式:txt、csv、xml、json、稀疏矩阵等
如何获取语料库:
- nltk:
dir(nltk.corpus)
- https://github.com/awesomedata/awesome-public-datasets
- https://www.kaggle.com/datasets
- https://www.reddit.com/r/datasets/
- 自己写爬虫
nltk语料库
下载语料库
nltk.download('brown')
# 上面可能网络原因失败,可以手动下载:
# https://github.com/nltk/nltk_data/tree/gh-pages
# 然后把 packages 文件夹下的所有内容复制到指定目录即可
# 为了使用其它功能,还需要:
# 解压缩 /nltk_data/tokenizers/punkt.zip
使用语料库
from nltk.corpus import brown
brown.categories() # 语料库类别
brown.fileids() # 语料库文件
brown.categories(fileids) # 根据文件返回类别
brown.fileids(categories) # 根据类别返回文件
brown.raw() # 返回语料, str 类型
brown.words() # 语料中的单词
brown.sents() # 语料的句子
# raw,words,sents 可以指定 fileids=[f1, f2, f3],categories=[c1, c2]。也可以用 index 来索引,例如:
brown.words(categories='news')[10:30]
brown.abspath(fileid='ca01') # 文件路径
brown.readme() # readme
文本预处理
一些步骤
- 分句。不是简单寻找句号,因为点号有时候用在缩写里面。可以使用一些工具,或搭建ML模型来做。
- 词形还原,例如比较级、过去式、现在式
- 停用词去除
- 拼写校正。
词频统计
from bs4 import BeautifulSoup
import requests
# 先用爬虫获取一段语料
text = BeautifulSoup(requests.get('https://www.python.org/').text, 'lxml').get_text()
import nltk
freq_dist = nltk.FreqDist(text.split()) # 类似 Counter 的词频统计
freq_dist.plot(50, cumulative=False) # 画图
分句
import nltk
text = "How long does it take to get a Ph.D. degree? I don't know."
nltk.tokenize.sent_tokenize(text)
# 另一个可训练的分句工具
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
分词
import nltk
text = 'Hello! Everyone!! Good morning!'
print(nltk.tokenize.word_tokenize(text=text)) # 最通用的方法
print(text.split()) # 朴素的方法
print(nltk.tokenize.regexp_tokenize(text=text, pattern='\w+')) # 允许用户自定义
print(nltk.tokenize.wordpunct_tokenize(text=text))
print(nltk.tokenize.blankline_tokenize(text=text))
词形还原
如果事先不知道词性,可以用词干提取(stemming),如果事先知道词性,用词形还原(lemmatization)
stemming:
import nltk
pst = nltk.stem.PorterStemmer()
lst = nltk.stem.LancasterStemmer()
sbt = nltk.stem.SnowballStemmer(language='english') # 可以支持多种语言
pst.stem('eating')
lst.stem('shopping')
sbt.stem('asked')
lemmatization:
from nltk.stem.wordnet import WordNetLemmatizer
wordlemma = WordNetLemmatizer()
print(wordlemma.lemmatize('cars'))
print(wordlemma.lemmatize('walking', pos='v'))
print(wordlemma.lemmatize('meeting', pos='n'))
print(wordlemma.lemmatize('meeting', pos='v'))
print(wordlemma.lemmatize('better', pos='a'))
print(wordlemma.lemmatize('is', pos='v'))
print(wordlemma.lemmatize('funnier', pos='a'))
print(wordlemma.lemmatize('expected', pos='v'))
print(wordlemma.lemmatize('fantasized', pos='v'))
# 也可以不给定词性
wordlemma.lemmatize('cooking')
# cooking
wordlemma.lemmatize('cooking', pos='v')
# cook
wordlemma.lemmatize('cookbooks')
# cookbook
停用词移除
移除停用词对语义影响极小,而且停用词本身数量较多,干扰模型。
from nltk.corpus import stopwords
stop_words = stopwords.words('english') # 支持几十种语言
stop_words = set(["hi", "bye"]) # 也可以自定义停用词
# 然后就可以移除了
tokens = [word for word in text.split() if word not in stopwords]
罕见词移除
freq_dist = nltk.FreqDist(tokens) # 类似 Counter 的词频统计
freq_dist1 = sorted(freq_dist.items(), key=lambda x: x[1], reverse=True)
rare_words = [i[0] for i in freq_dist1[-10:]]
# 然后就可以移除了
tokens = [word for word in text.split() if word not in rare_words]
拼写校正算法
使用最小编辑距离(也就是插入、删除、替换所需操作)做为标准,用动态规划来实现。
import re
from collections import Counter
# WORDS 一般可以从语料中获取,这里为了展示代码临时写成这样
WORDS = Counter(['apple', 'correction', 'statement', 'tutors'])
def P(word, N=sum(WORDS.values())):
"Probability of `word`."
return WORDS[word] / N
def correction(word):
"Most probable spelling correction for word."
return max(candidates(word), key=P)
def candidates(word):
"Generate possible spelling corrections for word."
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
def known(words):
"The subset of `words` that appear in the dictionary of WORDS."
return set(w for w in words if w in WORDS)
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
"All edits that are two edits away from `word`."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
if __name__ == "__main__":
print(correction('aple'))
print(correction('correcton'))
print(correction('statament'))
print(correction('tutpore'))
nltk自带一个距离算法
from nltk.metrics import edit_distance
edit_distance('aple', 'apple')
特征工程
NLP的基础特征
- 句法分析
- 词性标注
- 实体识别(NER)
- n-gram
- Bag-of-Word
NLP 的统计特征
- One-hot 等
- TF-IDF
NLP 的高级特征
- word2vec
词性标注
pos标注器(使用最大熵分类器)
text = 'While in France, Christine Lagarde discussed short-term ' \
'stimulus efforts in a recent interview at 5:00 P.M with the Wall Street Journal.'
nltk.pos_tag(nltk.word_tokenize(text=text))
Stanford 标注器:
# 下载模型:https://nlp.stanford.edu/software/tagger.shtml ,我是下载好后放到 Dowdloads 下面了
from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize
jar = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/english-left3words-distsim.tagger'
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')
text = 'While in France, Christine Lagarde discussed short-term ' \
'stimulus efforts in a recent interview at 5:00 P.M with the Wall Street Journal.'
tokenized_text=word_tokenize(text)
classified_text = pos_tagger.tag(tokenized_text)
print(classified_text)
[(‘While’, ‘IN’), (‘in’, ‘IN’), (‘France’, ‘NNP’), (‘,’, ‘,’), (‘Christine’, ‘NNP’), (‘Lagarde’, ‘NNP’), (‘discussed’, ‘VBD’), (‘short-term’, ‘JJ’), (‘stimulus’, ‘NN’), (‘efforts’, ‘NNS’), (‘in’, ‘IN’), (‘a’, ‘DT’), (‘recent’, ‘JJ’), (‘interview’, ‘NN’), (‘at’, ‘IN’), (‘5:00’, ‘CD’), (‘P.M’, ‘NN’), (‘with’, ‘IN’), (‘the’, ‘DT’), (‘Wall’, ‘NNP’), (‘Street’, ‘NNP’), (‘Journal’, ‘NNP’), (‘.’, ‘.’)]
中文也支持
from nltk.tag import StanfordPOSTagger
import jieba
jar = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/chinese-nodistsim.tagger'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/chinese-distsim.tagger'
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')
text='张伟在北京上班,李雷在南京逛街'
tokenized_text = list(jieba.cut(text))
classified_text = pos_tagger.tag(tokenized_text)
N-gram标注器,有几种:
from nltk.corpus import brown
import nltk
brown_tagged_sents = brown.tagged_sents()
train_data = brown_tagged_sents[:int(0.9 * len(brown_tagged_sents))]
test_data = brown_tagged_sents[int(0.9 * len(brown_tagged_sents)):]
default_tagger = nltk.DefaultTagger('NN') # NN 最多,默认设为NN,准确率就可达到 13%
default_tagger.evaluate(brown_tagged_sents)
unigram_tagger = nltk.tag.UnigramTagger(train=train_data, backoff=default_tagger)
bigram_tagger = nltk.tag.BigramTagger(train=train_data, backoff=unigram_tagger)
trigram_tagger = nltk.tag.TrigramTagger(train=train_data, backoff=bigram_tagger)
print(unigram_tagger.evaluate(test_data)) # 89%
print(bigram_tagger.evaluate(test_data)) # 91%
print(trigram_tagger.evaluate(test_data)) # 91%
NER标注器
import nltk
text = 'London is a big city in the United Kingdom.'
nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text=text)), binary=False)
#
用正则做词性标注
patterns = [
(r'\d+$', 'CD'),
(r'.*ing', 'VBG'),
(r'.*ment', 'NN'),
(r'.*ful', 'JJ')
]
# CD 是数字,VBG 是动词,NN 是名词,JJ 是形容词
其它算法(NLTK都有现成的)
Brill词性标注:原理是根据附近单词的词性来做预测。
TnT标注器:是基于二阶马尔可夫模型的标注器
wordnet
wordNet 是一个强大的英语词汇数据库
同义词
from nltk.corpus import wordnet
wordnet.synsets('motorcar') # 返回一个对象,等同于 wordnet.synset('car.n.01'),含义是 car 的名词释义中的第1个
wordnet.synset('car.n.01').lemma_names() # 返回同义词:['car', 'auto', 'automobile', 'machine', 'motorcar']
# 有些词是一词多义的,例如 car
wordnet.synsets('car') # 返回 5 类近义词
# 用一样的方法打印每个类型都有哪些近义词
for i in wordnet.synsets('car'):
print(i.lemma_names())
# 得到:
# ['car', 'auto', 'automobile', 'machine', 'motorcar']
# ['car', 'railcar', 'railway_car', 'railroad_car']
# ['car', 'gondola']
# ['car', 'elevator_car']
# ['cable_car', 'car']
动作的蕴含关系
#
wordnet.synset('walk.v.01').entailments() # [Synset('step.v.01')]
wordnet.synset('eat.v.01').entailments() # [Synset('chew.v.01'), Synset('swallow.v.01')]
wordnet.synset('tease.v.03').entailments() #[Synset('arouse.v.07'), Synset('disappoint.v.01')]
反义词
wordnet.lemma('supply.n.02.supply').antonyms() # [Lemma('demand.n.02.demand')]
词语相似度
from nltk.corpus import wordnet
word1 = wordnet.synset('cookbook.n.01')
word2 = wordnet.synset('instruction_book.n.01')
# 其原理是寻找 wordNet 上位词之间的最短路径
word1.wup_similarity(word2)
# 另一种相似度
word1.path_similarity(word2)
文本结构解析
NER
# 没跑通,因为没找到 english.all.3class.distsim.crf.ser.gz
from nltk.tag import StanfordNERTagger
sentence = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
sentence.tag('John goes to NY'.split())
规则式NLP
什么情况下使用?
- 专家经验容易变成规则
- 数据规模小
- 准确率要求高
- 无需覆盖很大场景
参考资料
《Python和NLTK自然语言处理》人民邮电出版社,尼天哈登尼亚