【NLP】NLTK



2021年05月02日    Author:Guofei

文章归类: 0x24_NLP    文章编号: 341

版权声明:本文作者是郭飞。转载随意,但需要标明原文链接,并通知本人
原文链接:https://www.guofei.site/2021/05/02/nltk.html


语料库和数据集

分类:孤立语料库(就是文本集合)、分类语料库(语料+分类信息)、重叠语料库(有分类,但分类有重叠现象)

基本数据类型:离散、连续等,不多说。

语料文件格式:txt、csv、xml、json、稀疏矩阵等

如何获取语料库:

  • nltk: dir(nltk.corpus)
  • https://github.com/awesomedata/awesome-public-datasets
  • https://www.kaggle.com/datasets
  • https://www.reddit.com/r/datasets/
  • 自己写爬虫

nltk语料库

下载语料库

nltk.download('brown')  
# 上面可能网络原因失败,可以手动下载:
# https://github.com/nltk/nltk_data/tree/gh-pages
# 然后把 packages 文件夹下的所有内容复制到指定目录即可
# 为了使用其它功能,还需要:
# 解压缩 /nltk_data/tokenizers/punkt.zip

使用语料库

from nltk.corpus import brown

brown.categories()  # 语料库类别
brown.fileids()  # 语料库文件
brown.categories(fileids)  # 根据文件返回类别
brown.fileids(categories)  # 根据类别返回文件



brown.raw()  #  返回语料, str 类型
brown.words()  # 语料中的单词
brown.sents()  # 语料的句子
# raw,words,sents 可以指定 fileids=[f1, f2, f3],categories=[c1, c2]。也可以用 index 来索引,例如:
brown.words(categories='news')[10:30]   



brown.abspath(fileid='ca01')  # 文件路径

brown.readme()  # readme

文本预处理

一些步骤

  • 分句。不是简单寻找句号,因为点号有时候用在缩写里面。可以使用一些工具,或搭建ML模型来做。
  • 词形还原,例如比较级、过去式、现在式
  • 停用词去除
  • 拼写校正。

词频统计

from bs4 import BeautifulSoup
import requests

# 先用爬虫获取一段语料
text = BeautifulSoup(requests.get('https://www.python.org/').text, 'lxml').get_text()


import nltk

freq_dist = nltk.FreqDist(text.split())  # 类似 Counter 的词频统计
freq_dist.plot(50, cumulative=False)  # 画图

分句

import nltk

text = "How long does it take to get a Ph.D. degree? I don't know."
nltk.tokenize.sent_tokenize(text)
# 另一个可训练的分句工具
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

分词

import nltk

text = 'Hello! Everyone!! Good morning!'
print(nltk.tokenize.word_tokenize(text=text))  # 最通用的方法
print(text.split())  # 朴素的方法
print(nltk.tokenize.regexp_tokenize(text=text, pattern='\w+'))  # 允许用户自定义
print(nltk.tokenize.wordpunct_tokenize(text=text))
print(nltk.tokenize.blankline_tokenize(text=text))

词形还原

如果事先不知道词性,可以用词干提取(stemming),如果事先知道词性,用词形还原(lemmatization)

stemming:

import nltk

pst = nltk.stem.PorterStemmer()
lst = nltk.stem.LancasterStemmer()
sbt = nltk.stem.SnowballStemmer(language='english')  # 可以支持多种语言

pst.stem('eating')
lst.stem('shopping')
sbt.stem('asked')

lemmatization:

from nltk.stem.wordnet import WordNetLemmatizer

wordlemma = WordNetLemmatizer()
print(wordlemma.lemmatize('cars'))
print(wordlemma.lemmatize('walking', pos='v'))
print(wordlemma.lemmatize('meeting', pos='n'))
print(wordlemma.lemmatize('meeting', pos='v'))
print(wordlemma.lemmatize('better', pos='a'))
print(wordlemma.lemmatize('is', pos='v'))
print(wordlemma.lemmatize('funnier', pos='a'))
print(wordlemma.lemmatize('expected', pos='v'))
print(wordlemma.lemmatize('fantasized', pos='v'))

# 也可以不给定词性
wordlemma.lemmatize('cooking')
# cooking
wordlemma.lemmatize('cooking', pos='v')
# cook
wordlemma.lemmatize('cookbooks')
# cookbook

停用词移除

移除停用词对语义影响极小,而且停用词本身数量较多,干扰模型。

from nltk.corpus import stopwords

stop_words = stopwords.words('english')  # 支持几十种语言
stop_words = set(["hi", "bye"])  #  也可以自定义停用词

# 然后就可以移除了
tokens = [word for word in text.split() if word not in stopwords]

罕见词移除

freq_dist = nltk.FreqDist(tokens)  # 类似 Counter 的词频统计
freq_dist1 = sorted(freq_dist.items(), key=lambda x: x[1], reverse=True)
rare_words = [i[0] for i in freq_dist1[-10:]]
# 然后就可以移除了
tokens = [word for word in text.split() if word not in rare_words]

拼写校正算法

使用最小编辑距离(也就是插入、删除、替换所需操作)做为标准,用动态规划来实现。

import re
from collections import Counter

# WORDS 一般可以从语料中获取,这里为了展示代码临时写成这样
WORDS = Counter(['apple', 'correction', 'statement', 'tutors'])


def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N


def correction(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)


def candidates(word):
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])


def known(words):
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)


def edits1(word):
    "All edits that are one edit away from `word`."
    letters = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)


def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))


if __name__ == "__main__":
    print(correction('aple'))
    print(correction('correcton'))
    print(correction('statament'))
    print(correction('tutpore'))

nltk自带一个距离算法

from nltk.metrics import edit_distance
edit_distance('aple', 'apple')

特征工程

NLP的基础特征

  • 句法分析
  • 词性标注
  • 实体识别(NER)
  • n-gram
  • Bag-of-Word

NLP 的统计特征

  • One-hot 等
  • TF-IDF

NLP 的高级特征

  • word2vec

词性标注

pos标注器(使用最大熵分类器)

text = 'While in France, Christine Lagarde discussed short-term ' \
       'stimulus efforts in a recent interview at 5:00 P.M with the Wall Street Journal.'
nltk.pos_tag(nltk.word_tokenize(text=text))

Stanford 标注器:

# 下载模型:https://nlp.stanford.edu/software/tagger.shtml ,我是下载好后放到 Dowdloads 下面了

from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize

jar = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/english-left3words-distsim.tagger'

pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')


text = 'While in France, Christine Lagarde discussed short-term ' \
       'stimulus efforts in a recent interview at 5:00 P.M with the Wall Street Journal.'

tokenized_text=word_tokenize(text)
classified_text = pos_tagger.tag(tokenized_text)

print(classified_text)

[(‘While’, ‘IN’), (‘in’, ‘IN’), (‘France’, ‘NNP’), (‘,’, ‘,’), (‘Christine’, ‘NNP’), (‘Lagarde’, ‘NNP’), (‘discussed’, ‘VBD’), (‘short-term’, ‘JJ’), (‘stimulus’, ‘NN’), (‘efforts’, ‘NNS’), (‘in’, ‘IN’), (‘a’, ‘DT’), (‘recent’, ‘JJ’), (‘interview’, ‘NN’), (‘at’, ‘IN’), (‘5:00’, ‘CD’), (‘P.M’, ‘NN’), (‘with’, ‘IN’), (‘the’, ‘DT’), (‘Wall’, ‘NNP’), (‘Street’, ‘NNP’), (‘Journal’, ‘NNP’), (‘.’, ‘.’)]

中文也支持

from nltk.tag import StanfordPOSTagger
import jieba
jar = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/chinese-nodistsim.tagger'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/chinese-distsim.tagger'
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')
text='张伟在北京上班,李雷在南京逛街'
tokenized_text = list(jieba.cut(text))
classified_text = pos_tagger.tag(tokenized_text)

N-gram标注器,有几种:

from nltk.corpus import brown
import nltk

brown_tagged_sents = brown.tagged_sents()
train_data = brown_tagged_sents[:int(0.9 * len(brown_tagged_sents))]
test_data = brown_tagged_sents[int(0.9 * len(brown_tagged_sents)):]

default_tagger = nltk.DefaultTagger('NN')  # NN 最多,默认设为NN,准确率就可达到 13%
default_tagger.evaluate(brown_tagged_sents)

unigram_tagger = nltk.tag.UnigramTagger(train=train_data, backoff=default_tagger)
bigram_tagger = nltk.tag.BigramTagger(train=train_data, backoff=unigram_tagger)
trigram_tagger = nltk.tag.TrigramTagger(train=train_data, backoff=bigram_tagger)


print(unigram_tagger.evaluate(test_data))  # 89%
print(bigram_tagger.evaluate(test_data))  # 91%
print(trigram_tagger.evaluate(test_data))  # 91%

NER标注器

import nltk

text = 'London is a big city in the United Kingdom.'

nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text=text)), binary=False)
#

用正则做词性标注

patterns = [
(r'\d+$', 'CD'),
(r'.*ing', 'VBG'),
(r'.*ment', 'NN'),
(r'.*ful', 'JJ')
]

# CD 是数字,VBG 是动词,NN 是名词,JJ 是形容词

其它算法(NLTK都有现成的) Brill词性标注:原理是根据附近单词的词性来做预测。
TnT标注器:是基于二阶马尔可夫模型的标注器

wordnet

wordNet 是一个强大的英语词汇数据库

同义词

from nltk.corpus import wordnet

wordnet.synsets('motorcar')  # 返回一个对象,等同于 wordnet.synset('car.n.01'),含义是 car 的名词释义中的第1个

wordnet.synset('car.n.01').lemma_names()  # 返回同义词:['car', 'auto', 'automobile', 'machine', 'motorcar']


# 有些词是一词多义的,例如 car
wordnet.synsets('car')  # 返回 5 类近义词
# 用一样的方法打印每个类型都有哪些近义词
for i in wordnet.synsets('car'):
    print(i.lemma_names())


# 得到:
# ['car', 'auto', 'automobile', 'machine', 'motorcar']
# ['car', 'railcar', 'railway_car', 'railroad_car']
# ['car', 'gondola']
# ['car', 'elevator_car']
# ['cable_car', 'car']

动作的蕴含关系

#
wordnet.synset('walk.v.01').entailments()  # [Synset('step.v.01')]
wordnet.synset('eat.v.01').entailments()  # [Synset('chew.v.01'), Synset('swallow.v.01')]
wordnet.synset('tease.v.03').entailments() #[Synset('arouse.v.07'), Synset('disappoint.v.01')]

反义词

wordnet.lemma('supply.n.02.supply').antonyms()  # [Lemma('demand.n.02.demand')]

词语相似度

from nltk.corpus import wordnet

word1 = wordnet.synset('cookbook.n.01')
word2 = wordnet.synset('instruction_book.n.01')

# 其原理是寻找 wordNet 上位词之间的最短路径
word1.wup_similarity(word2)

# 另一种相似度
word1.path_similarity(word2)

文本结构解析

NER

# 没跑通,因为没找到 english.all.3class.distsim.crf.ser.gz
from nltk.tag import StanfordNERTagger

sentence = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')

sentence.tag('John goes to NY'.split())

规则式NLP

什么情况下使用?

  • 专家经验容易变成规则
  • 数据规模小
  • 准确率要求高
  • 无需覆盖很大场景

参考资料

Python自然语言处理(Jalaj Thanaki)

《Python和NLTK自然语言处理》人民邮电出版社,尼天哈登尼亚


您的支持将鼓励我继续创作!