修改词典

动态增删新词

在程序中可以动态根据分词结果，对内存中的词库进行更新。

add_word(word)

word: 新词
freq = none :词频
tag = none :具体词性

del_word(word)

1 2	jieba.add_word("哀牢山三十六剑")#动态更改词典 '/'.join(jieba.cut(tmpstr))

'郭靖/和/哀牢山三十六剑/。'

如果是有大量的词要添加的话，一个一个添加是不效率的，所以我们往往会事先准备词库。

1 2	jieba.del_word("哀牢山三十六剑") '/'.join(jieba.cut(tmpstr))

'郭靖/和/哀牢山/三十六/剑/。'

使用自定义词典

load_userdict(file_name)

file_name:文件夹对象或自定义词典的路径

词典的基本格式：一个词占一行，词、词频（可省略）、词性（可省略），用空格隔开。词典文件必须用UTF-编码，必要时可以用Uedit进行文件编码转换。

云计算 5

李小福 2 nr

easy_install 3 eng

台中

1
2
3

dict = 'C:/Users/Wang/Desktop/PythonData/金庸小说词库.txt'
jieba.load_userdict(dict) #dict为自定义词典的路径
'/'.join(jieba.cut(tmpstr))

'郭靖/和/哀牢山三十六剑/。'

去除停用词

常见停用词种类

超高频的常用词，基本不携带有效信息/歧义太多无分析价值
- 的、地、得
虚词：如介词，连词等
- 只、条、件
- 当、从、同
专业领域的高频词：基本不携带有效信息
视情况而定的停用词

分词后取出停用词

基本步骤

读入停用词表文件
正常分词
在分词结果中去除停用词

新列表 = [word in word in 原列表 if word not in 停用词列表]

该方法存在的问题：停用词必须要被分词过程正确拆分出来才行

1 2	newlist = [w for w in jieba.cut(tmpstr) if w not in ['和','。']] print(newlist)

['郭靖', '哀牢山三十六剑']

当停用词很多时，我们可以先读入外部文件并且把它转化为列表。

1
2
3

import pandas as pd
tmpdf = pd.read_csv('C:/Users/Wang/Desktop/PythonData/停用词.txt', names = ['w'], sep = 'aaa',encoding = 'utf-8')
tmpdf.head()

c:\users\wang\appdata\local\programs\python\python36-32\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

	w
0	,
1	，
2	、
3	；
4	:

1	[ w for w in jieba.cut(tmpstr) if w not in list(tmpdf.w)]

['郭靖', '哀牢山三十六剑']

用extract_tags函数去除停用词

方法特点：

根据TF-IDF算法将特征词提取出来，在提取之前去掉停用词
可以人工指定停用词字典
jieba.analyse.set_stop_words()

1
2
3

import jieba.analyse as ana
ana.set_stop_words('C:/Users/Wang/Desktop/PythonData/停用词.txt')
jieba.lcut(tmpstr)# 读入停用词列表对分词结果无效

['郭靖', '和', '哀牢山三十六剑', '。']

1	ana.extract_tags(tmpstr,topK=20) #使用TF-IDF算法提取关键词，并同时去掉停用词

['郭靖', '哀牢山三十六剑']