瞎聊机器学习——TF-IDF算法（原理及代码实现） - 博客

[{"createTime":1735734952000,"id":1,"img":"hwy_ms_500_252.jpeg","link":"https://activity.huaweicloud.com/cps.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=V1g3MDY4NTY=&utm_medium=cps&utm_campaign=201905","name":"华为云秒杀","status":9,"txt":"华为云38元秒杀","type":1,"updateTime":1735747411000,"userId":3},{"createTime":1736173885000,"id":2,"img":"txy_480_300.png","link":"https://cloud.tencent.com/act/cps/redirect?redirect=1077&cps_key=edb15096bfff75effaaa8c8bb66138bd&from=console","name":"腾讯云秒杀","status":9,"txt":"腾讯云限量秒杀","type":1,"updateTime":1736173885000,"userId":3},{"createTime":1736177492000,"id":3,"img":"aly_251_140.png","link":"https://www.aliyun.com/minisite/goods?userCode=pwp8kmv3","memo":"","name":"阿里云","status":9,"txt":"阿里云2折起","type":1,"updateTime":1736177492000,"userId":3},{"createTime":1735660800000,"id":4,"img":"vultr_560_300.png","link":"https://www.vultr.com/?ref=9603742-8H","name":"Vultr","status":9,"txt":"Vultr送$100","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":5,"img":"jdy_663_320.jpg","link":"https://3.cn/2ay1-e5t","name":"京东云","status":9,"txt":"京东云特惠专区","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":6,"img":"new_ads.png","link":"https://www.iodraw.com/ads","name":"发布广告","status":9,"txt":"发布广告","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":7,"img":"yun_910_50.png","link":"https://activity.huaweicloud.com/discount_area_v5/index.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=aXhpYW95YW5nOA===&utm_medium=cps&utm_campaign=201905","name":"底部","status":9,"txt":"高性能云服务器2折起","type":2,"updateTime":1735660800000,"userId":3}]

TF-IDF的概念

TF-IDF是Term Frequency - Inverse Document
Frequency的缩写，即“词频-逆文本频率”。它由两部分组成，TF和IDF。

TF策略我在之前的高频词提取文章中进行过使用，TF用来表示词频，也就是某个词在文章中出现的总次数，也就是：

TF=某个词在文章中出现的总次数

但是考虑到每篇文章的长短是不同的，所以我们可以把上述内容进行一个标准化：

TF=某个词在文章中出现的总次数/文章的总词数

IDF用来表示逆文档频率，所谓逆文档频率其实是用来反映一个词在所有文档中出现的频率，当一个词在很多文档中出现的时候，其所对应的IDF值也应该变低，当一个词在很少的文档中出现的时候，其所对应的IDF值就会变高，用一个式子来表述一下：

IDF=log(语料库中的文档总数/(包含该词的文档数+1)）

在这里我们+1的目的是为了当没有词语在文档中时来避免分母为0。

现在我们知道了TF，IDF分别代表什么，那么我们也可以得到TF-IDF：

TF-IDF=TF*IDF

并且根据上述的性质我们可以得出：TF-IDF与一个词在文档中的出现次数成正比，与该词在整个语料库中的出现次数成反比。

TF-IDF的实现

我们了解了TF-IDF代表什么之后，下面我们来用不同的方式来实现一下该算法。

一、使用gensim来计算TF-IDF

首先我们来设定一个语料库并进行分词处理：
# 建立一个语料库 corpus = [ "what is the weather like today", "what is for dinner
tonight", "this is a question worth pondering", "it is a beautiful day today" ]
# 进行分词 words = [] for i in corpus: words.append(i.split(" ")) print(words)
得到的结果如下：

接下来我们来计算一下每个词语在当前文档中出现的次数：
# 给每一个词一个ID并统计每个词在当前文档中出现的次数 dic = corpora.Dictionary(words) new_corpus =
[dic.doc2bow(text) for text in words] print(new_corpus) print(dic.token2id)
得到的结果如下：

doc2bow函数主要用于让dic中的内用变为bow词袋模型，其中每个括号中的第一个数代表词的ID第二个数代表在当前文档中出现的次数。（可能例子选择的不佳，这里每个词出现的次数都为1）

token2id主要用于输出一种字典类型的数据，其数据格式为：{词，对应的单词id}

如果是id2token则为：{单词id，对应的词}，这里用那种形式都可以。

然后我们要训练gensim模型并保存，并加以测试：
# 训练模型并保存 tfidf = models.TfidfModel(new_corpus) tfidf.save("my_model.tfidf") #
载入模型 tfidf = models.TfidfModel.load("my_model.tfidf") # 使用训练好的模型计算TF-IDF值
string = "i like the weather today" string_bow =
dic.doc2bow(string.lower().split()) string_tfidf = tfidf[string_bow]
print(string_tfidf)
结果如下：

由结果我们可以看出训练出来的结果左边是词的ID右边是词的tfidf值，但是对于我们在训练模型时没有训练到的词，在结果中别没有显示出来。

二、sklearn来计算TF-IDF

sklearn使用起来要比gensim方便的多，主要用到了sklearn中的TfidfVectorizer：
from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ "what
is the weather like today", "what is for dinner tonight", "this is a question
worth pondering", "it is a beautiful day today" ] tfidf_vec = TfidfVectorizer()
# 利用fit_transform得到TF-IDF矩阵 tfidf_matrix = tfidf_vec.fit_transform(corpus) #
利用get_feature_names得到不重复的单词 print(tfidf_vec.get_feature_names()) # 得到每个单词所对应的ID
print(tfidf_vec.vocabulary_) # 输出TF-IDF矩阵 print(tfidf_matrix)
得到的部分参考结果如下：

三、用Python手动实现TF-IDF算法

上文中我们用了两种库函数来计算自定义语料库中每个单词的TF-IDF值，下面我们来手动实现一下TF-IDF：
import math corpus = [ "what is the weather like today", "what is for dinner
tonight", "this is a question worth pondering", "it is a beautiful day today" ]
words = [] # 对corpus分词 for i in corpus: words.append(i.split()) #
如果有自定义的停用词典，我们可以用下列方法来分词并去掉停用词 # f = ["is", "the"] # for i in corpus: #
all_words = i.split() # new_words = [] # for j in all_words: # if j not in f: #
new_words.append(j) # words.append(new_words) # print(words) # 进行词频统计 def
Counter(word_list): wordcount = [] for i in word_list: count = {} for j in i:
if not count.get(j): count.update({j: 1}) elif count.get(j): count[j] += 1
wordcount.append(count) return wordcount wordcount = Counter(words) #
计算TF(word代表被计算的单词，word_list是被计算单词所在文档分词后的字典) def tf(word, word_list): return
word_list.get(word) / sum(word_list.values()) # 统计含有该单词的句子数 def
count_sentence(word, wordcount): return sum(1 for i in wordcount if
i.get(word)) # 计算IDF def idf(word, wordcount): return math.log(len(wordcount) /
(count_sentence(word, wordcount) + 1)) # 计算TF-IDF def tfidf(word, word_list,
wordcount): return tf(word, word_list) * idf(word, wordcount) p = 1 for i in
wordcount: print("part:{}".format(p)) p = p+1 for j, k in i.items():
print("word: {} ---- TF-IDF:{}".format(j, tfidf(j, i, wordcount)))
运行后的部分结果如下：

总结

TF-IDF主要用于文章中关键词的提取工作，也可用于查找相似文章、对文章进行摘要提取、特征选择（重要特征的提取）工作。

TF-IDF算法的优点是简单快速，结果比较符合实际情况。缺点是，单纯以"词频"衡量一个词的重要性，不够全面，有时重要的词可能出现次数并不多。而且，这种算法无法体现词的位置信息，出现位置靠前的词与出现位置靠后的词，都被视为重要性相同，这是不正确的。（一种解决方法是，对全文的第一段和每一段的第一句话，给予较大的权重。）

技术

Java1212 篇
Python927 篇
开发语言608 篇
c语言463 篇
算法461 篇
MySQL438 篇
数据库394 篇
前端387 篇
更多...