W3Cschool
恭喜您成為首批注冊用戶
獲得88經(jīng)驗值獎勵
在本節(jié)中,我們將解決一些相關問題。
類別預測
在一組文件中,不僅單詞而且單詞的類別也很重要; 在哪個類別的文本中特定的詞落入。 例如,想要預測給定的句子是否屬于電子郵件,新聞,體育,計算機等類別。在下面的示例中,我們將使用 tf-idf 來制定特征向量來查找文檔的類別。使用 sklearn 的 20 個新聞組數(shù)據(jù)集中的數(shù)據(jù)。
導入必要的軟件包 -
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
定義分類圖。使用五個不同的類別,分別是宗教,汽車,體育,電子和空間。
category_map = {'talk.religion.misc':'Religion','rec.autos''Autos',
'rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}
創(chuàng)建訓練集 -
training_data = fetch_20newsgroups(subset = 'train',
categories = category_map.keys(), shuffle = True, random_state = 5)
構建一個向量計數(shù)器并提取術語計數(shù) -
vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)
tf-idf 轉(zhuǎn)換器的創(chuàng)建過程如下 -
tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)
現(xiàn)在,定義測試數(shù)據(jù) -
input_data = [
'Discovery was a space shuttle',
'Hindu, Christian, Sikh all are religions',
'We must have to drive safely',
'Puck is a disk made of rubber',
'Television, Microwave, Refrigrated all uses electricity'
]
以上數(shù)據(jù)將用于訓練一個Multinomial樸素貝葉斯分類器 -
classifier = MultinomialNB().fit(train_tfidf, training_data.target)
使用計數(shù)向量化器轉(zhuǎn)換輸入數(shù)據(jù) -
input_tc = vectorizer_count.transform(input_data)
現(xiàn)在,將使用 tfidf 轉(zhuǎn)換器來轉(zhuǎn)換矢量化數(shù)據(jù) -
input_tfidf = tfidf.transform(input_tc)
執(zhí)行上面代碼,將預測輸出類別 -
predictions = classifier.predict(input_tfidf)
輸出結果如下 -
for sent, category in zip(input_data, predictions):
print('\nInput Data:', sent, '\n Category:', \
category_map[training_data.target_names[category]])
類別預測器生成以下輸出 -
Dimensions of training data: (2755, 39297)
Input Data: Discovery was a space shuttle
Category: Space
Input Data: Hindu, Christian, Sikh all are religions
Category: Religion
Input Data: We must have to drive safely
Category: Autos
Input Data: Puck is a disk made of rubber
Category: Hockey
Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics
性別發(fā)現(xiàn)器
在這個問題陳述中,將通過提供名字來訓練分類器以找到性別(男性或女性)。 我們需要使用啟發(fā)式構造特征向量并訓練分類器。這里使用 scikit-learn 軟件包中的標簽數(shù)據(jù)。 以下是構建性別查找器的 Python 代碼 -
導入必要的軟件包 -
import random
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names
現(xiàn)在需要從輸入字中提取最后的 N 個字母。 這些字母將作為功能 -
def extract_features(word, N = 2):
last_n_letters = word[-N:]
return {'feature': last_n_letters.lower()}
if __name__=='__main__':
使用 NLTK 中提供的標簽名稱(男性和女性)創(chuàng)建培訓數(shù)據(jù) -
male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)
random.seed(5)
random.shuffle(data)
現(xiàn)在,測試數(shù)據(jù)將被創(chuàng)建如下 -
namesInput = ['Rajesh', 'Gaurav', 'Swati', 'Shubha']
使用以下代碼定義用于列車和測試的樣本數(shù) -
train_sample = int(0.8 * len(data))
現(xiàn)在,需要迭代不同的長度,以便可以比較精度 -
for i in range(1, 6):
print('\nNumber of end letters:', i)
features = [(extract_features(n, i), gender) for (n, gender) in data]
train_data, test_data = features[:train_sample],
features[train_sample:]
classifier = NaiveBayesClassifier.train(train_data)
分類器的準確度可以計算如下 -
accuracy_classifier = round(100 * nltk_accuracy(classifier, test_data), 2)
print('Accuracy = ' + str(accuracy_classifier) + '%')
現(xiàn)在,可以預測輸出結果 -
for name in namesInput:
print(name, '==>', classifier.classify(extract_features(name, i))
上述程序?qū)⑸梢韵螺敵?-
Number of end letters: 1
Accuracy = 74.7%
Rajesh -> female
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 2
Accuracy = 78.79%
Rajesh -> male
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 3
Accuracy = 77.22%
Rajesh -> male
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 4
Accuracy = 69.98%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 5
Accuracy = 64.63%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female
在上面的輸出中可以看到,結束字母的最大數(shù)量的準確性是兩個,并且隨著結束字母數(shù)量的增加而減少。
Copyright©2021 w3cschool編程獅|閩ICP備15016281號-3|閩公網(wǎng)安備35020302033924號
違法和不良信息舉報電話:173-0602-2364|舉報郵箱:jubao@eeedong.com
掃描二維碼
下載編程獅App
編程獅公眾號
聯(lián)系方式:
更多建議: