AI人工智能 解決問題

2020-09-24 13:46 更新

解決問題

在本節(jié)中,我們將解決一些相關問題。

類別預測

在一組文件中,不僅單詞而且單詞的類別也很重要; 在哪個類別的文本中特定的詞落入。 例如,想要預測給定的句子是否屬于電子郵件,新聞,體育,計算機等類別。在下面的示例中,我們將使用 tf-idf 來制定特征向量來查找文檔的類別。使用 sklearn 的 20 個新聞組數(shù)據(jù)集中的數(shù)據(jù)。

導入必要的軟件包 -

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

定義分類圖。使用五個不同的類別,分別是宗教,汽車,體育,電子和空間。

category_map = {'talk.religion.misc':'Religion','rec.autos''Autos',
   'rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}

創(chuàng)建訓練集 -

training_data = fetch_20newsgroups(subset = 'train',
   categories = category_map.keys(), shuffle = True, random_state = 5)

構建一個向量計數(shù)器并提取術語計數(shù) -

vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)

tf-idf 轉(zhuǎn)換器的創(chuàng)建過程如下 -

tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

現(xiàn)在,定義測試數(shù)據(jù) -

input_data = [
   'Discovery was a space shuttle',
   'Hindu, Christian, Sikh all are religions',
   'We must have to drive safely',
   'Puck is a disk made of rubber',
   'Television, Microwave, Refrigrated all uses electricity'
]

以上數(shù)據(jù)將用于訓練一個Multinomial樸素貝葉斯分類器 -

classifier = MultinomialNB().fit(train_tfidf, training_data.target)

使用計數(shù)向量化器轉(zhuǎn)換輸入數(shù)據(jù) -

input_tc = vectorizer_count.transform(input_data)

現(xiàn)在,將使用 tfidf 轉(zhuǎn)換器來轉(zhuǎn)換矢量化數(shù)據(jù) -

input_tfidf = tfidf.transform(input_tc)

執(zhí)行上面代碼,將預測輸出類別 -

predictions = classifier.predict(input_tfidf)

輸出結果如下 -

for sent, category in zip(input_data, predictions):
   print('\nInput Data:', sent, '\n Category:', \
      category_map[training_data.target_names[category]])

類別預測器生成以下輸出 -

Dimensions of training data: (2755, 39297)


Input Data: Discovery was a space shuttle
Category: Space


Input Data: Hindu, Christian, Sikh all are religions
Category: Religion


Input Data: We must have to drive safely
Category: Autos


Input Data: Puck is a disk made of rubber
Category: Hockey


Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics

性別發(fā)現(xiàn)器

在這個問題陳述中,將通過提供名字來訓練分類器以找到性別(男性或女性)。 我們需要使用啟發(fā)式構造特征向量并訓練分類器。這里使用 scikit-learn 軟件包中的標簽數(shù)據(jù)。 以下是構建性別查找器的 Python 代碼 -

導入必要的軟件包 -

import random


from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names

現(xiàn)在需要從輸入字中提取最后的 N 個字母。 這些字母將作為功能 -

def extract_features(word, N = 2):
   last_n_letters = word[-N:]
   return {'feature': last_n_letters.lower()}


if __name__=='__main__':

使用 NLTK 中提供的標簽名稱(男性和女性)創(chuàng)建培訓數(shù)據(jù) -

male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)


random.seed(5)
random.shuffle(data)

現(xiàn)在,測試數(shù)據(jù)將被創(chuàng)建如下 -

namesInput = ['Rajesh', 'Gaurav', 'Swati', 'Shubha']

使用以下代碼定義用于列車和測試的樣本數(shù) -

train_sample = int(0.8 * len(data))

現(xiàn)在,需要迭代不同的長度,以便可以比較精度 -

for i in range(1, 6):
   print('\nNumber of end letters:', i)
   features = [(extract_features(n, i), gender) for (n, gender) in data]
   train_data, test_data = features[:train_sample],
features[train_sample:]
   classifier = NaiveBayesClassifier.train(train_data)

分類器的準確度可以計算如下 -

accuracy_classifier = round(100 * nltk_accuracy(classifier, test_data), 2)
   print('Accuracy = ' + str(accuracy_classifier) + '%')

現(xiàn)在,可以預測輸出結果 -

for name in namesInput:
   print(name, '==>', classifier.classify(extract_features(name, i))

上述程序?qū)⑸梢韵螺敵?-

Number of end letters: 1
Accuracy = 74.7%
Rajesh -> female
Gaurav -> male
Swati -> female
Shubha -> female


Number of end letters: 2
Accuracy = 78.79%
Rajesh -> male
Gaurav -> male
Swati -> female
Shubha -> female


Number of end letters: 3
Accuracy = 77.22%
Rajesh -> male
Gaurav -> female
Swati -> female
Shubha -> female


Number of end letters: 4
Accuracy = 69.98%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female


Number of end letters: 5
Accuracy = 64.63%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female

在上面的輸出中可以看到,結束字母的最大數(shù)量的準確性是兩個,并且隨著結束字母數(shù)量的增加而減少。

以上內(nèi)容是否對您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號
微信公眾號

編程獅公眾號