Kwangmin Kim - Naive Bayes Classification

1 Naive Bayes

2 Navie Bayes Application

얼굴 인식: 분류기로 얼굴, 코, 입, 눈 등과 같은 여러 특징을 식별
날씨 예측: 날씨가 좋을지 나쁠지 예측
의료 진단: 의료 전문가는 나이브 베이즈를 사용하여 환자가 심장병, 암 및 기타 질병과 같은 특정 질병 및 상태에 대한 고위험군인지 여부를 확인
뉴스 분류: google 뉴스는 뉴스 유형을 분류
쇼핑: 한 사람이 제품을 구매할지 여부를 예측하기 위해 요일, 할인 및 무료 배송의 특정 조합으로 나이브 베이즈 분류기 사용. 쇼핑한 날이 주중인지 주말인지 공휴일인지 기록하고 지정된 날짜에 대해 할인 및 무료 배송이 있는지 여부를 확인.

3 장점

간단하고 구현하기 쉬움
훈련 데이터가 많이 필요하지 않음
연속 데이터와 이산 데이터를 모두 처리
예측 변수와 데이터 포인트의 수로 확장성이 뛰어남
빠르고 실시간 예측에 사용
관련 없는 특성에 민감하지 않음
텍스트 분류는 나이브 베이즈 분류기의 가장 인기있는 응용 프로그램

4 원리

나이브 베이즈를 이용하면 조건이 주어질 때의 사건 발생 여부를 에측한다. 즉, 주어진 데이터를 이용해 사건 발생 확률 모형을 생성하고 새로운 데이터가 들어왔을 때 예측을 한다. 예측 결과는 사건이 발생할 확률과 사건이 발생하지 않을 확률이 출력되며 확률 높은 쪽을 선택하여 결과를 출력한다. 주어진 데이터는 조건이라고 가정하고 사건 조건부 발생 활률을 추정하는 것이다.

조건이 주어졌을 때 조건부 확률을 계산하는 방식은 베이즈 정리를 이용한다.

#| echo: false
#| eval: true
radius = 10
from IPython.display import display, Markdown
display(Markdown("""
The radius of the circle is {radius}.
""".format(radius = radius)))

# 필요 라이브러리 로딩
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
print(data.target_names)

# 모든 카테고리 정의
categories =['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 
             'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 
             'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 
             'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 
             'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 
             'talk.politics.misc', 'talk.religion.misc']
# 모든 카테고리 데이터 훈련하기
train = fetch_20newsgroups(subset='train', categories=categories)
# 모든 카테고리 데이터 테스트하기
test = fetch_20newsgroups(subset='test', categories=categories)

# 훈련 데이터 보기
print(test.data[5])
# print(train.data[5])
#print(len(train.data))

# 필수 라이브러리 임포트
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# 다항식 나이브베이즈(Multinomial Navie Bayes) 기반 모델 생성
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
# 훈련데이터로 모델 훈련하기
model.fit(train.data, train.target)

# 테스트 데이터를 위한 레이블 생성하기
labels = model.predict(test.data)

# 혼동 행렬과 히트 맵 생성하기
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, 
            xticklabels=train.target_names, yticklabels=train.target_names)

# 혼동 행렬의 히트 맵 플로팅하기
plt.xlabel('true label')
plt.xlabel('predicted label')


# 훈련 모델 기반의 새로운 데이터 상 카테고리 예측하기
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

predict_category('Jesus Christ')
predict_category('Sending load to International Space Station')
predict_category('Audio is better than BMW')
predict_category('Prsident of India')