A simple example
Let's take a look at how this example works in practice . Suppose we are building a classifier , Explain whether the text involves sports or not . Our training set has 5 A sentence :

Text

Category

A great game( A great game )

Sports( athletic sports )

The election was over( Election is over )

Not sports( It's not sports )

Very clean match( No inside game )

Sports( athletic sports )

A clean but forgettable game

( A game that can't be forgotten )

Sports( athletic sports )

It was a close election

( It was an election that was equally close )

Not sports( It's not sports )

Because naive Bayes Bayesian is a probability classifier , We want to calculate sentences “A very close
game” It's the probability of sports and it's not a sport . In Mathematics , What we want is P(Sports | a very close
game) The category of this sentence is the probability of sports .

But how do we calculate these probabilities ?

Characteristic Engineering

When creating a machine learning model , The first thing we need to do is to decide what to use as a feature . for example , If we classify health , The characteristic is probably a person's height , weight , Gender, etc . We'll exclude useless things for the model , Such as a person's name or favorite color .

under these circumstances , We don't even have digital features . We only have words . We need to somehow convert this text to a number that can be calculated .

So what should we do ? It is usually used with word frequency . in other words , We ignore the word order and the construction of sentences , Treat each file as a word bank . Our characteristic will be the count of these words . Although it seems to be too simplistic , But it's amazing .

Bayes theorem

Bayesian theorem in using conditional probability ( If we do it here ) It's useful , Because it provides us with a way to reverse them :P(A|B)=P(B|A)×P(A)/P(B). In our case , We have P(sports
| a very close game), So using this theorem, we can reverse the conditional probability :

Because for our classifier , We just tried to figure out which category had a greater probability , We can discard divisor , It's just a comparison

So that's a better understanding , Because we can actually calculate these probabilities ! Just count the sentences “A very close game”  How many times “Sports”
Training focus for , Divide it by the total , You can get it P(a very close game | Sports).

There is a problem , But our training focus didn't show up “A very close
game”, So this probability is zero . Unless every sentence we want to classify is in our training set , Otherwise, the model won't be useful .

Being Naive

We assume that every word in a sentence has nothing to do with other words . That means we don't look at the whole sentence anymore , It's a single word . We'll P(A very close game) finish writing sth. :P(a very
close game)=P(a)×P(very)×P(close)×P(game)
This assumption is very strong , But it's very useful . This enables the entire model to handle a small amount of data or data that could be labeled incorrectly . Next, apply it to what we said before :

P(a very close
game|Sports)=P(a|Sports)×P(very|Sports)×P(close|Sports)×P(game|Sports)

Now? , All of our words actually appeared several times in our training set , We can calculate it !

Calculation probability

The process of calculating probability is actually just counting in our training set .

first , We calculate the prior probability of each category : For a given sentence in the training set ,P( athletic sports ) The probability of ⅗. then ,P( Non Sports ) yes ⅖. then , In calculation P(game |
Sports) namely “game” How many times did it appear sports Sample of , And divide by sports Total number of (11). therefore ,P(game|Sports)=2/11.

however , We have a problem :“close” It doesn't appear in any sports In the sample ! That means P(close | Sports)=
0. It's quite inconvenient , Because we're going to multiply it by other probabilities , So we'll get it in the end P(a|Sports)×P(very|Sports)×0×P(game|Sports) be equal to 0. And what we do is not give us any information at all , So we have to find a way .

What should we do ? By using a method known as Laplace smoothing : We add for each count 1, So it won't be zero . To balance this , We add the possible words to the divisor , So this part will never be greater than 1. In our case , If possible, it is [
“a” ,“great” ,“very” ,“over” ,'it' ,'but' ,'game' ,'election' ,'close' ,'clean'
,'the' ,'was' ,'forgettable' ,'match' ] .

Because the number of possible words is 14, We get it with Laplace smoothing . The results are as follows :

Now we just double all the probability , See who's bigger :

perfect ! Our classifier gives “A very close game”  yes Sport class .
Advanced technology
There are many things that can be done to improve this basic model . These techniques can make naive Bayes comparable to more advanced methods .

*
Removing
stopwords( Delete Inactive words ). These common words , No categories will be added really , for example , One , capable , There are other , Always wait . So for our purpose , The election will be over , A very close game will be a very close match .

*
Lemmatizing words( Word variation restore ). This is a combination of different words . So the election , general election , Being elected, etc. will be grouped together , More appearance of the same word .

*
Using n-grams( Use instance ). We can calculate some common examples , as “ No inside game ” and “ A close election ”. Not just a word , Calculation of a word .

*
use TF-IDF. Not just counting frequency , We can do something higher

Technology