Английская Википедия:Bayes classifier

Материал из Онлайн справочника
Перейти к навигацииПерейти к поиску

Шаблон:Bayesian statistics In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.[1]

Definition

Suppose a pair <math>(X,Y)</math> takes values in <math>\mathbb{R}^d \times \{1,2,\dots,K\}</math>, where <math>Y</math> is the class label of an element whose features are given by <math>X</math>. Assume that the conditional distribution of X, given that the label Y takes the value r is given by

<math>(X\mid Y=r) \sim P_r</math> for <math>r=1,2,\dots,K</math>

where "<math>\sim</math>" means "is distributed as", and where <math>P_r</math> denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function <math>C: \mathbb{R}^d \to \{1,2,\dots,K\}</math>, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as

<math>\mathcal{R}(C) = \operatorname{P}\{C(X) \neq Y\}.</math>

The Bayes classifier is

<math>C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r \mid X=x).</math>

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, <math>\operatorname{P}(Y=r \mid X=x)</math>. The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier <math>C</math> (possibly depending on some training data) is defined as <math>\mathcal{R}(C) - \mathcal{R}(C^\text{Bayes}).</math> Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.[2]

Considering the components <math>x_i</math> of <math>x</math> to be mutually independent, we get the naive bayes classifier, where <math>C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r)\prod_{i=1}^{d}P_r(x_i).</math>

Properties

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk <math>R(h)</math>, Bayes risk <math>R^*</math>, all possible classes to which the points can be classified <math>Y = \{0,1\}</math>. Let the posterior probability of a point belonging to class 1 be <math>\eta(x)=Pr(Y=1|X=x)</math>. Define the classifier <math>\mathcal{h}^*</math>as

<math>\mathcal{h}^*(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 0.5,\\ 0&\text{otherwise.}\end{cases}</math>

Then we have the following results:

(a) <math>R(h^*)=R^*</math>, i.e. <math>h^*</math> is a Bayes classifier,

(b) For any classifier <math>h</math>, the excess risk satisfies <math>R(h)-R^*=2\mathbb{E}_X\left[|\eta(x)-0.5|\cdot \mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}\right]</math>

(c) <math>R^* = \mathbb{E}_X\left[\min(\eta(X),1-\eta(X))\right]</math>

(d) <math>R^* = \frac{1}{2} - \frac{1}{2}\mathbb E [|2\eta(X) - 1|]</math>


Proof of (a): For any classifier <math>h</math>, we have

<math>R(h) = \mathbb{E}_{XY}\left[ \mathbb{I}_{ \left\{h(X)\ne Y \right\}} \right] </math>

<math>=\mathbb{E}_X\mathbb{E}_{Y|X}[\mathbb{I}_{ \left\{h(X)\ne Y \right\}} ]</math> (due to Fubini's theorem)

<math>= \mathbb{E}_X[\eta(X)\mathbb{I}_{ \left\{h(X)=0\right\}} +(1-\eta(X))\mathbb{I}_{ \left\{h(X)=1 \right\}} ] </math>

Notice that <math>R(h)</math> is minimised by taking <math>\forall x\in X</math>,

<math>h(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 1-\eta(x),\\ 0&\text{otherwise.}\end{cases}</math>

Therefore the minimum possible risk is the Bayes risk, <math>R^*= R(h^*)</math>.


Proof of (b):

<math>\begin{aligned} R(h)-R^* &= R(h)-R(h^*)\\

   &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h(X)=1\right\}}-\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}-(1-\eta(X))\mathbb{I}_{\left\{h^*(X)=1\right\}}]\\
   &=\mathbb{E}_X[|2\eta(X)-1|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]\\
   &= 2\mathbb{E}_X[|\eta(X)-0.5|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]

\end{aligned}</math>


Proof of (c):

<math>\begin{aligned} R(h^*) &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h*(X)=1\right\}}]\\ &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \end{aligned}</math>

Proof of (d):

<math> \begin{aligned} R(h^*) &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \\ &= \frac{1}{2} - \mathbb{E}_X[\max(\eta(X) - 1/2,1/2-\eta(X))]\\ &=\frac{1}{2} - \frac{1}{2} \mathbb E [|2\eta(X) - 1|] \end{aligned}</math>

General case

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.

<math>\begin{align} \mathbb{E}_Y(\mathbb{I}_{\{y\ne \hat{y}\}}) &= \mathbb{E}_X\mathbb{E}_{Y|X}\left(\mathbb{I}_{\{y\ne \hat{y}\}}|X=x\right)\\ &= \mathbb{E}\left[Pr(Y=1|X=x)\mathbb{I}_{\{\hat{y}=2,3,\dots,n\}}+Pr(Y=2|X=x)\mathbb{I}_{\{\hat{y}=1,3,\dots,n\}}+\dots+Pr(Y=n|X=x)\mathbb{I}_{\{\hat{y}=1,2,3,\dots,n-1\}}\right]

\end{align}</math>

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier<math>h(x)=k,\quad \arg\max_{k}Pr(Y=k|X=x)</math>

for each observation x.

See also

References

Шаблон:Reflist