カーネル密度推定~グラフィカルにまとめてみる~ - 医療職からデータサイエンティストへ

前回はR関数を実行して、抽出されたサンプルから母集団の確率密度を推定しました。今回は、Rの関数を使わずにカーネル密度推定を行いたいと思います。

medi-data.hatenablog.com

出来るだけグラフィカルにまとめてみます。

ヒストグラム密度推定

カーネル密度推定をする前に、まずはヒストグラム密度推定という簡単な方法で確率密度を推定してみましょう。

> set.seed(12)
> rand <- round(rnorm(50,0,1),1)
> estimated <- density(rand)
>
> histd <- hist(rand,freq = F,main="",
     xlab = "",xlim = c(-3,3),ylim = c(0,0.5))
> text(histd$mids,histd$density,histd$density,
      pos = 3,cex = 1)
> points(rand,rep(0,50),pch = "l")
> lines( estimated,col = "red")
> legend(x = "topright",lty = 1 , 
        col = "red",legend = "Estimated Density",cex = 0.8)

カーネル密度推定

Rで実行したデフォルトのヒストグラムはこんな感じでした。density関数で推定した密度とデフォルトヒストグラムの密度も描画してあります。

今度はヒストグラムの幅を変えてやってみます。

カーネル密度推定

幅が小さすぎるとギザギザになりすぎて、一つ一つの値に影響されすぎています。逆に幅が大きすぎると、密度をうまく推定できていないことが分かります。

さらにヒストグラムでは、値が離散的になるので断続的な推定値になります。この課題を解決するのがカーネル密度推定になります！

カーネル密度推定（ガウスカーネル）

それでは、カーネル密度推定をやってみましょう。実はカーネル密度推定には種類があって、離散的な密度推定を行う方法もあるのですが、今回は正規分布を使った連続的な密度推定を行います。

まずカーネル密度推定の数学的な式はこのような感じになります。

$p(x) = \dfrac{1}{Nh} \sum_{i = 1}^{N}k \left(\dfrac{x - x_{i}}{h}\right)$

このシグマの後ろにあるkがカーネル関数といって、このkを変化させることで色々な種類の推定ができるようになります。

今回はこのkに以下のようなガウスカーネルを設定します。

$k\left(\dfrac{x - x_{i}}{h}\right) = \dfrac{1}{\sqrt{2\pi}}\exp\left\{-\dfrac{(x - x_{i})^{2}}{2h^{2}}\right\}$

以上をまとめると

$\large p(x) = \dfrac{1}{Nh} \sum_{i = 1}^{N}\dfrac{1}{\sqrt{2\pi}}\exp\left\{-\dfrac{(x - x_{i})^{2}}{2h^{2}}\right\}$

この式の意味は、 $x$ の確率密度 $p(x)$ は、平均 $x$ 、標準偏差 $h$ の正規分布から求めたサンプル数N個の確率密度の平均値という意味になります。

確率密度をN分の1にしたN個の正規分布の足し合わせとも解釈できますね。

グラフのイメージはこんな感じです。

カーネル密度推定

実際に値 $x$ が0の時の確率密度を求めてみましょう！Rの関数densityを使って求めた値も出しておきます。

> #値が0の時の確率密度
> estimated$y[round(estimated$x,2)==0]
[1] 0.4311749
>#ガウスカーネルを定義
>gaus <- function(xi,x,h){
  (1/(2*pi)^(1/2))*exp(-((x-xi)^2/(2*h^2)))
}
> #値が0の時、平均rand、分散0.3のガウス分布確率密度を
> #50分の1にした確率密度ベクトル
> densevec <- gaus(0,rand,0.3)/(length(rand)*0.3)
> #それを合計
> sum(densevec)
[1] 0.4339481

出ました！ほとんど同じ値になってます！これを-4~4まで0.1刻みにした全ての $x$ について推定してみましょう！

>#-4,4までのx
> seqx <- seq(-4,4,0.1)

> #seqxについて推定
> myestimated <- sapply(seqx, function(x,h=0.3){
   densevec <- gaus(x,rand,h)/(length(rand)*h)
   sum(densevec)
 })
> plot(estimated$x,estimated$y , type = "l"
      ,col = "red",ylab = "",xlab = "")
> points(seqx,myestimated,type = "l",col = "blue")
> legend(x = "topright",lty = 1 , 
        col = c("red","blue")
        ,legend = c("Estimated Density","Myestimated Density")
        ,cex = 0.8)

カーネル密度推定

おお！ほとんどぴったり重なっています。

次に、標準偏差hを今回は勝手に0.3としましたが、これを変えるとどうなるかみて見ましょう。

カーネル密度推定

どうやら、hが小さすぎると一つ一つの値に影響されすぎて、逆にhが大きすぎると、なだらかになりすぎています。これはヒストグラム推定の時と似ていますね。

確率密度がうまく推定されるhを見つける必要がありそうです。

ちなみにdensity関数は、どれぐらいの大きさにhを設定しているかというと

> #density関数にアクセス
> estimated$bw
[1] 0.353217

0.35ぐらいになっているようです。しかし、どうやって求めているのでしょう...

サンプルサイズを大きくしてみる

次にサンプルサイズを大きくしてみるとどうなるか確認して見ましょう。1000個ぐらいのサンプルにしてみます。

カーネル密度推定

サンプルサイズが大きい場合はhが小さくても、ある程度正確な推定ができるようです。

分布を変えてみる

次は分布を変えてやって見ましょう。今回は異なる平均をもつ二つの正規分布を足し合わせた分布から100個の乱数を生成します。

> #二つの正規分布を6:4で足し合わせた確率密度から乱数を生成する
> 
gene <- function(n){
   prob <- (runif(n) <= 0.6)  # 確率0.6で片方の分布を採択
   rnorm(n,-1,1) *prob+rnorm(n,3,1)*(1-prob) 
 }
> 
set.seed(12)
rand3 <- gene(100)
estimated3 <- density(rand3)
 
plot(estimated3$x,estimated3$y , type = "l"
      ,col = rainbow(5)[1],ylab = "",xlab = "",
      ylim = c(0,0.3))
hist(rand3,add = T,freq = F)
legend(x = "topright",col = "red",lty = 1,
       legend = "Estimated Density",cex = 0.8)

カーネル密度推定

こんな二峰性の確率密度です。同じようにhを変えてグラフ化してみると...

カーネル密度推定

標準偏差hが２と大きくなると二峰性の形をうまく推定できていないことが分かります。

まとめ

今回はカーネル密度推定について、出来るだけグラフィカルにまとめて見ました。サンプルから母集団の確率密度をノンパラメトリックに推定するには、パラメータhの設定とサンプルの大きさがキーワードになりそうです！

最後までお読みいただきありがとうございました！

参考

カーネル密度推定法（第12章） - TOKYO TECH OCW

パターン認識と機械学習上

作者: C.M.ビショップ,元田浩,栗田多喜夫,樋口知之,松本裕治,村田昇
出版社/メーカー: 丸善出版
発売日: 2012/04/05
メディア: 単行本（ソフトカバー）
購入: 6人クリック: 33回
この商品を含むブログ (20件) を見る

今回のRソースコード

#密度推定
set.seed(12)
rand <- round(rnorm(50,0,1),1)
estimated <- density(rand)
histd <- hist(rand,freq = F,main="",
     xlab = "",xlim = c(-3,3),ylim = c(0,0.5))
text(histd$mids,histd$density,histd$density,
     pos = 3,cex = 1)
points(rand,rep(0,50),pch = "l")
lines(density(rand),col = "red")
legend(x = "topright",lty = 1 , 
       col = "red",legend = "Estimated Density",cex = 0.8)



#幅を変えて
for(i in c(0.1,0.5,1,2)){
  histd <- hist(rand,freq = F,main="",
                xlab = "",ylab = "",
                xlim = c(-3,3),ylim = c(0,0.5),
                breaks = seq(-3,3,i))
  lines(estimated,col = "red")
  text(2,0.4,paste0("h = ",i),
       cex = 1)
}



#グラフイメージ
#分かりやすくするために、確率密度の値を大きめにしてあります。
estim <- sapply(seqx, function(x,h=0.3){
  gaus(x,rand,h)/(20*h)
})

plot(1,1,xlim= c(-2,2),ylim = c(0,0.4),
     type = "n",xlab = "",ylab ="")
lines(estimated,col = "red")
for(i in 1:10){
  points(rand[i],0,cex = 2 , pch = "*",col = "red")
  points(seqx,estim[i,],type = "l")
}

#値が0の時の確率密度
estimated$y[round(estimated$x,2)==0]

#ガウスカーネルを定義
gaus <- function(xi,x,h){
  (1/(2*pi)^(1/2))*exp(-((x-xi)^2/(2*h^2)))
}

#値が0の時、平均rand、分散0.3のガウス分布確率密度を
#50分の1にした確率密度ベクトル
densevec <- gaus(0,rand,0.3)/(length(rand)*0.3)
#それを合計
sum(densevec)

#-4,4までのx
seqx <- seq(-4,4,0.1)

myestimated <- sapply(seqx, function(x,h=0.3){
  densevec <- gaus(x,rand,h)/(length(rand)*h)
  sum(densevec)
})
plot(estimated$x,estimated$y , type = "l"
     ,col = "red",ylab = "",xlab = "")
points(seqx,myestimated,type = "l",col = "blue")
legend(x = "topright",lty = 1 , 
       col = c("red","blue")
       ,legend = c("Estimated Density","Myestimated Density")
       ,cex = 0.8)

#hを変えて

i <- 2
for(j in c(0.1,0.5,1,2)){
  plot(estimated$x,estimated$y , type = "l"
       ,col = rainbow(5)[1],ylab = "",xlab = "")
  myestimated <- sapply(seqx, function(x,h=j){
    densevec <- gaus(x,rand,h)/(length(rand)*h)
    sum(densevec)
  })
  points(seqx,myestimated,type = "l",col = rainbow(5)[i])
  text(2,0.4,paste0("h = ",j),
       cex = 1)
  i <- i+1
}

#density関数にアクセス
estimated$bw

# サンプルサイズを大きくしてみる
set.seed(12)
rand2 <- round(rnorm(1000,0,1),1)
estimated2 <- density(rand2)

i <- 2
for(j in c(0.1,0.5,1,2)){
  plot(estimated2$x,estimated2$y , type = "l"
       ,col = rainbow(5)[1],ylab = "",xlab = "")
  myestimated <- sapply(seqx, function(x,h=j){
    densevec <- gaus(x,rand2,h)/(length(rand2)*h)
    sum(densevec)
  })
  points(seqx,myestimated,type = "l",col = rainbow(5)[i])
  text(2,0.4,paste0("h = ",j),
       cex = 1)
  i <- i+1
}
#分布を変えてみる
#二つの正規分布を6:4で足し合わせた確率密度から乱数を生成する
gene <- function(n){
  prob <- (runif(n) <= 0.6)  # 確率0.6で片方の分布を採択
  rnorm(n,-1,1) *prob+rnorm(n,3,1)*(1-prob) 
}

set.seed(12)
rand3 <- gene(100)
estimated3 <- density(rand3)

plot(estimated3$x,estimated3$y , type = "l"
           ,col = rainbow(5)[1],ylab = "",xlab = "",
           ylim = c(0,0.3))
hist(rand3,add = T,freq = F)
legend(x = "topright",col = "red",lty = 1,
        legend = "Estimated Density",cex = 0.8)

i <- 2
for(j in c(0.1,0.5,1,2)){
  plot(estimated3$x,estimated3$y , type = "l"
       ,col = rainbow(5)[1],ylab = "",xlab = "")
  myestimated <- sapply(seqx, function(x,h=j){
    densevec <- gaus(x,rand3,h)/(length(rand3)*h)
    sum(densevec)
  })
  points(seqx,myestimated,type = "l",col = rainbow(5)[i])
  text(5,0.15,paste0("h = ",j),
       cex = 1)
  i <- i+1
}