Pseudo-random number sampling
Formula for Monte Carlo generation
- In case of continuous distribution
- In case of discrete distribution
runif()를 이용하여 다른 분포 만들어 보기
이항분포
다항 분포
정규분포
카이제곱 분포
Student’s t 분포
F분포
지수분포
포아송 분포
감마분포
베타분포

Pseudo-random number sampling

Monte-Carlo simulations를 하기 위해서는 알려진 확률분포를 따르는 random number를 만들 수 있어야 한다. 그래서 1950년도부터 pseudo-random number sampling algorithm이 개발되었다.

GNU Scientific Library에 다양한 algorithm이 나와있다고 하나, C언어로 작성되어 있기에 확인해 볼 수 없었다. 그래서 비교적 계산속도는 늦어지겠지만, R함수를 이용해 다양한 distribution에서 random number를 만드는 함수를 만들어 보았다. 그리고 만들어 본 것들을 R에 내장된 stats 패키지와 비교해 보았다.

Formula for Monte Carlo generation

먼저 hogg책 4.8장에 나와있는 공식을 정리하였다. discret distribution도 만들고자 수업시간에 넘어간 Accept-Reject Algorithm까지 정리하였다.

In case of continuous distribution

Theorem 4.8.1

U~(0, 1) 이고 F가 continuous distirbution fucntion이라면, random variable X = F^− 1(U)는 F의 distribution function을 가진다.

(F가 strictly monotone아닌 경우에도 성립)

In case of discrete distribution

Accept-Reject Algorithm 4.8.1

먼저 우리는 pdf가 f(x)를 따르는 X를 구하고자 한다.

Y~pdfg(y)이고 U~unif(0, 1)일때, Y와 U가 독립이고 f(x) ≤ M g(x)일 경우 (단 M은 상수)

Y와 U를 만들고
$U \le \frac{f(Y)}{M\ g(Y)}$인경우 X = Y라고 하면

X는 pdf f(x)를 가진다.

(pdf를 normalizing하는 constants를 무시하고 알고리즘 사용 가능)

runif()를 이용하여 다른 분포 만들어 보기

사실 runif()도 만들어 사용하고 싶었다. 사실 이 부분은 통계학이라기 보다는 컴퓨터 과학이나 처음부터 만들려고 찾아봤다. R에서는 “Mersenne-Twister” 방법을 사용한다고 한다. 무려 2¹⁹⁹³⁷ − 1만큼 반복되어 유의미하나 624만의 반복된 수를 알면 seed수를 알 수 있다고 한다. 그 코드(C)가 영어 위키백과에도 나와있으나, 이를 R로 구현시 크기가 커 문제가 있다고 하고, 이를 다시 unif 분포로 만들기 어려워, 상대적으로 간단한 Linear congruential generator 방식을 이용해 runif()를 만들어 봤다. 이 방식은 초기값에 따라 난수의 질이 달라질 수 있고 마지막 난수만 알면 그 다음의 sequence를 예측 할 수 있기에 암호학 적으로 안전하지 않다고 한다. 먼저 방식은 다음과 같다.

양수 m과 0 < a < m, 0 ≤ c < m, 0 ≤ X₀ < m 를 만족하는 a, c, X₀에 대해서 난수는 아래와 같다.

X_n + 1 = (aX_n + c) mod m

코드로 구현하면 다음과 같다.

lcp_unif = function(n, m,a,c,seed){
  result = rep(0,n)
  result[1] = (a*seed+c) %% m
  for(i in 2:n){
    result[i] = (a*result[i-1]+c) %% m
  }
  return( result / m )
}

모수 m, a, c 를 정하는 다양한 방식이 있으나, POSIX [de]rand48에서 사용하는 값을 사용하였다.

random = lcp_unif(1000, m=2**48,a=  25214903917,c=11,seed=2020)

tb = tibble(r=runif(1000),
            lcp=lcp_unif(1000, m=2**48,a=   25214903917,c=11,seed=2020))  %>% gather(key="func",value="y",c("r","lcp"))
tb %>% ggplot(aes(x=y,col=func)) + geom_freqpoly()

하지만 이 방식은 앞서 말한것 처럼 연속된 난수들간에 상관관계가 존재하기에 좋은품질의 유사난수라고 하기는 어렵다. 따라서 우리가 사용하려고 하는 몬테 카를로 시뮬레이션에 적합하지 않다. 따라서 “Mersenne-Twister”기반인 runif(1)함수를 바탕으로 아래의 함수를 만들어 보았다. (물론 Mersenne-Twister도 동일한 문제에서 자유로울 수는 없으나, 상대적인 관점에서 사용하게 되었다.)

이항분포

lcp_binom = function(n, size, prob){
  result=rep(0,n)
  for(i in 1:n){
    x=0
    for(j in 1:size){
      u = runif(1)
      if( u < prob){ result[i]=result[i]+1 }
    }
  }
  return(result)
}

n=10000인 B(10,0.4)인 분포이다.

set.seed(2020)
tb = tibble(r=rbinom(10000,10,0.4),
            lcp=lcp_binom(10000,10,0.4))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram(breaks=0:10-0.5,aes(y=..density..))+facet_wrap(~func)

다항 분포

lcp_multinom = function(n, size, prob){
  result=matrix(0,ncol=n,nrow=length(prob))
  prob = c(prob,1)
  for(i in 1:n){
    x=0
    for(j in 1:size){
      u = runif(1)
      if(  u <=prob[1] ){ result[1,i]=result[1,i]+1 }
      for(k in 2:length(prob)){
      if( prob[k-1]< u & u <= prob[k] ){ result[k-1,i]=result[k-1,i]+1 }
      }
    }
  }
  return(result)
}

n=10인 multinormial(0.1,0.2,0.8)인 분포이다.

set.seed(2020)
rmultinom(10,5,c(0.1,0.2,0.8))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    0    0    0    0    1    1    0    2     0
## [2,]    0    1    0    1    1    1    1    1    1     0
## [3,]    4    4    5    4    4    3    3    4    2     5

lcp_multinom(10,5,c(0.1,0.2,0.8))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    1    2    1    0    1    0    1    0     1
## [2,]    0    4    0    3    2    4    3    4    4     3
## [3,]    3    0    3    1    3    0    2    0    1     1

정규분포

정규분포의 경우 2개의 모수 μ와 σ²가 필요하므로 2개의 unif가 필요하다. 방법으로는 책에 나와 있는 Box–Muller transform 방법과, 이를 조금 더 심화한 Marsaglia polar method, 그리고 더 심화하여 가장 최신의 방법인 Ziggurat algorithm이 있다.

먼저 책에 나와 있는 Box–Muller방법을 이용해 만들어 보았다.

lcp_norm1 = function(n, mean = 0, sd = 1){
  result=rep(0,n)
  for(i in 1:n){
    u1 = runif(1)
    u2 = runif(1)
    t = sqrt(-2*log(u1))*cos(2*pi*u2)
    result[i]= t*sd + mean
  }
  return(result)
}

그리고 Box–Muller transform 방법을 심화시킨 Marsaglia polar method이다. 이는 unif를 따르는 2개의 U를 -1과 1사이의 좌표에 포함되는 x,y로 만들고 이들중 단위원에 포함되는 x,y를 바탕으로 한다. 앞에서 cos(2πU)를 $x\over {\sqrt s}$형태로 바꾸어 코사인 함수를 간접적으로 사용하게 된 것이다. 코드를 만들어 보면 아래와 같다.

lcp_norm2 = function(n, mean = 0, sd = 1){
  result=rep(0,n)
  for(i in 1:n){
    s=1
    while(s<=0 | s>=1){
    x = runif(1)*2-1 # runif(1)만을 사용하기 위해 runif(1,-1,1)을 일부로 사용하지 않음
    y = runif(1)*2-1
    s = x**2 + y**2
    }
    t = sqrt(-2*log(s)/s) * x
    result[i]= t*sd + mean
  }
  return(result)
}

n=10000인 N(1, 0.5)인 분포이다.

set.seed(2020)
tb = tibble(r=rnorm(10000,1,0.5), lcp_BoxMuller=lcp_norm1(10000,1,0.5), lcp_Marsaglia=lcp_norm2(10000,1,0.5)) %>% gather(key="func",value="y",c("r","lcp_BoxMuller","lcp_Marsaglia"))
tb %>% ggplot(aes(x=y,col=func))+geom_freqpoly(aes(y=..density..))+facet_wrap(~func)+geom_vline(aes(xintercept = 1) , linetype = "dashed")

위의 두 방법에서 알 수 있는 것과 같이 log와 sqrt 그리고 삼각함수 등이 사용되기에 계산에 시간이 걸린다는 단점이 있다. 이를 보완한 방법이 Ziggurat algorithm이다. 정규분포의 pdf를 같은 면적으로 분할한 후 그 layer 중 하나를 바탕으로 random generating을 한 후 합치는 방식이다. 코드는 생략하였다.

카이제곱 분포

iid인 표준정규분포를 제곱하고 더하면 쉽게 만들 수 있다. Marsaglia polar method 방식의 정규분포를 이용하였다

lcp_chisq = function(n, df){
  result=rep(0,n)
  for(i in 1:n){
    result[i]= sum(lcp_norm2(df)**2)
  }
  return(result)
}

n=10000인 χ²(3)인 분포이다.

set.seed(2020)
tb = tibble(r=rchisq(10000,3),
            lcp=lcp_chisq(10000,3))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram(breaks=0:10-0.5, aes(y=..density..))+facet_wrap(~func)

Student’s t 분포

t분포도 정규분포와 카이제곱 분포를 통해 쉽게 만들 수 있다. $Z\over \sqrt {V/\nu}$이므로 lcp_norm2와 lcp_chisq를 사용하여 만들었다.

lcp_t = function(n, df){
  result = lcp_norm2(n) / sapply(lcp_chisq(n, df),sqrt) *
sqrt(df)
  return(result)
}

n=10000인 t(2)인 분포이다.

set.seed(2020)
tb = tibble(r=rt(10000,2),
            lcp=lcp_t(10000,2))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram(breaks=-10:10/2, aes(y=..density..))+facet_wrap(~func)

F분포

F분포도 정의 $V_1/k_1\over V_2/k_2$에 의해 lcp_chisq를 통해 간단히 만들 수 있다.

lcp_f = function(n, df1, df2){
  result = (lcp_chisq(n,df1)/df1) / (lcp_chisq(n, df2)/df2)
  return(result)
}

n=10000인 F(10,2)인 분포이다.

set.seed(2020)
tb = tibble(r=rf(10000,10,2),
            lcp=lcp_f(10000,10,2))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram(breaks=-0:50/10, aes(y=..density..))+facet_wrap(~func)

지수분포

continuous 분포로 간단하게 만들 수 있다. (뒤에 만들 감마 분포에 alpha에 1을 대입해도 가능하다.)

lcp_exp = function(n,rate=1){
  result=rep(0,n)
  for(i in 1:n){
    result[i]= -(1/rate)*log(1-runif(1))
    }
  return(result)
}

n=10000인 exp(1.2)인 분포이다.

set.seed(2020)
tb = tibble(r=rexp(10000,1.2),
            lcp=lcp_exp(10000,1.2))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram(breaks=0:40/10-0.1, aes(y=..density..))+facet_wrap(~func)

포아송 분포

Hogg책에 나와있는, 지수분포를 활용한 방식으로 알고리즘을 구성해 보았다. 즉 위에 지수분포를 활용해 rejection을 이용하는 방식이다

lcp_pois = function(n,lambda){
  result=rep(-1,n)
  for(i in 1:n){
    t=0
    while(t<=1){
      u = runif(1)
      y = -(1/lambda)*log(1-u)
      t = t+y
    result[i]= result[i]+1
    }
  }
  return(result)
}

n=10000인 pois(3)인 분포이다.

set.seed(2020)
tb = tibble(r=rpois(10000,3),
            lcp=lcp_pois(10000,3))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram(breaks=0:10-0.5, aes(y=..density..))+facet_wrap(~func)

감마분포

Computer Methods for Sampling from Gamma, Beta, Poisson and Binomial Distributions (1973) J. H. Ahrens, Halifax, and U. Dieter, Graz에서 사용한 방식을 참고한다. 먼저 Gamma(1,1)분포의 pdf는 e^− x형태로 간단한 지수함수형태이다. 따라서 iid한 Gamma(1,1)를 alpha개 더하고 beta(rate parameter)를 곱하면 Gamma(alpha,beta)가 될 것이다.

lcp_gamma1 = function(n, shape, rate = 1, scale = 1/rate){
      if (!missing(rate) && !missing(scale)) {
        if (abs(rate * scale - 1) < 1e-15) 
            warning("specify 'rate' or 'scale' but not both")
        else stop("specify 'rate' or 'scale' but not both")
      }
  result=rep(0,n)
  for(j in 1:n){
  i=1
  p=1
  while(i!=shape+1){
    p=p*runif(1)
    i=i+1
  }
  result[j]=-log(p)*scale
  }
  return(result)
}

n=10000인 Gamma(4, 2)인 분포이다.

set.seed(2020)
tb = tibble(r=rgamma(10000,4,rate=2),
            lcp=lcp_gamma1(10000,4,rate=2))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram( aes(y=..density..))+facet_wrap(~func)

그러나 shape parameter alpha가 정수인 경우에만 가능한 방법이다. 따라서 0 < a ≤ 1인 경우에는 다른 방식이 필요하다.

lcp_gamma_sub = function(shape, rate = 1, scale = 1/rate){
  i=1
  p=1
  result=NA
  while(is.na(result)){
    u=runif(1)
    b=(exp(1)+shape)/exp(1)
    p=b*u
    if(p>1){
      x=-log((b-p)/shape)
      if(runif(1)<=x**(shape-1)){result=x}
    }else{
      x=p**(1/shape)
      if(runif(1)<=exp(-x)){result=x}
    }
  }
  return(result)
}

lcp_gamma_sub(0.3) ; lcp_gamma_sub(0.3); lcp_gamma_sub(0.3) ; lcp_gamma_sub(0.3)

## [1] 1.512437

## [1] 0.0005884823

## [1] 0.04528588

## [1] 0.6667582

이제 이 둘을 합쳐 모든 shape parameter alpha에 대해 성립하도록 만들어 보면 다음과 같다.

lcp_gamma2 = function(n, shape, rate = 1, scale = 1/rate){
  result=rep(0,n)
  m = shape %/% 1
  f = shape - m
  for(j in 1:n){
    if(m==0){
      y=0
    }else{
      y =lcp_gamma1(1,m)
    }
    if(f==0){
      z=0
    }else{
      z = lcp_gamma_sub(f)
    }
  result[j]=(y+z)*scale
  }
  return(result)
}

n=10000인 Gamma(3.3, 3.3)인 분포이다.

set.seed(2020)
tb = tibble(r=rgamma(10000,3.3,rate=2.2),
            lcp=lcp_gamma2(10000,3.3,rate=2.2))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram( aes(y=..density..))+facet_wrap(~func)

추가적으로 위의 방식은 rejection이 많아 계산이 오래걸릴 수있다. A Convenient Way of Generating Gamma Random Variables Using Generalized Exponential Distribution (2007) Debasis Kundu & Rameshwar D. Gupta에 따르면나온 동일하게 기본적으로 Accept-Reject Algorithm방식을 활용하지만 rejection을 줄이는, 즉 좀더 빠르게 만들 수 있는 알고리즘을 알 수 있다.

베타분포

베타분포는 2개의 감마분포를 통해 계산할 수 있다. 즉 각각의 shape parameter가 α β인 두 표준감마분포 X_α X_β에 대해 $X_\alpha\over X_\alpha +X_\beta$는 Beta(α, β)를 따르게 된다. 따라서 lcp_gamma2를 사용하여 코드는 다음과 같다.

lcp_beta = function(n, shape1, shape2){
  x1 = lcp_gamma2(n,shape1)
  x2 = lcp_gamma2(n,shape2)
  return(x1/(x1+x2))
}

n=10000인 Beta(0.5, 0.5)인 분포이다.

set.seed(2020)
tb = tibble(r=rbeta(10000,0.5, 0.5),
            lcp=lcp_beta(10000,0.5, 0.5))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram( aes(y=..density..))+facet_wrap(~func)

n=10000인 Beta(2, 4)인 분포이다.

set.seed(2020)
tb = tibble(r=rbeta(10000,2,4),
            lcp=lcp_beta(10000,2,4))  %>% gather(key="func",value="y",c("r","lcp"))

tb %>% ggplot(aes(x=y,fill=func))+geom_histogram( aes(y=..density..))+facet_wrap(~func)

Monte Carlo generation