LOFTER for ipad —— 让兴趣,更有趣

点击下载 关闭
LDA文本分析
ID3498263 2021-05-06

本文介绍利用R对文本数据进行LDA的分析过程,欢迎各位交流!

1、LDA分析的第一步,是需要确定主题的数量,然后目前对主题数量的确定并没有确定的方法,本文主要采用复杂度的值越小,说明模型越好,而对数似然值越大越好,刚好相反。基于复杂度和对数似然值判断语料库中的主题数量,就是计算不同主题数量下的复杂度和对数似然值之间的变化。可以将复杂度和对数似然值变化的拐点对应的主题数作为标准主题数,拐点以后复杂度和对数似然值的变化区域平缓。观察拐点和趋势需要对数据可视化,因此,分别做复杂度、对数似然值与主题数目的趋势图。[1]

参考文献[1]中的代码其实是可以调通的,关键是需要整理源数据的格式,以下是利用topicmodels中的AssociatePress数据集进行测试分析,代码及结果如下:

代码部分:

dtm<-AssociatedPress

fold_num = 10

kv_num = c(5, 10*c(1:5, 10))

seed_num = 2003

smp<-function(cross=fold_num,n,seed)

{

  set.seed(seed)

  dd=list()

  aa0=sample(rep(1:cross,round(ceiling(n/cross),0))[1:n],n)

  for (i in 1:cross) dd[[i]]=(1:n)[aa0==i]

  return(dd)

}

selectK<-function(dtm,kv=kv_num,SEED=seed_num,cross=fold_num,sp) # change 60 to 15

{

  per_ctm=NULL

  log_ctm=NULL

  for (k in kv)

  {

    per=NULL

    loglik=NULL

    for (i in 1:3)  #only run for 3 replications# 

    {

      cat("R is running for", "topic", k, "fold", i,

          as.character(as.POSIXlt(Sys.time(), "Asia/Shanghai")),"\n")

      te=sp[[i]]

      tr=setdiff(1:nrow(dtm),te)

      

      # VEM = LDA(dtm[tr, ], k = k, control = list(seed = SEED)),

      # VEM_fixed = LDA(dtm[tr,], k = k, control = list(estimate.alpha = FALSE, seed = SEED)),

      

      CTM = CTM(dtm[tr,], k = k, 

                control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3)))  

      

      # Gibbs = LDA(dtm[tr,], k = k, method = "Gibbs",

      # control = list(seed = SEED, burnin = 1000,thin = 100, iter = 1000))

      

      per=c(per,perplexity(CTM,newdata=dtm[te,]))

      loglik=c(loglik,logLik(CTM,newdata=dtm[te,]))

    }

    per_ctm=rbind(per_ctm,per)

    log_ctm=rbind(log_ctm,loglik)

  }

  return(list(perplex=per_ctm,loglik=log_ctm))

}

sp=smp(n=nrow(dtm),seed=seed_num)

system.time((ctmK=selectK(dtm=dtm,kv=kv_num,SEED=seed_num,cross=fold_num,sp=sp)))

## plot the perplexity

m_per=apply(ctmK[[1]],1,mean)

m_log=apply(ctmK[[2]],1,mean)

k=c(kv_num)

df = ctmK[[1]]  # perplexity matrix

matplot(k, df, type = c("b"), xlab = "Number of topics", 

        ylab = "Perplexity", pch=1:5,col = 1, main = '')       

legend("bottomright", legend = paste("fold", 1:5), col=1, pch=1:5)

结果:

R is running for topic 5 fold 1 2021-05-06 13:07:50

R is running for topic 5 fold 2 2021-05-06 13:08:28

R is running for topic 5 fold 3 2021-05-06 13:09:13

R is running for topic 10 fold 1 2021-05-06 13:09:51

R is running for topic 10 fold 2 2021-05-06 13:12:11

R is running for topic 10 fold 3 2021-05-06 13:14:17

R is running for topic 20 fold 1 2021-05-06 13:16:34

R is running for topic 20 fold 2 2021-05-06 13:23:22

R is running for topic 20 fold 3 2021-05-06 13:30:06

   用户    系统    流逝 

1775.39    2.68 1781.62


未完待续……





————————————————


[1] 参考文献链接:https://blog.csdn.net/sinat_26917383/article/details/51547298


推荐文章
评论(0)
分享到
转载我的主页