本文介绍利用R对文本数据进行LDA的分析过程,欢迎各位交流!
1、LDA分析的第一步,是需要确定主题的数量,然后目前对主题数量的确定并没有确定的方法,本文主要采用复杂度的值越小,说明模型越好,而对数似然值越大越好,刚好相反。基于复杂度和对数似然值判断语料库中的主题数量,就是计算不同主题数量下的复杂度和对数似然值之间的变化。可以将复杂度和对数似然值变化的拐点对应的主题数作为标准主题数,拐点以后复杂度和对数似然值的变化区域平缓。观察拐点和趋势需要对数据可视化,因此,分别做复杂度、对数似然值与主题数目的趋势图。[1]
参考文献[1]中的代码其实是可以调通的,关键是需要整理源数据的格式,以下是利用topicmodels中的AssociatePress数据集进行测试分析,代码及结果如下:
代码部分:
dtm<-AssociatedPress
fold_num = 10
kv_num = c(5, 10*c(1:5, 10))
seed_num = 2003
smp<-function(cross=fold_num,n,seed)
{
set.seed(seed)
dd=list()
aa0=sample(rep(1:cross,round(ceiling(n/cross),0))[1:n],n)
for (i in 1:cross) dd[[i]]=(1:n)[aa0==i]
return(dd)
}
selectK<-function(dtm,kv=kv_num,SEED=seed_num,cross=fold_num,sp) # change 60 to 15
{
per_ctm=NULL
log_ctm=NULL
for (k in kv)
{
per=NULL
loglik=NULL
for (i in 1:3) #only run for 3 replications#
{
cat("R is running for", "topic", k, "fold", i,
as.character(as.POSIXlt(Sys.time(), "Asia/Shanghai")),"\n")
te=sp[[i]]
tr=setdiff(1:nrow(dtm),te)
# VEM = LDA(dtm[tr, ], k = k, control = list(seed = SEED)),
# VEM_fixed = LDA(dtm[tr,], k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
CTM = CTM(dtm[tr,], k = k,
control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3)))
# Gibbs = LDA(dtm[tr,], k = k, method = "Gibbs",
# control = list(seed = SEED, burnin = 1000,thin = 100, iter = 1000))
per=c(per,perplexity(CTM,newdata=dtm[te,]))
loglik=c(loglik,logLik(CTM,newdata=dtm[te,]))
}
per_ctm=rbind(per_ctm,per)
log_ctm=rbind(log_ctm,loglik)
}
return(list(perplex=per_ctm,loglik=log_ctm))
}
sp=smp(n=nrow(dtm),seed=seed_num)
system.time((ctmK=selectK(dtm=dtm,kv=kv_num,SEED=seed_num,cross=fold_num,sp=sp)))
## plot the perplexity
m_per=apply(ctmK[[1]],1,mean)
m_log=apply(ctmK[[2]],1,mean)
k=c(kv_num)
df = ctmK[[1]] # perplexity matrix
matplot(k, df, type = c("b"), xlab = "Number of topics",
ylab = "Perplexity", pch=1:5,col = 1, main = '')
legend("bottomright", legend = paste("fold", 1:5), col=1, pch=1:5)
结果:
R is running for topic 5 fold 1 2021-05-06 13:07:50
R is running for topic 5 fold 2 2021-05-06 13:08:28
R is running for topic 5 fold 3 2021-05-06 13:09:13
R is running for topic 10 fold 1 2021-05-06 13:09:51
R is running for topic 10 fold 2 2021-05-06 13:12:11
R is running for topic 10 fold 3 2021-05-06 13:14:17
R is running for topic 20 fold 1 2021-05-06 13:16:34
R is running for topic 20 fold 2 2021-05-06 13:23:22
R is running for topic 20 fold 3 2021-05-06 13:30:06
用户 系统 流逝
1775.39 2.68 1781.62
未完待续……
————————————————
[1] 参考文献链接:https://blog.csdn.net/sinat_26917383/article/details/51547298