专栏文章

LOFTER for ipad —— 让兴趣，更有趣

点击下载关闭

HMMER的使用方法

jiafeng11 2019-07-15

转载 https://www.360doc.com/content/17/0823/08/33204118_681408029.shtml

HMMER包含下面几个主要的程序:

phmmer：与Blastp类似，使用一个蛋白质序列搜索蛋白质序列库；

jackhmmer：与psiBlast类似，蛋白质序列迭代搜索蛋白质序列库；

常用

hmmbuild：用多重比对序列构建HMM模型；

hmmsearch：使用HMM模型搜索序列库；

hmmpress：格式化HMM数据库，以便于hmmscan搜索使用；

hmmscan：使用序列搜索HMM库；

hmmalign：使用HMM为线索，构建多重比对序列；

hmmconvert：转换HMM格式

hmmemit：从HMM模型中，得到一个模式序列；

hmmfetch：通过名字或者接受号从HMM库中取回一个HMM模型；

hmmstat：显示HMM数据库的统计信息；

二、常用的两大功能

（一）使用HMM数据集搜索全基因组蛋白（核酸）序列数据库

1，hmmbuild, 训练给定多序列比对结果，构建HMM数据集。举个例子像在基因家族分析中，用所有已知的某基因家族成员做多序列比对，然后利用下面命令构建HMM数据集，最后使用HMM数据集扫描hmmsearch你要鉴定的物种所有基因序列数据库即可获得获得该物种候选的该基因家族成员。

示例命令： hmmbuild [-options]

输入文件多序列比对后的序列格式如:CLUSTALW, SELEX, GCG MSF。输出文件一般命名为.hmm 后缀, 该结果为HMM 数据库。

2，hmmsearch, 寻找相似序列 hmmsearch [options]

（二）使用蛋白质（核酸）序列搜索已构建HMM数据库该方法为常用的功能注释方法。构建HMM数据库。使用多序列比对文件，同上述命令即可完成构建。同时可以从Pfam、SMART等网站下载现成额HMM。举个例子，假如我有一批蛋白质序列，想做Pfam注释，看看有什么结构域，那么我可以去Pfam下载下述文件： ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/Pfam-A.hmm.gz

使用hmmscan搜索HMM数据库，命令如下：

hmmscan -E 0.00001 --domE 0.00001 --cpu 2 --noali --acc --notextw

--domtblout pfam.tab Pfam-A.hmm test.pep.fa

输出结果两种格式

--domtblout

--tblout

输出结果中分为两类一类是针对序列的（full sequence），另一类是针对domain的（主要基于一条序列存在多个domain）。这两种格式涉及到的每一列信息解释如下（英文原文大家看的可能更明白！）

(1) target name: 目标序列或文件的名字

(2) accession: 目标序列的登录号

(3) query name: 查询序列或文件的名字

(4) accession: 查询序列的登录号

(5) hmmfrom: 比对起始位置.

(6) hmm to: 比对终止位置.

(7) alifrom: 目标序列比对起始位置.

(8) ali to: 目标序列比比对终止位置.

(9) envfrom: 结构域的起始位置。

(10) env to: 结构域的终止位置。

(11) sq len: 目标序列长度..

(12) strand: 链-+.

(13) Evalue: The expectation value (statistical significance) of the target, as above.

(14) score (full sequence): The score (in bits) for this hit. It includes the biased-composition correction.

(15) Bias (full sequence): The biased-composition correction, as above

(16) description of target: The remainder of the line is the target’s description line, as free text.

(17) c-Evalue: The “conditional E-value”, a permissive measure of how reliable this particular domain may be. The conditional Evalue is calculated on a smaller search space than the independent Evalue. The conditional Evalue uses the number of targets that pass the reporting thresholds. The null hypothesis test posed by the conditional Evalue is as follows. Suppose that we believe that there is already sufficient evidence (from other domains) to identify the set of reported sequences as homologs of our query; now, how many additional domains would we expect to find with at least this particular domain’s bit score, if the rest of those reported sequences were random nonhomologous sequence (i.e. outside the other domain(s) that were sufficient to identified them as homologs in the first place)?

(18) i-Evalue: The “independent E-value”, the E-value that the sequence/profile comparison would have received if this were the only domain envelope found in it, excluding any others. This is a stringent measure of how reliable this particular domain may be. The independent E-value uses the total number of targets in the target database.

Envelope定义：The envelope defines a subsequence for which their is substantial probability mass supporting a homologous domain, whether or not a single discrete alignment can be identified. The envelope may extend beyond the endpoints of the MEA（maximum expected accuracy ） alignment, and in fact often does, for weakly scoring domains.

Envelope鉴定：Now, within each region, we will attempt to identify envelopes. An envelope is a subsequence of the target sequence that appears to contain alignment probability mass for a likely domain (one local alignment to the profile).

When the region contains '1 expected domain, envelope identification is already done: the region’s start and end points are converted directly to the envelope coordinates of a putative domain. There are a few cases where the region appears to contain more than one expected domain – where more than one domain is closely spaced on the target sequence and/or the domain scores are weak and the probability masses are ill-resolved from each other. These “multidomain regions”, when they occur, are passed off to an even more ad hoc resolution algorithm called stochastic traceback clustering. In stochastic traceback clustering, we sample many alignments from the posterior alignment ensemble, cluster those alignments according to their overlap in start/end coordinates, and pick clusters that sum up to sufficiently high probability. Consensus start and end points are chosen for each cluster of sampled alignments. These start/end points define envelopes.These envelopes identified by stochastic traceback clustering are not guaranteed to be nonoverlapping.It’s possible that there are alternative “solutions” for parsing the sequence into domains, when the correct parsing is ambiguous. HMMER will report all high-likelihood solutions, not just a single nonoverlapping parse. It’s also possible (though rare) for stochastic clustering to identify no envelopes in the region.In a tabular output (--tblout) file, the number of regions that had to be subjected to stochastic traceback clustering is given in the column labeled clu. This ought to be a small number (often it’s zero). The number of envelopes identified by stochastic traceback clustering that overlap with other envelopes is in the column labeled ov. If this number is non-zero, you need to be careful when you interpret the details of alignments in the output, because HMMER is going to be showing overlapping alternative solutions. The total number of domain envelopes identified (either by the simple method or by stochastic traceback clustering) is in the column labeled env. It ought to be almost the same as the expectation and the number of regions

# 生信 # 长文章

版权归作者所有，转载请注明出处

jiafeng11 关注

热度 0

LOFTER-网易轻博