Profiles:
- Profile1:
Model with junk domains and no deletions
Pretty file
Model with deletion and without junk domains
pretty file
This model is obtained by doing the following:
- 9 training examples, corresponding to 11 segments were picked.
- A profile hidden markov model (one with deletion states) is
constructed. The match states corresponds to a strictly
hydrophobic column. The rest were treated as insertions. Laplace
prior is used to alleviate the problem of overfitting. This
model is final.profile1. For the "match" states, the prior
is only applied to the hydrophobic symbols. For the insertion
states, the prior is applied to every symbol.
- This profile hidden markov model then is converted to one which
has no deletion states by converting deletion transitions into
"skipping transitions". So the transition matrix takes a
triangular shape (lower or upper will depend on the structure of
the matrix itself). Then two junk modeling states are constructed.
Also, to maintain numerical stability, the transition matrix is
expressed in the log domain. So is the prior vector. The emission
matrix is expressed in the linear domain. Also, the junk domains
are appeneded to the ends of this model and some necessary
changes was made. The junk domain are the self-looping states
using the emission probability from null.model. This model
is profile1.model
- Profile2
Model with junk domains and no deletions
Pretty file
Model with deletion and without junk domains
pretty file
- initial.model:
raw file
pretty file
This is the initial file used to train the hidden markov models with 11
fragments from our family. This initial model is estimated using very
rough technique. First, we obtained an output (a multiple alignment)
of our family from SuperFamily. Then we marked a few columns with the
sheet Prof. Kister provided. Then a few segments with Hydrophobic amino
acid residues ONLY in these marked columns are isolated. The transition
matrix of the initial hidden markov model is obtained by eye-balling
the alignment of THESE SEGMENTS ONLY while assuming an exponential
distribution. The emission matrix is estimated in this way: for the
"Gap" positions, the distribution is background and for the "critical"
positions, the distribution is uniform on HYDROPHOBIC AMINO ACIDS ONLY
(Nine: A,C,F,I,L,M,V,W,Y).
- initial-Skip.model:
raw file
pretty file
This model is just a slight modification of initial.model, where
very slight probability is added to most of the gap states so that
critical states can be escaped.
- final.model:
raw file
pretty file
The outcome of the EM algorithm, where the training set are the 9
proteins in SWISS-PROT. Here are the 9 proteins: O18835 O77695 O97524
P00722 P00723 P06760 P08236 P12265 P23989. Some of them have multiple
domains belonging to our family. The model identified 9 critical
positions, which turned out to be TOO SEPCIFIC, i.e. a problem of
overfitting. The inital model was hand-crafted. The model has 32 states.
This model itself identifies the domain. Nevertheless, two states need
to be added -- one in the very beginning and one in the very last (prior
to the virtual end state though). These two newly added states should
emit symbols with background probablility. This model will later be
identified as the identified.model but will do a little more.
(The initial model is initial.model).
- final-Skip.model:
raw file
pretty file
The outcome of the EM algorithm using inital-Skip.model as
initial model. 32 states. No pseudocount is added during and after the
training process. It is a little surprising to see that this model and
final.model are almost identical, except a 10^-6 difference on
the transition vector of the seventh state, despite the fact that we
started from a model in which skipping gap is allowed. However, this can
be explained by the fact that No training fragments skip a
critical state and also the low, discouraging probability of the
skipping transition.
- final-SkipPseudo.model:
raw file
pretty file
This model is obtained using initial-Skip.model with EM algorithm.
However, during each iteration in training, two things are done:
- Background is added to each gap state.
- Normalized Hydrophobic background is added to each critical state.
And of course, after these, normalization is performed. Then 2 states
are added to the very beginning and one state prior to the virtual
ending state of the model. Therefore, this model has 34 states.
A transition from the last state looping back to the second state was
also added.
Nevertheless, notice again, the 'structure' i.e. the transition matrix
of this model and final.model, as well as final-Skip.model
are almost identical except some differences between numbers. The
impossible transitions remain impossible.
- final-SkipPseudoSkip.model:
raw file
pretty file
Obtained by merely adding skipping probability (.01) to
final-SkipPseudo.model
- null.model:
raw file
pretty file
It is just a single state model (without the virutal end state) which
emits symbols according to the background probability.
- critical model:
raw file
pretty file
This is a modification of final.model, with 32 states as well.
The emitting probability of the first state and the last state has been
changed to the background probability, which is intrinsically wrong.
The correct way of doing it will be outlined in modified.model
- modified.model:
raw file
pretty file
This model is based on final.model. Two states were added, as
specified in the section explaining final.model. This first state
is added in the very beginning which emits background probability and
the second state is added all the way in the end, just before the
virtual end state, also emitting background probability. These two
states in order to model the other domains that we are not intersted in.
The transition probability of the last state (the one which was newly
added) has some slight probability of going back to the second state,
which encourages the model to identify more than one domain in the
family we are interested in.
- mutation.model:
raw file
pretty file
This model is based on modified.model. The only difference is
that: among all the 9 critical positions that were identified, some
slight probability are added to the emission matrix for the hydrophobic
amino acid residues. This is in attempt to model mutation. Note that
the program need to normalize the emission matrix after reading the
model in.
- skipMut.model:
raw file
pretty file
Based on mutation.model. However, skipping transitions were added
so that, if desirable, critical positions can be skipped. Nevertheless,
severe penalty is involved when a critical position is skipped, which
discourages the skipping. Note that BOTH transition and the emission
probability need to be normalized after the model file is read.
- all-no-pseudo.model:
raw file
pretty file
Obtained by using all available instances to train. PseudoCount is not
added. Then 2 states are inserted after the training. This model has
34 states. The initial model is initial.model. 43 instances are
used for training.
- all-plus-pseudo.model:
raw file
pretty file
Obtained by using all available instances to train. PseudoCount is added
during each iteration of EM (background to gap states, and normalized
hydrophobic background to critical states). Then 2 states are inserted
after training. This model has 34 states. The initial model is
initial.model. 43 instances are used for training.
- null2.model:
raw file
pretty file
Made simply by observing the frequencies of all the amino acids from all
43 fragments.
- all-plus-pseudoFB.model:
raw file
pretty file
Obtained using the same method as all-plus-pseudo.model except
that now background probability is the one in null2.model.
- null3.model:
raw file
pretty file
Obtained by observing the frequencies of all the amino acids from 11
fragments only.
- part-plus-pseudoFB.model:
raw file
pretty file
Obtained by training on 11 fragments only and the background
probabilities corresponds to those in null3.model.
- flank.model:
raw file
pretty file
This model is just a slight modification of the model
all-plus-pseudo.model, which has 34 states. States 0 and 32 are
the junk-modelling domains. State 33 is the virutal end state. The
following describes the modification:
- First of all, a sequence has the option of directly entering the
domain or entering the junk-modelling domain first. The prior
vector is set to have 0.5 probability of entering the junk domain
and 0.5 probability of entering the first residue in the domain
that we are modeling.
- Originally, state 1 has 2 outgoing probabilities -- a
high-probability self-looping transition (0.999999) and a low
probability transition that goes to the first residue of the
domain we are modeling. Now, let alpha be 0.001. So, the
self-looping transition is replaced by (1-alpha) and the low
probability transition is replaced by (alpha/2). (N-3) more
outgoing transitions are added to state 1 -- all goes to one of
the states that models the domain and all have the probability
of (alpha/(2*(N-3))).
- Also, from state 1 to state 30, there is an outgoing probability
to state 32, the junk-modeling domain.
The rationale behind changes 2 and 3 is that in the previous experiments
we were forcing all proteins to fit our model. In this representation
the idea is that if a protein does not fit the model well, then it should
come out of that domain as soon as possible.