Jason Rennie's Research Log
Older
 9/15/05  Read the Blei/Lafferty paper on Correlated
Topic Models. Basic idea is that instead of drawing topics from a
Dirichlet, they will be drawn from (what I would call) a natural
parameter Gaussian with full(?) covariance matrix. Blei and Lafferty
call this distributoin a logistic normal (apparently same term as used
in a 1982 book by Aitchison). i.e. here's the process: draw
parameters from a Gaussian; use those parameters as natural parameters
for a multinomial. The Gaussian specifies the distribution of topics
for a document (separately drawn per topic). My (biased) guess is
that the benefit they get with this is the ability to more smoothly
transition between ranks. i.e. it might help to make the landscape
smoother and easier to find a point closer to the global optimum.
 9/15/05  Gave a quick read to the Lebanon/Lafferty
paper on the multinomial manifold. Didn't understand large chunks
of the paper. Maybe a paper I could present at group meeting? Basic
idea of the paper is that linear classification in the mean parameter
space of the multinomial isn't the right thing to do. They derive an
alternative way of constructing a decision boundary that has to do
with projecting points from the manifold onto the sphere. Seems like
this might be related to finding linear boundaries in the natural
parameter space... Should try to understand this better...
 9/15/05  Read the Gordon GGLLM
paper. Generalization of PCA allowing for nonlinear transition
between U,V and X. Discussion of how to choose link and loss
functions so that some aspect of the optimization problem can be
called convex. Equation 2 looks like the loglikelihood of an
exponential distribution with natural parameterization.
 9/15/05  Read the Collins/Schapire/Singer
paper. Premise seems kind of sillybetter algorithms for
optimizing adaboost and logistic regression. Why not just use
gradient descent? Okay. I know I sound like a broken record. I'll
shut up now. Discussion of Bregman distances, the "natural
optimization problem" associated with them.
 9/15/05  Read the Collins/Dasgupta/Schapire
paper. Describes a general natural parameter lowrank model for
exponential family distributions. Also shows that negative log
likelihood is always a Bregeman distance for exponential
distributions. Thus, optimization is optimization of sum of Bregman
distances. Some useful examples of how PCA changes when natural
parameterization is useddifferences are in terms of natural
parameter value, not that value exponentiated. Nice intro to Bregman
distances.
 9/13/05  Things I need to do:
 Create visualization of NPLT solutions: rank 1 with/without
norm constraint (fixed V, Fro norm on U, elipsoid in NP space); rank 2
with norm constraint
 Answer this question: what is the strength of the natural
parameter representation? Tommi says that a value of PLSA is the
topics that result; solution is not unique, but interpretation is.
NPLT gives convex optimization, but what else? Here's one answer: the
eigenvectors corresponding to the singular values can be interpreted
as topics. Is that true? How exactly?
 What's a good hierarchical model for NPLT? Is the one that
Tommi suggested reasonable? Where there is a rank constraint between
parents and children controling the "error" between them? Note that
his suggestion deals with the regularization term. In fact, it
explodes that term into many rank/trace norm terms; how to find all
the regularization constants? Alternate hierarchical model is the one
I discussed w/ John. Important question: how do you intepret X?
Especially the rows of X that correspond to higherlevel statistics?
 9/13/05  Good chat w/ Tommi today. He was trying to get me to
come up with some way that the natural parameter low trace norm (NPLT)
model is better than PLSA. In the end, I think he answered his own
question :) PLSA topics are unique and interpretable in the sense that
they are the only underlying representation for which convex
combinations result in the parameter matrix. Only permutations are
generally possible (without violating the nonnegative & sumtoone
constraints). You can visualize the subspace to which the data has
been projectedit's a hypertetrahedron. There isn't an exact
analog in NPLT. NPLT simply yields a lowerdimensional subspace.
However, we can come up with natural "topic" vectors: the eigenvectors
corresponding to the singular values of the matrix. Though these can
be rotated, they do yield a natural representation. For constructing
a hierarchical model, Tommi suggested projecting sentences from a
single paragraph onto a line (in natural space, curve in mean space);
paragraphs from a single post; posts from a single thread; threads
from the whole collection. Each level is grounded by the above level;
that is, the lowerlevel line must pass through the projected
higherlevel point. Procedure might be iterative. First find line
that is best fit for threads. Then, requiring that line pass through
projected thread point, find line that is best fit for posts from that
thread. And, so on. 'course, you might want a slightly larger space
for projection (2 rank, 3 rank, or a trace norm restriction). But,
then, what we're really doing is finding a low trace norm projection
at each level of the hierarchy. i.e. the interpretation here (finding
variability at each level) is very similar to the interpretation of
the nonhierarchical problem.
 9/9/05  Gave the Sajama & Orlitsky PCA
paper a(nother) read. They use natural parameters like I am
proposing to do. Some useful observations, such as how natural
parameters are defined (linear product with data in logprobability
space), and that two point close in natural param. space will be close
in mean parameter space (due to continuity of transform). I need to
read the Collins/Dasgupta/Schapire
paper on exponentialfamily PCA. Reasons why mean
parameterization is bad (for multinomial): parameters are tied, cannot
change individual parameters, overparameterization (represents n1
dimensional space). I didn't totally follow the semiparametric
aspect of the paper... But, one thing that was clear to me was that
they use a rankconstrained approach (limit dimensionality of
underlying representation).
 9/8/05  Nice chat w/ Charles Elkan this
morning. Turns out that the Dirichlet
Compound Multinomial (1) was originally introduced by Thomas
Minka, and (2) it is closely realted to "Bayesian Naive Bayes" in
my master's
thesis. In fact, the main difference between my "Bayesian Naive
Bayes" and the DCM is that for the DCM, the Dirichlet parameters are
learned, whereas in my model the parameters are set to the training
data counts (plus "prior" counts). This is important since as Charles
& his coauthors have found, the best parameters for the DCM tend to
(1) sum to the length of an (average) document, and (2) be
proportional to the document frequency (not the term frequency).
 8/30/05  Phew! Just spent 2 hours talking w/ Tommi about
aspect model stuff. I now (mostly) understand why the trace norm can
be written as the minimum Frobenius norm of factorizations. I also
understand that we can't really speak in terms of "topics" with this
model. It constrains the space of the underlying representation.
But, that space can be represented as the hull of many different sets
of vectors. The mixture models (PLSA, LDA) also have this problem.
One thing Tommi made me realize is that a mixture of multinomials is a
multinomial. The only reason that a mixture of Gaussians is not a
Gaussian is b/c the Gaussian does not fully parameterize all possible
distributions on the space of possibilities. There are only N
possibilities in a multinomial & there are N parameters. Two
multinomials can't give you any more expressiveness than you already
have with one! Need to do a write up on this. Need to do a writeup
proving the trace norm equivalencies. We also talked about
hierarchical representations and where this model might apply to my
thesis work.
 8/30/05  Hmm... so my guess is that if I were to work out all
the math, I would find that the proper (description length) penalty
for encoding the parameters is a constant multiple of the prior (with
maybe a constant factor added in). i.e. I think I'm 99% of the way
there just by multiplying in the prior and then using a constant to
adjust the influence of the prior.
 8/29/05  Okay, I think I have this worked out for the matrix
factorization model, but what about a simple multinomial model?
Assume each row of X is drawn from a uniform Dirichlet prior. But,
the uniform Dirichlet gives a value of (n1)! for any value of
theta!!!! That's not right! Okay, here's the issue: the Dirichlet
operates on the simplex, which is (much) smaller than the unit
hypercube. So, the normalizer is correct. What needs fixing is my
approach. Let's return to the "bitsback" argument... I can't
believe it, but I never properly derived Logistic Regression using the
bitsback argument. My development uses very fuzzy reasoning.
What I should be using for any distribution on a continuous space is
the KLDivergence.
 8/29/05  How do I specify the encoding length of a term
frequency model? Need to encode (1) the parameters, and (2) the data
given the parameters. So, for the matrix factorization model, the
likelihood of the data, P(YU,V), is multinomial and UV^T provides the
natural parameters. There are Gaussian priors on each row of U and V,
P(U\mu_U) and P(V\mu_V), with unit variance and a shared mean across
rows of U (\mu_U) and rows of V (\mu_V). There is a standard Normal
prior on each mean.
 8/26/05  Something I didn't realize about trace norm. It is
(effectively) the L1 norm of a matrix. The L2 norm can be calculated
by taking the trace of X^TX, which works for any matrix, not just PSD
ones. However, these are not the traditional sort of matrix norm
(which would, I think, only look at the maximum eigenvalue). Here's
some information about matrix
norms.
 8/25/05  Unlike LDA, our natural parameter model
simultaneously generates topic vectors and mixture vectors.
 8/24/05  Chatted with Tommi about a normconstrained version
of the aspect model. The necessary changes would be a bit larger than
I originally envisioned. First, we would need to change to the
natural parameter space so that we are not using mixtures of
distributions, but rather geometric averages of parameters. This
change allows us to attain the desired convexity benefit. But, the
interpretation as a generative process changes. We can no longer
follow a simple generate process to produce words in a document.
 8/24/05  Gave the Joachims
paper on optimizing nonlinear error measures a read. Sounds like
it's a nice extension of his work on structured output learning.
Instead of optimizing for individual labels, he performs a joint
optimization on all the labels so that instead of measuring simple
error (which can be separated into sum of individual label errors), he
measures an error which cannot be decomposed as a sum of errors on
individual examples. He ends up optimizing a bound on the given
criterion.
 8/23/05  Revisited the Taskar/et
al. "Relational Data" work. The only insight seems to be the idea
of using clique templates. I.e. instead of placing cliques on all
pairs or triples of variables, place cliques based on some structure
of the data, e.g. a clique for each hyperlink and pair of web pages
such that page1 points at page2. I don't think this is useful for my
coreference work...
 8/23/05  A talk by Zoubin Ghahramani gave me a better
appreciation for the usefulness of the models Zhu/Lafferty/Ghahramani
use for active learning. An interesting aspect of their model is that
it values "nearness" to other unlabeled examples as well as
uncertainty in the label.
 8/23/05  While looking at the Lowd/Domingos paper, I noticed a
paper on MultiRelational
Record Linkage by Parag/Domingos. Sounds like it's based on the
McCallum/Wellner work, but I lost track of the development of their
model on a quick read through the paper. Should come back and study
this...
 8/23/05  Slightly interesting ICML paper on
learning graphical models by Lowd and Domingos. Idea is to
introduce hidden variables, each of which has Naive Bayeslike effect
on the observed variables. Only links in graph are from hidden
variables to observed variables. Dimensionality reduction is
achieved by limiting number of hidden variables (and number of states
for the hidden variables). Speed comparison is only against full
networks. I think they should have compared vs. tree networks. My
guess is that tree networks would have achieved comparable
speed/accuracy performance. As with almost all work of this sort
where there is a rank parameter, a better formulation might have been
to allow the maximum # of hidden variables, but to penalize entropy of
the distributions.
 8/23/05  Anand recommended that I take a look at a paper by
Mikheev, Moens and Grover, which looks at NED with
gazetteers/dictionaries and discusses the MUC conferences. Someone at
SIGIR suggested that MUC looked at (what I call) informal textual
domains, but this does not seem to be the case. The data I got from
McCallum/Wellner is a collection of newspapertype articles and the
example in this paper reads as though it was taken from a
newspapertype source (at the very least, syntax isn't noisy).
 8/23/05  Nice paper by
Madsen, Kauchak & Elkan at ICML on a heavytailed term frequency
distribution. Instead of learning a multinomial, they learn a
Dirichlet, which generates a multinomial. Seems to be closely tied to
LDA. And, I think what happens is that the alpha's (params of
Dirichlet) are updated as the document is processed.
 7/29/05  I've fixed the bestpreviousmention formulation so
that training is only done using prior mentions up to the first
coreferent mention. Oh, except for this: the null mention should be
compared against for all cases, not just the first mention in
each cluster. Otherwise, the weight for the feature corresponding to
the null mention will be too large.
 7/28/05  Okay. Here's what I need to do. First, determine
the reference chains on the entire data set. Next, remove nonproper
nouns, connecting any "broken" links. Then, we have to produce
training data in matlab format. Now, for training, there are
effectively multiple right answers. Any previous mention qualifies as
correct. If we consider all correct mentions, we will find it hard
not to run into the singlelinkthreshold algorithm, which we don't
want to do. So, instead, we only train on mentions up to and
including the nearest previous coreferent mention. I recall this
being a somewhat common training method in earlier works on
coreference resolution...
 7/28/05  Realized a bug in my current formulation &
implementation. Currently, I first eliminate nonpropernouns, then I
determine references. Need to reverse that since currently a
reference chain can be cut off. Also, when dealing with proper nouns
only, it doesn't make any sense for there to be exactly one correct
answer. Any previous mention referring to the same thing should be
correct. i.e. I need to resolve the reference chains to determine
clusters of reference. For bestpreviousmatch, it's probably best to
train only using examples up to & including the nearest mention
referring to the same thing. When we do the thresholdbased
resolution, we'll train on all examples jointly...
 7/26/05  I need to rewrite the code that determines the "left
of head" word... It's obvious what to do when there is no MIN
attribute. I think when there is a MIN attribute, I should use the
2nd to last word in the MIN attribute.
 7/26/05  Okay, I have written up the maxmargin
bestpreviousmatch algorithm. Now, I have to implement it. One
bothersome question is: how do you define a proper noun? My first
thought was: it's a proper noun if the head is a NNP or NNPS. But,
the head (as defined by MIN) may span multiple words. In fact, some
heads are downright odd, such as "William E. Royster, 33, of Kansas
City, Mo., and the bombardiernavigator, Lt. Keith A. Douglas." How
do you categorize that?!?!? I think what we should do is consider it
a proper noun if (1) there is no MIN and the last word in the COREF is
NNP(S), or (2) there is a MIN and the first head word is NNP(S). When
there is a MIN, I'll need to search through the LEX texts to find the
head.
 7/21/05  Met with Tommi. He had some ideas on how to show
that we converge to global optimum. We can think of the joint
derivative over U and V as (1) a derivative over X, and (2) a
derivative over the space of invertible matrices. Note that the space
of invertible matrices gives us the space of all U,V s.t. UV'=X;
UAA^{1}V'=X if A is invertible and UV'=X. Another thing that peaked
Tommi's interest is calculating the gradient of X. This involves
taking the derivative of a trace norm. We can write this as minimum
of Frobenius norms of U,V and then use Lagrangian multipliers to
account for the UV'=X constraint. The gradient of X is the maximized
Lagrangian multiplier. Recall that gradient=0 at maximum.
 7/21/05  Due to the way entities are labeled in the data set
(as a chain of references), we cannot simply throw away nonNNP noun
phrases. To get the labels, we have to follow all the chains to the
root. Looks like there are a decent # of NNP's. First file has 75
out of 210. Issue to think about: nonproper noun phrases can anchor
a reference chain, e.g. "the deer." Question is, where do we draw the
line? Clearly, pronouns reference in a chainlike manner.
 7/20/05  I think my first goal for coreference resolution
work should be to replicate the McCallum/Wellner results. My hybrid
model uses theirs for proper nouns and sortof a global
bestpreviousmatch algorithm for nonproper nouns (maxantecedent)
once the McCallum/Wellner algorithm is run on the propers.
McCallum/Wellner use 5fold CV for evaluation. One thing they don't
do which might improve performance is to use a regularization penalty.
Though, reg. penalty may not be useful since there are so few
features. One issue is that during training, we would learn
parameters for the entire data set together, whereas for testing, I
think we'd be best off first doing inference on proper nouns, then
nonpropers.
 7/20/05  Finally finished the loglog writeup. Also did my
practice talk for ICML on Monday. Nati doesn't like the idea of
starting from ordinal regression, but I think it clarifies the
process, allows us to show off allthresholds and lets us show how
MMMF is a generalization of the SVM.
 6/28/05  ~/project/loglog/classificationSandbox.m
 6/28/05  Smoothing is an issue. When I train unigram model on
a class, then evaluate on the test data for that (same) class, I get
nearly all infinities (due to words not observed in the training data;
I get finite objectives for all training documents). Seems important
to use a type of smoothing that does not yield a bias for one model or
the other (though maybe I'm trying too hard?). Though, maybe the best
thing to do is to simply use crossvalidation to set minimum parameter
values (theta and axponent).
 6/28/05  Negative alpha is often indication that my initial
parameter vector is bad.
 6/28/05  We know that the loglog model provides a better fit
for topicoriented words. We want to show that this translates to
reduced classification error. Hmm... for classification, it's not
absolute fit that's important, it's relative fit. First thing I think
we should do is a basic classification comparison.
 6/28/05  Okay, shouldn't be hard (in priciple) to run
classification experiments w/ loglog. Need to figure out how to
restrict vocabulary size. Though, it might be easier to do
axponentperword model (w/ PCCG), which would make learning faster.
What to do with zero occurrence words? Assume each class has extra
document with one of each of vocab words?
 6/20/05  I think I should do the 'parameterization' section of
the paper before plotting graphs of the loglog distribution. What
would be nice would be to introduce biasperword frequency model,
show graphs comparing that to empirical and unigram, then introduce
the other models.
 6/15/05  Okay, what do I need to do? I need to plot a
histogram of all frequency counts vs. the maximumlikelihood unigram
distribution. In estimating the unigram, we're really estimating a binomial.
 6/14/05  The Unigram model conditions on document length.
i.e. it's not outrageous for our loglog models to condition on
document length. To get the joint (unconditional) probability, we can
introduce a Poisson prior on document length.
 6/10/05  I need a plan of how to finish this loglog stuff. I
need to look at six different models:
 Unigram
 Mixture of Unigrams
 LogLog Frequency axponentperword
 LogLog Frequency biasperword
 LogLog Rate axponentperword
 LogLog Rate biasperword
For the loglog models, I should look at how well they model each
other's data. In particular, I know the axponentperword models will
probably do poorly. It would be nice if I could show quasicovexity
of one or more of the loglog models (this might also be something to
think about for CG learning of U,V for MMMF). This might lead us to a
proof that as long as we don't start at a critical point, we can find
the global minimum. Definitions of quasi and pseudoconvexity are
found in 6.252 lecture notes dated 2/24 (pgs. 1517). I want to look
at model fit of these models on real data. I also want to look at
application performance using these models (e.g. text classification
accuracy). I do need to discuss the length issue. This will be an
explanation of why the axponentperword model does poorly. Need to
talk about the alternative models, but not go into them in too much
detail. There is something interesting there in terms of modeling
textthere being a limit on how often a word can occur. This is
something a sentencelevel model of language would naturally capture
(due to partofspeech restrictions).
 6/9/05  I think all this length stuff is hogwash. The model
needs to apply probability mass to all possible frequency values.
And, what's important is how the model discriminates. Futzing with
length can lead to higher training loglikelihood, but will have
minimal impact on our ability to discriminate or to determine similarity.
 6/9/05  Tommi & I talked at length about the loglog model.
Big question was how to deal with the length issue. Some ideas we
discussed: Set a maximum frequency rate value, gamma. For the model,
l=n gamma, where n is the empirical document length. Clearly, gamma
<= 1. We have one gamma per word; for most words, gamma will be
much less than 1. But, there's the problem that a test document may
have a frequency rate > gamma. One option would be to create a
distribution over values of gamma. But, then we'd need to integrate
the product of a beta and our loglog distribution (or something like
that). We'd also need a distribution over document lengths to make it
truly proper. Another option would be to make our distribution a
mixture between a uniform model (for values > n*gamma) and the loglog
model. Then, again, we could consider all sorts of mixturesa
multinomial and loglog, a mixture of experts (two loglogs, one
assigned to low frequency values, one assigned to high values). A
completely separate option is to tune gamma on the test datado
transduction. For classification, we'd learn the a,b parameters on
the training data, using gamma equal to the max frequency rate in the
training data. Then, to classify a document, we'd set gamma to the
max rate over training+test documents (no change of a,b). Note that
for information retrieval, we could construct two models, one
irrelevant model using only the document database, and one relevant
model using a mixture of the document database and the query.
 6/8/05  This suggests that "length" should be learned.
i.e. we'd be learning one parameter per word and one parameter per
document (it's "length"). Note that the model length need not match
up with the document's actual length. But, there's a problem here.
Length is a discrete quantity. 84.5 doesn't make any sense as the
length quantity in the model. Unless maybe we were to assign (only)
that last fraction to the probability of the "count equals
ceil(document length) event".
 6/8/05  Okay, finally figured out what was going on. I was
using the emprical document length rather than the "length" I used in
generating the data. Though, it is important to know that the
axponentperword model deals badly when length is taken from the
empirical document length.
 6/6/05  Okay, tried learning with only three values of a, tons
of observations. Now, "right" model gets lower NLL, but the
parameters that are learned are weird. True a's are 2, 3, 4. It
basically learns these values shifted by 2: 0, 1, 2. The b it
learns is basically correct (though slightly larger than it should
be). There's something funny going on here...
 6/6/05  Tried looking at generalization (what is negative
loglikelihood of test data generated from same dist as training
distribution). "Wrong" model won(!) 3 out of 3. "Right" model got
better firt to training data 2 out of 3 times strangely enough. Also
noticed that nearly half of the words yield zero occurrences. Me
thinks I need to draw more than 100 documents... This partially
explains the shifting and shrinking of the range of parameters learned
by the "right" model. I should choose parameters so that very few
words are all zero occurrences.
 6/6/05  Hmm... so I tried three different generations of
parameters. Each time, the "wrong" model gave the lower negative
loglikelihood. One thing I don't understand. The "right" model
consistently chooses parameters that aren't in the right range. b is
always too large (1e3, where true value is 6e4) and range of a's is
too small ([1,5,3.7] where true range is [1,5]). When I fix
b=6e4, fit is even worse and range of a's is still very wrong
[1.4,3.1]. The "wrong" model, OTOH, chooses b's in a range
([1e4,5e3]) that stradles the true b. When I generate data
according to the biasperword model, the "right" model gets the
lowest negative loglikelihood. I wonder if the axponentperword
model can at least generalized better? Also, what happens if I
generate a ton of data for a small number of words? If the objgrad
code is written correctly, the model should learn the underlying
parameter values. These are things I need to try...
 6/6/05  All the papers are done! Yeah! Need to finish up the
loglog work. I've implemented PRCG. Makes learning of
axponentperword model very fast, but it still achieves higher
negative loglikelihood than biasperword model on data generated
from the axponentperword model!!! Maybe I need to try multiple
different random iniitializations? Vary parameters?
 5/20/05  To do: (1) calculate significance and ZOE for 1M
MovieLens (basically, rerun experiments & save test and holdout
outputs), (2) revise Figure 2 in IJCAI paperadd vertical lines at
thresholds, make interval sizes vary.
 5/19/05  Appears that the problems with the axponentperword
model are numerical. I think preconditioned CG would fix the problem.
An issue is that one component of the gradient is much larger than the
rest. Another is that dobj varies a lot between iterations (by up to
8 orders of magnitude). The optimization appears to stop early, but
there's no way it could continue on (I tried letting it do so). The
change is so small that matlab recognizes it as zero! I think I need
to do some work on the camera ready papers... First task is to do
holdout runs for the ICML/MMMF paper.
 5/18/05  Think I've found the problem: perword a_i is just
harder to optimize. Had to do some reworking of my linesearch code to
get it to do a good job of optimizing. Still very slow. Seems that
the gradient magnitude for b is much, much larger than for all the
a_i. Preconditioning would help...
 5/18/05  Welp, I can now generate data from loglogRateA and
learn parameters for both RateA and RateB. RateB wins!?!?
 5/18/05  Got the matlab mixture code working. Implemented a
generator. Need to generate data from the axponent model, see if it
fits better than the bias model...
 5/16/05  Also, if we want to compare count vs. ratebased
models, we should normalize over only feasible counts (not penalize
countbased model for assigning weight to counts > document
length). No reason for doing infinite sum!!!
 5/16/05  Chat w/ Tommi. He had a good suggestion to make sure
my optimizaiton is doing what it's supposed to: generate data
according to each of the models. Clearly, the generating model should
provide the best fit! :)
 5/15/05  Implemented loglog model with reversed
parameterization. When I fix b, optimization is fairly fast and
converges welltermination is due to sufficient decrease in gradient
magnitude. However, negative loglikelihood is larger than original
(highly nonconvex) parameterization! i.e. perword b_i models the
text better than perword a_i!
 5/13/05  Showed that anything of the form log E[exp(a^Tx)] is
convex in a. The Hessian is a covariance matrix, which is PSD. Did a
writeup. Next: do writeup for loglog using a_i as perword
parameters.
 5/12/05  Talked w/ Nati. He'd like me to select
regularization parameter via a hold out set. Something as simple as
splitting the training data into two (use some small percentage as
holdout, e.g. 10%). Would do this completely separately from other experiments
 5/12/05  Okay, we can't solve for the a_i's analytically.
But, we can show that it is convex. Anything of the form E[exp(a^Tx)]
(where x is the variable) is convex in a. I should sit down and prove
this. Tommi and I talked about how to show that our distribution is
better. We want to show that we fit discriminative words
wellfitting of common words (such as articles) doesn't matter for
most tasks. One way to do this is to look at a histogram of words as
a function of information gain, where IG is H(y)H(yx), where y is
the classification label and x is the wordi.e. how much do we
reduce entropy by including x in the classification
model/distribution.
 5/10/05  Tommi suggested using perword a_i's. Big advantage
is that the problem should be convex and we should be able to solve
the a_i's analytically since they are exponential parameters.
 5/10/05  I've finally been able to achieve lower
loglikelihood than I was with the count model. I had to learn the
betas with a fixed alpha, then switch over to learning alpha and betas
jointly. It seems that when we're far from the solution, optimization
diverges. But, when we're close, we get convergence. However, the
convergence seems quite slow. Gradient looks like it's on a roller
coaster. Alpha has finally settled around 0.07 (a=2.93). I
expected a much more negative value of a!
 5/9/05  I think I've completely debugged the loglog code, but
I still can't get it to learn alpha & beta simultaneouslyit tries
to put alpha too large. I wonder if I should unreparameterize 'a'
and simply have obj/gradient return infinity for outofbounds values.
Maybe for b/beta too?
 5/8/05  Should try constraint that L2 norm of rate vector is
unity. Loglikelihoods for basic rate model are not better than raw
frequency LL's.
 5/8/05  I have running everything for EM. At some point,
would be good to look into Bookcrossing, but that's not a priority. I
have SIGIR final submission coming up May 26th. There are a number of
SIGIR workshops with deadlines May 15th. One to consider is ELECTRA.
In particular, the workshop focuses on going beyond the unigram/term
independence model. My idea is somewhat off topic b/c it tackles the
occurrence independence assumption, not the term independence
assumption (which seems what they are most interested in,
e.g. multiword units, phrases). However, the fact that the model can
help in finding multiword named entities might be of interest to the
audience. There is also a workshop (due date May 20), which is on
Mathematical/Formal Methods. The description is not very specific,
but it seems a new model for term frequency would fit.
 5/6/05  Matlab is down... to do when it comes back up: (1)
generate weak generalization plot for EM, (2) polish script and start
runs for strong generalization, (3) produce rank plot showing
performance as rank changes, (4) debug iteration plot codeI keep
getting outofmemory errors...
 5/5/05  Met w/ Tommi. He likes my idea for a loglog tf rate
model. He would like me to submit something for NIPS, but thinks the
loglog stuff would need to be part of something larger; coreference
would be good, but I told him I doubt I'll have something ready in
time, especially with all the stuff to wrapup for ICML and IJCAI
papers...
 5/4/05  Wrote up a loglogbased term frequency rate model.
Also wrote objective/derivative code, but there seem to be lingering
bugs. Time to take a break and work on MMMF stuff for a little while.
 5/1/05  I produced two scores that are based on the loglog
model: (1) LLSO, or the logodds of the loglog and unigram
likelihoods, and (2) LLPROD, the product of LLSO and IDF. I realize
now that LLPROD may not make much sense since LLSO may be negative
 5/1/05  Trained loglog model using chow data, wrote python
code to calculate logodds ratio of loglog and unigram models.
Appears to be a bit less effective than mixture score at rating
informative words on chow data. I think I need to use loglog
dist. to model term frequency rate (instead of raw frequency) so that
the model takes advantage of document length information as does the
unigram.
 4/28/05  Chat w/ Tommi: lots to do: (1) use heavytailed model
in place of mixture for NEE experiments; (2) implement CRF, plus
marginal objective CRF and semiMarkov CRF; (3) try joinly extracting
locations and restaurant names.
 4/28/05  Need to finish up MMMF work: update ICML paper w/
"big" results, do comparison vs csdp.
 4/26/05  What's next? (1) Write good CG code for the
alternating optimization, (2) interface w/ Python and/or write
equivalent python code.
 4/26/05  It works. Yeah! What makes the optimization tricky
is that the 'a' parameter needs to be close to the optimal value in
order for the optimization to go smoothly. a=5.11 is optimal for the
restaurant data.
 4/26/05  Okay, so, it turns out that if I fix a=3.0 and try
to learn b/beta for each word individually, for most words I get nice,
fast convergence. For some words (particularly highfrequency and
words w/ empirical distributions that are far from loglog), CG
doesn't converge and it gets to the point that it can't find a
decrease in the objective by following the negative gradient. Best I
can tell is that this is due to our approximate calculation of the
infinite sum. But, CG finds a very good solutionvery close to the
true solution, plenty close for practical purposes. There's still the
issue of the axponent. I see very weird behavior when I try to learn
a and b/beta simultaneously. Maybe the best thing to do is to first
learn 'a' using a single b/beta (using the full empirical frequency
distribution as data), then use that fixed 'a' while learning b/beta's
for individual words.
 4/22/05  Okayhere's the deal, I need to try moment
matching. Fix a, compute the derivative for b_i as the difference
between the empirical and expected means.
 4/21/05  Met w/ Tommi, had him check over the math, everything
looked good to him. Only quips: (1) I should combine expectations for
each word before summing, (2) I should add a small value to
x+e^{\beta_i} to ensure that it is always > 0.
 4/20/05  Another observation: the expected value we calculate
for the partial derivative wrt a does not intermingle a in the
expectation (a only has an effect on the probability values). This
isn't the case for beta_i. Changes to beta_i affect both the real
expectation and the empirical expectation. For the real expectation,
both the probabilities and values are affected; for the empirical
expectation, only the values are affected. But, I imagine this would
make it difficult to do gradient descent b/c values can change in such
nonlinear fashion!
 4/20/05  Ack! I need to reparameterize b as e^\beta (or
something like that) so that the optimization can't try b <= 0.
 4/20/05  loglog obj/grad code is behaving badly. It's not
converging quickly, it fails checkgrad2 at a semioptimized point, and
it once tripped the bad obj/grad code warning in conjgrad! But, I
don't see what the problem is. I think I need to step through the
calculations for a small number of words, make sure I have a very good
understanding of what is being calculated. The hard thing is that for
small # of words, the code converges fine. It's only when I use all
the words that I run into trouble! Argh! Oh, I'm also seeing
imaginary numbers. That can't be good! Where do I take the square
root of a negative number?!? Ah, if x+b is negative, we would get
imaginary numbers! Good things to think about for the trip home...
Are we sure the model would never want to choose b <=0 ?
 4/19/05  Got approximateExpectedValue code working (renamed to
infiniteSum). Checked against Riemann Zeta Function values for
n={2,3,4,5,6}. Correct to 10 digits.
 4/19/05  My writeup on learning the loglog model has the
correct derivatives. Just checked them on the white board. I added
some additional detail to the writeup. I need to take a stepbystep
approach to debugging the code: (1) make sure the approximate EV code
works by checking with known values of the Riemann Zeta Function, (2)
test derivative code for a single word: (a) make sure signs are in
right direction, (b) then do a full CG and make sure learned dist
matches up with empirical dist pretty well (plot), (3) finally debug
on full data.
 4/19/05  Just chatted w/ Tommi about no need for normalizing
by the marginal probability when learning the loglog model. He's
skeptical, but basically agrees with me. Got him to understand that
the case is very different from the multinomial/unigram (since with
the multinomial multiples # of numbers together proportional to
docment length; loglog product is constant number elements (number of
features/size of vocabulary)). He still thinks there are document
length issues, but doesn't think they are killer. I agree. The model
isin some senselearning how long documents should be. E.g. you
can calculate an expected document length for the model.
 4/16/05  Do we really need the conditional probability? To
train the distribution, why can't we just maximize the likelihood of
the (joint) data? Even with the multinomial, we don't calculate the
conditional. It's not immediately obvious to me that we could easily
calculate the marginal probability of a document being a certain
length with the multinomial.
 4/15/05  We can think of this as a variational approach.
Define P(x;\alpha,\beta) = 1/(x+\alpha)^\beta. Assume \beta > 2.
Since P(x) is convex, we have that $\frac{1}{ba} \int_a^b P(x) >
P(\frac{a+b}{2})$. We bound P(x_idlength=N) above by N
\int_{x_i1/(2N)}^{x_i+1/(2N)} 1/(x+\alpha)^\beta$. As N \to \infty,
this bound tends to the actual value. Note that we can easily
calculate the correct normalizing constant.
 4/15/05  I think my best line of attack is to model frequency
rates using a continuous distribution. One thing I should do is
consider the domain to be from 1/(2N) to 1+1/(2N) to accout for the
discreteness (i.e. integrals should have those limits).
 4/15/05  Chat w/ Johna few good ideas. To calculate
normalization constant: (1) use a multinomial approximation (w/ same
mean), (2) use continuous approximation. In both cases, it's easy to
calculate the marginal probability. Another idea he suggested is to
use a distribution like P(\vec x)=(\vec w^T vec x + a)^b; though, I
think it would be hard to calculate single word probabilities from
that distribution...
 4/15/05  Tommi has a simple, workable idea for how to deal
with document length: divide frequencies by document length! It's not
perfect since there isn't proper normalization (e.g. the distribution
will assign nonzero density to frequency rates > 1), but the only way
to get rid of that issue would be to calculate the marginal
probability, which we've already convinced ourselves that we can't do!
 4/15/05  Okay, I'm giving up on the idea of calculating the
marginal probability of P(\sum x_i=N). What next? Well, I think we
should use an extra additive constant to account for document length.
That is, P(x_id_j) = (x_i+a_i+c_j)^{b}. The probability of word i
occurring x_i times in document d_j depends on c_j, which is a
constant specific to that document that effectively accounts for
document length. The c_j constant has no dependence on individual
words, only on the document. Learn the c_j parameters alongside the
a_i and b parameters. Note that allows the model to account for
document lengths. It does not allow the model to account for
variability in correlation; it may be that some documents have higher
correlation in word occurrence rates.
 4/14/05  What exactly do I want to plot? The normalizer:
P(\sum x_i=N). I want to get a real empirical distribution for a set
of words, the calculate the normalizer based on those empirical
distributions.
 4/14/05  What about: model the normalizers dist. as a function
of N with a heavytailed dist, like exp(abs(x+c)) or something like
that? Guess this is why it would be useful to look at a sample so we
can judge whether this idea is semirealistic...
 4/14/05  Big problem with the Gaussian idea: 2nd moment of our
term freq. dist is infinite!!! (for b<=3). It would be nice to
compute at least a miniature version of the distribution of masses for
different document lengths...
 4/14/05  One offthecuff idea: ignore probability values for
words with tf=0 (they're going to be extremely close to unity anyway;
well, except maybe for extremely frequent words, like "the"). Oh, but
that assumes the dist. is normalized (not easy to do).
 4/14/05  Just met with Tommi. Threw around some ideas
concerning the normalization constant, but couldn't come up with
anything that could avoid the combinatorial explosion. I'm wondering
if we might be able to approximate the distribution of normalizers (N
is the variable) with a Gaussian. Mean is trivial to calculate :)
Variance might not be so easy. And who knows how good of an
approximation we'll get... Though, with a reasonablesize document
set, I can't imagine it's too awful. Recall, the reason we are doing
this is to handle different document lengths in data set, so all we
need is that ratios of values are correct, not absolute values.
Though, it is also important that as we change params a & b, the
ratios of loglikelihoods are correct...
 4/14/05  Need to think about how to deal with nasty
normalization constant in term freq. model. One possibility is
sampling, though I imagine this would be extremely slow.
 4/14/05  Finished draft of CSAIL abstract
 4/12/05  Hochreiter/Obermayer paper:
 Introduction
 Summary: We model data generation as causes, z, being
generated from a prior distribution, P_z. These causes induce
observations through a transfer function, x=f(z;w). We see
observations, x, and would like to recover the sources, z.
Effectively, we are trying to determine f^{1}. ML is often applied
to this problem, but ML requires that we impose a specific P_z
(e.g. single Gaussian) and assume a linear trasnfer function. The
authors propose a sample and kernelbased method that learns f^{1}
 Q for summary: do the authors randomly generate causes, z,
or are they learned along with the f? I guess what happens is that
they learn parameters of f or f^{1}. They learn parameters for f if
observations, x, are given. They learn parameters for f^{1} if
reference sources, z, are given.
 Observations, x, are generated from causes, z, through a
transfer function, f.
 Eqn. 2: subst. z=f^{1}(x) into eqn. 1. Note that this is
the only value for which \delta(xf) is nonzero.
 Cost Functions/Optimization
 Summary: the goal is to find parameters for f that minimize
the difference between observed and generated points. A kernel is
placed on each observed/generated point as a relaxation method. This
objective is the integrated squared difference, which is equivalent to
another expression that isolates the continuous integration. This is
capsulated as a kernel function. If we are given k_d, we can
construct k (as long as the integral converges). If we are given k,
it must be able to be written as in equation 5 (must have condiitons
of Thm. 1). Result: we don't need to evaluate the nasty integral. Equation 7 is chain rule, first part is \pd{x}{w}, second part is \pd{F}{x}.
 \tilde F and F can be shown equal via simple algebra
(subst. defn. of \Phi into \tilde F, expand, push in integral. Only a
few technical issues to deal with.
 Progression assumes k_d given; if we are given k, we must ensure that it can be represented as in eqn. 5.
 Empirical distribution is defined by k_d.
 Optimal Kernels and Convergence Properties
 Summary: Idea here is to determine what is the right kernel
to use. We want a kernel that will give a single local minimum (which
is the global min), so that learning is computationally
straightforward. Without motivation, they move to a scenario where
the number of samples goes to infinity. They make new definitions of
\Phi and F; F clearly corresponds to the previous definition (think of
variance of k_d to zeroempirical weight for each samples is 1/N);
unclear how \Phi corresponds to old definition. They want to find a
kernel that assures uniform convergence. That is, they want the
disparity between observed and model distributions to be bounded by a
decreasing function that goes to zero. I don't understand how the
Poisson equation comes about, but 11 is the condition that results
from the uniform convergence constraint. Is uniform convergence good?
They don't provide an argument.
 It appears these defintions are not in some way equivalent to
the earlier definitions of \Phi and F. Q: where do they come from?
 With continuous distribution, don't need k_d to smooth
things (dist is already smooth). Q: why isn't \Phi(a)=\rho(a) ?
 Uniform convergence. Recall that \rho is part of what we
are optimizing (\rho changes as we change w). We get uniform
convergence if we can upper bound the maximum value of \rho during
steps of the optimization so that the upper bound goes to zero.
 Poisson Equation: I don't understand this. Right side of
(11) seems to indicate that gradient is nonzero when a and b are
within the same partition.
 This statement confuses me: "All local optima of the cost
function F wrt the model parameters w are then only a property of the
model class f(.;w)"
 Plummer kernels?
 Experiments: ICA
 Summary: Authors apply their method according to bottom of
figure 2. ICA methods use independence of source signals as the
objective; this is a poor objective. ICA methods posit certain
properties of the source distribution. The authors' method is
distributionindependent.
 How is f defined?
 Generative ICA methods can only deal with linear mixing;
the author's method can deal with nonlinear mixing (see last example)
 Generative ICA methods require a certain source
distributionsthey can't handle data generated from very different
distributions. E.g. if Gaussians are assumed, performance with
mixtures of Gaussians is terrible; the author's method can deal with
arbitrary distributions (see first example)
 f is defined differently for the different problems: linear
for the first two problems, nonlinear neural network for the third
problem.
 Experiment 1 shows their method works well for mixtures of
Gaussians
 Experiment 2 shows they can handle different source
distributions in the same data
 Experiment 3 shows they can deal with a nonlinear transfer
function
 4/11/05  Need to prepare Hochreiter/Obermayer paper for
tomorrow's reading group.
 4/11/05  My SIGIR paper got accepted! Woo hoo!
 4/11/05  Wrt the normalizer for P(x) = exp(a log(b+x)), Tommi
suggested looking at partial sums, S_n = \sum_{x=0}^n exp(a log(b+x)),
and plotting S_{1/n} as 1/n goes to zero. Cubic fit of a few points
should give a very good approximation of the true result.
 4/11/05  Chat w/ Tommi. My idea for length normalization is
one possibility (scaling the 'b' parameter according to document
length). Another possibility is direct normalization of the
distribution. The chance of seeing a certain term freq. dist is
$P(x_1,x_2,...,x_n) = \prod_i P(x_i)$. That probability is implicitly
a joint probabilityjoint with the event that the document length is
$\sum_i x_i=N$. So, to make it conditional on the length, we must
divide by the marginal probability of generating a document of that
length. i.e. we must divide by the sum of probability mass associated
with empirical distributions (x_1,x_2,...,x_n) of length N (\sum_i
x_i=N). Trouble is, for N of any size, this may be infeasible to
compute exactly and computationally very difficult to compute
approximately.
 4/10/05  Finished JMLR review.
 4/6/05  One option for lenghtnormalization would be to
separateout the probability of x=0. But, that makes it possible for
p(x=0) << p(x=1), which isn't realistic according to the data I've
looked at. Note, model is p(x) = \exp(a \log(b+x))/Z. I wonder if we
should get rid of 'a' parameter. Additive 'b' constant seems like it
might be enough for the power we need. Can use 'b' to adjust for
different desired rates of occurrence.
 4/5/05  Need to write up lengthnormalization for loglog
model.
 4/3/05  To do:

ordistic experiments (and comparable shc allthresh experiments)

experiment plots, plots, plots (avg, std. dev. code)

margin cost & loss function plots (including zeroone and
heavaside at z=1)

write section 3 (immthresh loss vs. allthresh loss)
 binomial significance test
 4/2/05  Finally! Worked out the objective/gradient function
for multiclass classification (where loss is incurred for all other
classes, not just the maximal other class).
 4/2/05  I'm doing L1 regression w/ an L2 regularization norm.
Maybe not the most sensible thing, but it's easier to optimize!!! :)
 4/2/05  L1 Regression doesn't work. I don't think the
gradient is changing at all and CG keeps resetting (since gradient
never changes), so the search doesn't get anywhere. Gets closer and
closer to a "V", but not making any real progress. How do we make
real progress? I'm not sure. Randomly zero parts of the gradient?
That would help to some degree. Minimize an upper bound such that one
of the parameters controls how close the bound is? That's a better
solution.
 4/1/05  Fixed bugs in regression code, started runs. Both L1
& L2 appear to be running fine. L1 is very slow, as expected. I'm
doing 100 iterations. Might need to do more. Also, I've noticed that
the conjgrad for the first lambda often ends too quickly. I've upped
the tol for the first to make sure we get reasonably close to the
solution. Next: code up multiclass classification; use sub2ind.
 4/1/05  Coded up regression. Don't think a bias term will be
necessary since the feature vectors have been centered (zero mean).
 4/1/05  To do:
 Allthreshold:
shc, mls, lsc, logistic
 Immediatethreshold:
shc, mls, lsc, logistic[karhu]
 Regression w/
L2, L1[orava]
 Multiclass classification: shc, mls, lsc, logistic
 Chu/Ghahramani
 Hinge loss (allthresh, immthresh, multiclass)
 Ordistic (recode with matrix mentality)
 4/1/05  Problem w/ Nati's ordistic code is that it is
specifically meant for the one user casecan't deal with empty (0)
labels.
 3/31/05  Chat w/ Nati: do immediatethreshold experiments, do
experiments on regular multiclass classification.
 3/30/05  Found a bug in my least squares FixedV MMMF code
(wasn't handling z=0 correctly). Fixed. Also, wrote FixedV MMMF code
for modified least squares, (1z)^2 for z <= 1. Need to write code
for the logistic loss.
 3/29/05  Loglog word frequency model wrt document length.
Tommi says: just condition on document length, and everything will
work out. It'll be a bitch computationally, but the math is
straightforward. I don't completely get this... need to sit down &
work out the math...
 3/28/05  100 iteration EachMovie experiments ran out of
memory. I think I should put them on hold for now. The 10 iteration
runs completed and gave better results than Marlingood enough for
ICML paper. More important is to finish strong generalization. Still
need to do strong generalization on EachMovie. Strong generalization
for MovieLens is currently running. Once those finish, I should do
10 iteration strong generalization for EM. After that, do 30x weak
gen. for EM. Then, maybe 30x strong gen. for EM. Also, need to do
a training & test error plot as width of U/V increases and as the
number of iterations of CG increases. To do:

100x iteration strong generalization MovieLens

30x iteration weak generalization EachMovie
 10x iteration strong generalization EachMovie
 30x iteration strong generalization EachMovie
 Plot train/test error as function of iterations EachMovie
(p=100 x={1,...,100} i=1)
 Plot train/test error as function of components EachMovie
(p={1,...,100} x=100 i=1)

Plot train/test error as function of components MovieLens
(p={1,...,100} x=100 i=1)
For EachMovie runs, do only one process per machine, only use karhu/orava.
 3/28/05  Thoughts on rating experiments: use top 100 users'
ratings as V matrix, try to predict ratings of rest of users.
 3/25/05  To do

Plot train and test error for one run of 1M MovieLens

Write code for learning U for fixed V

Do strong generalization runs on 1M MovieLens

Run experiments on EachMovie once preparation script finishes

weak generalization

strong generalization

Plot train & test error for one run of EachMovie
 Write code for rating prediction (not MMMF); actually, it's
already written, we just have to figure out what to put in V; I think
the best is to take the ratings of the 100 (or so) users with the most
ratings. Try to predict ratings of all other users.
 Formalize loglog paper a bit, do more analysis about how
model should change when document length changes.
 Literature search for MovieLens and EachMovie results
 Make plots for ICML paper of MovieLens and EachMovie results.
 3/24/05  Another thing to do for MMMF: plot training and
testing error as a function of the number of iterations/line searches.
Also need to do a literature search for MovieLens collaborative
filtering results.
 3/24/05  What I need for my term distribution model is a
logGaussian model! That way even frequently occurring words may be
modeled well. Just need to get a sense of how document length should
playin. Should it be a constant multiplier of the mean parameter?
The "mean" parameter might need to be shifted "left" of the true
distribution mean...
 3/23/05  Term distribution model: it's easy enough to make the
model loglog if I don't mind a monotone distribution, but what if the
word has an expected rate of occurrence > 1?
 3/23/05  MovieLens data preparation is in motion. Wow does it
take a long time to holdout one rating per user. Doesn't help
that I can't think of a matrixy way to do the process!
Jason Rennie
Last modified: Fri Sep 16 12:03:05 EDT 2005