Jason Rennie's Research Log

Older

9/15/05 - Read the Blei/Lafferty paper on Correlated Topic Models. Basic idea is that instead of drawing topics from a Dirichlet, they will be drawn from (what I would call) a natural parameter Gaussian with full(?) covariance matrix. Blei and Lafferty call this distributoin a logistic normal (apparently same term as used in a 1982 book by Aitchison). i.e. here's the process: draw parameters from a Gaussian; use those parameters as natural parameters for a multinomial. The Gaussian specifies the distribution of topics for a document (separately drawn per topic). My (biased) guess is that the benefit they get with this is the ability to more smoothly transition between ranks. i.e. it might help to make the landscape smoother and easier to find a point closer to the global optimum.
9/15/05 - Gave a quick read to the Lebanon/Lafferty paper on the multinomial manifold. Didn't understand large chunks of the paper. Maybe a paper I could present at group meeting? Basic idea of the paper is that linear classification in the mean parameter space of the multinomial isn't the right thing to do. They derive an alternative way of constructing a decision boundary that has to do with projecting points from the manifold onto the sphere. Seems like this might be related to finding linear boundaries in the natural parameter space... Should try to understand this better...
9/15/05 - Read the Gordon GGLLM paper. Generalization of PCA allowing for non-linear transition between U,V and X. Discussion of how to choose link and loss functions so that some aspect of the optimization problem can be called convex. Equation 2 looks like the log-likelihood of an exponential distribution with natural parameterization.
9/15/05 - Read the Collins/Schapire/Singer paper. Premise seems kind of silly---better algorithms for optimizing adaboost and logistic regression. Why not just use gradient descent? Okay. I know I sound like a broken record. I'll shut up now. Discussion of Bregman distances, the "natural optimization problem" associated with them.
9/15/05 - Read the Collins/Dasgupta/Schapire paper. Describes a general natural parameter low-rank model for exponential family distributions. Also shows that negative log likelihood is always a Bregeman distance for exponential distributions. Thus, optimization is optimization of sum of Bregman distances. Some useful examples of how PCA changes when natural parameterization is used---differences are in terms of natural parameter value, not that value exponentiated. Nice intro to Bregman distances.
9/13/05 - Things I need to do:
1. Create visualization of NPLT solutions: rank 1 with/without norm constraint (fixed V, Fro norm on U, elipsoid in NP space); rank 2 with norm constraint
2. Answer this question: what is the strength of the natural parameter representation? Tommi says that a value of PLSA is the topics that result; solution is not unique, but interpretation is. NPLT gives convex optimization, but what else? Here's one answer: the eigenvectors corresponding to the singular values can be interpreted as topics. Is that true? How exactly?
3. What's a good hierarchical model for NPLT? Is the one that Tommi suggested reasonable? Where there is a rank constraint between parents and children controling the "error" between them? Note that his suggestion deals with the regularization term. In fact, it explodes that term into many rank/trace norm terms; how to find all the regularization constants? Alternate hierarchical model is the one I discussed w/ John. Important question: how do you intepret X? Especially the rows of X that correspond to higher-level statistics?
9/13/05 - Good chat w/ Tommi today. He was trying to get me to come up with some way that the natural parameter low trace norm (NPLT) model is better than PLSA. In the end, I think he answered his own question :) PLSA topics are unique and interpretable in the sense that they are the only underlying representation for which convex combinations result in the parameter matrix. Only permutations are generally possible (without violating the non-negative & sum-to-one constraints). You can visualize the subspace to which the data has been projected---it's a hyper-tetrahedron. There isn't an exact analog in NPLT. NPLT simply yields a lower-dimensional sub-space. However, we can come up with natural "topic" vectors: the eigenvectors corresponding to the singular values of the matrix. Though these can be rotated, they do yield a natural representation. For constructing a hierarchical model, Tommi suggested projecting sentences from a single paragraph onto a line (in natural space, curve in mean space); paragraphs from a single post; posts from a single thread; threads from the whole collection. Each level is grounded by the above level; that is, the lower-level line must pass through the projected higher-level point. Procedure might be iterative. First find line that is best fit for threads. Then, requiring that line pass through projected thread point, find line that is best fit for posts from that thread. And, so on. 'course, you might want a slightly larger space for projection (2 rank, 3 rank, or a trace norm restriction). But, then, what we're really doing is finding a low trace norm projection at each level of the hierarchy. i.e. the interpretation here (finding variability at each level) is very similar to the interpretation of the non-hierarchical problem.
9/9/05 - Gave the Sajama & Orlitsky PCA paper a(nother) read. They use natural parameters like I am proposing to do. Some useful observations, such as how natural parameters are defined (linear product with data in log-probability space), and that two point close in natural param. space will be close in mean parameter space (due to continuity of transform). I need to read the Collins/Dasgupta/Schapire paper on exponential-family PCA. Reasons why mean parameterization is bad (for multinomial): parameters are tied, cannot change individual parameters, overparameterization (represents n-1 dimensional space). I didn't totally follow the semi-parametric aspect of the paper... But, one thing that was clear to me was that they use a rank-constrained approach (limit dimensionality of underlying representation).
9/8/05 - Nice chat w/ Charles Elkan this morning. Turns out that the Dirichlet Compound Multinomial (1) was originally introduced by Thomas Minka, and (2) it is closely realted to "Bayesian Naive Bayes" in my master's thesis. In fact, the main difference between my "Bayesian Naive Bayes" and the DCM is that for the DCM, the Dirichlet parameters are learned, whereas in my model the parameters are set to the training data counts (plus "prior" counts). This is important since as Charles & his co-authors have found, the best parameters for the DCM tend to (1) sum to the length of an (average) document, and (2) be proportional to the document frequency (not the term frequency).
8/30/05 - Phew! Just spent 2 hours talking w/ Tommi about aspect model stuff. I now (mostly) understand why the trace norm can be written as the minimum Frobenius norm of factorizations. I also understand that we can't really speak in terms of "topics" with this model. It constrains the space of the underlying representation. But, that space can be represented as the hull of many different sets of vectors. The mixture models (PLSA, LDA) also have this problem. One thing Tommi made me realize is that a mixture of multinomials is a multinomial. The only reason that a mixture of Gaussians is not a Gaussian is b/c the Gaussian does not fully parameterize all possible distributions on the space of possibilities. There are only N possibilities in a multinomial & there are N parameters. Two multinomials can't give you any more expressiveness than you already have with one! Need to do a write up on this. Need to do a write-up proving the trace norm equivalencies. We also talked about hierarchical representations and where this model might apply to my thesis work.
8/30/05 - Hmm... so my guess is that if I were to work out all the math, I would find that the proper (description length) penalty for encoding the parameters is a constant multiple of the prior (with maybe a constant factor added in). i.e. I think I'm 99% of the way there just by multiplying in the prior and then using a constant to adjust the influence of the prior.
8/29/05 - Okay, I think I have this worked out for the matrix factorization model, but what about a simple multinomial model? Assume each row of X is drawn from a uniform Dirichlet prior. But, the uniform Dirichlet gives a value of (n-1)! for any value of theta!!!! That's not right! Okay, here's the issue: the Dirichlet operates on the simplex, which is (much) smaller than the unit hypercube. So, the normalizer is correct. What needs fixing is my approach. Let's return to the "bits-back" argument... I can't believe it, but I never properly derived Logistic Regression using the bits-back argument. My development uses very fuzzy reasoning. What I should be using for any distribution on a continuous space is the KL-Divergence.
8/29/05 - How do I specify the encoding length of a term frequency model? Need to encode (1) the parameters, and (2) the data given the parameters. So, for the matrix factorization model, the likelihood of the data, P(Y|U,V), is multinomial and UV^T provides the natural parameters. There are Gaussian priors on each row of U and V, P(U|\mu_U) and P(V|\mu_V), with unit variance and a shared mean across rows of U (\mu_U) and rows of V (\mu_V). There is a standard Normal prior on each mean.
8/26/05 - Something I didn't realize about trace norm. It is (effectively) the L1 norm of a matrix. The L2 norm can be calculated by taking the trace of X^TX, which works for any matrix, not just PSD ones. However, these are not the traditional sort of matrix norm (which would, I think, only look at the maximum eigenvalue). Here's some information about matrix norms.
8/25/05 - Unlike LDA, our natural parameter model simultaneously generates topic vectors and mixture vectors.
8/24/05 - Chatted with Tommi about a norm-constrained version of the aspect model. The necessary changes would be a bit larger than I originally envisioned. First, we would need to change to the natural parameter space so that we are not using mixtures of distributions, but rather geometric averages of parameters. This change allows us to attain the desired convexity benefit. But, the interpretation as a generative process changes. We can no longer follow a simple generate process to produce words in a document.
8/24/05 - Gave the Joachims paper on optimizing non-linear error measures a read. Sounds like it's a nice extension of his work on structured output learning. Instead of optimizing for individual labels, he performs a joint optimization on all the labels so that instead of measuring simple error (which can be separated into sum of individual label errors), he measures an error which cannot be decomposed as a sum of errors on individual examples. He ends up optimizing a bound on the given criterion.
8/23/05 - Revisited the Taskar/et al. "Relational Data" work. The only insight seems to be the idea of using clique templates. I.e. instead of placing cliques on all pairs or triples of variables, place cliques based on some structure of the data, e.g. a clique for each hyperlink and pair of web pages such that page1 points at page2. I don't think this is useful for my co-reference work...
8/23/05 - A talk by Zoubin Ghahramani gave me a better appreciation for the usefulness of the models Zhu/Lafferty/Ghahramani use for active learning. An interesting aspect of their model is that it values "nearness" to other unlabeled examples as well as uncertainty in the label.
8/23/05 - While looking at the Lowd/Domingos paper, I noticed a paper on Multi-Relational Record Linkage by Parag/Domingos. Sounds like it's based on the McCallum/Wellner work, but I lost track of the development of their model on a quick read through the paper. Should come back and study this...
8/23/05 - Slightly interesting ICML paper on learning graphical models by Lowd and Domingos. Idea is to introduce hidden variables, each of which has Naive Bayes-like effect on the observed variables. Only links in graph are from hidden variables to observed variables. Dimensionality reduction is achieved by limiting number of hidden variables (and number of states for the hidden variables). Speed comparison is only against full networks. I think they should have compared vs. tree networks. My guess is that tree networks would have achieved comparable speed/accuracy performance. As with almost all work of this sort where there is a rank parameter, a better formulation might have been to allow the maximum # of hidden variables, but to penalize entropy of the distributions.
8/23/05 - Anand recommended that I take a look at a paper by Mikheev, Moens and Grover, which looks at NED with gazetteers/dictionaries and discusses the MUC conferences. Someone at SIGIR suggested that MUC looked at (what I call) informal textual domains, but this does not seem to be the case. The data I got from McCallum/Wellner is a collection of newspaper-type articles and the example in this paper reads as though it was taken from a newspaper-type source (at the very least, syntax isn't noisy).
8/23/05 - Nice paper by Madsen, Kauchak & Elkan at ICML on a heavy-tailed term frequency distribution. Instead of learning a multinomial, they learn a Dirichlet, which generates a multinomial. Seems to be closely tied to LDA. And, I think what happens is that the alpha's (params of Dirichlet) are updated as the document is processed.
7/29/05 - I've fixed the best-previous-mention formulation so that training is only done using prior mentions up to the first coreferent mention. Oh, except for this: the null mention should be compared against for all cases, not just the first mention in each cluster. Otherwise, the weight for the feature corresponding to the null mention will be too large.
7/28/05 - Okay. Here's what I need to do. First, determine the reference chains on the entire data set. Next, remove non-proper nouns, connecting any "broken" links. Then, we have to produce training data in matlab format. Now, for training, there are effectively multiple right answers. Any previous mention qualifies as correct. If we consider all correct mentions, we will find it hard not to run into the single-link-threshold algorithm, which we don't want to do. So, instead, we only train on mentions up to and including the nearest previous co-referent mention. I recall this being a somewhat common training method in earlier works on coreference resolution...
7/28/05 - Realized a bug in my current formulation & implementation. Currently, I first eliminate non-proper-nouns, then I determine references. Need to reverse that since currently a reference chain can be cut off. Also, when dealing with proper nouns only, it doesn't make any sense for there to be exactly one correct answer. Any previous mention referring to the same thing should be correct. i.e. I need to resolve the reference chains to determine clusters of reference. For best-previous-match, it's probably best to train only using examples up to & including the nearest mention referring to the same thing. When we do the threshold-based resolution, we'll train on all examples jointly...
7/26/05 - I need to rewrite the code that determines the "left of head" word... It's obvious what to do when there is no MIN attribute. I think when there is a MIN attribute, I should use the 2nd to last word in the MIN attribute.
7/26/05 - Okay, I have written up the max-margin best-previous-match algorithm. Now, I have to implement it. One bothersome question is: how do you define a proper noun? My first thought was: it's a proper noun if the head is a NNP or NNPS. But, the head (as defined by MIN) may span multiple words. In fact, some heads are downright odd, such as "William E. Royster, 33, of Kansas City, Mo., and the bombardier-navigator, Lt. Keith A. Douglas." How do you categorize that?!?!? I think what we should do is consider it a proper noun if (1) there is no MIN and the last word in the COREF is NNP(S), or (2) there is a MIN and the first head word is NNP(S). When there is a MIN, I'll need to search through the LEX texts to find the head.
7/21/05 - Met with Tommi. He had some ideas on how to show that we converge to global optimum. We can think of the joint derivative over U and V as (1) a derivative over X, and (2) a derivative over the space of invertible matrices. Note that the space of invertible matrices gives us the space of all U,V s.t. UV'=X; UAA^{-1}V'=X if A is invertible and UV'=X. Another thing that peaked Tommi's interest is calculating the gradient of X. This involves taking the derivative of a trace norm. We can write this as minimum of Frobenius norms of U,V and then use Lagrangian multipliers to account for the UV'=X constraint. The gradient of X is the maximized Lagrangian multiplier. Recall that gradient=0 at maximum.
7/21/05 - Due to the way entities are labeled in the data set (as a chain of references), we cannot simply throw away non-NNP noun phrases. To get the labels, we have to follow all the chains to the root. Looks like there are a decent # of NNP's. First file has 75 out of 210. Issue to think about: non-proper noun phrases can anchor a reference chain, e.g. "the deer." Question is, where do we draw the line? Clearly, pronouns reference in a chain-like manner.
7/20/05 - I think my first goal for co-reference resolution work should be to replicate the McCallum/Wellner results. My hybrid model uses theirs for proper nouns and sort-of a global best-previous-match algorithm for non-proper nouns (max-antecedent) once the McCallum/Wellner algorithm is run on the propers. McCallum/Wellner use 5-fold CV for evaluation. One thing they don't do which might improve performance is to use a regularization penalty. Though, reg. penalty may not be useful since there are so few features. One issue is that during training, we would learn parameters for the entire data set together, whereas for testing, I think we'd be best off first doing inference on proper nouns, then non-propers.
7/20/05 - Finally finished the log-log write-up. Also did my practice talk for ICML on Monday. Nati doesn't like the idea of starting from ordinal regression, but I think it clarifies the process, allows us to show off all-thresholds and lets us show how MMMF is a generalization of the SVM.
6/28/05 - ~/project/loglog/classificationSandbox.m
6/28/05 - Smoothing is an issue. When I train unigram model on a class, then evaluate on the test data for that (same) class, I get nearly all infinities (due to words not observed in the training data; I get finite objectives for all training documents). Seems important to use a type of smoothing that does not yield a bias for one model or the other (though maybe I'm trying too hard?). Though, maybe the best thing to do is to simply use cross-validation to set minimum parameter values (theta and axponent).
6/28/05 - Negative alpha is often indication that my initial parameter vector is bad.
6/28/05 - We know that the log-log model provides a better fit for topic-oriented words. We want to show that this translates to reduced classification error. Hmm... for classification, it's not absolute fit that's important, it's relative fit. First thing I think we should do is a basic classification comparison.
6/28/05 - Okay, shouldn't be hard (in priciple) to run classification experiments w/ log-log. Need to figure out how to restrict vocabulary size. Though, it might be easier to do axponent-per-word model (w/ PCCG), which would make learning faster. What to do with zero occurrence words? Assume each class has extra document with one of each of vocab words?
6/20/05 - I think I should do the 'parameterization' section of the paper before plotting graphs of the log-log distribution. What would be nice would be to introduce bias-per-word frequency model, show graphs comparing that to empirical and unigram, then introduce the other models.
6/15/05 - Okay, what do I need to do? I need to plot a histogram of all frequency counts vs. the maximum-likelihood unigram distribution. In estimating the unigram, we're really estimating a binomial.
6/14/05 - The Unigram model conditions on document length. i.e. it's not outrageous for our log-log models to condition on document length. To get the joint (unconditional) probability, we can introduce a Poisson prior on document length.
6/10/05 - I need a plan of how to finish this log-log stuff. I need to look at six different models:
1. Unigram
2. Mixture of Unigrams
3. Log-Log Frequency axponent-per-word
4. Log-Log Frequency bias-per-word
5. Log-Log Rate axponent-per-word
6. Log-Log Rate bias-per-word
For the log-log models, I should look at how well they model each other's data. In particular, I know the axponent-per-word models will probably do poorly. It would be nice if I could show quasi-covexity of one or more of the log-log models (this might also be something to think about for CG learning of U,V for MMMF). This might lead us to a proof that as long as we don't start at a critical point, we can find the global minimum. Definitions of quasi- and pseudo-convexity are found in 6.252 lecture notes dated 2/24 (pgs. 15-17). I want to look at model fit of these models on real data. I also want to look at application performance using these models (e.g. text classification accuracy). I do need to discuss the length issue. This will be an explanation of why the axponent-per-word model does poorly. Need to talk about the alternative models, but not go into them in too much detail. There is something interesting there in terms of modeling text---there being a limit on how often a word can occur. This is something a sentence-level model of language would naturally capture (due to part-of-speech restrictions).
6/9/05 - I think all this length stuff is hogwash. The model needs to apply probability mass to all possible frequency values. And, what's important is how the model discriminates. Futzing with length can lead to higher training log-likelihood, but will have minimal impact on our ability to discriminate or to determine similarity.
6/9/05 - Tommi & I talked at length about the log-log model. Big question was how to deal with the length issue. Some ideas we discussed: Set a maximum frequency rate value, gamma. For the model, l=n gamma, where n is the empirical document length. Clearly, gamma <= 1. We have one gamma per word; for most words, gamma will be much less than 1. But, there's the problem that a test document may have a frequency rate > gamma. One option would be to create a distribution over values of gamma. But, then we'd need to integrate the product of a beta and our log-log distribution (or something like that). We'd also need a distribution over document lengths to make it truly proper. Another option would be to make our distribution a mixture between a uniform model (for values > n*gamma) and the log-log model. Then, again, we could consider all sorts of mixtures--a multinomial and log-log, a mixture of experts (two log-logs, one assigned to low frequency values, one assigned to high values). A completely separate option is to tune gamma on the test data---do transduction. For classification, we'd learn the a,b parameters on the training data, using gamma equal to the max frequency rate in the training data. Then, to classify a document, we'd set gamma to the max rate over training+test documents (no change of a,b). Note that for information retrieval, we could construct two models, one irrelevant model using only the document database, and one relevant model using a mixture of the document database and the query.
6/8/05 - This suggests that "length" should be learned. i.e. we'd be learning one parameter per word and one parameter per document (it's "length"). Note that the model length need not match up with the document's actual length. But, there's a problem here. Length is a discrete quantity. 84.5 doesn't make any sense as the length quantity in the model. Unless maybe we were to assign (only) that last fraction to the probability of the "count equals ceil(document length) event".
6/8/05 - Okay, finally figured out what was going on. I was using the emprical document length rather than the "length" I used in generating the data. Though, it is important to know that the axponent-per-word model deals badly when length is taken from the empirical document length.
6/6/05 - Okay, tried learning with only three values of a, tons of observations. Now, "right" model gets lower NLL, but the parameters that are learned are weird. True a's are -2, -3, -4. It basically learns these values shifted by 2: 0, -1, -2. The b it learns is basically correct (though slightly larger than it should be). There's something funny going on here...
6/6/05 - Tried looking at generalization (what is negative log-likelihood of test data generated from same dist as training distribution). "Wrong" model won(!) 3 out of 3. "Right" model got better firt to training data 2 out of 3 times strangely enough. Also noticed that nearly half of the words yield zero occurrences. Me thinks I need to draw more than 100 documents... This partially explains the shifting and shrinking of the range of parameters learned by the "right" model. I should choose parameters so that very few words are all zero occurrences.
6/6/05 - Hmm... so I tried three different generations of parameters. Each time, the "wrong" model gave the lower negative log-likelihood. One thing I don't understand. The "right" model consistently chooses parameters that aren't in the right range. b is always too large (1e-3, where true value is 6e-4) and range of a's is too small ([-1,5,-3.7] where true range is [-1,-5]). When I fix b=6e-4, fit is even worse and range of a's is still very wrong [-1.4,-3.1]. The "wrong" model, OTOH, chooses b's in a range ([1e-4,5e-3]) that stradles the true b. When I generate data according to the bias-per-word model, the "right" model gets the lowest negative log-likelihood. I wonder if the axponent-per-word model can at least generalized better? Also, what happens if I generate a ton of data for a small number of words? If the obj-grad code is written correctly, the model should learn the underlying parameter values. These are things I need to try...
6/6/05 - All the papers are done! Yeah! Need to finish up the log-log work. I've implemented PR-CG. Makes learning of axponent-per-word model very fast, but it still achieves higher negative log-likelihood than bias-per-word model on data generated from the axponent-per-word model!!! Maybe I need to try multiple different random iniitializations? Vary parameters?
5/20/05 - To do: (1) calculate significance and ZOE for 1M MovieLens (basically, rerun experiments & save test and hold-out outputs), (2) revise Figure 2 in IJCAI paper---add vertical lines at thresholds, make interval sizes vary.
5/19/05 - Appears that the problems with the axponent-per-word model are numerical. I think preconditioned CG would fix the problem. An issue is that one component of the gradient is much larger than the rest. Another is that dobj varies a lot between iterations (by up to 8 orders of magnitude). The optimization appears to stop early, but there's no way it could continue on (I tried letting it do so). The change is so small that matlab recognizes it as zero! I think I need to do some work on the camera ready papers... First task is to do hold-out runs for the ICML/MMMF paper.
5/18/05 - Think I've found the problem: per-word a_i is just harder to optimize. Had to do some reworking of my linesearch code to get it to do a good job of optimizing. Still very slow. Seems that the gradient magnitude for b is much, much larger than for all the a_i. Preconditioning would help...
5/18/05 - Welp, I can now generate data from loglogRateA and learn parameters for both RateA and RateB. RateB wins!?!?
5/18/05 - Got the matlab mixture code working. Implemented a generator. Need to generate data from the axponent model, see if it fits better than the bias model...
5/16/05 - Also, if we want to compare count- vs. rate-based models, we should normalize over only feasible counts (not penalize count-based model for assigning weight to counts > document length). No reason for doing infinite sum!!!
5/16/05 - Chat w/ Tommi. He had a good suggestion to make sure my optimizaiton is doing what it's supposed to: generate data according to each of the models. Clearly, the generating model should provide the best fit! :)
5/15/05 - Implemented log-log model with reversed parameterization. When I fix b, optimization is fairly fast and converges well---termination is due to sufficient decrease in gradient magnitude. However, negative log-likelihood is larger than original (highly non-convex) parameterization! i.e. per-word b_i models the text better than per-word a_i!
5/13/05 - Showed that anything of the form log E[exp(a^Tx)] is convex in a. The Hessian is a covariance matrix, which is PSD. Did a write-up. Next: do write-up for log-log using a_i as per-word parameters.
5/12/05 - Talked w/ Nati. He'd like me to select regularization parameter via a hold out set. Something as simple as splitting the training data into two (use some small percentage as hold-out, e.g. 10%). Would do this completely separately from other experiments
5/12/05 - Okay, we can't solve for the a_i's analytically. But, we can show that it is convex. Anything of the form E[exp(a^Tx)] (where x is the variable) is convex in a. I should sit down and prove this. Tommi and I talked about how to show that our distribution is better. We want to show that we fit discriminative words well---fitting of common words (such as articles) doesn't matter for most tasks. One way to do this is to look at a histogram of words as a function of information gain, where IG is H(y)-H(y|x), where y is the classification label and x is the word---i.e. how much do we reduce entropy by including x in the classification model/distribution.
5/10/05 - Tommi suggested using per-word a_i's. Big advantage is that the problem should be convex and we should be able to solve the a_i's analytically since they are exponential parameters.
5/10/05 - I've finally been able to achieve lower log-likelihood than I was with the count model. I had to learn the betas with a fixed alpha, then switch over to learning alpha and betas jointly. It seems that when we're far from the solution, optimization diverges. But, when we're close, we get convergence. However, the convergence seems quite slow. Gradient looks like it's on a roller coaster. Alpha has finally settled around -0.07 (a=-2.93). I expected a much more negative value of a!
5/9/05 - I think I've completely debugged the log-log code, but I still can't get it to learn alpha & beta simultaneously---it tries to put alpha too large. I wonder if I should un-re-parameterize 'a' and simply have obj/gradient return infinity for out-of-bounds values. Maybe for b/beta too?
5/8/05 - Should try constraint that L2 norm of rate vector is unity. Log-likelihoods for basic rate model are not better than raw frequency LL's.
5/8/05 - I have running everything for EM. At some point, would be good to look into Bookcrossing, but that's not a priority. I have SIGIR final submission coming up May 26th. There are a number of SIGIR workshops with deadlines May 15th. One to consider is ELECTRA. In particular, the workshop focuses on going beyond the unigram/term independence model. My idea is somewhat off topic b/c it tackles the occurrence independence assumption, not the term independence assumption (which seems what they are most interested in, e.g. multi-word units, phrases). However, the fact that the model can help in finding multi-word named entities might be of interest to the audience. There is also a workshop (due date May 20), which is on Mathematical/Formal Methods. The description is not very specific, but it seems a new model for term frequency would fit.
5/6/05 - Matlab is down... to do when it comes back up: (1) generate weak generalization plot for EM, (2) polish script and start runs for strong generalization, (3) produce rank plot showing performance as rank changes, (4) debug iteration plot code---I keep getting out-of-memory errors...
5/5/05 - Met w/ Tommi. He likes my idea for a log-log tf rate model. He would like me to submit something for NIPS, but thinks the log-log stuff would need to be part of something larger; co-reference would be good, but I told him I doubt I'll have something ready in time, especially with all the stuff to wrap-up for ICML and IJCAI papers...
5/4/05 - Wrote up a log-log-based term frequency rate model. Also wrote objective/derivative code, but there seem to be lingering bugs. Time to take a break and work on MMMF stuff for a little while.
5/1/05 - I produced two scores that are based on the log-log model: (1) LLSO, or the log-odds of the log-log and unigram likelihoods, and (2) LLPROD, the product of LLSO and IDF. I realize now that LLPROD may not make much sense since LLSO may be negative
5/1/05 - Trained log-log model using chow data, wrote python code to calculate log-odds ratio of log-log and unigram models. Appears to be a bit less effective than mixture score at rating informative words on chow data. I think I need to use log-log dist. to model term frequency rate (instead of raw frequency) so that the model takes advantage of document length information as does the unigram.
4/28/05 - Chat w/ Tommi: lots to do: (1) use heavy-tailed model in place of mixture for NEE experiments; (2) implement CRF, plus marginal objective CRF and semi-Markov CRF; (3) try joinly extracting locations and restaurant names.
4/28/05 - Need to finish up MMMF work: update ICML paper w/ "big" results, do comparison vs csdp.
4/26/05 - What's next? (1) Write good CG code for the alternating optimization, (2) interface w/ Python and/or write equivalent python code.
4/26/05 - It works. Yeah! What makes the optimization tricky is that the 'a' parameter needs to be close to the optimal value in order for the optimization to go smoothly. a=-5.11 is optimal for the restaurant data.
4/26/05 - Okay, so, it turns out that if I fix a=-3.0 and try to learn b/beta for each word individually, for most words I get nice, fast convergence. For some words (particularly high-frequency and words w/ empirical distributions that are far from log-log), CG doesn't converge and it gets to the point that it can't find a decrease in the objective by following the negative gradient. Best I can tell is that this is due to our approximate calculation of the infinite sum. But, CG finds a very good solution---very close to the true solution, plenty close for practical purposes. There's still the issue of the axponent. I see very weird behavior when I try to learn a and b/beta simultaneously. Maybe the best thing to do is to first learn 'a' using a single b/beta (using the full empirical frequency distribution as data), then use that fixed 'a' while learning b/beta's for individual words.
4/22/05 - Okay---here's the deal, I need to try moment matching. Fix a, compute the derivative for b_i as the difference between the empirical and expected means.
4/21/05 - Met w/ Tommi, had him check over the math, everything looked good to him. Only quips: (1) I should combine expectations for each word before summing, (2) I should add a small value to x+e^{\beta_i} to ensure that it is always > 0.
4/20/05 - Another observation: the expected value we calculate for the partial derivative wrt a does not intermingle a in the expectation (a only has an effect on the probability values). This isn't the case for beta_i. Changes to beta_i affect both the real expectation and the empirical expectation. For the real expectation, both the probabilities and values are affected; for the empirical expectation, only the values are affected. But, I imagine this would make it difficult to do gradient descent b/c values can change in such non-linear fashion!
4/20/05 - Ack! I need to reparameterize b as e^\beta (or something like that) so that the optimization can't try b <= 0.
4/20/05 - loglog obj/grad code is behaving badly. It's not converging quickly, it fails checkgrad2 at a semi-optimized point, and it once tripped the bad obj/grad code warning in conjgrad! But, I don't see what the problem is. I think I need to step through the calculations for a small number of words, make sure I have a very good understanding of what is being calculated. The hard thing is that for small # of words, the code converges fine. It's only when I use all the words that I run into trouble! Argh! Oh, I'm also seeing imaginary numbers. That can't be good! Where do I take the square root of a negative number?!? Ah, if x+b is negative, we would get imaginary numbers! Good things to think about for the trip home... Are we sure the model would never want to choose b <=0 ?
4/19/05 - Got approximateExpectedValue code working (renamed to infiniteSum). Checked against Riemann Zeta Function values for n={2,3,4,5,6}. Correct to 10 digits.
4/19/05 - My write-up on learning the log-log model has the correct derivatives. Just checked them on the white board. I added some additional detail to the write-up. I need to take a step-by-step approach to debugging the code: (1) make sure the approximate EV code works by checking with known values of the Riemann Zeta Function, (2) test derivative code for a single word: (a) make sure signs are in right direction, (b) then do a full CG and make sure learned dist matches up with empirical dist pretty well (plot), (3) finally debug on full data.
4/19/05 - Just chatted w/ Tommi about no need for normalizing by the marginal probability when learning the log-log model. He's skeptical, but basically agrees with me. Got him to understand that the case is very different from the multinomial/unigram (since with the multinomial multiples # of numbers together proportional to docment length; log-log product is constant number elements (number of features/size of vocabulary)). He still thinks there are document length issues, but doesn't think they are killer. I agree. The model is---in some sense---learning how long documents should be. E.g. you can calculate an expected document length for the model.
4/16/05 - Do we really need the conditional probability? To train the distribution, why can't we just maximize the likelihood of the (joint) data? Even with the multinomial, we don't calculate the conditional. It's not immediately obvious to me that we could easily calculate the marginal probability of a document being a certain length with the multinomial.
4/15/05 - We can think of this as a variational approach. Define P(x;\alpha,\beta) = 1/(x+\alpha)^\beta. Assume \beta > 2. Since P(x) is convex, we have that $\frac{1}{b-a} \int_a^b P(x) > P(\frac{a+b}{2})$. We bound P(x_i|dlength=N) above by N \int_{x_i-1/(2N)}^{x_i+1/(2N)} 1/(x+\alpha)^\beta$. As N \to \infty, this bound tends to the actual value. Note that we can easily calculate the correct normalizing constant.
4/15/05 - I think my best line of attack is to model frequency rates using a continuous distribution. One thing I should do is consider the domain to be from -1/(2N) to 1+1/(2N) to accout for the discreteness (i.e. integrals should have those limits).
4/15/05 - Chat w/ John---a few good ideas. To calculate normalization constant: (1) use a multinomial approximation (w/ same mean), (2) use continuous approximation. In both cases, it's easy to calculate the marginal probability. Another idea he suggested is to use a distribution like P(\vec x)=(\vec w^T vec x + a)^b; though, I think it would be hard to calculate single word probabilities from that distribution...
4/15/05 - Tommi has a simple, workable idea for how to deal with document length: divide frequencies by document length! It's not perfect since there isn't proper normalization (e.g. the distribution will assign non-zero density to frequency rates > 1), but the only way to get rid of that issue would be to calculate the marginal probability, which we've already convinced ourselves that we can't do!
4/15/05 - Okay, I'm giving up on the idea of calculating the marginal probability of P(\sum x_i=N). What next? Well, I think we should use an extra additive constant to account for document length. That is, P(x_i|d_j) = (x_i+a_i+c_j)^{b}. The probability of word i occurring x_i times in document d_j depends on c_j, which is a constant specific to that document that effectively accounts for document length. The c_j constant has no dependence on individual words, only on the document. Learn the c_j parameters along-side the a_i and b parameters. Note that allows the model to account for document lengths. It does not allow the model to account for variability in correlation; it may be that some documents have higher correlation in word occurrence rates.
4/14/05 - What exactly do I want to plot? The normalizer: P(\sum x_i=N). I want to get a real empirical distribution for a set of words, the calculate the normalizer based on those empirical distributions.
4/14/05 - What about: model the normalizers dist. as a function of N with a heavy-tailed dist, like exp(abs(x+c)) or something like that? Guess this is why it would be useful to look at a sample so we can judge whether this idea is semi-realistic...
4/14/05 - Big problem with the Gaussian idea: 2nd moment of our term freq. dist is infinite!!! (for b<=3). It would be nice to compute at least a miniature version of the distribution of masses for different document lengths...
4/14/05 - One off-the-cuff idea: ignore probability values for words with tf=0 (they're going to be extremely close to unity anyway; well, except maybe for extremely frequent words, like "the"). Oh, but that assumes the dist. is normalized (not easy to do).
4/14/05 - Just met with Tommi. Threw around some ideas concerning the normalization constant, but couldn't come up with anything that could avoid the combinatorial explosion. I'm wondering if we might be able to approximate the distribution of normalizers (N is the variable) with a Gaussian. Mean is trivial to calculate :) Variance might not be so easy. And who knows how good of an approximation we'll get... Though, with a reasonable-size document set, I can't imagine it's too awful. Recall, the reason we are doing this is to handle different document lengths in data set, so all we need is that ratios of values are correct, not absolute values. Though, it is also important that as we change params a & b, the ratios of log-likelihoods are correct...
4/14/05 - Need to think about how to deal with nasty normalization constant in term freq. model. One possibility is sampling, though I imagine this would be extremely slow.
4/14/05 - Finished draft of CSAIL abstract
4/12/05 - Hochreiter/Obermayer paper:
1. Introduction
  - Summary: We model data generation as causes, z, being generated from a prior distribution, P_z. These causes induce observations through a transfer function, x=f(z;w). We see observations, x, and would like to recover the sources, z. Effectively, we are trying to determine f^{-1}. ML is often applied to this problem, but ML requires that we impose a specific P_z (e.g. single Gaussian) and assume a linear trasnfer function. The authors propose a sample- and kernel-based method that learns f^{-1}
  - Q for summary: do the authors randomly generate causes, z, or are they learned along with the f? I guess what happens is that they learn parameters of f or f^{-1}. They learn parameters for f if observations, x, are given. They learn parameters for f^{-1} if reference sources, z, are given.
  - Observations, x, are generated from causes, z, through a transfer function, f.
  - Eqn. 2: subst. z=f^{-1}(x) into eqn. 1. Note that this is the only value for which \delta(x-f) is non-zero.
2. Cost Functions/Optimization
  - Summary: the goal is to find parameters for f that minimize the difference between observed and generated points. A kernel is placed on each observed/generated point as a relaxation method. This objective is the integrated squared difference, which is equivalent to another expression that isolates the continuous integration. This is capsulated as a kernel function. If we are given k_d, we can construct k (as long as the integral converges). If we are given k, it must be able to be written as in equation 5 (must have condiitons of Thm. 1). Result: we don't need to evaluate the nasty integral. Equation 7 is chain rule, first part is \pd{x}{w}, second part is \pd{F}{x}.
  - \tilde F and F can be shown equal via simple algebra (subst. defn. of \Phi into \tilde F, expand, push in integral. Only a few technical issues to deal with.
  - Progression assumes k_d given; if we are given k, we must ensure that it can be represented as in eqn. 5.
  - Empirical distribution is defined by k_d.
3. Optimal Kernels and Convergence Properties
  - Summary: Idea here is to determine what is the right kernel to use. We want a kernel that will give a single local minimum (which is the global min), so that learning is computationally straightforward. Without motivation, they move to a scenario where the number of samples goes to infinity. They make new definitions of \Phi and F; F clearly corresponds to the previous definition (think of variance of k_d to zero---empirical weight for each samples is 1/N); unclear how \Phi corresponds to old definition. They want to find a kernel that assures uniform convergence. That is, they want the disparity between observed and model distributions to be bounded by a decreasing function that goes to zero. I don't understand how the Poisson equation comes about, but 11 is the condition that results from the uniform convergence constraint. Is uniform convergence good? They don't provide an argument.
  - It appears these defintions are not in some way equivalent to the earlier definitions of \Phi and F. Q: where do they come from?
  - With continuous distribution, don't need k_d to smooth things (dist is already smooth). Q: why isn't \Phi(a)=\rho(a) ?
  - Uniform convergence. Recall that \rho is part of what we are optimizing (\rho changes as we change w). We get uniform convergence if we can upper bound the maximum value of \rho during steps of the optimization so that the upper bound goes to zero.
  - Poisson Equation: I don't understand this. Right side of (11) seems to indicate that gradient is non-zero when a and b are within the same partition.
  - This statement confuses me: "All local optima of the cost function F wrt the model parameters w are then only a property of the model class f(.;w)"
  - Plummer kernels?
4. Experiments: ICA
  - Summary: Authors apply their method according to bottom of figure 2. ICA methods use independence of source signals as the objective; this is a poor objective. ICA methods posit certain properties of the source distribution. The authors' method is distribution-independent.
  - How is f defined?
  - Generative ICA methods can only deal with linear mixing; the author's method can deal with non-linear mixing (see last example)
  - Generative ICA methods require a certain source distributions---they can't handle data generated from very different distributions. E.g. if Gaussians are assumed, performance with mixtures of Gaussians is terrible; the author's method can deal with arbitrary distributions (see first example)
  - f is defined differently for the different problems: linear for the first two problems, non-linear neural network for the third problem.
  - Experiment 1 shows their method works well for mixtures of Gaussians
  - Experiment 2 shows they can handle different source distributions in the same data
  - Experiment 3 shows they can deal with a non-linear transfer function
4/11/05 - Need to prepare Hochreiter/Obermayer paper for tomorrow's reading group.
4/11/05 - My SIGIR paper got accepted! Woo hoo!
4/11/05 - Wrt the normalizer for P(x) = exp(a log(b+x)), Tommi suggested looking at partial sums, S_n = \sum_{x=0}^n exp(a log(b+x)), and plotting S_{1/n} as 1/n goes to zero. Cubic fit of a few points should give a very good approximation of the true result.
4/11/05 - Chat w/ Tommi. My idea for length normalization is one possibility (scaling the 'b' parameter according to document length). Another possibility is direct normalization of the distribution. The chance of seeing a certain term freq. dist is $P(x_1,x_2,...,x_n) = \prod_i P(x_i)$. That probability is implicitly a joint probability---joint with the event that the document length is $\sum_i x_i=N$. So, to make it conditional on the length, we must divide by the marginal probability of generating a document of that length. i.e. we must divide by the sum of probability mass associated with empirical distributions (x_1,x_2,...,x_n) of length N (\sum_i x_i=N). Trouble is, for N of any size, this may be infeasible to compute exactly and computationally very difficult to compute approximately.
4/10/05 - Finished JMLR review.
4/6/05 - One option for lenght-normalization would be to separate-out the probability of x=0. But, that makes it possible for p(x=0) << p(x=1), which isn't realistic according to the data I've looked at. Note, model is p(x) = \exp(a \log(b+x))/Z. I wonder if we should get rid of 'a' parameter. Additive 'b' constant seems like it might be enough for the power we need. Can use 'b' to adjust for different desired rates of occurrence.
4/5/05 - Need to write up length-normalization for log-log model.
4/3/05 - To do:
1. ~~ordistic experiments (and comparable shc all-thresh experiments)~~
2. ~~experiment plots, plots, plots (avg, std. dev. code)~~
3. ~~margin cost & loss function plots (including zero-one and heavaside at z=1)~~
4. ~~write section 3 (imm-thresh loss vs. all-thresh loss)~~
5. binomial significance test
4/2/05 - Finally! Worked out the objective/gradient function for multiclass classification (where loss is incurred for all other classes, not just the maximal other class).
4/2/05 - I'm doing L1 regression w/ an L2 regularization norm. Maybe not the most sensible thing, but it's easier to optimize!!! :)
4/2/05 - L1 Regression doesn't work. I don't think the gradient is changing at all and CG keeps resetting (since gradient never changes), so the search doesn't get anywhere. Gets closer and closer to a "V", but not making any real progress. How do we make real progress? I'm not sure. Randomly zero parts of the gradient? That would help to some degree. Minimize an upper bound such that one of the parameters controls how close the bound is? That's a better solution.
4/1/05 - Fixed bugs in regression code, started runs. Both L1 & L2 appear to be running fine. L1 is very slow, as expected. I'm doing 100 iterations. Might need to do more. Also, I've noticed that the conjgrad for the first lambda often ends too quickly. I've upped the tol for the first to make sure we get reasonably close to the solution. Next: code up multi-class classification; use sub2ind.
4/1/05 - Coded up regression. Don't think a bias term will be necessary since the feature vectors have been centered (zero mean).
4/1/05 - To do:
1. All-threshold: ~~shc~~, ~~mls~~, ~~lsc~~, ~~logistic~~
2. Immediate-threshold: ~~shc~~, ~~mls~~, ~~lsc~~, logistic[karhu]
3. Regression w/ L2, L1[orava]
4. Multi-class classification: shc, mls, lsc, logistic
5. Chu/Ghahramani
6. Hinge loss (all-thresh, imm-thresh, multi-class)
7. Ordistic (re-code with matrix mentality)
4/1/05 - Problem w/ Nati's ordistic code is that it is specifically meant for the one user case---can't deal with empty (0) labels.
3/31/05 - Chat w/ Nati: do immediate-threshold experiments, do experiments on regular multiclass classification.
3/30/05 - Found a bug in my least squares FixedV MMMF code (wasn't handling z=0 correctly). Fixed. Also, wrote FixedV MMMF code for modified least squares, (1-z)^2 for z <= 1. Need to write code for the logistic loss.
3/29/05 - Log-log word frequency model wrt document length. Tommi says: just condition on document length, and everything will work out. It'll be a bitch computationally, but the math is straightforward. I don't completely get this... need to sit down & work out the math...
3/28/05 - 100 iteration EachMovie experiments ran out of memory. I think I should put them on hold for now. The 10 iteration runs completed and gave better results than Marlin--good enough for ICML paper. More important is to finish strong generalization. Still need to do strong generalization on EachMovie. Strong generalization for MovieLens is currently running. Once those finish, I should do 10 iteration strong generalization for EM. After that, do 30x weak gen. for EM. Then, maybe 30x strong gen. for EM. Also, need to do a training & test error plot as width of U/V increases and as the number of iterations of CG increases. To do:
1. ~~100x iteration strong generalization MovieLens~~
2. ~~30x iteration weak generalization EachMovie~~
3. 10x iteration strong generalization EachMovie
4. 30x iteration strong generalization EachMovie
5. Plot train/test error as function of iterations EachMovie (p=100 x={1,...,100} i=1)
6. Plot train/test error as function of components EachMovie (p={1,...,100} x=100 i=1)
7. ~~Plot train/test error as function of components MovieLens (p={1,...,100} x=100 i=1)~~
For EachMovie runs, do only one process per machine, only use karhu/orava.
3/28/05 - Thoughts on rating experiments: use top 100 users' ratings as V matrix, try to predict ratings of rest of users.
3/25/05 - To do
1. ~~Plot train and test error for one run of 1M MovieLens~~
2. ~~Write code for learning U for fixed V~~
3. ~~Do strong generalization runs on 1M MovieLens~~
4. ~~Run experiments on EachMovie once preparation script finishes~~
  - ~~weak generalization~~
  - ~~strong generalization~~
5. ~~Plot train & test error for one run of EachMovie~~
6. Write code for rating prediction (not MMMF); actually, it's already written, we just have to figure out what to put in V; I think the best is to take the ratings of the 100 (or so) users with the most ratings. Try to predict ratings of all other users.
7. Formalize log-log paper a bit, do more analysis about how model should change when document length changes.
8. Literature search for MovieLens and EachMovie results
9. Make plots for ICML paper of MovieLens and EachMovie results.
3/24/05 - Another thing to do for MMMF: plot training and testing error as a function of the number of iterations/line searches. Also need to do a literature search for MovieLens collaborative filtering results.
3/24/05 - What I need for my term distribution model is a log-Gaussian model! That way even frequently occurring words may be modeled well. Just need to get a sense of how document length should play-in. Should it be a constant multiplier of the mean parameter? The "mean" parameter might need to be shifted "left" of the true distribution mean...
3/23/05 - Term distribution model: it's easy enough to make the model log-log if I don't mind a monotone distribution, but what if the word has an expected rate of occurrence > 1?
3/23/05 - MovieLens data preparation is in motion. Wow does it take a l-o-n-g time to hold-out one rating per user. Doesn't help that I can't think of a matrix-y way to do the process!

Jason Rennie
Last modified: Fri Sep 16 12:03:05 EDT 2005