Jason Rennie's Research Pensieve

Older

1/23/07 - MIT has a web page with thesis specifications.
1/17/07 - Outline for Context Tracking chapter:
- Introduction
- Model Overview
  - Structure
  - Clustering
  - Definitions
- Trace Norm Distribution
  - Trace Norm
  - Definition
  - Comparison to Exponential
  - Partitionings
  - Inference
  - λ
- Normalization Constant
  - SVD
  - Jacobian
  - TND Integral
  - Sampling
  - Variational Bound
  - Evaluation
- Practical Algorithms
  - Bottom-up
  - k-flats
  - Simulated Experiments
1/10/07 - Why is precision-recall breakeven ill-defined? Because the breakeven value cannot necessarily be attained (since it is typically taken to be an average of the precision and recall values at some "close" precision-recall point). Consider a contingency table:

label\prediction pos. neg.

pos. tp fp

neg. fn tn

The entries stand for: true positive, false positive, false negative and true negative. The contingency table is a representation of the performance of a classifier. Typically, a single classifier may yield different contingency tables simply by altering its rate of positive/negative predictions. Different applications may find different positive/negative rates to be more or less optimal, so it is valuable to consider classifier performance over a range of different rates. To simplify comparison between classifiers, it is common to use a single summary statistic which evaluates the classifier at some "breakeven" point. One such statistic is the precision-recall breakeven, which is the point at which precision (tp/(tp+fn)) and recall (tp/(tp+fp)) are equal. Of course, there may not be a point where precision and recall are equal. So, instead, Typically, a workaround is to find the point where precision and recall are closest and
1/7/07 - Appears that the "logistic" MMMF experiments I did weren't using the logistic loss. All of the objgrad variables are set to m3fshc_norm, not m3flogistic_norm. No wonder the results were nearly identical to shc!!! Grr...
12/20/06 - Chat w/ David. He says that my assumptions that there are other systematic errors besides self-interaction are true. One thing he noted is that proteins are often inserted into yeast cells to observe interaction. Yet, the proteins don't come from yeast cells. So, if the goal is to determine interaction in the original species, there are many factors that could make the experimental result different from what it should be. David agrees that there can be both false positives and false negatives that are independent of self-interaction. He'd like to see a somewhat better motivation for the prior. Better to focus on "few factors" than "compact representation of the data". David doesn't have time to "prepare" the data for me that he used in the paper he worked on. He suggested I contact Rohit, who might be easier able to give me access to the data. Seems that it would be fairly easy to run experiments on the data using my trace norm-based model.
David told me about another interesting problem: discovering alignment between proteins in different species. Each species has a different set of proteins, but often proteins can be linked because the species are relatively close in the evolutionary tree and their functions have remained somewhat similar. Such proteins are called "homologous". We have information about PPI for each species. The question is, can we use this information to determine which proteins are homologous? Assume that interactions are symmetric. Then if we learn a matrix factorization of some underlying interaction parameter, the two factors are identical, e.g. X=UU^T. Imagine that we put all the interactions into a single matrix. Top columns are proteins of species #1, bottom columns are proteins of species #2. We have no data concerning inter-species interaction. But, a trace norm penalty on the underlying parameter matrix would encourage the model to find similarites in the protein representations. We could run a bi-partide graph matching algorithm using Euclidean distances between the representations (or some such). But, this doesn't seem that much better than the algorithms David told me about... That's not completely true. At least the similarity measure is not completely heuristic, but rather learned in a semi-intelligent way.
12/18/06 - Writing to do:
1. Integrate Mike's comments
2. Context Tracking chapter
3. Introduction, Conclusion & Thanks
12/18/06 - Experiments to do:
1. Re-generate Marlin versions of ML & EM to include train, test and hold-out splits (takes forever to do ML weakHold---a waste to repeat for every experiment!). Try to use existing train, test splits and just pull out one training example for the hold-out set. The less changed, the better! (so that I can easily compare to old results)
2. Logistic loss experiments on ML
3. Rank-constrained experiments on ML. Do more iterations (1000?). Two approaches:
  - Cascade rank optimization: 2 4 8 16 32 64 ...
  - Simulated annealing (w/ trace norm as regularization!)---make larger jumps than I currently do, say 8 6 4 2 0 ...
12/18/06 - Logistic experiments:
- MovieLens
  - weak: #1-3 finished
  - weakHold: #1-3 finished
  - strong: #1 on janis; #2 finished; #3 on orava
  - strongHold: #1 on orava; #2 on kettu; #3 waiting
12/18/06 - Okay, the experiments should be set.
- MovieLens
  - weak: #1-3 finished
  - weakHold: #1-3 finished
  - strong: #1-3 finished
  - strongHold: #1-3 finished
- EachMovie
  - weak: #1-3 finished [#2 missing one unimportant run]
  - weakHold: #1 on kettu; #2 on janis; #3 waiting
  - strong: #1-3 finished
  - strongHold: #1-3 waiting
12/18/06 - I've changed the code to print out the result of the first optimization (why not?). This will give us the rank of the initial solution (and give us some sense of how good it is).
12/18/06 - em-mmmf/shc/matlab: weak: there's a slightly similar issue here, but I moved from 9.5 as the largest value to 9 and the new results are virtually identical to before, so I don't think there are any optimization issues. One reason for the change was to increase the number of parameters: before the set I used was very small. I'm going to delete the old results. NOTE: it's always good to make sure the largest parameter yields a low-rank solution. This means that we've probably found a global minimum (which should be a good starting point for later parameters).
12/18/06 - 1mml-mmmf/shc/matlab: strong: we've tried two different parameter sets, [8 7.5 7 6.5 6 5.5 5 4.5 4] and [7 6.5 6 5.5 5 4.5 4 3.5 3]. Using the first parameter set and random split #3, we found the lowest strong error at 4. Hence, the idea of trying to explore smaller values. Only thing I'm concerned about is the fact our optimization is like a cascade: each optimization starts at the ending point of the last parameter solution. If the first parameter is too small, the initial solution will be poor and other parameters will have a bad starting point. Bah! I'm just going to start over and use [8 7.5 7 ... 4 3.5 3].
12/15/06 - I do see occasional non-monotonic changes, especially when we are very far from a solution.
12/15/06 - Another interesting thing about the normalized code is that the change in the objective decreases monotonically. And, oh, yeah, it's really kicking butt!!! Also, gradient magnitude appears to be decreasing monotonically.
12/15/06 - Wow! The normalized code is kicking butt!!! After only 40 iterations, it's at a lower objective than the regular code was at after 60 iterations.
12/15/06 - Hmm... watching it a bit longer, it now seems to be doing quite a bit better than the non-normalized code! After 24 iterations, it had a slightly lower objective than the non-normalized code after 29 iterations. And, the normalized code began at a slightly higher objective.
12/15/06 - I'm testing to see if automatic normalization of the U,V columns will speed up optimization. To implement this, I simply rescale the variables in m3fshc. This isn't perfect since the gradient calculations will be a bit off. But, if there is a major benefit to be had, we should see some of that. So far, the results are not encouraging. Testing on ML w/ a large regularization constant appears to speed things up by only a few iterations. Though, it stopped after 53 iterations, where the unnormalized obj/grad code took 60 iterations. However, the unnormalized obj/grad code made fewer calls (even though it used more iterations).
12/15/06 - There was a bug somewhere in my updated m3fshc code (no reallyzero)---checkgrad2 was yielding non-unity values. I fixed the problem by starting with the old code and making incremental changes. Not sure what the bug was.
12/4/06 - Thesis defense outline, with important points for each slide:
1. Title
2. Introduce IE on IC
  - Problem: extract information from informal communication, such as e-mail, mailing lists, bulletin boards
  - Issues: context switching, abbreviations & shortened forms, variable punctuation, formatting, grammar
  - Give restaurant example
3. Thesis Advertisement: Outline
  - Thesis is not creation of end-to-end IE system
  - Addresses some aspects of the IE on IC problem:
    1. Identifying & Resolving Named Entities (for Extracting Factual Information)
    2. Identifying Opinions (Context Tracking)
    3. Synthesizing New Opinions (Learning User Preferences)
  - Last problem is topic for this talk.
  - Next few slides: review in more detail
4. Identifying Named Entities
  - EXAMPLE: "Ten Tables is now open until 11pm."
  - Before we can extract this information, need to identify "Ten Tables" as named entity, in particular, a restaurant name.
  - NEE on structured text traditionally relies heavily on punctuation/capitalization/formatting/grammar.
  - These features are more noisy on IC.
  - Word occurrence statistics are not so noisy.
  - Our contribution: Score which uses word statistics to identify topic-oriented words (of which names are a subset). Experiments show improved NEE using this score.
5. Resolving Named Entities
  - Most mentions of an entity do not use the full name, especially in in IC.
  - To extract the fact: "They're now open until 11pm", we need to know who "they" refers to. This problem is called "CRR".
  - This is basically a clustering problem: group noun phrases according to what they refer to.
  - McCallum/Wellner '05 published a model which is excellent for proper nouns: exponential model which requires all elements in cluster to have relatively high similarity.
  - Our contribution: extension of their model to better handle non-proper nouns (including pronouns)
6. Tracking Context
  - "The swordfish was fabulous." This is a comment on the food, which is a comment on the restaurant. Restuarant is identified by context---no direct reference to restaurant.
  - Our view: can identify topic switches by looking at word statistics.
  - Our contribution: new clustering algorithm which groups sentences according to how well they fit together in a low-dimensional manifold.
7. Outline: Rest of Talk
  - Problem to be Addressed: Predicting Opinions
    - Single user, Feature-based
    - Multiple user, no features (collaborative filtering)
    - Hybrid, observation model, multi-rating
  - Solutions
    - Loss function for ratings
    - Regulariztion which ties together multiple users
  - First, I'll give an overview of the preference learning problems, then I'll get into the details of how we do learning.
8. Preference Learning
  - Opinion examples:
    - "I loved Rialto last night"
    - "Overall, Oleana was worth the money"
    - "Radius wasn't bad, but it could have been better"
    - "Om was purely pretentious"
  - Two issues:
    1. Translating text to partial ordering or rating
    2. Generalizing observed ratings to unseen ratings
  - I don't deal with first issue---I can point you to other work if you're interested.
9. Single User, Item Features
  - When considering ratings of a single user, we need something to tie together the different restaurants. For this, we use generally available information. For our restaurant example, we might use seating capacity, average meal price, food type, formality and attractiveness of the location. These are called "features" and are represented as a real-valued vector.
  - We assume that the user has "weights" for each of these features which indicate how those features affect the user's opinion.
  - We assume a linear interaction between the weights and features. A simple dot-product between the weights and features yields a real-valued preference score for each restaurant.
  - Can easily be "kernelized" to yield non-linear interaction.
  - Finally, we assume that the user has a set of thresholds which segment the scores into discrete rating values.
  - Note that the rating values are defined only by the thresholds, so they are purely symbolic and could easily be renamed. This contrasts with approaches which try to make the preference scores equal to the rating values. The advantage of the threshold approach is that the model can adapt to different rating styles without affecting the weight vector.
10. Single User, Item Features (Learning)
  - Now, we assume that we are presented with a few restuarant ratings, and features for all of the restuarants.
  - Our goal is to learn weights and thresholds which yield ratings that make a good fit to the observed ratings.
  - Later, I'll discuss how we do this.
11. Many Users, No Features
  - Another interesting scenario is where we have multiple users, but no features. We assume that a similar process generates the ratings. Each restaurant has a real-valued feature vector. Each user has a real-valued weight vector. A dot-product of weight and feature vectors yield a preference score for each user/restaurant pair. A set of thresholds for each user translate preference scores to ratings. The key difference between this and the previous scenario is that we assume that the features are unknown and must be learned. Whereas before, we needed the feature information to tie together different restaurants, here, the ratings across different users allow us to do this.
  - This can be thought of as a multi-task problem where predicting ratings for each user is a single task. The similarity of the problems allow us to take advantage of transfer. We reduce noise and improve generalization by solving the tasks jointly.
  - One motivation for the multi-user scenario is that not all features can be easily quantified. E.g. food taste.
  - Note that no restrictions on the weight and feature vectors corresponds to no restrictions on the preference score matrix. We could perfectly reproduce any rating matrix simply by manipulating the weights and features. In order to generalize to unseen ratings, we need a way of encouraging the model to tie together the different users and restaurants. Later, I'll discuss how we do this.
12. Collaborative Filtering
  - Possible goals: predict missing entries, cluster users or items
  - Applications
    - Restaurant, book, movie reviews
    - Gene expresssion (cell death) [data gathered one row at a time]
    - Network routing
    - Sports performance
13. Extensions
  - Hybrid: When we have both multiple users and outside information about the restaurants, we can improve performance by making use of the additional information.
  - Multiple ratings per user/restaurant: Food, Service & Decor ratings
  - Observation model: Predict which restaurants the user will rate
14. This Talk: Contributions
  - Implementation and systematic evaluation of loss functions for single user prediction.
  - Scaling multi-user algorithm (MMMF) to large (thousands of users/items) problems
  - Analysis of MMMF optimization
  - Extensions:
    - Multi-user + Features
    - Observation model
    - Multiple ratings per user/item
15. This Talk: Outline
  - Single User, Item Features
    - Loss Functions, Convexity, Large Margin Classifiers
    - Construction of Rating Loss
  - Multi-user, No Features
    - Feature Selection, Rank, SVD
    - Regularization which ties together multiple problems
    - Optimization: scaling to large problems
  - Extensions
16. Rating Classification
  - [Picture of 2-D rating data]
  - [Show weight vector, w, and thresholds, theta1 and theta2]
  - How do we classify restaurants into ratings? Here's a 2-D example to provide some intuition. If each restaurant has only two real-valued features, then we can represent it as a point in 2-D space. This is what we've done here. The number is the rating given by the user.
  - Learning involves selecting a weight vector and thresholds so as to best separate the data into their respective rating values.
  - Note that this is distinct from multi-class classification. Here the classes are ordered, where for multi-class classification, they are not.
17. Loss Functions
  - First thing I'm going to do is give some background on loss functions.
  - If you think in terms of probability, there is a likehood for each example. The loss function for that example is its negative log-likelihood.
  - Whereas probabilities are maximized and multiplied together, loss functions are minimized and summed together.
  - [Show 6 binary classification loss functions: 0-1, margin agreement, hinge, smooth hinge, modified least squares, logistic]
  - Most loss functions increase as example nears and crosses the decision boundary.
  - 0-1 only penalizes examples that cross the boundary. Other loss functions begin penalizing the example as it nears the boundary.
  - I'll use binary classification loss functions to construct my loss functions for rating prediction. I'll call them "Margin Penalty Functions" to differentiate them from loss functions for rating prediction.
18. Convexity
  - Whether or not the loss function is convex has a impact on the quality of solution we find. A convex function does not have local minima. Also, convex functions tend to be less sensitive to small changes. Because a convex function basically has to have a "bowl" shape, the global optimum cannot shift easily. For a non-convex function, the global optimum can move a large distance with only small changes to the function. [Point out example on slide]
  - I also want to make sure you know what a convex set is as I'll be referring to that later. A set is convex if all line segments connecting points in the set are contained within the set. For the non-convex set here, I've drawn a line segment which goes outside the set boundaries. Such a line segment can't be drawn for the convex set.
19. Convexity of Loss Functions
  - Simplest loss is 0-1; minimizing 0-1 loss minimizes error on the training data; but, finding weights, thresholds to minimize 0-1 loss is hard; multiple minima, and solution is sensitive to small changes.
  - Modern theory is to use a loss function which is a convex bound on the 0-1 error.
  - All convex bound loss functions establish a margin and yield a clear solution. If the regularization term is convex (as is squared weights), we get convex optimization which encourages large margin. Experiments (done by others) indicate the convex/"large margin" approach is better than 0-1 loss.
20. Proportional Odds
  - One of the first, if not the first rating classification algorithm was Proportional Odds, introduced by McCullagh (1980)
    - Linear interaction: weights & features
    - Thresholds
    - Maximized Likelihood
  - Similar to modern rating algorithms, except: no regularization of weights (and no kernelization)
  - McCullagh's negative log-likelihood looks a lot like the sum of two logistics.
21. Loss Function for Ratings
  - Popular is to use one MPF for each neighboring threshold; we call this "immediate-thresholds".
  - [Picture: what Immediate-thresholds looks like]
  - Cite McCullagh, Shashua/Levin, Chu/Ghahramani
22. Some Errors Better than Others
  - Immediate-thresholds bounds the number of mistakes. In binary classfication, number of mistakes is correct error. For rating prediction, it is not.
  - Consider this example. The user loves Legal Seafoods and gives it "5 stars". One system predicts "4 stars" while a second system predicts "1 star". Clearly, the first system is better.
  - The objective for rating prediction is to minimize the absolute difference between the correct value and the predicted value.
23. Immediate-thresholds not a bound on abolute difference
  - This "large margin" approach to binary classification is to minimize a loss function that bounds the error. Immediate-thresholds doesn't bound the appropriate error for ratings. They bound the number of mistakes, not the absolute difference.
  - If weight vector is small or thresholds are packed tightly together, then immediate-thresholds does not bound the absolute difference.
  - Here we can see an example that shows that immediate-thresholds is not a bound on absolute difference.
24. All-Thresholds
  - [Picture class=3]
  - Srebro et al. introduced "all-thresholds" which bounds the absolute difference error.
  - All-thresholds sums margin penalty functions, one for each threshold, oriented appropriately, so that if the MPF is a bound on 0-1 loss, the all-thresholds loss is guaranteed to be at least as large as the absolute difference.
  - Cite Srebro, Rennie, Jaakkola NIPS*04
25. Experiments
  - Evaluated various MPFs with all-thresholds and compared all-thresholds versus other rating classification approaches, such as immediate-thresholds, multi-class classification, a simple least squares algorithm.
  - We used the ML data set, which is about 6k users by 4k movies, with about 4% observed entries.
  - We used ratings of the top 200 users as the features for the movies.
  - Shown in the table are absolute difference errors for three different MPFs and three different constructions. The constructions are a multi-class classification (where no order is imposed on ratings), immediate-thresholds (which bounds 0-1 error), and all-thresholds (which bounds absolute difference error).
  - We used a binomial significance test to judge whether these differences could have been due to chance. The p-values comparing all-thresholds to the next-best construction are in the last column. P-values indicate that it was extremely rare for all-thresholds to make a prediction that was worse than multi-class or immediate-thresholds.
  - To give some bearing for these results, we also tested a simple least-squares predictor, which performed much worse than any of the other techniques we tested.
  - One thing we didn't expect, which isn't shown is that all-thresholds even outperformed the other approaches in terms of 0-1 error: more details in the paper.
  - Cite Rennie, Srebro 2005.
26. Many Users, No Features
  - Refresh audience memory: show same slide as before
  - Key: need a way to tie together the different users.
  - Problem now symmetric, so we can equivalently think of tying together restaurants.
  - Note that we can either learn the weights and features, or simply try to learn the preference scores directly.
27. Background: L_d Norms
  - L₀ is number of non-zero entries: [0 2 0 3 4] = 3
  - L₁ is sum of absolute values: [2 -2 1] = 5
  - L₂ is vector length: [1 -1] = \sqrt{2} (Euclidean length)
28. Background: Feature Selection
  - Now, let me talk about how these norms can be involved in regularization.
  - The objective used for optimization is loss plus the regularization.
  - Squared L₂ typically used for regularization of a weight vector, which encourages a large margin [picture]
  - Another approach is to use the L₁-norm [picture]. Note discontinuity in derivative at zero.
  - To change a weight from zero w/ L₁ regularization, the loss must have a gradient of ±1 or larger (in magnitude). L₂ has continuous derivative---weight changes continuously with changes in loss.
  - L₁ regularization performs automatic feature selection: encourages weight values to be zero.
29. Background: Singular Value Decomposition
  - Any (real) matrix X can be decomposed X=USV' where U,V are orthonormal (rotation) matrices and S is diagonal and non-negative.
  - Entries of (diagonal of) S are the singular values. These correspond to (L₂) lengths. They are the lengths of the columns of US.
  - If you're familiar with eigenvalues, note that squared SV's are eigenvalues of XX'=USV'VSU'=USSU'.
  - One way to define the rank of a matrix is as the number of non-zero SV's, or L₀ of the vector of singular values.
  - The low-rank matrix which minimizes the squared difference to X is found by zeroing-out the smallest SV's of the SVD of X.
30. Low-Rank Matrix Factorization
  - We need a way to tie together the different users/restaurants for our rating prediction problem. We do this by either forcing or encouraging the solution to find similarities amongst the users and restaurants. The more similarities, the more compactly we can represent the parameters.
  - Historically, the most popular way to force compact representation of a matrix of parameters is to limit the rank of the matrix. This is still very popular today.
  - A low-rank matrix can be learned by using a factorized representation X=UV', where U and V have number of columns equal to the desired rank.
  - [Introduce translation: U is weight matrix, V is feature matrix, X is preference score matrix, Y is rating matrix.]
  - That factoid I just told you about SVD actually drove a lot of the work on low-rank matrix learning. A low-rank minimum-squared-loss approximation to a matrix X can be found using the SVD: find the SVD, zero-out the smallest singular values, reconstruct. Viola, you have a low-rank matrix.
  - This is the correct thing to do if your loss function is squared loss, and your matrix is fully observed.
  - If there are missing/unobserved entries, or if you use a loss funtion other than least-squares, SVD is inappropriate.
  - Using SVD is historically extremely common. SVD is very easy to use (matlab command), so many, many people have used it, even when it wasn't appropriate.
  - Going back to our rating prediction problem, it's clear that SVD isn't appriate: we have missing entries and a non squared-error loss function. Also, we want to approximate the preferece score matrix, which we don't know, not the rating matrix.
  - Low-rank is a nasty constraint to impose for optimization. It works using SVD in the special case I described. But, for virtually any other case, the objective has many local minima and good solutions are virutally impossible to find for large problems.
31. Low-Rank -> Non-Convex Set
  - The reason that the low-rank constraint is bad is that it forms a non-convex set.
  - The convex sum of two rank-one matrices can yield a rank-two matrix.
  - At bottom, we show a visualization of rank-constrained optimization. The black-line draws the border of a fictional low-rank set of matrices. We are optimizing a function with the low-rank constraint. The global minimum of the function and contours are in red. The blue point is the low-rank global minimum of the function. The green point is a low-rank local minimum. Even though the function is convex, the optimization problem is not because we are optimizing over a non-convex set. Note that the low-rank global minimum can shift somewhat drastically with a small change in the function.
32. Trace Norm Reguarization
  - Recall that L₁-norm, if used as a regularization penalty on weights, encourages zero weights: i.e. it performs feature selection.
  - We can apply the same norm to singular values. This is the trace norm: sum of singular values. Recall that SVs are non-negative, so no need for absoute value.
  - Trace norm can be thought of as a hybrid between L₂ and L₁ norms. The trace norm of a vector is simply the Euclidean length of that vector. The trace norm of two orthogonal vectors stacked into a matrix is the sum of the two lengths. When the vectors are not orthogonal, the trace norm is less than the sum of their lengths. Savings is achieved by representing the overlap in a single basis vector.
  - [Picture] Trace norm is not affected by rotation, and is associative with scalar multiplication. Thus, we can use this matrix to explore the trace norm for the 2x2 case. To the right is a 3-d plot of the trce norm. In the direction where we introduce a new direction to the matrix, the plot looks like absolute value (L₁). In the other direction, it looks like a plot of the Euclidean norm (L₂).
  - The trace norm is also known as the Ky-Fan n-norm and the Nuclear Norm.
  - Just like the L₁-norm encourages feature selection when used for regularization, using trace norm for regularization has effect of encouraging SVs to be zero. I.e. it encourages a low-rank solution.
  - Cite: Fazel
33. Max Margin Matrix Factorization
  - Srebro et al. introduced MMMF.
  - Give objective
  - Convex in X and θ (no local minima)
  - Low-rank in X
  - X is the preference score matrix
  - This is a semi-definite program. We tested a variety of packages: none could deal with more than a few thousand observed entries (10% of 100x100); time and memory consumption were quite large. Srebro & D'Aspremont are working toward faster SDP methods which are specific to the problem. I've been told that SDP time complexity is O(n^6.5).
  - Cite: Srebro, Rennie, Jaakkola
34. Bounds on the Trace Norm
  - To find an efficient way of optimizing the trace norm, we turned to definitions which involve minimizations over factorizations of the matrix.
  - The trace norm can be defined as the minimizing factorization for one of two different quantities, average squared frobenius norm or the product of frobenius norms.
  - A factorization based on the singular-value decomposition minimizes both quantities. This is easy to verify.
35. Factorized Optimization
  - We deveoped a factorized objective which is a tight bound on the original MMMF objective.
  - We substituted UV for X and the average Frobenius norms of U and V for the trace norm.
  - This yields a "tight" bound since we can always find a factorization which yields average Frobenius norm equal to the trace norm.
  - For a loss function with a smooth derivative, we can easily optimize this objective with gradient descent.
  - One round of gradient descent takes O(mnkl). Rating matrix is m × n. Rank of U,V is k. Number of labels is l. Compared to SDP: O(n³).
  - Factorized objective is not convex and has stationary points. However there do not seem to be local minima. We haven't been able to prove local minima and empirical tests have not been able to find any. Empirical tests indicate that gradient descent finds the global minimum. Note that stationary points are non-attractive, so if you don't start at one, you are unlikely to find one.
  - One benefit of this approach is that it is easy to find approximate solutions. Just a few iterations of gradient descent can yield a decent solution. For our experiments, we did 100 iterations. And, good solutions can be found even when U,V are low-rank. The key is the regularization term. If the optimum solution is rank-10, optimization may only require U,V to be of rank-11. For our experiments, we used U,V of rank 100, even though full rank would have been more than 1000.
  - Rennie, Srebro citation.
36. Many Users, No Features
  - [Picture of matrices, variable names]
  - To make sure we're on the same page, here's how the variable names link up with what I showed you earlier.
37. Experiments
  - Again, we used movie rating data sets for testing, EachMovie and MovieLens. EM has 6 rating levels, ML has 5, hence the larger absolute difference errors for EM.
  - Ben Marlin tested a number of different multi-user rating prediction algorithms for his master's thesis. We followed his methodology and list the two algorithms that performed best in his tests: URP and Attitude, both of which effectively learn by limiting the rank of the matrix of parameters used for learning.
  - EM has 36k users, 1.6k movies; ML has 6k users, 4k movies.
  - For weak error, the model is trained on all users, and a prediction is made on a held-out movie for each user. For strong error, the model is trained on a subset of users, then provided with some observations for the held-out users & you must make predictions without re-training the entire model. MMMF adapts easily to the strong error scenario.
  - MMMF out-performed all of the algorithms Marlin tested. We didn't have the full output for URP and Attitude, so our ability to determine significance is limited, but we did calculate a standard devaition for the MMMF errors and they were all at least one standard deviation better than the next best algorithm for each of the columns in the table.
  - The MMMF results are using the smooth hinge loss function. The MovieLens here are slightly better than those we observed for the single-user model.
  - We used approximations for the MMMF optimization: limited rank of U,V to 100 and stopped after 100 iterations. So, better optimization may lead to better results.
38. Extensions
  - Before concluding, I want to address some ideas which aren't tested as thoroughly as the earlier techniques, but hold a lot of promise.
    - SVD Parameterization
    - Multi-user and Features
    - Observation model
39. SVD Parameterization
  - Note that the UV'=X factorization leads to too many parameters. Any invertible matrix gives another factorization of X=UAA^-1V'.
  - Instead of U,V, we could use a singular value decomposition parameterization: U,V orthonormal (spherical coordinates), S diagonal, non-negative.
  - May lead to faster optimization since there are fewer parameters to manipulate.
  - Advantage in optimization: objective is not a bound---it is identical to MMMF objective, simply w/ different parameterization.
  - Stationary points that we found in U,V parameterization do not exist in SVD parameterization.
  - No implementation yet.
40. Multi-User + Features
  - Want benefit of both multiple-users and outside information
  - Solution: using Factorized MMMF, populate some columns of V' with fixed features, some columns allow to be learned. Learn weights (U) for both fixed and learned features.
  - Fixed portion of V is not included in regularization term.
41. Observation Model
  - Most rating prediction models assume that ratings are provided at random. Everything I've shown you so far effectively makes this assumption.
  - In selecting restaurants, someone may limit their choices based on geography, popularity, price and style of food. They might also only go to or only rate restaurants they like. This can bias how we generalize.
  - We can try to remove the bias by modeling how ratings are observed. This is a binary classification problem. Like with multi-rating, we can add in the loss function and append the two parameter matrices in the regularization. Thus, the model is encouraged to find similarities in how people rate and choose restaurants.
42. Observation Model
  - Observations are binary labels. Use binary classification to model
  - Tie together observation and rating models by using trace norm of combined parameter matrix for regularization.
  - This encourages the model to find similarities in how users observe and rate restaurants.
43. Multiple Ratings
  - Consider multiple ratings for each user/restaurant pair: e.g. Food, Service, Decor
  - Our loss is just a sum of the losses for each of the multiple tasks - nothing special.
  - We tie these tasks together via the regularization term by stacking parameter matrices for calculation of the trace norm: [X1 X2 X3]. This encourages the model to find similarities in the way Food, Service and Decor are rated.
44. Summary
  - Developed loss function for ratings: bound on absolute difference error; experiments
  - Regularization for multi-user case: convex, low-rank solutions; MMMF; scale MMMF to large problems; experiments
  - Trace norm: wide range of applications where we want to find a compact representation of matrix data
  - Extensions: multi-user and features; multi-rating tasks; observation model
11/21/06 - Ack! I need to stop thinking about this and send out copies of my thesis!!! And, start working on my talk!!!
11/21/06 - Tommi and I discussed: how do we find the gradient with respect to X (or some surrogate). The U,D,V representation makes this discussion irrelevant. The factorized objective is no longer a bound on the MMMF objective, it is identical: i.e. J(U,D,V,\theta) = J_MMMF(UDV^T,\theta). One corollary is that the factorized objective (U,D,V version) does not have (strict) local minima!!!! It does have stationary points. And, it may have (non-globally optimal) stationary points without strictly decreasing paths. No it cannot! As long as all directions of X are accessible (which they are), we know that we must be able to move an infinitesimal amount in U,D,V (which corresponds to an infinitesimal change in X) and achieve a lower objective value, assuming that the MMMF objective is convex, and assuming that we are not at a/the global minimum.
11/21/06 - Met w/ Tommi, we had lots to say about the factorized objective. We figured out how to analytically solve for the minimum-average-Frobenius-norm. When talking to John afterward, we might have figured out how to reparameterize to avoid stationary points.
We get minimum average Frobenius norm when corresponding columns of U and V have the same length. Let U_k (V_k) be the k^th column of U (V). Then, the avg Fro norm is minimized when ||U_k||₂ = ||V_k||₂. Thus, given any point in the factorized objective, it is simple to renormalize the columns to yield the min Fro norm version without affecting X=UV^T.
A problem with the U,V representation is that if U and V are null or low rank in the same way (e.g. Ux=Vx=0), then the gradient will be (at best) zero for corresponding components of U and V. A change in one element of U or V will not affect X because the corresponding elements in the product will zero the introduced effect. Consider an alternate parameterization: U,D,V, where X=UDV^T, D is a diagonal matrix, and U and V have unit length columns. We have eliminated m parameters by combining the magnitude parameters of U and V into a single set of parameters in D. These were the parameters which affected the difference between the trace norm of X and the average Frobenius norm of U and V. Now, ||UD^1/2||_F² + ||VD^1/2||_F² = 2||X||_Σ. I.e. the factorized objective is no longer a bound on the MMMF objective, they are identical! The factorized objective is now simply an over-parameterized version of the MMMF objective. Also, the stationary points issues we had before are (at least) lessened, if not eliminated. Note that zero entries no longer cause a problem. U and V cannot be zero and, while diagonal entries of D can be zero, changes away from zero will affect X, so gradient descent will change the zero entries, if necessary. There is still a problem of rank deficiency. If U and V have matching duplicate columns, and the corresponding entries of D are identical, then gradient descent will not be able to recover---those columns and entries of D will remain identical.
11/19/06 - Consider U,V such that the first columns of U and V are null and U,V is a local minimum with respect to the 2^nd through n^th columns of U and V. Let X=UV^T. If U,V is a local minimum, is it possible that the average Frobenius norms of U and V is not equal to the trace norm of X? I.e. is it possible that U,V is not the minimum Frobenius norm factorization of X? Let U,V be the minimum Frobenius norm factorization of X. Let 0 ≤ γ ≤ 1. Then [U^γ U^(1-γ)] [V^γ V^(1-γ)] = X, where the exponentiation is applied item-wise. This provides a path from U,V to U,V with decreasing Frobenius norm with product equal to X. Thus, the existence of a lower-Frobenius-norm factorization of X contradicts the fact that U,V is a minimum of the joint objective. Thus if U,V is a minimum of the joint objective, then it is the minimum Frobenius norm factorization of UV^T=X.
11/19/06 - Stationary points: would be nice to be able to say that from any stationary point U,V which isn't a global minimum, there is a decreasing path in U,V space. One approach is to say that there is a decreasing path in X space (guaranteed by convexity of loss + strict convexity of trace norm) and that U,V can follow the X path. Seems it would be easy to show that we can update U,V so that UV^T=X. What isn't easy is to show that U,V maintain minimum Frobenius norm...
11/13/06 - To do:
- Edit missingness mechanism section to indicate that it may be substantially less efficient than regular (Fast) MMMF. Note that MM introduces loss function on all user/item pairs (not just the observed ones). However, to obtain some of the transfer benefit without as much computational overhead, we can sample a few user/item pairs per user. Trace norm regularizer (and proper selection of regularization constant) ensures that any (uniformly sampled w/o replacement) missingness data is better than none.
- ~~Add Fu/Simpson '02 citation to OR prior work section.~~
- Write two page summary/outline of context tracking/text clustering chapter.
- Thesis defense: give brief (1-3 slide) outline of thesis, then focus on preference learning material.
- If I have time: show collaborative filtering gradient derivations in appendix.
10/24/06 - Papers for Thursday: Crammer02, Herbrich00, Agresti90
10/24/06 - To do:
- ~~Publish Proportional Odds comparison, send to Tommi~~
- ~~Find 2-3 related work articles to review Thursday~~
- ~~Integrate published writing into thesis as discussion at end of Ordinal Regression section~~
- ~~More editing of CF section~~
10/19/06 - To do:
- Compare Proportional Odds model to Immediate-thresholds with LR loss function. What would happen to LR if we were to normalize? Plot PO "loss" function.
- Show that GP OR is Proportional Odds model but with a different distribution over δ_i
- Additional papers to review: Johnson/Albert '99 and Tutz '03, Fu02 (need to add to papers/other, add entries to refs.bib, link here)
- Still have lots of work to do on the OR prior work section.
10/19/06 - Related work: Tommi and I also reviewed three rating-realted papers. It was very good to go over them with him. There were a number of things he pointed-out that I would have missed if I had reviewed them myself.
- Learning to Order Things by Cohen, Shapire and Singer - This is not as closely tied to my work as some of the other papers. Tommi suggests that I include a paragraph comparing rating to ranking. This would be a good work to cite. This paper provides some of the basics of how to approach a ranking problem, including how to represent a ranking (their f function). It is interesting to note that they are effectively producing a ranking by a weighted combination of expert ratings (since S is a (discrete) totally ordered set). There is a separate f for each expert, which defines his/her binning/rating of the examples. Down sides of their approach is that they do exponential weight updates, effectively trying to focus on the single best expert. And, their loss function is classification error, which may not generalize well to new data.
- Regression Models for Ordinal Data by McCullagh - Only the Proportional Odds Model is of interest here (section 2.1). This model has some similar structure to our Ordinal Regression setup. In particular, McCullagh uses a linear predictor (β^Tx) to project each data point onto ℜ and uses thresholds (θ_j = log κ_j) to segment ℜ. However, McCullagh does not use a regularization penalty and uses a different loss function. However, Tommi thinks that you can get derive the immediate-thresholds OR (w/ Logistic Regression MPF) by introducing duplicate events. The immediate-thresholds loss for an event Y=y can be written as the negative logarithm of P(Y ≤ y)P(Y > y-1), i.e. by assuming independence of two events, Y ≤ y and Y > y-1, which are obviously dependent.
- Gaussian Processes for Ordinal Regression by Chu and Ghahramani - First, Gaussian Process is just the Bayesian way of saying kernel. The idea: θ ∼ N(0,1); the observation is f(x_k) = θ_Tφ(x_k) → y_k; then Cov[f(x_i)f(x_j)] = φ(x_i) φ(x_j), which is a kernel: K(x_i,x_j). Their model is similar to an immediate-thresholds model. But, the loss function is not defined as a sum of two MPF's. In fact, the way of constructing their loss function is somewhat reversed from the method I use. They define an ideal likelihood which is unity in the correct segment (defined by the thresholds). But, they assume a Gaussian distribution over the mapping of the example to the real number line. The result is something very similar to immediate thresholds with the modified least squares loss function (note the middle plot from Figure 1: derivative is linear). Tommi thinks that if you replace the Gaussian prior on the example mapping to the real line, you get the Proportional Odds model. Chu/Ghahramani achieve regularization via the Gaussiann prior on f. Both of their learning methods try to maximize the posterior, which is comparable to how we learn parameters. Tommi notes that the immediate-thresholds plus Logistic Regression MPF does not correspond to a probability distribution. When we sum the LR MPFs, we would need to renormalize.
10/19/06 - Tommi likes my new introduction, thinks it is more balanced, and gives a better perspective on the preference learning problem (especially since I discuss many different aspects). One thing he mentioned is that my chapter might serve as good introductory reading material to students which are new to the preference learning topic. This would be a good goal to shoot for in this chapter.
10/19/06 - Conjugate gradients never finds a decent solution for logBeta=-2 or logLambda < 1 (after 10k iterations). So, I'm going to eliminate those parameters from our search space. Solutions for logBeta=-2 and logLambda < 1 can be found in saveTN2/i=9/j=1
10/18/06 - We have introduced rank-constrained matrix factorization; we have introduced the idea of using a norm penalty in place of the rank-constraint.
10/16/06 - Good idea for exploring parameter space? Start w/ extremely large value for lambda, reduce value until we can't solve problem in number of allowed iterations. Would avoid lots of wasted max iteration runs (which yield a poor solution anyway).
10/15/06 - I need to read 5.2 & 5.3 and determine what high-level changes I need to make---when to introduce and explain what concepts & ideas. Need to make sure that 5.3 bases its discussion on what is said in 5.2. Also, might still need to make some notational/nomenclature changes in 5.2. Currently, I have this somewhat awkward distinction between h-function and MPF. Would be good to reconcile those. I think MPF & h-function should be the same thing. Binary loss function is an oriented (+/-), located (θ) MPF/h-function. OR loss function is a sum of oriented, located MPFs.
10/15/06 - I think my introduction (5.1) is in good shape. OR (5.2) is good except for the prior work; I'll deal with prior work by presenting papers to Tommi in our 1-on-1 meetings. I should prepare/read 3 or so papers every week for our meeting. CF (5.3) needs major rewriting. Hybrid (5.4) and MM (5.5) need to be written.
10/13/06 - Papers for next week: Chu/Ghahramani '04, McCaullagh '80, Cohen/Shapire/Singer '99.
10/13/06 - I hate writing about previous work. Actually, "hate" is too strong of a word, but I do have a hard time motivating myself to understand other works in sufficient detail to succinctly compare and contrast them. What might be helpful would be to have discussions of previous work with Tommi. Idea is to select 2-3 papers, read them before the meeting, then discuss them w/ Tommi to the point that I can write a paragraph comparing/contrasting their methods.
10/12/06 - Another topic Tommi and I discussed is the idea of eliminating certain entries of V and renormalizing rows of U to compensate for the lost magnitude. The idea is that limiting U and V to low rank may not be sufficient sparsity. We may also want to eliminate entries of U/V. However, when an entry of V is eliminated, the magnitude of the corresponding entries in X is decreased. Tommi suggests to (temporarily) renormalize the portion of each row of U which is free of eliminated entries in V. E.g. say we are considering X₁₁. This is a product of U₁ and V₁. V₁₁ is eliminated, so U₁₁ has no impact on X₁₁. So, we renormalized the other entries of U, U₁₂,...,U1n, to have the same length as the full row of U. I'm not convinced that this is a good idea for MMMF, but it is an interesting concept.
10/12/06 - Chatted with Tommi today about preference learning. His main comment about my thesis is that the related work section needs help, especially in the discussion of older statistical models. We also talked extensively about his topic model for collaborative filtering and about a "missingness mechanism" (aka observation model) for MMMF. In my original reading of the model, I thought the style variable was not user-specific. Tommi noted that it is only drawn once per user and hence is user-specific. I suggested one substantial improvement to the model: a symmetrization so that movies and users are interchangable. The extension to MMMF to allow for a missingness mechanism is quite simple. We create a new matrix, S, the same size as U, which stores weights for the movie features indicating whether the user will want to see the movie. These weights may be different from the weights indicating whether the user will like the movie. E.g. the amount of money spent on advertising the movie will affect whether or not a user sees a movie, but may not affect his/her liking of the movie. S*V^T yields a new matrix, W, which is a score for each user/movie pair. A per-user threshold value applied to entries of W indicate whether or not the user will see the movie. A loss function is applied to penalize mistakes. It is not clear that the loss function for the ratings should have the same weight as the loss function for the missingness mechanism (MM). We introduce a parameter, 0 ≤ γ, which weights the MM loss function. We presume that γ will be less than unity (indicating that the rating loss is more important than the MM loss).
10/5/06 - I'm giving up on trying to do 100k iteration runs. I'm going to limit to 10k iterations. I'll do one run with random initialization and one run where the previous solution is used as initialization for the next run.
10/2/06 - Grrr... my CG runs weren't re-using previous solutions (I didn't make the necessary "v0=v" assignment in runTN2.m).
9/22/06 - I asked Tommi about how to deal with co-authored work. He said I should include whatever I need to include and he said I should use a citation whenever the work has already been published (even if it's my own work). Do include a "contributions" section delineating exactly what I am contributing. But, elsewhere in the text, write somewhat ambiguously (avoid possessive pronouns such as "our" and "my").
9/22/06 - Tommi thought it might be a good idea to split OR & CF into separate chapters, though when I pointed out that each of my chapters focus on a different learning application for information communication, he relented.
9/22/06 - Tommi noted that when the matrix is full rank, a non-linear product is no more powerful than a linear one. Only when the rank is not full does the non-linear product allows a different set of solutions. This should not be discussed in the introduction, but rather should be moved to the end (discussion section).
9/22/06 - Tommi was pretty happy with my chapter so far. He said one problem was that I too strongly emphasized the lack of a need for item features. He suggested that I say that CF is an alternate approach (vs. OR). At the end of the chapter, I should discuss a combined hybrid approach where some item/user features are fixed, and some are learned from the preferences.
9/22/06 - A problem with using solution from one run to initizialize the next is that parameter magnitude can affect the gradient magnitude. Also, gradient descent can't recover from low-rank. Once the rank of X is below n, GD may not be stuck with solutions with rank <= n. To "fix" this problem, I could start w/ small lambda and use those solutions to initialize larger lambda optimization runs.
9/20/06 - Current outline of CF chapter:
1. Introduction
  - So far, this thesis has focused on direct extraction of information (identifying and resolving restaurant names; determining the extent of opinions in text). In this chapter, we assume that information about user opinions/preferences has already been extracted; we will describe methods for the synthesis of that information to predict unobserved perferences. Specifically, we address the collaborative filtering problem, where we assume that user preferences are given in terms of discrete, ordered ratings (e.g. 1 through 5 "stars," as is common with movie reviews). Although informal communication authors do not generally specify a 1-5 rating, they do usually have their own vocabulary of words which they use to designate different preference levels. Since two different people may use the same words in different ways, the methods we present allow for per-user customization of the rating levels.
  - We begin this chapter with a discussion of the ordinal regression problem, which assumes that there is a vector of attributes or features associated with each item. We assume that there is a single user, and the goal is to learn a weight vector for the user which matches, as best as possible, his or her preferences. For our restaurant bulletin-board example, the feature vector might include who owns the restaurant, the number of seats, the type of cuisine, as well as specific food and/or drink items from the menu. The user weight vector would then reflect the direction (positive/negative) and magnitude of the user's feelings toward each of these items. We provide discussion and evaluation of various loss functions and paradigms for ordinal regression.
  - When there are multiple users and a common set of items (as is typically the case on a restaurant discussion board), we can obviously learn multiple user weight vectors simply by posing an ordinal regression problem for each. However, this ignores the fact that the problems are related. When two (or more) users rate the same item, they implicitly provide information about how their preferences relate. This can, in turn, be used to deduce information about the items which may or may not be otherwise available---there may be aspects of items which are difficult to quantify (such as the dramatic component of a movie) or impossible to access (such as the seasoning of a dish experienced by a patron at a restaurant.
  - Collaborative filtering is the name given to the task of predicting unobserved preferences based on a sparse set of user-item preferences. Unlike ordinal regression, it is generally assumed that no item information is given. Rather, it is overlapping preferences that allows us to generalize to unseen user/item pairs. Thus, we are not limited by identifiable features of the items---we are only limited by the quality and number of ratings that we obtain.
  - We show how collaborative filtering can be viewed as a matrix completion problem. Unlike most work on collaborative filtering, which imposes a hard rank constraint on the underlying parameter matrix, we allow unlimited rank, but use a soft penalty which encourages low rank. This can be seen as a direct generalization of the $L_2$ regularization commonly used for binary classification. This regularization, combined with a loss function we have introduced for ordinal regression yields a collaborative filtering system we call "Maximum Margin Matrix Factorization" (MMMF). We show how to scale MMMF to large collaborative filtering problems. We then evaluate it against a wide selection of algorithms and find that it yields significantly lower error than any of the competing methods.
  - Throughout this work, we make the assumption that the underlying preference score for each user/item pair is generated through only an affine/linear product of the parameters. One reason for this choice is simplicity. A second reason is that the generalization to non-linear interactions is straightforward---it simply requires a re-interpretation of the linear products as kernel products. This kernel product technique was popularized for binary classification via the Support Vector Machine (Vapnik, 1995). As with binary classifiers, kernelization of MMMF does not affect convexity of the optimization objective.
2. Ordinal Regression
  1. Related Work
    1. Threshold-based approaches
      Crammer/Singer, Shashua/Levin; briefly introduce immediate-thresholds vs. all-thresholds
    2. Probabilistic approaches
      McCullagh, Fu/Simpson, Chu/Ghahramani, Rennie/Srebro (ordistic), Cohen et al.?, Johnson/Albert?, Tutz?
    3. Other approaches
      Hebrich et al.
  2. Specific Contribution
    1. All-thresholds construction
    2. Systematic evaluation of various loss function constructions and threshold penalty functions
  3. Preliminaries
    1. Margin/Threshold Penalty Functions
      Zero-one, Margin, Hinge, Smoothed Hinge, Modified Least Squares, Logistic Regression
    2. It's the Gradient, Stupid
      Discussion of my bias: when the objective is convex, what matters is not the value of the loss, but the value of the gradient (of the loss). When solving a convex optimization problem, the value of the objective is irrelevant except in how it relates to the gradient. It is the gradient that determines the solution, as well as the behavior of the loss/objective. Note that hinge loss encourages a solution where the number of positive class errors equals the number of negative class errors. Something similar happens w/ all-thresholds, except that the equality is for summed magnitude of under-predictions and summed magnitude of over-predictions. Note that the SVM is highly succeptible to skewed class membership (e.g. many more positive examples than negative examples). Explain why this is a minor issue for the Shashu/Levin ordinal regression framework (which is what our work is based on)
    3. Discrete Ordinal Regression
      OR loss functions are constructed from binary loss functions (margin/threshold penalty functions), and depend on threshold parameters
    4. Beyond Feature-Based Regression
      Non-linear generalization can be made via kernels.
  4. Threshold-Based Constructions
    Introduce details of generalization of binary classification to ordinal regression.
    1. Immediate-Thresholds
      Shashua/Levin's "fixed margin"
    2. All-Thresholds
      Magnitude of error is important; immediate-threshold penalizes based on (and is bound for) number of errors; all-thresholds loss penalizes based on (and is bound for) number of thresholds crossed.
    3. Learning Thresholds
      Learn seprate set of thresholds per user.
  5. Experiments Comparison of various ordinal regression loss functions w/ various margin/threshold penalty functions. All-thresh yields lowest MAE as expected as it is the only loss function which bounds absolute error. All-thresh also yields the lowest ZOE. We find that all-thresh combined w/ any margin/threshold penalty function is highly effective for ordinal regression and, based on how soundly it outperforms the other tested algorithms, may be the best algorithm available for ordinal regression.
  6. Discussion
3. Collaborative Filtering
  Ordinal regression required that we provide a feature vector for each item. But, often, information about items are difficult to observe or quantify. Collaborative filtering is a framework that utilizes overlapping preferences of a set of users to predict unobserved preferences without the use of outsize information about the users or items.
  1. Introduction and Related Work
    1. CF is a matrix completion problem; no explicit feature vectors.
    2. Low-rank factor model is common approach to CF.
    3. Rank constraint yields non-convex objective except for special case: fully-observed least-squares loss, which is inappropriate for collaborative filtering.
    4. MMMF introduced by Srebro et al. which penalizes the norm of the matrix. Fazel introduced the use of this norm penalty (trace norm) as a convex surrogate for rank-constrained problems.
  2. Maximum Margin Matrix Factorization
    1. Factor Models as Feature Learning
      Collaborative filtering can be viewed as simultaneously learning user weight vectors and item feature vectors. The conditional problem (fixed V, learn U or fixed U, learn V) is exactly a set of ordinal (unregularized) regression problems. Adding a Frobenius norm penalty for each of U and V corresponds exactly to the usual $L_2$ regularization commonly used for large-margin discrimination problems (such as binary classification and ordinal regression).
    2. Low-Norm Factorizations
      Trace norm of X is the minimum Frobenius norm over U,V such that UV^T=X. Trace norm is a convex function, and so it can be combined with a convex loss to yield a convex objective.
    3. Formulation
      MMMF objective. Start with hard-margin, then soft-margin, then ordinal regression/ratings. Need to refer to loss function discussion from Ordinal Regression section. Thresholds can be per-user and learned from the data.
    4. MMMF as an Infinite Factor Model
      X can be viewed as sum of unit rank matrices. For MMMF, there are effectively infinite number of unit rank matrices, but their norm is bounded.
  3. Optimization Methods
    Substitute Frobenius norm for trace norm in objective to obtain smooth objective that we can optimize w/ gradient descent.
    1. Smooth Hinge
      Must use smooth margin/threshold penalty---subsitute smooth hinge for hinge. Could have used any smooth margin/threshold penalty function.
    2. Implementation Details
    3. Local Minima
      Detail experiments investigating prevalence of local minima.
  4. Experiments
    Comparison of all-thresholds + trace norm (MMMF) to Marlin's study of nine different CF methods. On each of two large data sets, MMMF yields the lowest mean absolute error by at least one standard deviation.
  5. Discussion
    We removed the item attribute vector assumption and introduced a state-of-the-art algorithm which scales to very large data sets.
9/20/06 - One important question in thesis writing is: what work do I assume is in the "past" and what word do I present as being "new." I think I should assume the original MMMF paper to be in the past. Although my name is on the paper, only the experiments were my doing. However, the IJCAI ordinal regression paper and the ICML MMMF papers are my work (except ordistic in IJCAI) and there is value in each of those papers:
- IJCAI paper gives careful attention to the different loss functions and constructions that might be used for OR and provides experimental comparison.
- ICML paper shows how to scale-up MMMF through a variational upper bound on the objective and provides experimental comparison. I can also provide some the development of MMMF, though this may best be left to the tracking context chapter.
9/19/06 - Why not save solutions and re-use them for the next conjugate gradients? At the very least, I should use the beta=10^n as the starting point for beta=10^(n-1).
9/15/06 - Looks like the rank problem I was seeing (rank(S) < 256) wasn't really a problem. The issue is that rank returns the count of singular values above a certain threshold. Occasionally, one of the sv's was small enough to be below that threshold (without being too small to create issues with the inversion operation). I.e. I was raising a red flag (with my rank check), when there wasn't really a problem.

label\prediction	pos.	neg.
pos.	tp	fp
neg.	fn	tn

Jason Rennie
Last modified: Tue Jan 23 09:12:24 EST 2007