Jason Rennie's Research Log

Older

3/14/06 - There are a few levels of complication we'll want to test:
1. Covariance structure (always use zero mean):
  1. Perpendicular manifolds: Gaussian is block-diagonal w/ each cluster corresponding to one block.
  2. Interleaved manifolds: generate each covariance matrix w/o restriction. Note that points should be generated from a low-dimensional manifold.
2. Cluster skew:
  1. Even distribution: all clusters generate the same number of points
  2. Skewed distribution: uneven distribution of points (e.g. (9,3), (8,2,2), etc.).
We'll want to try different numbers of clusters, e.g. 2, 3, 4, and different distributions of points, e.g. (6,6), (4,4,4), (3,3,3,3), (9,3), (6,4,2), (6,3,2,1).
3/14/06 - Okay, our abstract is accepted and we have something resembling a paper now! Woo hoo! Next step is for me to do some simulated experiments. We'll do these with Gaussians. There are various things we want to evaluate
1. whether clustering is correct for some value of λ,
2. whether clustering is hierarchical (more fine/coarse clustering agrees with underlying structure of data), and
3. whether we get the correct clustering when we learn λ.
4. whether we can deal with a skewed distribution (e.g. 1 point from cluster A, 9 points from cluster B).
These are so closely tied that I don't think its worth attacking them separately. Need a set up that will allow us to evaluate all three at once. Let's establish a fixed dimensionality of the data, say 100. We'll probably only be able to deal with a very small number of data points, say 12. Need a prior on the choice of cluster. To generate a point: (1) choose a cluster, then (2) generate a point from that cluster. Hence, multinomial prior on cluster selection. We could assume the uninformative/flat prior, or we could try to learn the prior... For now, assume uninformative/flat prior. Oh, no! A mixture model! Ack! John suggest that we assume that # of points per cluster is pre-specified. Fine with me.
3/11/06 - I think I can show that if X=UΣV^T where V^TV=I, Σ=diag(σ) and U has unit columns, then ||X||_Σ ≤ ∑_i σ_i. I.e. the trace norm of X (the sum of its singular values is less than the trace of Σ in the above decomposition. Note that the trace norm of X is equal to the trace norm of UΣ since the trace norm is invariant to orthogonal transforms (V is orthogonal). The trace norm can be defined as
||X||_Σ = max_{Y: ||Y|| ≤ 1} tr(Y^TX),
where ||Y|| is the spectral norm (maximum singular value) of Y and tr() denotes the trace. The maximization is over matrices with maximal eigen value of one or less. However, it is equivalent to maximize over orthogonal matrices, which have all singular values equal to one. Thus, the trace norm is less than or equal to the sum of the lengths of the columns (or rows) of X.
3/10/06 - John helped me w/ Hessians a bit. A Hessian is second derivative at a particular point. Hitting it with a vector on both sides gives you the curvature in that direction at the point corresponding to the Hessian. In order for a function to have a local maximum, the Hessin must be negative semi-definite somewhere. NSD requires that all determinants of all sub-matrices are non-positive. The fact that our Hessian is block PSD means that it can't be negative definite and would have to be all zero to be negative-semi definite. Hence, it's not NSD and there are no local max.
3/9/06 - Assertion: a function with no local maxima has no local minima (but it may have saddle points). This is the case for our J'(A,B). Note that J' is convex separately for A and B, but not jointly. However, since it is separately convex, the Hessian of J' is block-diagonal-PSD. Hence, Hessian of J' cannot be negative semi definite (except all zero, which it is not) and hence J' cannot have local maxima. By our assertion, J' cannot have local minima. What remains is to characterize the saddle points.
3/9/06 - John can show that J(A) = ||A||_Fro² + ||A^-1X||_Fro² is convex in A if A is invertible. Proof calculates second derivative and shows that each term is non-negative. This is more difficult to do when A is not square b/c we have to use the pseudo-inverse, which has a more complicated derivative. The definition of the pseudo-inverse is A⁺ = lim_{ε → 0} (A'A + εI)A'. The derivative of the psuedo-inverse is
dA⁺ = A⁺A⁺'(dA' - (dA'A + A'dA)A⁺).
We haven't been able to show convexity for arbitrary A. Some terms in the second derivative are negative and do not appear to cancel w/ other terms. Useful in the square A case is what is known as a commutation matrix, which is defined on page 16 of Thomas Minka's Matrix Algebra for Statistics paper. Note that the commutation matrix has orthonormal columns.
3/9/06 - Even if we assume that A and B must use X's SVD, we're still a ways from being able to say anything useful. Consider: X=U_XΣ_XV_X^T, A=U_XΣ_AV_A^T, and B=V_XΣ_BU_B^T so that AB^T=X. Only saving grace would be if we could show that V_A=U_B. Need orthogonal matrix Q such that AQ=U_XΣ_A and BQ=V_XΣ_B.
3/9/06 - Continuing from below, let X^* = arg min_X J(X). Note that we can follow a decreasing path (wrt J) from X to X^*. Let X=UΣV^T and X^*=U^*Σ^*V^*T be SVD's. U^αU^*(1-α)Σ^αΣ^*(1-α)(V^αV^*(1-α))^T. Yes, this decreases trace norm, but what about loss(X)?
3/9/06 - Let X ∈ ℜ^{m × n}. Let X=UΣV^T be a SVD of X. Define the objectives J(X) = loss(X) + λ||X||_Σ, and J'(A,B) = loss(AB^T) + λ/2*(||A||²_Fro + ||B||²_Fro). Let A=UΣ^α and B=VΣ^1-α where α ∈ [0,1]. Note that X=AB^T. And, note that ||A||²_Fro + ||B||²_Fro ≥ ||U√Σ||²_Fro + ||V√Σ||²_Fro. I.e. starting from A,B, we can follow a downward path to the minimum of J' subject to the constraint X=AB^T. But, we would like to be able to say that there is a downward path starting from any AB^T=X, not only decompositions that can be written as part of the SVD for X.
3/8/06 - Notes on pseudoconvexity (copied from Robert Freund's 2/24/04 6.252 course notes). Let X be a convex set in ℜⁿ. The differentiable function f(x) : X → ℜⁿ is pseudoconvex if ∀ x, y ∈ X,
∇f(x)^T(y-x) ≥ 0 ⇔ f(y) ≥ f(x)
Every differentiable convex function is pseudoconvex. A pseudoconvex function is quasiconvex.
3/8/06 - Notes on quasiconvexity (copied from Robert Freund's 2/24/04 6.252 course notes). Let X be a convex set in ℜⁿ. The function f(x) : X → ℜⁿ is quasiconvex if ∀ x, y ∈ X and ∀ λ ∈ [0,1],
f(λx + (1-λ)y) ≤ max{f(x),f(y)}.
If f(x) is convex, then f(x) is quasiconvex. For each α ∈ ℜ, define the α-level set of f(x) as
S_α = {X ∈ X | f(x) ≤ &alpha}.
A function f(x) is quasiconvex iff S_α is a convex set ∀ α ∈ ℜ.
3/7/06 - Okay, I'm working on the most basic clustering algorithm---iterate over all possible clusterings. The basic idea is, for each index, iterate over each value from 1 to max+1, where max is the largest value to the left of the current location. But, how do we do this? No, I don't want a for-loop for each index. What about a "next" function? Similar to iterating over numbers, but the rules are slightly different. Next function would calculate the cumulative maximum, find the first position where cummax does not change, add one to that position, set all right positions to one. I.e. difference between this and iteration over numbers is that iter over numbers has fixed ceiling. Here, ceiling depends on cummax.
3/7/06 - No reason that we can't calculate SVD every time for simple experiments. So, how do we determine a clustering? Well, we certainly need to calcualte the normalization constant for each possible cluster size, 1, ..., n. Note that we would assume that one dimension of the matrix is fixed (the dimension of each data point). Let d be the dimension of each data point; we would need to calculate the norm. constant for 1×d, 2×d, 3×d, ..., n×d. In principle, we need to evaluate all possible clusterings. I.e. all arrangements of the points into clusters. That's a lot!
3/6/06 - When we add a row to a matrix, we may need to find a new V matrix. As Todd Will notes, finding a V that yields an orthogonal U matrix is the major difficulting of computing an SVD.
3/6/06 - Todd Will gives a very nice explanation of how to project a point onto a line. It involves hitting the point with aligner, then stretcher, then hanger. The aligner puts the point into the perpframe. The stretcher removes components corresponding to other vectors in the perpframe. Finally, the hanger projects the point back into the Euclidean perpframe, I. One very nice aspect of his explanation is that it easily transfers to other types of projections (point to hyper-plane).
3/6/06 - I'm giving a careful read to Todd Will's Singular Value Decomposition Tutorial. It's well done and has already provided me with some intuition I didn't have. One thing it has helped me realize is how the "stretcher matrix" Σ affects the result. Without Σ, we'd simply have two rotations. If we think of a unit circle, V rotates that circle, Σ stretches it, then U provides another rotation(/embedding). Without the Σ step, it would still look like a circle.
3/5/06 - Horn & Johnson define a "unitarily invariant norm" on a matrix as one that is not affected by orthogonal transformations. I.e. ||A||=||UAV|| for U, V orthogonal. I.e. a unitarily invariant norm is only a function of singular values. Horn & Johnson note that zero-padding does not affect the singular values, and hence does not affect a unitarily invariant norm.
3/5/06 - Another Horn & Johnson observation: given any A,B ∈ ℜ^{n × m}, ∃ U ∈ ℜ^{n
× n}, V ∈ ℜ^{m × m} such that B=UAV iff A and B have identical singular values.
3/5/06 - Here's a (seemingly obvious, yet possibly) useful observation from Horn & Johnson: the singular values of U₁AU₂ are the same as those of A whenever U₁ and U₂ are orthogonal matrices of the appropriate sizes. Horn/Johnson uses the term "unitary," which is equivalent to "orthogonal" for real matrices. According to Horn/Johnson, this expresses the "unitary invariance" of the singular values.
3/2/06 - Pat & Bobby also suggested a specific algorithm that might be better than calculating singular values. Also, they noted that it may be more efficient to not get U,V from svd() unless actually necessary. Let A be original matrix, let b be new row, A'=[A;b]. Let A=UΣV' be the SVD of A. Then AA'+bb' =UΣ²U'+bb' = U(Σ²+U'bb'U)U'. Ack! This isn't right! UU' is not the identity matrix. Oh, unless they're assuming more columns than rows, then it is! Continuing. Let a=U'b. Then from above = U(Σ²+aa')U'. Let T=Σ²+aa'. Then sum of singular values of updated matrix is sum(sqrt(eig(T))).
3/2/06 - Chatted with some Mathworks math guys. Mostly talked to Pat (heavy-set, red hair, beard) and Bobby (asian descent). I should get the "Matrix Computations" book. They all had a copy---seems like a standard reference for anyone dealing w/ matrices. They had some good ideas and, importantly, gave me some keywords for typing into Google :) This process of adding a row/column to a matrix is known as "updating." So, I would search for "updating the singular value decomposition." Appears there's an old paper by Bunch and Nielsen on the subject. I also found a newer paper, "A stable and fast algorithm for updating the SVD" by Ming Gu and Stanley Eisenstat that claims O((n+m)min(n,m)) time to compute the update. Also, my idea that the updated singular value is Σ'_ii ≈ (Σ_ii + u_i)^1/2, where u is the row vector transformed into the V space.
3/2/06 - Two things to do today: (1) write back to John and write-up our discussions on proving that we can find global optimum, (2) develop and write-up algorithm(s) for trace norm clustering.
3/1/06 - Tommi had some thoughts on how to implement clustering/segmentation. He thinks it would make sense to calculate probabilities for greedy clusters starting from a single point. We'd do this for each point, so we'd end up with n² clusters. We'd then find the best non-overlapping set of these. Though, this step might be quite time-consuming. To find the best next point, he thinks we might be able to simply take an SVD of the current cluster, then look at how the next point fits with the SVD. How does it decompose wrt U? If it fits within large singular value components of U, it will add little to trace norm. If it fits within small or zero singular value components of U, it will greatly increase trace norm. A method I like is to consider all possible partition sizes and find the best clustering within each size. I think we might be able to do the within-size clustering efficiently by computing SVD on X and comparing rows of U. Need to think more on this.
3/1/06 - Problem solved, I think. Assume we are given supervised data Y, a partitioning ρ(Y). Learn λ with Y and ρ(Y). Then, given data X, find the best partitioning by maximing likelihood of X using λ learned from Y. Tommi agrees that this is an okay way to do things. He thinks it's a bit sketchy that we assume flat prior over partitionings and improper (flat) prior over λ, but okay for version 1. We can talk about improving this in later versions.
2/28/06 - We have a problem. Yes, we can figure out optimal model for a fixed data set, but this is much like PLSA---we need additional framework for learning. Consider the case that we are given some supervised (classified) data. We can use this to find an optimal partitioning on that particular data set, but we can't apply it to new data. Also, note that if we could select λ, we still have the problem that it would be unnatural to select a structure for that data. Need a prior over structures and we should integrate out structures or some such. Tommi noted that this is starting to get drawn out---need to bring it to a close soon.
2/28/06 - Proved trace norm inequality using the fact that the trace norm is convex.
2/27/06 - Might be nice to write-up the practical issues of doing a bottom-up or top-down clustering/segmentation. E.g. define two partitionings to be of the same type if sets within the partitionings can be matched up by size. Two partitionings of the same type have the same sum of log-normalization constants, so only one partitioning of each type might be selected---the one with lowest sum of trace norms. It seems that each type must be evaluated separately, but the number of types is O(n²). So, if we can evaluate efficiently, O(1), within each type (and we can calculate trace norm dist. norm. const. efficiently, O(1)), then our overall running time is O(n²).
2/27/06 - Did a write-up of the derivation of the Jacobian of the SVD, with application to the trace norm dist. norm. const. Nice to finally have that written up.
2/27/06 - Turns out it's trivial to show that log-normalization constant is smaller when trace norm is larger. I did a short write-up. I also did a write-up showing that the model score is a linear function of λ (and hence each model is preferred only for a contiguous region (&lambda₁,&lambda₂) ∈ ℜ). What's next? Would be nice to write-up the Jacobian derivation...
2/24/06 - Presentation on structure learning with trace norm regularization:
- Trace norm
  - Sum of singular values
  - Convex: if |X|_Σ=a and |Y|_Σ=b then |λX+(1-λY)|_Σ≤λa+(1-λ)b
  - If singular values restricted to ≤ 1, trace norm constraint traces out convex hull of rank constraint
- Low-Rank Learning
  - High-level goal is generalization---given a data set, want to find a representation that includes all of the structure and eliminates the noise.
  - Can use trace norm regularization to find a low-rank representation/solution.
  - Nati & I worked on this for collaborative prediction (MMMF).
  - Simple method is to minimize sum of loss and trace norm regularization.
- MMMF for text
  - Want to set up a generative model for text using trace norm as part of prior.
  - P(Θ) = exp(-λ|Θ|_Σ)/Z_λ
  - Use natural parameter multinomial: P(y|θ) ∝ ∏_i (exp(θ_i)/(∑_i' exp(θ_i')))^y_i
  - Want to maximize joint/posterior: P(Θ)P(Y|Θ), where we take product of individual document likelihoods to get collection likelihood
  - Easy since we don't need to calculate partition function
- Structure Learning
  - What if you want to learn a segmentation where each segment is modeled as a multinomial/trace norm? We want to find segment boundaries and number of segments.
  - Many other problems fall into model selection framework---clustering, co-reference resolution, learning gene expression pathways, etc.
  - But, now we need to calculate partition function in order to compare models. John will explain this.
  - What's good: select clusterings that minimize sum of trace norms. Maximize overlap within clusters and minimize overlap between clusters. Will put together vectors that can be encoded w/ same basis.
  - What's bad: highly sensitive to scale. For a particular value of λ, changing scale of data (multiplying by constant) may affect decision. But, we aren't interested in exact relation between λ and clustering, but rather the trajectory.
- Example
  - Consider two structures: (1) single model, and (2) a partition into two models. We only consider simple data generated from trace norm model (you can think of multinomial parameters being matched to data). We calculate f(λ) = log P(X) - log P(X₁)P(X₂). [show what we wrote on board yesterday in Tommi's office] Choice of structure is monotonic as a function of λ
2/23/06 - John, Tommi & I had a great meeting today that cleared up the issues that weren't clear to me concerning trace-norm structure learning. So, it happens that if we hold lambda fixed, then scaling of the data can affect the decision of which structure to select. This (at first glance) seems like a horribly, terrible thing. And, it took quite some time to convince Tommi that we could salvage trace norm clustering/model selection. But, as we scale λ, we move between all possible clusterings. And, it is the way that the algorithm clusters that is important. Also, we can show that the clustering is monotonic as function of λ. I.e. the exact scaling of λ varies by data set, but we can move between the different clusterings by scaling λ. Note that the way this algorithm clusters is interesting---it looks to group together data that minimially increases the summed trace norm. I.e. it tries to separate data that is perpendicularly and group data that is highly overlapping.
2/22/06 - In optimizing the product of probabilities, we may prefer a fine-grained clustering over a coarse-grained one! Yeah! The trace norm always increases when you split, but the normalization constant may decrease. Consider splitting a 3x1 matrix. Vol(V_3,1)=4π; Vol(V_2,1)=2π; Vol(V_1,1)=2. The Σ integral is 2 for 3x1, and 1 for 2x1, 1x1. So, the normalization constants are Z_3,1 = 2³π, Z_3,1 = 2π, Z_3,1 = 2. We're looking at maximizing the product of probabilities or minimizing the sum of negative log-probabilities. The objective is |X|_tr + log Z_n,m. The trace norm increases with a split, the norm constant sum decreases with a split (at least in above example), so the optimization will not necessarily select the trivial split.
2/22/06 - It is also intructive to look at derivatives for the simple z=xy example. Let J(z)=(z-z₀)²+λz. This function has the same minimum: J'(x,y)=(xy-z₀)²+λ/2(x²+y²). ∂J/∂z=2(z-z₀)+λz. ∂J/∂x=2y(xy-z₀)+λx. ∂J/∂y=2x(xy-z₀)+λy. Note that at x=y=0, ∂J/∂x=∂J/∂y=0, whereas at z=0, ∂J/∂z is generally not equal to zero. This is why we had trouble with optimization when we set the initial parameter vector to zero.
2/22/06 - In our discussion, John and I looked at the simple example of z=xy. It is very instructive. It seems that it is very easy to find a continuous f: X -> U,V function at U₀V₀^T for any (U₀,V₀). For example, when x≠0, we can use f(z)=(z,1). When y≠0, we can use f(z)=(1,z). When x=y=0, we need a slightly less trivial function, such as f(z)=(sign(z)sqrt(|z|),sqrt(|z|)), where sqrt(z) is the positive square root of z. Similarly, for larger examples, it is easy to come up with a continuous function when U and V are of full rank. Let rows > cols; note that V is square; U = XV^-1. Trouble arises when U and/or V is not of full rank.
2/22/06 - Wrote-up obj/grad for natural parameter multionimial MMMF, including matlab code. Had a discussion with John about convexity of the transformed problem (X=UV'; what happens when we optimize over U,V?). We made a useful observation. If the objective does not depend on the representation, then in order to prove that we will find the optimum using the alternate representation, we must show that for any starting point, (U₀,V₀), for any continuous path X_p starting from U₀V₀^T, there is a corresponding path in U,V-space. This follows as a result of the following (which we have not proven yet): for any point, (U₀,V₀), in U,V-space, there exists a function f: X -> U,V that is continuous at U₀V₀^T. However, this does not solve our problem, as the U,V-objective depends on the U,V representation of X (i.e. we can increase the objective w/o changing UV^T).
2/18/06 - Wrote obj/grad code for MMMF/least squares, used it to write a simple demo script showing that MMMF prefers to zero-out singular values. Dennis noted that netlab's line search code may be better than mine b/c it is very selective about using the gradient calculation. Should check it out. I downloaded the latest netlab to ~matlab/netlab.
2/15/06 - John made a nice observation that summarizes something I had noticed. The minimum over partitions [A B]=X or [A;B]=X of ||A||_Σ+||B||_Σ is A=X, B=[] (or vice versa). My question is thus: what happens when you minimize P(A)P(B) where P is the trace norm distribution? You still get the sum of trace norms, but you also get log-normalization-constant terms. How do they affect the minimum?
2/15/06 - Hmm.... I was thinking of the unnormalized version. What happens when you normalize? Any chance that the more discrete model would improve a posterior or likelihood?
2/15/06 - There's a problem in model selection with MMMF---simpler models will always be preferred. Any additional structure imposes limitations which will decrease the posterior and/or likelihood. I think we can even prove this. However, I can see cases where structure is useful. Consider two levels of trace norm distributions---one top-level "parent" node and many second-level "children" class nodes; the parent houses one "topic" per class; the children house the actual document parameters. If two classes have similarities (consider a "general English" topic), the model will find it most efficient to put that topic at the parent level. If it is specific to the class, the model will want to place it at the "child" level. Here's the trouble: increasing the number of children will always lead to a lower joint probability. Is there a rationalization for creating such a structure? Why not always just have one child, which is equivalent to the simplest MMMF-type model.
2/7/06 - Need to do a write-up of the partition function work, including introduction and data on how good the approximation is. No need to motivate trace norm regularization; that's already been done in earlier works. Our contribution is twofold: (1) application of trace norm regularization (MMMF) to text modeling, (2) calculation of the normalization constant for trace norm distribution.
2/7/06 - John made a good point---there is really no need to keep track of values much smaller than about MAX*10^-22. Double machine precision records about 16 (base 10) significant digits (52 bits). 10^6*10^-23 + 1 = 1 to machine precision (since 10^6*10^-23 is too small to affect the mantissa of 1.
2/7/06 - We can get a very good approximate upper bound on integral using sampling methodology. Current problem is that in the same batch, log_10 values will be more than 600 apart. I.e. they won't be representable in IEEE double format. My approximate solution is to round up all values less than MAX*10^-600 where MAX is the largest sample value.
2/7/06 - Some good news is that John got the variational approximation working. The results seem to be tighter than simply applying the log-convexity bound. Though, he did run into trouble with large m since the number of variables becomes quite large.
2/7/06 - We've decied to skip ICML and focus on preparing a submission for UAI. I thought it might be okay to submit the normalization constant stuff to ICML and submit the text segmentation application to ICML, but Tommi says that would weaken the UAI submission significantly since we couldn't claim the normalization constant calculation as a new contribution. And, we both doubt that the ICML submission would get accepted because ICML is fairly application-focused. So, we might end up with a portion of the work in limbo.
2/4/06 - The more exhaustive test makes it look like the scaling has no effect whatsoever on the difference of log-values (max-min). I tested from r=-3 to r=3 in 0.1 increments using 10k sample (sets). Max-Min of log-values ranged from about 3200-2600, but there was no clear trend. All of the variability seems to be due to sample noise. I had a lingering suspicion that this might be the case, since we are simply dividing by relatively stable values. And, the goal is to reduce the ratio between max and min values---dividing by a constant doesn't affect that. When I chatted with Tommi, he pointed out that the difficult term is (σ_i-σ_j). The other factor of the difference-of-squares is relatively stable, (σ_i+σ_j). The difference throws things out-of-whack. And, (σ_iσ_j)^r isn't sufficiently well related to the difference to be important.
2/4/06 - Well, one of the ideas Tommi and I talked about helped marginally, the other simply made things worse. What worked okay was to divide difference by the product (taken to a power). r=0.77 turned out to be a good power. This yielded numbers (comparatively) close to 1, but for m=100, the ratio between max and min were still way too large. Bzzzt! Actually, this is wrong. I had the impression that r=0.77 was good b/c it lead to values closest to 1 (log-values closest to 0). I did not look at the difference between min and max log-values, which is the most important quantity (if this is < 600, we we are set). In fact, in preliminary tests, it looks like r=0 gives the smallest max-min difference! I.e. the idea of dividing by the product only makes things worse. I'm currently running more detailed tests to confirm/deny this conjecture.
2/3/06 - The mean of the gamma distribution, where P(x;n,θ)=x^n-1e^-θx/Γ(n)/θⁿ, is θn.
2/3/06 - Tommi has a nice idea. Scale the difference of square σ terms by (σ_iσ_j)^r (where r ∈ (1,2)).
2/3/06 - I think the only solution here is to use the bound on the log-normalization constant. I don't see a way to use sampling to estimate the real normalization constant unless there's a package that can handle numbers of the order 10^50000. One thing that might make sense is to take the median log value rather than the average log value. This might provide a better estimate, especially since the weights on the values are all the same.
2/1/06 - I have Python sampling code based on the gamma distribution that seems to work quite well. Samples are close to all correct values we have computed with maxima. And, variance isn't horrible at the high-end of values that aren't too large for IEEE double---for 30x10 w/ theta=1.0, the range of values (100,000 samples) is 3 orders of magnitude, and most values are in (1e+291,1e+290).
2/1/06 - The Python Numeric package uses a different parameterization for the gamma distribution than I am used to. In fact, it does not even include a parameter I need (what I call 'lambda'). However, the effect of this parameter can be added post-hoc for sampling. Key is to multiply the sample by 1/lambda. Let P(x)=e^-x. Let Q(x)=e^-λx. We can sample directly from P, but not Q. To sample from Q, sample x from P, then multiply by 1/λ. To show that this is correct, look at cdf's. ∫₀^xe^-tdt = 1-e^-x. ∫₀^λxλe^-λtdt = 1-e^-x. Using the θ parameterization, Q(x)=e^-x/θ, we would multiply by θ.
1/27/06 - John has started exploring the idea of bounding the integral. Some of the bounds we're looking at:
- ∏_{i < j} (σ_i²-σ_j²) ≤ ∏_{i < j} σ_i²
- ∫₀^x f(t)e^-t dt ≤ ∫₀^∞ f(t)e^-t dt
- log-concavity: log(E_p[f(x)]) ≥ E_p[log(f(x))], where integrals serve as the expectations
1/26/06 - John thinks a good approach to calculating norm. const. might be to use variational techniques to approximate the integral itself. I wonder: can we very roughly approximate norm. constant and divide through by value for sampling?
1/26/06 - GOOD NEWS: we can approximate the norm. constant via sampling (John's code is sampling.R in papers/06-icml). Results of this are quire promising. It's fast, and the computation doesn't take exponentially longer as we increase matrix size (yeah!). Also, the numbers jive with the exact calculation of the norm. constant for small values of n & m (which means that we know how to write down the integral). BAD NEWS: normalization constant is too big for IEEE double for a 41x11 matrix. Fiddling with n & m, we were able to observe norm. constants ~10^300.
1/26/06 - We've found that a fairly successful way to approximate the normalization constant is to do importance sampling where the proposal distribution is exp(sum(-A(:))). For 2x2 matrix, the approximate value is close to the real value for as little as 1000 samples.
1/26/06 - Problem with doing A=UV decomposition is that we're dealing with two different spaces. Would need to sum over invertible matrices to convert between p(A) and p(U,V).
1/26/06 - Calculating the normalization constant (exactly) is officially impossible. John declared it. But, we think we might be able to do a very good job w/ sampling. One promising line of reasoning is to use the ||A||_tr = (||U||_F² + ||V||_F²)/2 identity.
1/26/06 - John got a copy of Muirhead out of the library. Looks like this is the reference. Looks like Edelman's notes are largely based on this book. Everyone references it. Edelman also pointed me to Ioana Dumitriu's thesis. Need to read these at some point.
1/26/06 - John and I are going to work on an ICML paper together. Idea is to put together as much of the trace norm distribution stuff we can. No hope of doing real experiments, so we'll just try to do some "proof of concept" experiments ("look ma! clustered sentences!")
1/20/06 - John has an idea for how to expand ∏_i<j (σ_i²-σ_j²), but it seems like it will still be computationlly difficult/impossible for semi-large values of m (like 100). Consider the integral over the σ_m. Let x ≡ σ_m. Only relevant terms are ∏_i=1^m (σ_i-x) (squares are discarded for simplicity's sake). Expanding this gives a polynomial in x where the coefficients are very regular. For m=4, we'll get -x³ + (σ₁+σ₂+σ₃)x² - (σ₁σ₂+σ₁σ₃+σ₂σ₃)x + σ₁σ₂σ₃. We end up with an integral that is a (lower) incomplete gamma function. Only problems is that each integral makes things more complicated. Each of these terms must be incorporated into a ∏_i=1^m-1 (σ_i-x) product where now x ≡ σ_m-1.
1/17/06 - I think I understand why dr^(rdq) = r^n-1dr(dq). q only has n-1 independent parameters, and r is a scalar that is applied to all elements of dq. So, when it is pulled out, it becomes r^n-1.
1/16/06 - I need to understand handout #4 as best as possible for tomorrow.
1/10/06 - The Householder reflector is a nice way to obtain an orthogonal matrix with first column equal to a given vector q. Let v=e₁-q, where e₁ is the first column of the identity matrix. Then the Householder reflector is H(q) = I - 2 vv^T/v^Tv.
1/9/06 - Our integral is ∫ exp(-||X||_tr) dX. Theorem 3 in Edelman's Handout #3 gives us the translation of dX when we do a change of variables to the singular value decomposition. The change of variables allows us to express exp(-||X||_tr) as exp(-∑ σ_i). Edelman's Handout #4 gives us an explicit formulas for (H^TdU)^{^} and (V^TdV)^{^}, and those terms are separate from the rest of the integral, so we are left with a integral of a simple exponential times a large polynomial. Note that the first product in the Jacobian is for i < j ≤ p, so it scales according to the smaller dimension. One difficulty is that the singular values are ordered, σ₁ ≥ σ₂ ≥ σ₃ ≥ ..., so the limits of integration depend on the singular values.
1/6/06 - John and I were able to work out our first integral. Note that UΣV^T decomposition fixes singular values to be ordered (from largest to smallest). Hence, limits of integration are not going to be 0 to infinity for all variables, only the last variable. Edelman gives us the formula for the volume of the U and V terms. We need to cancel-out for the fact that we can change the sign of columns of U and V and see no change in X (or do we?). For 2-by-2, John calcualted Vol(V_m,n) = 4π. Maple can evaluate the integral ∫₀^∞ ∫₀^σ₁ (σ₁²-σ₂²) exp(-σ₁-σ₂) dσ₁ dσ₂ = 3/2
1/6/06 - ∫₀^∞ x^n-1e^-x dx = Γ(n)
1/4/06 - Finally met with Tommi after all these weeks of going without meetings. I think the last meeting was about 7 weeks ago (11/15/05). Anyway, we had a nice chat. Appears that all this work that has gone into trying to understand Edelman's course notes has been useful. Tommi looked at the SVD Jacobian and thinks it is the what we need to work with. Also, he noted that handout #4 gives the calculation of the U and V quantities (volume calculation).
1/4/06 - Wow. There's a collaborative filtering data set sitting right in front of me: www.metacritic.com. They compile game reviews, converting scores to a 100-point scale (when the article itself doesn't include a score). Seems like a natural fit into the framework Nati and I have use (game=movie,each magazine is a user). Plus, the article text allows one to do predictions without the scores---similar to what I would like to do for restaurant discussions.
12/27/05 - dRR^-1 is an upper triangular (R is upper triangular). dR is obviously upper triangular, so all we have to show is that R^-1 is upper triangular. For this, we will use Cramer's Rule. Cramer's Rule involves cofactors. The cofactor of an entry of a matrix is the determinant of the sub-matrix excluding the row and column of the entry, with sign corresponding to the sum of indices. C_ij = (-1)^i+j(det M_ij), where M_ij represents the matrix with row i and column j excluded. In a 2-by-2 matrix M=[a b; c d], the cofactor matrix is C=[d -c; -b a]. Cramer's Rules says that (M^-1)_ij = C_ji/(det M).
12/27/05 - Q^TdQ is an antisymmetric matrix (Q is orthogonal). To see this, start with the identity, I=Q^TQ. Take the derivative, 0 = dQ^TQ + Q^TdQ. Thus, -dQ^TQ = Q^TdQ. dQ^TQ = (Q^TdQ)^T, so Q^TdQ must be antisymmetric.
12/27/05 - Here's how to apply a scalar function, f, to a diagonalizable matrix using power series. Represent f as its power series, f(x) = ∑_i=1^∞ a_i xⁱ, for some sequence of constants, {a_i}. Let X be our matrix; since it is diagonalizable, it can be written X=QΛQ^T where Q is orthogonal (i.e. QQ^T = I). Note that powers of X are simple to compute, Xⁿ = QΛⁿQ^T. Applying the power series expansion to f(X), we get f(X) = ∑_i=1^∞ a_i QΛⁱQ^T = Q(∑_i=1^∞ a_i Λⁱ)Q^T. Note that the inner sum can be written as the diagonal matrix with ∑_i=1^∞ a_i λⁱ as the j^th diagonal element.
12/27/05 - Another nice observation by Edelman that I didn't realize: If X=ZΛZ^-1 (X can be diagonalized, which is similar to, but not equivalent to X being invertible), then f(X) = Zf(Λ)Z^-1, where f(Λ) = diag(f(&lambda₁),...,f(&lambda_n)). I.e. if f is a scalar (ℜ to ℜ) function, we can calculate f(X) by applying f to X's eigenvalues individually.
12/27/05 - There's a typo in section 3.0.1---it should be
dY = (I⊗X + X^T⊗I)dX
Note that in the calculation of the determinant, ∏_i (λ_i + λ_i) = 2ⁿ(det X). Similar logic is used to obtain the expression for general powers, though neither John nor I completely understand the manipulations. And, neither of us have a good handle on the formula for general functions...
12/27/05 - Edelman: "The Jacobian of a linear map is just its determinant." This is obvious once you observe that a linear map is its own derivative.
12/27/05 - Comment #5 says that for ⊗_sym and B=A,
det J = ∏_i≤j λ_iλ_j = (det A)ⁿ⁺¹
Note that in the product, λ_i appears n+1 times. One way to see this is that there are n(n+1)/2 λ_iλ_j terms, each of which has 2 lambdas. There are n different λ_i, so each λ_i must occur n+1 times. Another way to see this is to consider λ₁. It appears in the following terms: λ₁λ₁, λ₁λ₂, λ₁λ₃, ... It appears in n terms, but appears twice in one of those terms, for a total of n+1 occurrences.
12/27/05 - Note that comment #2 under Linear Subspace Kronecker Products in Edelman's #2 handout provides the correct product. Intuitively, one expects the second term to be subtracted from the first. In fact, it is since (BXA^T)^T = -AX^TB^T = AXB^T. The last equality is due to the anti-symmetry of X.
12/27/05 - Edelman states the following theorem: det(A⊗B) = (det A)^m(det B)ⁿ where A ∈ ℜⁿ and B ∈ ℜ^m. Here's the proof (mostly copied from Edelman's handout). Assume that A and B are diagonalizable, with Au_j = λ_ju_j, and Bv_i = μ_iv_i. Let M_ij = v_iu_j^T. These mn matrices form a basis for ℜ^mn. Why? Because the v_i form a basis for ℜⁿ and the u_j form a basis for ℜ^m. The M_ij are all possible outer-products of these vectors, yielding a basis for ℜ^mn. These M_ij are eigenvectors for A⊗B. Note that A⊗B is used as a function to which an argument is applied. A⊗B applied to X is BXA^T (by definition). The M_ij are eigenvectors for A⊗B because BM_ijA^T = μ_iλ_jM_ij. The determinant is
∏_{1≤i≤n, 1≤j≤m} μ_iλ_j = (det A)^m(det B)ⁿ
Note that the number of instances of λ_j in the product is equal to the number of distinct μ_i's. Hence, the reason that (det A) is taken to the power of the dimesion of B and vice versa.
An alternate proof is to note that A⊗B = (A⊗I)(I⊗B) and that the determinant of a product is the product of determinants. Note that the first I is m-by-m; the second is n-by-n.
12/27/05 - Note that the Kronecker product has m²+n² parameters (viewed as a function or mapping). It is a mapping from &real^mn to &real^mn, so it is very limited (an unrestricted mapping between those spaces would require m²n² parameters). John and I compared this to another (less complex) mapping. The name escapes me. Need to ask John when he comes in... Ah, here it is, the Kronecker product is like a matrix outer product, going from two rank 2 things to a rank 4 things (similar to how outer product takes two rank 1 things (vectors) and turns them into a rank 2 thing).
12/27/05 - Something I didn't know before: the determinant of a matrix is the product of its eigenvalues.
12/27/05 - According to Edelman: "...if a little box of n dimensional volume ε surrounds x, then it is transformed by y into a parallelopiped of volume |det J|ε around y(x). Therefore, the Jacobian |det J|, is the magnification factor for volumes."
12/27/05 - An antisymmetric matrix is one with A=-A^T. Note that an antisymmetric matrix has zeros along the diagonal.
12/21/05 - Edelman taught a course on random matrices this fall, 18.338. Wish I would have taken it!
12/21/05 - I now undertstand the beauty of the Kronecker product, denoted as ⊗. An important identity is (Matlab notation): if Y=B*X*A', then Y(:)=kron(A,B)*X(:), or Y(:)=A⊗B*X(:). Some useful properties:
- (A⊗B)^T = A^T⊗B^T
- (A⊗B)^-1 = A^-1⊗B^-1
- det(A⊗B) = det(A)^mdet(B)ⁿ, for A ∈ ℜ^n,n, B ∈ ℜ^m,m
- tr(A⊗B) = tr(A)tr(B)
- A⊗B is orthogonal if A and B are othogonal
- (A⊗B)(C⊗D) = (AC)⊗(BD)
- if Au=λu, and Bv=λv, then if X=vu^T, then BXA^T=λμX, and also AX^TB^T = λμX^T. Therefore A⊗B and B⊗A have the same eigenvalues, and trasposed eigenvectors.
These properties are taken from Alan Edelman's 18.325 course notes.
12/20/05 - Take the extension of this example to 4-dimensions. f(w,x,y,z) = (-w+x+y+z,w-x+y+z,w+x-y+z,w+x+y-z). Note that \|f(w,x,y,z)\| = 2\|(w,x,y,z)\| (i.e. the output vector is twice the length of the input vector). The Jacobian is 1-2I (matrix of all one's minus twice the identity matrix---Jacobian is +1 everywhere except the diagonal, where it is -1). The determinant of the Jacobian is -2^4. I.e. it is the the degree to which the vector is stretched (2), to the power of the dimensionality of the space (4). The minus sign simply tells us that there is a reversal of direction. Note that if we change all the signs in the function, we would get the same determinant.
12/20/05 - Here's a simple Jacobian example that provides some intuition. Consider a function from two variables to two variables, f: R^2->R^2, f(x,y) = (x-y,x+y). df = (dx-dy,dx+dy) and J=[1 -1; 1 1]. Determinant of the Jacobian is 2. If we calculate the length of f, we find that it is sqrt(2) times the length of (x,y). I.e. in this case the determinant of the Jacobian is the square of the change in length (I think because there are two variables---if there were three variables, it'd be the third power).
12/20/05 - Edelman suggests to read Loan's paper on the Kronecker product to better one's understanding.
12/20/05 - Starting to have some understanding of the determinant of the Jacobian. I've gone through Edelman's initial Jacobian examples. Discovered that symbolic Matlab is great for this stuff. Maple also works, but the matrix tools are clumsy. It's worth noting that the identity function, Y=X yields a Jacobian determinant of 1. The Jacobian is the identity (since it is the derivatives of the variables).
12/16/05 - Prof. Edelman makes a very insightful point in comparing the derivative of the function to the Jacobian. He says that they encode the same information. In fact, the Jacobian stores the coefficients for the differential terms.
12/16/05 - I was trying to compute X^2 as X^TX. That's not right. X^2=XX.
12/16/05 - John pointed me to some lecture notes on random matrices. Dunno if it'll solve my trace norm problem, but they'll certainly give me some better understanding of how to do calculus with matrices. Handout 2 steps through what it means to take the derivative of a function with respect to a matrix. One valuable tidbit from the handouts (written by Prof. Alan Edelman) is that you should think of matrices as scalars that don't commute. I.e. d(x^3)=3x^2=x*x*dx+x*dx*x+dx*x*x. So, d(X^3)=X*X*dX+X*dX*X+dX*X*X.
12/13/05 - Thinking about the alternate MMMF optimization again. Note that all zero parameters will yield zero derivative for U and V, but not \theta. However, \theta will adjust and find a zero derivative for the entire set of parameters (\theta will be at a local min. conditioned on U & V).
12/9/05 - There is a special way to do numerical integration in Maple. The help page is "?int[numerical]". Roughly, the evalf and int calls must be made on the same line (you'd can't assign the int result to a name, then try to evalf it).
12/8/05 - Google has a nice information page for webmasters.
12/6/05 - In fact, Maple can simply solve my integral! As many digits as I want. It takes care of all the numerical stuff for me. :)
12/6/05 - Maple can do analytic derivatives quickly and easily! Perfect for doing the Taylor expansion of exp(sin(x)).
11/28/05 - Note that the integral is the same for all orientations of x_1. The sum of the singular values is the length of x_1, plus $\sqrt{r^2-c^2} \cos \theta$, where $c$ is the length of x_1 and $\theta$ is the angle between x_1 and x_2.
11/28/05 - The HTML spec has a nice section on helping search engines index your web site.
11/28/05 - Given a vector, finding a perpendicular vector is a simple linear programming problem. In particular, given a 2-d vector, x=(x_1,x_2), a perpendicular vector is y=(x_2,-x_1). For the 2-by-2 problem, we are interested in integrating over (1) all radius (r) values, (2) all lengths of x_1 (<= r), (3) all directions of x_1, (4) all directions of x_2.
11/23/05 - Okay, I now understand how to calculate the volume of a sphere. Back to the trace norm distribution. Simplified trace norm normalizer is \exp(-r) integrated over sphere surface area from r=0 to r=\infty.
11/22/05 - Note that equations 5 & 6 are simply calculating the n-dimensional Gaussian normalization constant.
11/22/05 - Consider a small square of surface "area" on a unit n-sphere, measuring x on each side (total area is x^{n-1}. Now project that same square to the surface of a n-sphere with radius r. Each side is now rx, so the area is (rx)^{n-1}, or r^{n-1} times the original area.
11/22/05 - Okay, I think I understand it. The important question is: how does the surface area change as a function of r? For a 1-sphere, the surface area is a constant 2. For a 2-sphere, the surface area is r times the unit sphere surface area. For a 3-sphere, it's r^2. We're integrating over one dimension. While that is happening, the area across the other dimensions are expanding---hence the reason for the r^{n-1} term.
11/22/05 - To know the best way to calculate the volume of an n-sphere, it would help to understand the volume-surface relation of which Mathworld speaks, V_n = \int_0^R S_n r^{n-1} dr, where V_n is the volume of a radius R n-sphere, and S_n is the surface area of a unit radius n-sphere. Why the r^{n-1} term?
11/18/05 - Fixed the sum limits. Should have been n/2. My formula is correct for n=2 (3-sphere).
11/18/05 - Okay, my power of R is correct. Note that in calculating the volume of an (n+1)-sphere, you take the integral of (R^2-x^2)^{n/2}, which yields an R^{n+1} term. But.... somehow, we end up with a sum from i=0 to i=(n+1)/2 with n even. We can't have that!!!!
11/15/05 - Need to do calculation of volume of n-sphere (for myself). Need to do calculation of trace norm distribution for a two-row matrix (using set-up Tommi & I talked about).
11/15/05 - Talked about the thesis outline. One significant comment that Tommi had is that I need to provide arguments for the statements in my motivation, such as the fact that existing techniques perform poorly on informal communication. My NEE experiments are evidence of this. Would also be good to provide other evidence. What kind of evidence? Might be useful to provide some specific examples. Could we do experiments to look at how well certain small sets of features perform on NEE? Maybe do some empirical studies? I.e. what % of newspaper NE's are capitalized; what % of restaurant bboard NE's are capitalized?
11/15/05 - This is a very useful note on the sphere in Euclidean n-space. There is also a version of the note in HTML converted from Latex. Unrelated to research, but good to know, the math FAQ also has an explanation of how to compute compound interest.
11/15/05 - Here's a good question: how do I calculate the volume of a sphere? Or, how do I calculate the surface area of a sphere?
11/15/05 - What if we were to do the integral simply ignoring one direction's contribution to length? Is that a good simplified version to try?
11/14/05 - So, we do have the basic distribution derived. Now, how do we deal with the conditioning? Before, we were integrating over shells multiplied by a function of the radius. Now, direction is important. We will need to do a double integral. One over the "old" radius, one over the "new" radius. Old radius only extends out to the length of vectors 1 through k-1. Radius of shell is \sqrt{r_n^2+r_o^2}. Note that "old" radius is not the sum of lengths of previous vectors, it's the length of maximum extents in perpendicular directions. Hmm... and I can't integrate over shells like I did before b/c the function is not constant over a shell (well, at least it'd be hard to do it that way...)
11/14/05 - Ack! I haven't calculated the log-norm eliptical distribution, where the regularization constants differ by axis/dimension. Wait, should we be looking at this distribution? No! We're only going to have one regularization constant. Only dependence on location is the radius (axis/dimension neutral).
11/14/05 - We can treat the trace norm distribution as a product of distributions. Think: product of conditionals. The first is a simple log-norm distribution. The second ignores length corresponding that overlaps with the first. The third ignores length that overlaps with the first & second. Etc.
11/14/05 - We can think of the trace norm distribution as a cascade. The first row of the matrix is handled just like the log-norm distribution. For the second, we separate the vector into two parts: the part that can be constructed using the first vector, and the new part. Only the "new" part contributes length for the distribution. The old part is ignored. Note that the first vector can only account for the old part of the second vector up to its length. Length beyond that must be accounted for and tagged "new".
11/14/05 - The distribution whose negative-log-likelihood is the sum of lengths of the row vectors of a matrix is simply a sum of log-norm distributions.
11/14/05 - Spent some time last week putting together a thesis outline. I think it's a good start. And, it'll be a good way to get the ball rolling and start to really make progress with Tommi in terms of getting a thesis fleshed out.
11/8/05 - At Tommi's suggestion, I need to spend some time this week not working on the trace norm integral. I need to spend some time putting together a detailed thesis outline with a story-line; it should have descriptions of motivations for each part and how the various parts flow together.
11/8/05 - Tommi took a look over my log-norm (not log-normal) distribution write-up. Looked good to him. We had a short conversation about how to calculate the integral. We came back to the issue of finding the volume of matrices with a particular set of singular values. I still can't get my head wrapped around the problem. Some thoughts: we're integrating over a n-by-k space. This is very important---I've been causing myself confusion b/c I've been thinking in an n-dimensional space. We can think of it as a set of axis-parallel vectors (the singular values) that we're rotating.
11/7/05 - Mathworld and Planet Math have useful discussions of the hypersphere.
11/6/05 - Started to look at the distribution such that the unnormalized log density is the negative vector length (L2-norm). This does not decompose like the Normal, so it does not seem to be as easy to generalize to large n as was the case for a Normal-like distribution. For n=1, it's a simple exponential. For n=2, we can change to polar coordinates then use integration by parts. This matches Ali's result. But, I'm not sure how to generalize... Polar coords are specific to 2-D, as is the change of variables formula (eqn. 17.14 in Swokowski).
11/6/05 - Successfully worked out the Gaussian(-like) normalization constant (finally!). Since an n-D standard Normal is simply a product of 1-D standard Normals, it's simple to generalize.
11/4/05 - John also described to me an idea for how we might calculate the trace norm integration constant via a change of variables and the Jacobian calculation. Let X' = [X;U^TU-I;V^TV-I]. Now, integrate X over +- infinity and the other parts of X' over [0,\lambda]. Use change of variables to do the integration over U,S,V. Then, to get the integral over only X, take the partial wrt \lambda and evaluate at \lambda=0. This forces the constraints to hold. One issue I see is that U^TU-I and V^TV-I aren't variables. How do you integrate over them?
11/4/05 - John showed me how to calculate the normalization constant of a Gaussian. It requires a "trick." I knew the trick at some point, but had forgotten it until John showed it to me again. Here it is. Don't try do do it in 1-D, do it in 2-D. I.e. integrate the product of two Gaussians. And, convert to polar coordinates. dx dy converts to r dr d\theta. Then, we have re^{-r^2}, which is easy to integrate. This page also describes the procedure.
11/3/05 - WikiPedia has a nice table of integrals (like the kind you might find in the back of an undergrad Calculus textbook). They also have a nice discussion of the exponential function.
11/3/05 - Sam Roweis has some very useful notes pages, including one on matrix identities and one on Gaussian identities.
11/2/05 - John pointed me toward the field of "Geometric Algebra," in particular, a paper by Gull, Lasenby and Doran, which describes the basics of the topic. Lots of overlap with differential geometry.
11/2/05 - We cannot simply integrate over hyper-ellipsoids! The set of matrix corresponding to a vector of singular values cannot be represented as the surface of an ellipsoid. A matrix is a set of (orthogonal) points on an ellipsoid.
11/1/05 - Do this for thesis: show that exp(-c\|X\|_tr) is max. ent. distribution w/ constraint that mean sv is c.
11/1/05 - We can use the USV^T decomposition. Multiple USV decompositions yield the same X, but the duplication is measure zero, so it's unique as a far as integration is concerned. Note that if two singular values are equal to each other, then the corresponding rows/columns of U,V can be swapped and yield the same X. So, we can try to do a change of variables where we calculate the determinant of the Jacobian of the X->USV transform. Note that we probably need to add bogus dimensions of X so that the Jacobian is square. Note that the derivative of USV with respect to S has determinant 1. And determinant of USV wrt to U or V has determinant something like product of singular values. Note that the volume with a particular value of trace norm does not increase exponentially, hence the integral is finite (would be good to show this).
10/25/05 - A reason to want a separate regularizer for the hierarchical terms and the sequential terms is that the trace norm scales differently than the L2-norm. L2-norm terms scale as the sum of the individual values. Trace norm scales more like the max (in particular, the sum of max-es of each dimension). So, the nature of the data will dictate different values for the regularization constants.
10/25/05 - Before realizing that the normalization constant was sufficient for directing model choice, Tommi suggested weighting the trace norm values. I pointed out that if the weights are 1, the flat model will always be chosen. If the weights are 1/n, the hierarchical model will always be chosen. It would be good to prove this (or at least more carefully state what I mean).
10/25/05 - Met w/ Tommi. Talked about some applications of this framework. One interesting thing we could do would be to learn the structure of the data. I.e. don't fix the structure (hierarchy, sequence); find the structure that maximizes marginal likelihood of the data. Tommi convinced himself that the model would do the right thing---if the data is clearly hierarchical, marginal likelihood would prefer the hierarchy. One important point is that we marginalize out additional variables that are introduced. Otherwise, the MAP will decrease as the number of variables increases. Also, we will need to calculate normalization constants for the trace norm. That seems hard. Tommi couldn't come up with anything right away. Need to give it some thought... Also, how hard is it to marginalize in this model?
10/14/05 - Tommi likes this framework. Some notes: the trace norm for replies to a post and that post's paragraphs are separate. We can easily add sequential structure by adding L2 penalty between sentence models (just include in potential).
10/14/05 - Aha! The solution! Use a factor graph. I.e. the distribution is just a product of potentials where the potentials are exponentiated, negated trace norms. The normalization constant is constant (wrt the parameters of the model), so we can ignore it for optimiation.
10/14/05 - Tommi pointed out that we cannot use the inverted hierarchy model that John suggested because the joint distrubution over a set of children nodes is not just a product of their marginals---the joint distribution depends on the model differences (but not their absolute positions).
10/14/05 - The odd thing about using the trace norm is that only integration over singular vectors will sum to one. We really have a prior on singular vectors...
10/14/05 - Okay, we should use the trace norm separately for each parent (and its set of children), and use the sum of trace norms for the complexity penalty. Note that this trace norm sum is strictly larger than if we had used a single trace norm per parent "type" or level of the hierarchy. However, once the normalization constant is factored-in, this "separate" application of trace norms will give lower complexity to models with close similarity among siblings without similarity among cousins.
10/12/05 - The trace norm of a matrix composed of a single vector is the L2 length of that vector.
10/12/05 - Prove that L1 norm and trace norm are equivalent for a 1-by-n matrix. THEY AREN'T!!!!! Trace norm is L1 norm of the vector of singular values. The (first) singular value of a vector (treated as a matrix) is its L2 norm.
10/12/05 - More thoughts: L1 vector norm is equivalent to trace norm (of a 1-by-n matrix). No! No! No! It's not!
10/12/05 - Complexity of model should be sum of (1) sum over all parents of trace norm of the set of parent-children difference vectors, and (2) sum over all neighbor pairs of L2 norm of difference vector.
10/12/05 - Finally, categories and top-level posts. They should be similar to a "root"; no extra "neighborhood" similarity.
10/12/05 - Next, the document/post. If there is categorical structure, document should be similar to its category; no extra "neighborhood" similarity. If there is thread structure, post should be similar to the post to which it is replying; no extra "neighborhood" similarity.
10/12/05 - Next, the sentence. Each paragraph should be similar to neighboring paragraphs. Also, each sentence should be similar to the document as a whole.
10/12/05 - From the ground up. First, the sentence. Each sentence should be similar to neighboring sentences. Also, each sentence should be similar to paragraph as a whole.
10/12/05 - Realized that trace norm is only appropriate if the set of vectors eminate from a common origin. So, it is appropriate to use the trace norm on the set difference vectors for a set of children with the same parent. A stretch is to use a single trace norm on difference vector sets corresponding to parents of the same type (should we do this?).
10/5/05 - Nice chat w/ Tommi. He likes the sequential idea, but he made an important point---that it may not be appropriate for there to be a specific number of levels. The structure of the data may warrant otherwise. For example, within a thread, posts should not be simply ordered by date. Rather, the hierarchy structure should follow the structure of the data (reply should be child of the original post). Tommi also suggested that the regularization be done on a per-level or per-type basis. I.e. the change between the root and thread is very different in nature than the change between a paragraph and a sentence. Also, he questioned the sensibility of allowing "free" paragraph/sentence changes simply because they occurred higher-up in the hierarchy (since the trace norm only penalizes the largest change in a certain direction). I.e. there should be one trace norm penalty for each level of the matrix (in a simple hierarchy where there is one level per type of data) or one trace norm penalty per pair of types (e.g. thread/post, post/post, post/paragraph, etc.). One thing I observed during the discussion: when determining which vectors could be pruned, it is essential to look at the derivative, not the actual size. Some vectors will touch the boundary of the hyper-ellipsoid, but would not grow longer if their likelihood were weighted heavier; these can be safely pruned. Others would grow longer; these are the important ones.
9/29/05 - Finally started to look at the papers Erik Sudderth pointed me to: Malioutov/Cetin/Fisher/Willsky paper on superresolution, Malioutov/Cetin/Willsky paper on Laplacian prior, Cetin/Malioutov/Willsky paper on variational technique. Hmm... it looks like they're using the L1 vector norm (not trace norm). Some high-level similarities, but not sufficiently related to warrant further investigation, I think.
9/29/05 - Saw Alan give a talk on a Bayesian version of the CRF. Here's the Qi/Szummer/Minka paper. I asked Alan why he thought the Bayesian version is better. He argued that model averaging is the right thing to do. I retorted that you couldn't find a model for MAP that would yield the same distribution over the data. I think the Bayesian version gives you a within-document heavy-tailed distribution, due to the posterior updating that goes on. I.e. the Bayesian version is to be preferred if data is more similar within document than between documents. Couldn't convince Alan of this, though...
9/22/05 - Just finished building a new machine for home. The parts:
- Motherboard: ECS 755-A2
- Memory: 512M Rosewill PC3200
- CPU: AMD Athlon 64 3700+ (754 pin Clawhammer, 1M cache)
- Video card: ATI Radeon 9600 128M Atlantis
- Hard Drive: Seagate Barracuda 80G PATA
- Optical Media: Sony DVD/RW, CD/RW
- Case: Apex SK337-A 350W PS
What's amazing is that in my numerical computation tests (matlab gradient descent, matlab matrix inversion, python), it outperforms the dual Xeon 3.6Ghz (2M cache) server machine that was purchased for our group just two months ago. Identical software (OS is AMD64 Debian Sarge). My guess? Intel has lost its crown to AMD. Just imagine if I had used top-of-the-line motherboard, cpu & memory!
9/22/05 - Tea talk for today
1. Matrix norms
  1. Define singular values
    1. Most general definition: $X \vec v = \sigma \vec u$ or $X^T \vec u = \sigma v$, \sigma is a singular value, u, v are the singular vectors for the matrix $X$
    2. More common definition: singular values are diagonal entries of S in singular value decomposition, $USV^T=X$. U, V are orthogonal matrices with transpose equal to inverse, UU^T=I, VV^T=I. Here, all of the singular vectors are perpendicular and ordered corresponding to largest-to-smallest singular values.
    3. Singular values are always positive
  2. Rank is number of non-zero singular values (L_0 norm of vector of singular values)
  3. Trace norm is sum of singular values (L_1 norm)
  4. Frobenius norm is square root of sum of squared singular values (L_2 norm); also: square root of sum of squared entries of matrix
2. Convex Hull proof
  1. Space of rank constrained matrices is not convex. Take two rank one matrices, for example, two matrices each with a single non-zero row, and take a convex combination (weighted sum) of the matrices. Generally speaking, you get a rank two matrix. In fact, the singular values of a sum of two matrices is the sum of their singular values.
  2. Now, if we take the space of matrices with spectral norm less than or equal to 1 (which means that the max singular value is less or equal 1), then the trace norm is a lower bound on rank (with equality when all non-zero singular values are 1). And, since the singular values of a convex combination of two matrices are simply a convex combination of the singular values, the space of matrices with bounded trace norm is convex. And, specifically, it is the convex hull of the space of rank-constrained matrices. Furthermore, since the trace norm is an L_1 norm on the singular values, when it is used as a penalty, it has the tendency of encouraging zero singular values.
3. Review topic models
  1. Latent Semantic Analysis: do singular value decomposition on term frequency matrix, zero many of the smaller singular values. Result: low-rank least squares approximation to the original term frequency matrix.
  2. Probabilistic LSA: Assume each document generated from a multinomial, place rank constraint on matrix of multinomial parameters. Also, entries of U,V constrained to be positive and sum to one. Columns of V correspond to "topics." Each is the mean parameter vector for a multinomial. I.e. it specifies the rate of occurrence of each word in the "topic." Param vector for each document is a convex combination of topics.
  3. Latent Dirichlet Allocation: Bayesian extension of PLSA; add prior over topics, integrate out topics.
  4. Other work: how to figure out rank, or correct # of topics.
4. Natural parameter trace constrained topic model
  1. Like PLSA, we assume a multinomial model on each document & maximize likelihood subject to a constraint on the parameter matrix.
  2. But, the constraint we use is the trace norm. This effectively limits the topics (or basis vectors) to lie within a sphere such that the such of axis-parallel radii
9/17/05 - Mathworks web-publishes a book entitled Numerical Computing with Matlab, which includes a chapter on Eigenvalues and singular values, which serves as a nice introduction to the topic.
9/16/05 - Okay, now I think I'll have an easier time reading the Lafferty/Lebanon paper on Diffusion Kernels. They define the Fisher kernel/information matrix and note that it is invariant under reparameterization. Though, shortly after that, I'm lost! I don't even follow the multinomial example... (which seems to be the section I might most easily understand since I'm so familiar with the distribution) The example (figure 1) appears to use a hyperplane as a decision boundary once the points are projected onto the sphere... though I somewhat doubt that that is what's really going on...
9/16/05 - This Jaakkola/Haussler paper is a nice introduction to kernels. Defines "natural gradient." Defines the Fisher score and Fisher kernel. I still get a bit lost in the development. Why is the natural gradient " the gradient in the direction that maximizes log-likelihood while traversing the minimum distance in the manifold?" Actually, the answer to this question is almost certainly a result of the definition of "distance in the manifold," which uses the Fisher information matrix. Paper contains an important theorem with the following observation: "[the Fisher kernel] is invariant to any invertible (and differentiable) transformation of the parameters." E.g., mean and natural parameterizations of exponential models yield the same Fisher kernel. The FK captures aspects of the model itself, not the parameterization.
9/16/05 - Boyd and Vandenberghe's recently published book on Convex Optimization is available on line.
9/16/05 - Discovered lecture notes on Fisher Kernels and Semidefinite Programming from one of Michael I. Jordan's classes. The score vector is defined as the derivative (with respect to the parameters) of the log-likelihood (which is a function of the data and the parameters). The notes say that "the score vector captures the change to the fixed parameter to accommodate the new data point." My explanation would be that the score vector captures the change in log-likelihood of the data (for a small positive change in each of the parameter values). Maybe those are equivalent statements, but I don't immediately see how... Here's what I think is a useful statement about the Fisher kernel: "The Fisher kernel measures the similarity between data points by measuring the similarity of their imact on the parameters with respect to the score function." i.e. it's a natural measure of similarity with respect to a model.
9/16/05 - Read the Kondor/Lafferty paper on Diffusion Kernels. In short, a diffusion kernel is a kernel on a graph structure. In particular, the diffusion kernel is the covariance (limit as \Delta t \to 0) of a random field where each node sends a fraction of their value to their neighbors. i.e. it tells you how nodes interact over time as their values are diffused through the graph. Close relation to random walks. In fact, it's probably accurately thought of as a continuous-time version of a random walk. My guess is that these are closely related to the Gaussian Fields/Harmonic Functions that Zhu/Ghahramani/Lafferty have worked on. Here's a useful note in the conclusion section: "...the kernels correspond to standard Gaussian kernels in a continuous limit."

Jason Rennie
Last modified: Mon Mar 20 14:22:01 EST 2006