Multisite adaption algorithm based on low-rank representaion: Criticizing a paper and developing my own algorithm, which is also full of flaws.
You may wanna read my note on low-rank representaion and the paper on multi-site adaption based on low-rank representation.
Part 1: Problems about this paper
In the paper, the basic set up is: data come from different sites, each site may have some systematic bias which make data from different sites somewhat "heterogeneous". Then from these
sites, choose one as "Target Site"
, assume there is a projection matrix
which maps data from
to the common latent space. Then for each of other sites
, which are called a "Source Site", there is a projection matrix
, obtained by
, which maps data from
to the latent space. Then use the mapped data from the target site to linearly represent the mapped data from source sites, that is
, where
is the data from source site
and
is the data from target site
.The setup is illustrate by the picture below.
It's assumed that the latent space is low-rank, then the following optimization problem is given
This problem can be solved by solving the equivalent problem
Then the paper solve this problem with Augmented Lagrangian Multiplier. First, form the augmented lagrangian multiplier as if the constraint does not exist.
Then the optimization algorithm is given
initialize all variables
while not convergence{
Fix other variables, minimize A_c over J
Fix other variables, minimize A_c over F_i
Fix other variables, minimize A_c over E_{S_i}
Fix other variables, minimize A_c over E_{P_i}
Fix other variables, minimize A_c over F_i
Fix other variables, minimize A_c over Z_i
Fix other variables, minimize A_c over P_i
Fix other variables, minimize A_c over P
Update dual variable Y_{1,i} += Z_i - F_i
Update dual variable Y_{2,i} += Z_i - F_i
Update dual variable Y_{3,i} += Z_i - F_i
Update dual variable Y_1 += Z_i - F_i
P = Orth(P) to satisfy PP^T = I
}
I found 3 fatal mistakes in this article.
1. About the objective
In this article, one term in the the objective of the optimization problem is , which is the sum of singular values of
. But when the equality constraint
is satisfied, we can perform singular value decomposition on
, we will get
Multiplie both sides by , we get
Since
is a diagonal matrix, the only solution is that
. If so,
has nuclear norm
, then the term
is determined by
. So the term
in the objective is redundant.
2. About the optimization problem
The optimization algorithm given by this article is not quite justified.
They constructed the Augmented Lagrangian Multiplier as if the equality constraint , which is a quadratic constraint, does not exist. After each iteration where the primal and dual variables are minimized over, the constraint
is satisfied by "bruteforcely" orthogonalizing
, this step changes
, thus
does not minimizes the lagrangian when other variables are fixed, so the convergence of augmented lagrangian multiplier does not hold any more.
If this algorithm does converge with some magic going on underneath, at least they can give some proof or say something like "we cannot prove the convergence but it does converge everytime in the experiment".
3. About the convexity
As I quote from the paper, "the optimization of Eq.(5) is convex and can be solved by iteratively updating each variable separately".
Even if we do not consider this equality constraint , which is not included when forming augmented lagrangian multiplier, the equality constraint
, which is a quadratic equality constraint because of the appearance of variables
and
in the same term, makes the problem nonconvex.
Thus the optimization problem is in no sense convex due to a quadratic equality constraint. There are some convergence proof for augmented lagrangian multiplier on nonconvex problems, but the statement "the optimization of Eq.(5) is convex" is definitely wrong.
So, there are 2 possibilities
- I do not fully understand this paper, if you find where I was wrong, please email diqiudaozhezhuan@gmail.com, it really bothers me.
- This article is full of flaws.
When thinking through this article, I came up with my own idea, which turns out to be also a piece of garbage. But I still wanna write it down.
Part 2: My own idea.
Actually, my idea is very simple. Suppose datas from differents sites came from a common space, for each site , there is a projection matrix
which project the data to a common space, then perform low-rank representation in the common space.
We can arrange
's into a block-diagonal matrix
and horizontally stack
's in to a matrix
. And the optimization problem can be written as
This is identical to the basic low-rank representation except that here we perform low-rank representation with . The structure of the matrix multiplication is shown in the picture below.
However, this optimization has a trivial solution . This trivial solution exists because that when mapping datas to a common space, we through all informations in
by setting
, thus the common space is surely low-rank. Thus, we want to project
into the common space and keep some information in
in the meantime. Consider the following optimization problem
The term in objective is quite heuristic, that is, put it other way, not fully justified. The motivation is, if
is small, then the sigular values of
tends to 0, then singular values of
tends to 1, thus we avoid the situation where
throws all information in
.
This problem is nonconvex due to the quadratic equality constraint ,but we can still use augmented lagragian multiplier to get to a local minimum.
To solve the optimization problem, we can solve the equivalent problem
First, we build the augmented lagrangian multiplier
We can alternatively minimize the lagrangian over each variable.
1. minimize over
This gives the optimization problem
The close-form solution is given by singular value shrinkage
2. minimize over
This gives the optimization problem
The close-form solution is given by singular value shrinkage
3. minimize over
This gives the optimization problem
This problem can be dissect into several optimization problems
where is the ith row block of
.
The close-form solution is given by
4. minimize over
This gives the optimization problem
The close-form solution is given by
Implementation details: In (3) and (4),the "diagonal-plus-lowrank" structure is crying out, make sure that you exploit the sparsity pattern by using block elimination to solve the linear system
5. minimize over
This gives the optimization problem
This problem can be dissected into several problems
where is the ith column block of
,
is the ith column of the dual variable
.
This problem can be formulated as below
which can be solved efficiently with primal-dual interior point method. For the detail, please refer to and appendix.
We can alternatingly optimize over each variable and update the dual variable accordingly in each iteration of augmented lagrangian multiplier.
This method also have some problems.
1. About the objective
The term make the singular values of
tend to 1. However, we have no idea that the "right" solution for
have all 1 singular values. We can only prevent
from throwing away all information in
.
2. About the convexity
Due to the equality constraint , the problem is nonconvex. Though we can get a local minimum by using the augmented lagrangian multiplier and achieve a "fairly good" solution by running multiple trials with different initial points, it's not as elegant as a convex problem.
3. About the complexity
In each iteration, assuming there are sites and each data has dimension
, we'll need to solve
SVDs when calculating
, each of which has complexity
. When the data is high-dimension, the problem becomes intractable quickly.
Part 3: Results of my own idea.
Following is the result on Yale face dataset when I set and
.
These images have lower quality after low-rank representaion, but remember that the purpose of low-rank representation if to map data into a common space. In these results, the shade at the eye is eliminated in the first image and the shade near the nose is eliminated in the second image, making those two images pretty "homogeneous".
Only one experiment was done since solving the optimization problem is pretty time-consuming. I think the result can be better if proper and
is chosen.
4. Appendix : Solving one-norm-plus-frobenius-norm-square
Consider the optimization problem
This can be reformulated into an equivalent problem
which is differentiable.
Write this into a standard form
with a change of notation, we can write the problem as
Form the Lagrangian
the modified KKT condition for the log barrier is
The residual is given by
Then form the KKT system
where and
stand for dual residual and centrality residual, respectively. Then solve the KKT system by block elimination
For more detail about primal-dual interior point method, refer to my note.
Implementation details: Notice the structure of matrices and make sure you exploit the sparsity in solving the KKT system.