Multisite adaption algorithm based on low-rank representaion: Criticizing a paper and developing my own algorithm, which is also full of flaws.

You may wanna read my note on low-rank representaion and the paper on multi-site adaption based on low-rank representation.

Part 1: Problems about this paper

In the paper, the basic set up is: data come from $T$ different sites, each site may have some systematic bias which make data from different sites somewhat "heterogeneous". Then from these $T$ sites, choose one as "Target Site" $S_T$ , assume there is a projection matrix $P$ which maps data from $S_T$ to the common latent space. Then for each of other sites $S_i$ , which are called a "Source Site", there is a projection matrix $P_i$ , obtained by $P_i = P + E_{P_i}$ , which maps data from $S_i$ to the latent space. Then use the mapped data from the target site to linearly represent the mapped data from source sites, that is $P_iX_{S_i} = PX_TZ_i + E_{S_i}$ , where $X_{S_i}$ is the data from source site $S_i$ and $X_T$ is the data from target site $S_T$ .The setup is illustrate by the picture below.

It's assumed that the latent space is low-rank, then the following optimization problem is given

\text{min}_{P,P_i,Z_i,E_{S_i}, E_{P_i}}~~~ ||P||_* + \sum_{i=1}^K\bigg( ||Z_i||_* + \alpha ||E_{S_i}||_1 + \beta ||E_{P_i}||_1 \bigg) \\ \text{s.t}~~~~P_i X_{S_i} = P X_T Z_i + E_{S_i}\\ P_i = P + E_{P_i}\\ PP^T = I

This problem can be solved by solving the equivalent problem

\text{min}_{P,J,P_i,F_i,Z_i,E_{S_i}, E_{P_i}}~~~ ||J||_* + \sum_{i=1}^K\bigg( ||F_i||_* + \alpha ||E_{S_i}||_1 + \beta ||E_{P_i}||_1 \bigg) \\ \text{s.t}~~~~P_i X_{S_i} = P X_T Z_i + E_{S_i}\\ P_i = P + E_{P_i}\\ P = J \\ Z_i = F_i\\ PP^T = I

Then the paper solve this problem with Augmented Lagrangian Multiplier. First, form the augmented lagrangian multiplier as if the constraint $PP^T=I$ does not exist.

A_c(P,J,P_i,F_i,Z_i,E_{S_i}, E_{P_i}, Y_{1,i}, Y_{2,i}, Y_{3,i}, Y_4) =

||J||_* + \sum_{i=1}^K\bigg( ||F_i||_* + \alpha ||E_{S_i}||_1 + \beta ||E_{P_i}||_1 \bigg)

+ &lt;Y_{4,i}, P-J&gt; + \sum_{i=1}^K\bigg( &lt;Y_{1,i}, Z_i - F_i&gt; + &lt;Y_{2,i}, P_i X_{S_i} - P X_T Z_i - E_{S_i}&gt; + &lt;Y_{3,i}, P_i - P - E_{P_i}&gt; \bigg)

+ \frac{c}{2}||P-J||_F^2 + \frac{c}{2} \sum_{i=1}^K\bigg( ||Z_i - F_i||_F^2 + ||P_i X_{S_i} - P X_T Z_i - E_{S_i}||_F^2 + ||P_i - P - E_{P_i}||_F^2 \bigg)

Then the optimization algorithm is given

initialize all variables
while not convergence{
    Fix other variables, minimize A_c over J
    Fix other variables, minimize A_c over F_i 
    Fix other variables, minimize A_c over E_{S_i}
    Fix other variables, minimize A_c over E_{P_i}
    Fix other variables, minimize A_c over F_i
    Fix other variables, minimize A_c over Z_i
    Fix other variables, minimize A_c over P_i
    Fix other variables, minimize A_c over P
    Update dual variable Y_{1,i} += Z_i - F_i
    Update dual variable Y_{2,i} += Z_i - F_i
    Update dual variable Y_{3,i} += Z_i - F_i
    Update dual variable Y_1 += Z_i - F_i
    
    P = Orth(P) to satisfy PP^T = I
}

I found 3 fatal mistakes in this article.

1. About the objective

In this article, one term in the the objective of the optimization problem is $||P||_*$ , which is the sum of singular values of $P$ . But when the equality constraint $PP^T=I$ is satisfied, we can perform singular value decomposition on $P$ , we will get

PP^T = U \Sigma V^T (U \Sigma V^T)^T \\ = U \Sigma V^T V \Sigma U^T \\ = U \Sigma^2 U^T = I

Multiplie both sides by $U$ , we get $U \Sigma^2=U$ Since $\Sigma$ is a diagonal matrix, the only solution is that $\Sigma=I$ . If so, $P = U \Sigma V^T$ has nuclear norm $n$ , then the term $||P||_*$ is determined by $PP^T=I$ . So the term $||P||_*$ in the objective is redundant.

2. About the optimization problem

The optimization algorithm given by this article is not quite justified.

They constructed the Augmented Lagrangian Multiplier as if the equality constraint $PP^T=I$ , which is a quadratic constraint, does not exist. After each iteration where the primal and dual variables are minimized over, the constraint $PP^T=I$ is satisfied by "bruteforcely" orthogonalizing $P$ , this step changes $P$ , thus $P$ does not minimizes the lagrangian when other variables are fixed, so the convergence of augmented lagrangian multiplier does not hold any more.

If this algorithm does converge with some magic going on underneath, at least they can give some proof or say something like "we cannot prove the convergence but it does converge everytime in the experiment".

3. About the convexity

As I quote from the paper, "the optimization of Eq.(5) is convex and can be solved by iteratively updating each variable separately".

Even if we do not consider this equality constraint $PP^T=I$ , which is not included when forming augmented lagrangian multiplier, the equality constraint $P_i X_{S_i} = P X_T Z_i + E_{S_i}$ , which is a quadratic equality constraint because of the appearance of variables $P$ and $Z_i$ in the same term, makes the problem nonconvex.

Thus the optimization problem is in no sense convex due to a quadratic equality constraint. There are some convergence proof for augmented lagrangian multiplier on nonconvex problems, but the statement "the optimization of Eq.(5) is convex" is definitely wrong.

So, there are 2 possibilities

I do not fully understand this paper, if you find where I was wrong, please email diqiudaozhezhuan@gmail.com, it really bothers me.
This article is full of flaws.

When thinking through this article, I came up with my own idea, which turns out to be also a piece of garbage. But I still wanna write it down.

Part 2: My own idea.

Actually, my idea is very simple. Suppose datas from differents sites came from a common space, for each site $S_i$ , there is a projection matrix $P_i$ which project the data to a common space, then perform low-rank representation in the common space. We can arrange $X_i$ 's into a block-diagonal matrix $X$ and horizontally stack $P_i$ 's in to a matrix $P$ . And the optimization problem can be written as

\text{min}_{P,Z,E} ~~~~~~ ||Z||_* + \lambda ||E||_1\\ \text{s.t.}~~~~~~~~ PX = PXZ + E

This is identical to the basic low-rank representation except that here we perform low-rank representation with $PX$ . The structure of the matrix multiplication is shown in the picture below.

However, this optimization has a trivial solution $P=0, Z=0, E=0$ . This trivial solution exists because that when mapping datas to a common space, we through all informations in $X$ by setting $P=0$ , thus the common space is surely low-rank. Thus, we want to project $X$ into the common space and keep some information in $X$ in the meantime. Consider the following optimization problem

\text{min}_{P,Z,E} ~~~~ ||Z||_* + \sum_i \bigg( \alpha ||P_i-I||_* + \beta ||E_i||_1 \bigg) \\ \text{s.t.}~~~~~~PX = PXZ + E

The term $||P_i - I||_*$ in objective is quite heuristic, that is, put it other way, not fully justified. The motivation is, if $||P_i - I||_*$ is small, then the sigular values of $P_i - I$ tends to 0, then singular values of $P_i$ tends to 1, thus we avoid the situation where $P_i$ throws all information in $X$ .

This problem is nonconvex due to the quadratic equality constraint $PX=PXZ+E$ ,but we can still use augmented lagragian multiplier to get to a local minimum.

To solve the optimization problem, we can solve the equivalent problem

\text{min}_{P_i,Z,E_i, Q_i, J} ~~~ ||J||_* + \sum_i \bigg( \alpha ||Q_i||_* + \beta ||E_i||_1 \bigg)

\text{s.t.}~~~PX = PXZ + E : Y1 \\ J = Z : Y_2\\ Q_i = P_i - I : Y_{3,i}

First, we build the augmented lagrangian multiplier

A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i}) = ||J||_* + \sum_i \bigg( \alpha ||Q_i||_* + \beta ||E_i||_1 \bigg)

+ &lt;Y_1, PX-PXZ-E&gt; + &lt;Y_2, J-Z&gt; + \sum_i &lt;Y_{3,i}, Q_i-P_i+I&gt;

+ \frac{c}{2}||PX-PXZ-E||_F^2 + \frac{c}{2}||J-Z||_F^2 + \frac{c}{2}\sum_i ||Q_i-P_i+I||_F^2

We can alternatively minimize the lagrangian over each variable.

1. minimize over $J$

\text{min}_{J}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

\text{min}_{J}~~~ ||J||_* + &lt;Y_2, J-Z&gt; + \frac{c}{2}||J-Z||_F^2

The close-form solution is given by singular value shrinkage

J = \mathcal{D}_{1/c}(Z-\frac{Y_2}{c}) ~~~~~~~~~~~~~~~~~~~~~~ (1)

2. minimize over $Q_i$

\text{min}_{Q_i}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

\text{min}_{J}~~~ \alpha ||Q_i||_* + &lt;Y_{3,i}, Q_i-P_i+I&gt; + \frac{c}{2}\sum_i ||Q_i-P_i+I||_F^2

The close-form solution is given by singular value shrinkage

Q_i = \mathcal{D}_{\alpha / c}(P_i - I - \frac{Y_{3,i}}{c}) ~~~~~~~~~~~~~~~~~~~~~~ (2)

3. minimize over $P$

\text{min}_{P}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

\text{min}_P~~~ \frac{c}{2}||PX-PXZ-E||_F^2 + \frac{c}{2}\sum_i ||Q_i-P_i+I||_F^2

+ &lt;Y_1, PX-PXZ-E&gt; + \sum_i &lt;Y_{3,i}, Q_i-P_i+I&gt;

This problem can be dissect into several optimization problems

\text{min}_{P_i}~~~||P_iX_iF_i + \sum_{j \neq i} P_jX_jF_j - E||_F^2 + \frac{c}{2}\sum_i ||Q_i-P_i+I||_F^2

+ &lt;Y_1, P_iX_iF_i&gt; + \sum_i &lt;Y_{3,i}, Q_i-P_i+I&gt;

where $F_i$ is the ith row block of $I-Z$ .

The close-form solution is given by

P_i [(X_iF_i)(X_iF_i)^T + I] = (Q_i + I + \frac{Y_{3,i}}{c}) - (G + \frac{Y_1}{c})(X_iF_i)^T ~~~~~~~~~~~~~~~~~~~~~~ (3)

4. minimize over $Z$

\text{min}_{Z}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

\text{min}_Z \frac{c}{2}||PX-PXZ-E||_F^2 + \frac{c}{2}||J-Z||_F^2

+ &lt;Y_1, PX-PXZ-E&gt; + &lt;Y_2, J-Z&gt;

The close-form solution is given by

[(PX)^T(PX) + I]Z = (PX)^T[PX-E+\frac{Y_1}{c}] + J + \frac{Y_2}{c} ~~~~~~~~~~~~~~~~~~~~~~ (4)

Implementation details: In (3) and (4),the "diagonal-plus-lowrank" structure is crying out, make sure that you exploit the sparsity pattern by using block elimination to solve the linear system

5. minimize over $E$

\text{min}_{E}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

\text{min}_{E}~~~ \beta ||E||_1 + \frac{c}{2}||PX-PXZ-E||_F^2 + &lt;Y_1, PX-PXZ-E&gt;

This problem can be dissected into several problems

\text{min}_{E_i}~~~ \beta ||E_i||_1 + \frac{c}{2}||PXF_i-E_i||_F^2 + &lt;Y_{1,i}, -E_i&gt;

where $F_i$ is the ith column block of $I_Z$ , $Y_{1,i}$ is the ith column of the dual variable $Y_1$ .

This problem can be formulated as below

\text{min}_{E_i}~~~ \frac{\beta}{c}||E_i||_1 + \frac{1}{2}||E_i - (PXF_i + \frac{Y_{1,i}}{c})||_F^2

which can be solved efficiently with primal-dual interior point method. For the detail, please refer to and appendix.

We can alternatingly optimize over each variable and update the dual variable accordingly in each iteration of augmented lagrangian multiplier.

This method also have some problems.

1. About the objective

The term $||P_i-I||_*$ make the singular values of $P_i$ tend to 1. However, we have no idea that the "right" solution for $P$ have all 1 singular values. We can only prevent $P$ from throwing away all information in $X$ .

2. About the convexity

Due to the equality constraint $PX=PXZ + E$ , the problem is nonconvex. Though we can get a local minimum by using the augmented lagrangian multiplier and achieve a "fairly good" solution by running multiple trials with different initial points, it's not as elegant as a convex problem.

3. About the complexity

In each iteration, assuming there are $K$ sites and each data has dimension $n$ , we'll need to solve $K$ $n \times n$ SVDs when calculating $(2)$ , each of which has complexity $O(n^3)$ . When the data is high-dimension, the problem becomes intractable quickly.

Part 3: Results of my own idea.

Following is the result on Yale face dataset when I set $\alpha=0.5$ and $\beta=0.05$ .

294

226

These images have lower quality after low-rank representaion, but remember that the purpose of low-rank representation if to map data into a common space. In these results, the shade at the eye is eliminated in the first image and the shade near the nose is eliminated in the second image, making those two images pretty "homogeneous".

Only one experiment was done since solving the optimization problem is pretty time-consuming. I think the result can be better if proper $\alpha$ and $\beta$ is chosen.

4. Appendix : Solving one-norm-plus-frobenius-norm-square

Consider the optimization problem

\text{min}_x~~~ k||x||_1 + ||x-v||_2^2

This can be reformulated into an equivalent problem

\text{min}_{x,y}~~~ k 1^Ty + ||x-v||_2^2 \\ \text{s.t.}~~~-y &lt;= x &lt;= y

which is differentiable.

Write this into a standard form

\text{min}_{[x,y]} ~~~ k \begin{pmatrix} 0, 1 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} + || \begin{pmatrix} I,0 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} - v ||_2^2

\text{s.t}.~~~~~ \begin{pmatrix} -I &amp; -I\\ I &amp; -I \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} \leq \begin{pmatrix} 0 \\ 0 \end{pmatrix}

with a change of notation, we can write the problem as

\text{min}_x~~~ c^Tx + ||Ax-b||_2^2 \\ \text{s.t.}~~~Dx&lt;=0 : \lambda

Form the Lagrangian

\mathcal{L}(x, \lambda) = c^Tx + x^T(A^TA)x - 2(A^Tb)^Tx + b^Tb + \lambda^TDx

the modified KKT condition for the log barrier is

\begin{cases} \nabla \mathcal{L}_x(x,\lambda) = c + 2A^TAx - 2A^Tb + D^T\lambda = 0\\ -diag(\lambda)Dx = \frac{1}{t}1 \end{cases}

The residual is given by

r(x,\lambda) = \begin{pmatrix} c + 2A^TAx - 2A^Tb + D^T\lambda\\ -diag(\lambda)Dx - \frac{1}{t}1 \end{pmatrix}

Then form the KKT system

\begin{pmatrix} 2A^TA &amp; D^T \\ -diag(\lambda)D &amp; -diag(Dx) \end{pmatrix} \begin{pmatrix} dx \\ d\lambda \end{pmatrix} = - \begin{pmatrix} c + 2A^TAx - 2A^Tb + D^T\lambda\\ -diag(\lambda)Dx - \frac{1}{t}1 \end{pmatrix} = \begin{pmatrix} -rd\\ -rc \end{pmatrix}

where $rd$ and $rc$ stand for dual residual and centrality residual, respectively. Then solve the KKT system by block elimination

(2A^TA - D^T diag(Dx)^{-1}diag(\lambda)D)dx = -rd-D^Tdiag(Dx)^{-1}rc\\

diag(\lambda)Ddx + diag(Dx)d\lambda = rc

For more detail about primal-dual interior point method, refer to my note.

Implementation details: Notice the structure of matrices and make sure you exploit the sparsity in solving the KKT system.

Multisite adaption algorithm based on low-rank representaion: Criticizing a paper and developing my own algorithm, which is also full of flaws.

Part 1: Problems about this paper

1. About the objective

2. About the optimization problem

3. About the convexity

Part 2: My own idea.

1. minimize over JJJ

2. minimize over QiQ_iQi​

3. minimize over PPP

4. minimize over ZZZ

5. minimize over EEE

1. About the objective

2. About the convexity

3. About the complexity

Part 3: Results of my own idea.

4. Appendix : Solving one-norm-plus-frobenius-norm-square

1. minimize over $J$

2. minimize over $Q_i$

3. minimize over $P$

4. minimize over $Z$

5. minimize over $E$