The EM 
∏ i = 1 m p ( x ( i ) ; θ ) = ∏ i = 1 m ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) \prod_{i=1}^m p(x^{(i)};\theta)
=   \prod_{i=1}^m
    \sum_{z^{(i)}}
    p(x^{(i)},z^{(i)};\theta) i = 1 ∏ m  p ( x ( i ) ; θ ) = i = 1 ∏ m  z ( i ) ∑  p ( x ( i ) , z ( i ) ; θ ) where the z ( i ) z^{(i)} z ( i ) Bayesian MAP θ \theta θ 
( ∏ i = 1 m p ( x ( i ) ; θ ) ) p ( θ ) = ( ∏ i = 1 m ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) ) p ( θ ) (
    \prod_{i=1}^m p(x^{(i)};\theta)
)p(\theta) 
= 
(\prod_{i=1}^m
    \sum_{z^{(i)}}
    p(x^{(i)},z^{(i)};\theta)
)p(\theta) ( i = 1 ∏ m  p ( x ( i ) ; θ ) ) p ( θ ) = ( i = 1 ∏ m  z ( i ) ∑  p ( x ( i ) , z ( i ) ; θ ) ) p ( θ ) Generalize the EM MAP l o g   p ( x , z ∣ θ ) log~p(x,z|\theta) l o g   p ( x , z ∣ θ ) l o g   p ( θ ) log~p(\theta) l o g   p ( θ ) θ \theta θ 
Given training datasete
S = { ( x ( i ) , y ( i ) ) } i = 1 m S = \{(x^{(i)}, y^{(i)})\}_{i=1}^m S = { ( x ( i ) , y ( i ) ) } i = 1 m  In Miximum Likelihood Estimation 
θ M L = a r g m a x θ   P ( S ∣ θ ) \theta_{ML}
=   argmax_{\theta}~P(S| \theta) θ M L  = a r g m a x θ    P ( S ∣ θ ) the intuition is, what θ \theta θ  .
But in Maximum A Posteriori Estimation θ \theta θ 
θ M A P = a r g m a x θ   P ( θ ∣ S ) \theta_{MAP}
=   argmax_{\theta}~P(\theta|S) θ M A P  = a r g m a x θ    P ( θ ∣ S ) The intuition is, given the training dataset, what θ \theta θ  .
And in MAP P ( θ ∣ S ) P(\theta|S) P ( θ ∣ S ) Bayes Rule 
= P ( S ∣ θ )   P ( θ ) P ( S ) =   \frac{P(S|\theta)~P(\theta)} {P(S)} = P ( S ) P ( S ∣ θ )   P ( θ )  = P ( θ ) ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) ∫ θ P ( θ ) ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) d θ =   \frac
    {
    P(\theta) 
    \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)
    }
    {
        \int_{\theta}
        P(\theta)
        \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)
        d\theta
    } = ∫ θ  P ( θ ) ∏ i = 1 m  P ( y ( i ) ∣ x ( i ) ; θ ) d θ P ( θ ) ∏ i = 1 m  P ( y ( i ) ∣ x ( i ) ; θ )  Actually, the denominator is just P ( S ) P(S) P ( S ) θ \theta θ 
θ M A P = a r g m a x θ P ( θ ∣ S ) \theta_{MAP}
=   argmax_{\theta}P(\theta|S) θ M A P  = a r g m a x θ  P ( θ ∣ S ) = a r g m a x θ P ( S ∣ θ ) P ( θ ) =   argmax_{\theta}P(S|\theta)P(\theta) = a r g m a x θ  P ( S ∣ θ ) P ( θ ) = a r g m a x θ P ( θ ) ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) =   argmax_{\theta}
    P(\theta) 
    \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta) = a r g m a x θ  P ( θ ) i = 1 ∏ m  P ( y ( i ) ∣ x ( i ) ; θ ) Note that what we are maximizing is almost the same as we maximize in Maximum Likelihood Estimation θ \theta θ 
When we introduce latent variables into MAP Estimation θ \theta θ 
P ( θ ) ∏ i = 1 m P ( x ( i ) ∣ θ ) = P ( θ ) ∏ i = 1 m ∑ z ( i ) P ( x ( i ) , z ( i ) ∣ θ ) P(\theta) \prod_{i=1}^mP(x^{(i)}|\theta)
=   P(\theta) 
    \prod_{i=1}^m
    \sum_{z^{(i)}}
    P(x^{(i)},z^{(i)}|\theta) P ( θ ) i = 1 ∏ m  P ( x ( i ) ∣ θ ) = P ( θ ) i = 1 ∏ m  z ( i ) ∑  P ( x ( i ) , z ( i ) ∣ θ ) where z ( i ) z^{(i)} z ( i ) EM Algorithm P ( θ ) P(\theta) P ( θ ) 
l o g   l ( θ ) = ∑ i = 1 m l o g P ( x ( i ) ∣ θ ) + l o g P ( θ ) log~l(\theta)
=   \sum_{i=1}^m logP(x^{(i)}|\theta) + logP(\theta) l o g   l ( θ ) = i = 1 ∑ m  l o g P ( x ( i ) ∣ θ ) + l o g P ( θ ) = ∑ i = 1 m l o g [ ∑ z ( i ) P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) =   \sum_{i=1}^m log
    [\sum_{z^{(i)}}P(x^{(i)},z^{(i)}|\theta)] + 
    logP(\theta) = i = 1 ∑ m  l o g [ z ( i ) ∑  P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) = ∑ i = 1 m l o g [ ∑ z ( i ) Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) =   \sum_{i=1}^m log
    [\sum_{z^{(i)}}
        Q_i(z^{(i)})
        \frac
        {P(x^{(i)},z^{(i)}|\theta)}
        {Q_i(z^{(i)})}
    ] + logP(\theta) = i = 1 ∑ m  l o g [ z ( i ) ∑  Q i  ( z ( i ) ) Q i  ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ )  ] + l o g P ( θ ) = ∑ i = 1 m l o g E z ( i ) [ P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) =   \sum_{i=1}^m
    log
    E_{z^{(i)}}
    [\frac
        {P(x^{(i)}, z^{(i)}|\theta)}
        {Q_i(z^{(i)})}
    ]
    + logP(\theta) = i = 1 ∑ m  l o g E z ( i )  [ Q i  ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ )  ] + l o g P ( θ ) Since l o g log l o g Jensen's Inequality 
∑ i = 1 m l o g E z ( i ) [ P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) \sum_{i=1}^m
    log
    E_{z^{(i)}}
    [\frac
        {P(x^{(i)}, z^{(i)}|\theta)}
        {Q_i(z^{(i)})}
    ]
    + logP(\theta) i = 1 ∑ m  l o g E z ( i )  [ Q i  ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ )  ] + l o g P ( θ ) ≥ ∑ i = 1 m E z ( i ) [ l o g P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) ≥   \sum_{i=1}^m
    E_{z^{(i)}}
    [log
        \frac
        {P(x^{(i)},z^{(i)}|\theta)}
        {Q_i(z^{(i)})}
    ] + logP(\theta) ≥ i = 1 ∑ m  E z ( i )  [ l o g Q i  ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ )  ] + l o g P ( θ ) = ∑ i = 1 m [ ∑ z ( i ) Q i ( z ( i ) ) l o g P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) =   \sum_{i=1}^m
    [\sum_{z^{(i)}}
        Q_i(z^{(i)})
        log
            \frac
            {P(x^{(i)},z^{(i)}|\theta)}
            {Q_i(z^{(i)})}
    ]
    + logP(\theta) = i = 1 ∑ m  [ z ( i ) ∑  Q i  ( z ( i ) ) l o g Q i  ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ )  ] + l o g P ( θ ) To hold equality at Jensen's Inequality θ \theta θ 
P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) = C \frac
        {P(x^{(i)}, z^{(i)}|\theta)}
        {Q_i(z^{(i)})}
         = C Q i  ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ )  = C Hence
Q i ( z ( i ) ) = P ( x ( i ) , z ( i ) ∣ θ ) C Q_i(z^{(i)}) = 
\frac
{P(x^{(i)}, z^{(i)}|\theta)}
{C} Q i  ( z ( i ) ) = C P ( x ( i ) , z ( i ) ∣ θ )  ∑ z ( i ) Q i ( z ( i ) ) = ∑ z ( i ) P ( x ( i ) , z ( i ) ∣ θ ) C \sum_{z^{(i)}} Q_i(z^{(i)})
= \sum_{z^{(i)}} 
\frac
{P(x^{(i)}, z^{(i)}|\theta)}
{C} z ( i ) ∑  Q i  ( z ( i ) ) = z ( i ) ∑  C P ( x ( i ) , z ( i ) ∣ θ )  The left part is 1 1 1 Q i ( z ( i ) ) Q_i(z^{(i)}) Q i  ( z ( i ) ) z ( i ) z^{(i)} z ( i ) 
∑ z ( i ) P ( x ( i ) , z ( i ) ∣ θ ) C \sum_{z^{(i)}}
\frac
{P(x^{(i)},z^{(i)}|\theta)}
{C} z ( i ) ∑  C P ( x ( i ) , z ( i ) ∣ θ )  = P ( x ( i ) ∣ θ ) C =
\frac
{P(x^{(i)}|\theta)}
{C} = C P ( x ( i ) ∣ θ )  And so
C = P ( x ( i ) ∣ θ ) C = P(x^{(i)}|\theta) C = P ( x ( i ) ∣ θ ) Q i ( z ( i ) ) = P ( x ( i ) , z ( i ) ∣ θ ) P ( x ( i ) ∣ θ ) Q_i(z^{(i)}) = 
\frac
{P(x^{(i)},z^{(i)}|\theta)}
{P(x^{(i)}|\theta)} Q i  ( z ( i ) ) = P ( x ( i ) ∣ θ ) P ( x ( i ) , z ( i ) ∣ θ )  = P ( z ( i ) ∣ x ( i ) , θ ) = P(z^{(i)}|x^{(i)},\theta) = P ( z ( i ) ∣ x ( i ) , θ ) The E-Step  is just the general case
Q i ( z ( i ) ) = P ( z ( i ) ∣ x ( i ) , θ ) Q_i(z^{(i)}) = P(z^{(i)}|x^{(i)},\theta) Q i  ( z ( i ) ) = P ( z ( i ) ∣ x ( i ) , θ ) And M-Step  is to find the maximum of the lower bound
θ : = a r g m a x θ ∑ i = 1 m [ ∑ z ( i ) Q i ( z ( i ) ) l o g P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) \theta :=
argmax_{\theta}
\sum_{i=1}^m
[\sum_{z^{(i)}}
    Q_i(z^{(i)})log
    \frac
    {P(x^{(i)},z^{(i)}|\theta)}
    {Q_i(z^{(i)})}
] + logP(\theta) θ : = a r g m a x θ  i = 1 ∑ m  [ z ( i ) ∑  Q i  ( z ( i ) ) l o g Q i  ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ )  ] + l o g P ( θ )