Conjugate Gradient Method

Abstract:

Conjugate gradient method is one of the improved gradient algorithms for optimization.
It’s used to resolve linear system, which need to meet some strict requirements.
Though it’s not a general solution for optimization problem, but it’s quite efficient and memory saving for problems it fit.

Content:

Theory:

The conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix $A$ is symmetric and positive-definite. For symmetric, it means $A$ need to satisfy Equation \eqref{symmetric}.

\begin{equation}
\label{symmetric}
A=A^T \\
\end{equation}

For positive-definite, it means for any real value vector $x$, we can have Equation \eqref{positive-def}.

We can see there are many preconditions for conjugate gradient method, but there are advantages. It require less memory than computing reverse $A^{-1}$ by Jacobi equation. And it can guarantee that only n iterations(n is the row or column of $A$, here we ignore computing error) are needed to converge the solution, which is far more faster and reliable than Newton gradient method.

The problem, which fit conjugate gradient method, could be expressed as Equation \eqref{cg-problem}.

where $A$ is a symmetric and positive-definite matrix, $x$ is the vector to be calculated, $b$ is the constant vector. Following is the computational steps for getting x. Since it’s systems of linear equation, initial $x_0$ could be any random value you like.

Firstly, initilaze variables as Equation \eqref{algo-init1}\eqref{algo-init2}\eqref{algo-init3}.

repeat Equation \eqref{algo-update1}\eqref{algo-update2}\eqref{algo-update3} to update $x$ and $r$.

if $r_{k+1}$ is sufficiently small, say $r_{k+1}^2<e$, when $e$ is a given small number. then exist loop, else udpate $p$ by Equation \eqref{algo-update4}\eqref{algo-update5}, and increase $k$ as Equation \eqref{algo-update6}.

go back to loop Equation \eqref{algo-update1}\eqref{algo-update2}\eqref{algo-update3}.

The final result is $x_{k+1}$. Above theory is derived from wikipedia.

Numerical implemetation in Fortran:

Fortran subroutine for conjugate gradient:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
module conjugate_gradient_method
contains
!The following subroutine is the to get x from Ax=b by Conjugate Gradient
!N is the N_max ,this method is come from "矩阵计算的理论与方法"北大徐树方p154
!with a little change
subroutine cg(A,b,x,N)
!variable meaning
!A a matrix,from main program,is coefficient matrix of Ax=b
!b a vector,from main program,is righthand of Ax=b
!x a vector,the answer of Ax=b,is what we need,our goal
!r a vector,minus grads of 0.5xAx-bx at x point,says b-Ax
!p a vector,the direction of iteration better than r
!w a vector,value is A*p,is useful to simplify the process
!q0 a number,value is r0*r0,is standard of loop times
!q1 a number,value is rk-1*rk-1
!q2 a number, value is rk*rk
!ba,ar a number,named by their pronounciation
!e a number,standard of loop times,input by client
!test a number,value is matmul(r,w)
!pw a number,value is matmul(p,w)
!i a number,count variable
!N a number,the degree of A
real*8 A(N,N)
real*8 b(N),x(N),r(N),p(N),w(N)
!real*8 A(2,2),b(2),x(2),r(2),p(2),w(2)
real*8 q0,q1,q2,ba,ar,e,test,pw
integer i,N
! write(*,*)"you want the x_error less than"
! read(*,*)e
! write(*,*)"you want x0 to be?"
! read(*,*)x
e=0.01d0
r=b-matmul(a,x)
call onedimenmul(r,r,N,q0)
q2=q0
p=r

! w=matmul(A,p)
! call onedimenmul(r,w,2,test)
! ar=q2/test
! x=x+ar*p
! r=r-ar*w
! q1=q2;call onedimenmul(r,r,2,q2)
! !r=r-a*w
i=1
write(*,"(5f13.6,i13)")2*x(1)**2+x(2)**2-x(1)*x(2),x,r,i

do while(q2>=e)
q1=q2
! ba=q2/q1
! p=r+ba*p

w=matmul(A,p)
call onedimenmul(p,w,N,pw)!pw is p*w
ar=q1/pw
x=x+ar*p
r=r-ar*w
call onedimenmul(r,r,N,q2)
!r=r-a*w
ba=q2/q1
i=i+1
p=r+ba*p
write(*,"(5f13.6,i13)")2*x(1)**2+x(2)**2-x(1)*x(2),x,r,i

end do
! write(*,*)"x",x
! write(*,*)"i",i
end subroutine cg

!This subroutine is to solve one dimention's multiplication
subroutine onedimenmul(m1,m2,n,ans)
integer n
real*8 m1(n),m2(n),ans
ans=0
do i=1,n
ans=m1(i)*m2(i)+ans
end do

end subroutine onedimenmul
end module conjugate_gradient_method

Conjugate gradient application:

1
2
3
4
5
6
7
8
9
10
11
program conjugate_gradient_example
use conjugate_gradient_method
real*8 x(2),A(2,2),b(2)
integer n
write(*,"(6A13)")"f","x1","x2","g1","g2","i"
data a /4.0d0,-1.0d0,-1.0d0,2.0d0/
data b/0.0d0,0.0d0/
data x/1.0d0,1.0d0/
n=2
call cg(A,b,x,n)
end program conjugate_gradient_example

Questions:

Is it adapt to deeplearning issue?

No. As we address above, it’s adapted to linear system with extra requirements(symmetric and positive-definite) to coefficients matrix $A$.

Since deeplearning is non-linear structure, muchless the extra requrements, so the original conjugate gradient method is not suitable for deeplearning.

But there are some improvement that try to appply conjugate gradient to non-linear problem, but it may need more adjustments and tests to check if it could help on deeplearning training.

Is the error scale for measuring if training converge sensitive?

Yes. the error variable $e$ in the code, which decide if loop converge, is quite sensitive. My own test case shows change $e=1.0\times10^{-3}$ to $e=1.0\times10^{-6}$ could improve very much for the solution precision.

History:

  • 2018-04-10: create post for demonstration of conjugate gradient method
  • 2018-04-26: reformat this page by latex style equation and ref.