A multilevel stochastic gradient algorithm for PDE-constrained optimal control problems under

(1)

A multilevel stochastic gradient algorithm for PDE-constrained optimal control problems under

uncertainty

Fabio Nobile

CSQI – Institute of Mathematics, EPFL, Switzerland

Joint work with: M. Martin(Criteo, Grenoble),S. Krumscheid(RWTH Aachen), P. Tsilifis(EPFL)

RICAM Special Semestre on Optimization

Workshop 3 “Optimization and Inversion under Uncertainty”

November 11-15, 2019, Linz, Austria

(2)

Outline

1 Problem setting – quadratic optimal control problem

2 Discretization by finite elements + Monte Carlo

3 Deterministic (CG) iterative solvers versus Stochastic Gradient

4 Multilevel stochastic gradient algorithms

5 Conclusions

(3)

Problem setting – quadratic optimal control problem

Outline

5 Conclusions

(4)

Problem setting

(Ω,F,P): complete probability space

D⊂R^d: physical domain. Throughoutk · k=k · k^L²^(D). Forward problem

(−div(a(x,ω)∇y(x, ω)) =g(x) +u(x), for a.e. x∈D, ω∈Ω

y(x, ω) = 0, for a.e. x∈∂D, ω∈Ω (*) witha(·, ω) a random fields.t. 0<a_min≤a(x, ω)≤a_max, ∀(x, ω)∈D×Ω.

=⇒random solutionω7→y(·, ω)∈H₀¹(D). In particulary ∈L²_P(Ω;H₀¹(D)). u∈L²(D): control function

Optimal control problem

u∈L2 (D)min

y∈L2 P(Ω;H1

0(D))

J(u,˜ y) := 1

2Eω[ky(·, ω)−ytargetk²] + β

2kuk², subject to (*)

(5)

Problem setting

D⊂R^d: physical domain. Throughoutk · k=k · k^L²^(D).

Forward problem

y(x, ω) = 0, for a.e. x∈∂D, ω∈Ω (*) witha(·, ω) a random fields.t. 0<a_min≤a(x, ω)≤a_max, ∀(x, ω)∈D×Ω.

=⇒random solutionω7→y(·, ω)∈H₀¹(D). In particulary ∈L²_P(Ω;H₀¹(D)). u∈L²(D): control function

u∈L2 (D)min

y∈L2 P(Ω;H1

0(D))

J(u,˜ y) := 1

(6)

Problem setting

y(x, ω) = 0, for a.e. x∈∂D, ω∈Ω (*)

witha(·, ω) a random fields.t. 0<a_min≤a(x, ω)≤a_max, ∀(x, ω)∈D×Ω.

=⇒random solutionω7→y(·, ω)∈H₀¹(D). In particulary ∈L²_P(Ω;H₀¹(D)).

u∈L²(D): control function Optimal control problem

u∈L2 (D)min

y∈L2 P(Ω;H1

0(D))

J(u,˜ y) := 1

(7)

Problem setting

y(x, ω) = 0, for a.e. x∈∂D, ω∈Ω (*)

u∈L²(D): control function

u∈L2 (D)min

y∈L2 P(Ω;H1

0(D))

J(u,˜ y) := 1

(8)

Problem setting

y(x, ω) = 0, for a.e. x∈∂D, ω∈Ω (*)

u∈L²(D): control function Optimal control problem

u∈L2 (D)min

y∈L2(Ω;H1(D))

J(u,˜ y) := 1

2E^ω[ky(·, ω)−ytargetk²] + β

(9)

Reduced functional

(Stochastic) Affine solution operator: y_ω:L²(D)→H₀¹(D)

∀ω∈Ω u7→y_ω(u), solution of

(−div(a(·, ω)∇y_ω(u)) =g+u, inD

y_ω(u) = 0, on∂D.

Reduced functional: min_u_∈_L2(D)J(u) J(u) =Eω[f(u, ω)], f(u, ω) = 1

2ky_ω(u)−ytargetk²+β 2kuk² Adjoint based gradient computation:

∇^uf(u, ω) =βu+p_ω(u), ∇^uJ(u) =βu+Eω[p_ω(u)] wherep_ω(u) solves the adjoint problem∀ω∈Ω

−div(a(·, ω)∇pω(u)) =yω(u)−ytarget inD, pω(u) = 0 on∂D. Lipschitz and strong convexity properties of∇^uf: ∀u1,u1∈L²(D), ω∈Ω

k∇ûf(u1, ω)− ∇ûf(u2, ω)k ≤Lku1−u2k, L=β+ C_P⁴ a²_min h∇ûf(u1, ω)− ∇ûf(u2, ω),u1−u2i^L²^(D)≥ `

2ku1−u2k², `= 2β

(10)

Reduced functional

y_ω(u) = 0, on∂D.

2ky_ω(u)−ytargetk²+β 2kuk²

Adjoint based gradient computation:

∇^uf(u, ω) =βu+p_ω(u), ∇^uJ(u) =βu+Eω[p_ω(u)] wherep_ω(u) solves the adjoint problem∀ω∈Ω

−div(a(·, ω)∇pω(u)) =yω(u)−ytarget inD, pω(u) = 0 on∂D. Lipschitz and strong convexity properties of∇^uf: ∀u1,u1∈L²(D), ω∈Ω

k∇ûf(u1, ω)− ∇ûf(u2, ω)k ≤Lku1−u2k, L=β+ C_P⁴ a²_min h∇ûf(u1, ω)− ∇ûf(u2, ω),u1−u2i^L²^(D)≥ `

2ku1−u2k², `= 2β

(11)

Reduced functional

y_ω(u) = 0, on∂D.

∇^uf(u, ω) =βu+p_ω(u), ∇^uJ(u) =βu+Eω[p_ω(u)]

wherep_ω(u) solves the adjoint problem∀ω∈Ω

−div(a(·, ω)∇pω(u)) =yω(u)−ytarget inD, pω(u) = 0 on∂D.

Lipschitz and strong convexity properties of∇ûf: ∀u1,u1∈L²(D), ω∈Ω k∇ûf(u1, ω)− ∇ûf(u2, ω)k ≤Lku1−u2k, L=β+ C_P⁴

a²_min h∇^uf(u1, ω)− ∇^uf(u2, ω),u1−u2i^L²^(D)≥ `

2ku1−u2k², `= 2β

(12)

Reduced functional

y_ω(u) = 0, on∂D.

∇^uf(u, ω) =βu+p_ω(u), ∇^uJ(u) =βu+Eω[p_ω(u)]

wherep_ω(u) solves the adjoint problem∀ω∈Ω

−div(a(·, ω)∇pω(u)) =yω(u)−ytarget inD, pω(u) = 0 on∂D.

Lipschitz and strong convexity properties of∇ûf: ∀u1,u1∈L²(D), ω∈Ω k∇ûf(u1, ω)− ∇ûf(u2, ω)k ≤Lku1−u2k, L=β+ C_P⁴

a²_min h∇ f(u , ω)− ∇ f(u , ω),u −u i ≥ `

ku −u k², `= 2β

(13)

Discretization by finite elements + Monte Carlo

Outline

5 Conclusions

(14)

Finite dimensional approximation

Finite Element approximation of the PDE:∀u∈L²(D), ω∈Ω u7→y_ω^h(u) solves

Z

D

a(·, ω)∇y_ω^h(u)· ∇vh= Z

D

(g +u)vh, ∀vh∈Y_h^r Y_h^r: space of continuousP^r finite element functions vanishing on the boundary

Monte Carlo approximation of expectation: ωi

iid∼ P, i= 1, . . .N

J(u) =Eω[f(u, ω)]≈ 1 N

N

X

i=1

f(u, ωi)

Discrete optimal control problem:

min

u∈L²(D) J^h,N(u) := 1 N

N

X

i=1

f^h(u, ωi) = 1 N

N

X

i=1

1

2ky_ω^h_i(u)−ytargetk²+β 2kuk²

Remark: The unique minimizeru^h,N? ∈Y_h^r

(15)

Finite dimensional approximation

Z

D

a(·, ω)∇y_ω^h(u)· ∇vh= Z

D

iid∼ P, i= 1, . . .N

J(u) =Eω[f(u, ω)]≈ 1 N

N

X

i=1

f(u, ωi)

min

u∈L²(D) J^h,N(u) := 1 N

N

X

i=1

f^h(u, ωi) = 1 N

N

X

i=1

1

(16)

Finite dimensional approximation

Z

D

a(·, ω)∇y_ω^h(u)· ∇vh= Z

D

iid∼ P, i= 1, . . .N

J(u) =Eω[f(u, ω)]≈ 1 N

N

X

i=1

f(u, ωi)

min

u∈L²(D) J^h,N(u) := 1 N

N

X

i=1

f^h(u, ωi) = 1 N

N

X

i=1

1

(17)

Finite dimensional approximation

Z

D

a(·, ω)∇y_ω^h(u)· ∇vh= Z

D

iid∼ P, i= 1, . . .N

J(u) =Eω[f(u, ω)]≈ 1 N

N

X

i=1

f(u, ωi)

min

u∈Y_h^r

J^h,N(u) := 1 N

N

X

i=1

f^h(u, ωi) = 1 N

N

X

i=1

1

(18)

Optimality conditions

primal pbs:

Z

D

a(·, ωi)∇y_ω^h_i · ∇vh= Z

D

(g+u^h,N)vh, ∀vh∈Y_h^r, i= 1, . . . ,N, adjoint pbs:

Z

D

a(·, ωi)∇vh· ∇p^h_ω_i = Z

D

(y_ω^h_i −ytar)vh, ∀vh∈Y_h^r, i= 1, . . . ,N,

sensitivity:

Z

D

(βu^h,N+ 1 N

N

X

i=1

p^h_ω_i)vh= 0 ∀vh∈Y_h^r.

Algebraic system







A1 −M

. .. ...

AN −M

−M A^T₁

. .. . ..

−M A^T_N M · · · M βNM











 y1

... yN

p₁ ... pN

u







=





 g

... g

−y_tar ...

−ytar

0







(19)

Optimality conditions

primal pbs:

Z

D

a(·, ωi)∇y_ω^h_i · ∇vh= Z

D

(g+u^h,N)vh, ∀vh∈Y_h^r, i= 1, . . . ,N, adjoint pbs:

Z

D

a(·, ωi)∇vh· ∇p^h_ω_i = Z

D

(y_ω^h_i −ytar)vh, ∀vh∈Y_h^r, i= 1, . . . ,N,

sensitivity:

Z

D

(βu^h,N+ 1 N

N

X

i=1

p^h_ω_i)vh= 0 ∀vh∈Y_h^r.

Algebraic system







A1 −M

. .. ...

AN −M

−M A^T₁

. .. . ..

−M A^T_N M · · · M βNM











 y₁

... yN

p1

... pN

u







=





 g

... g

−ytar

...

−ytar

0







(20)

Deterministic (CG) iterative solvers versus Stochastic Gradient

Outline

5 Conclusions

(21)

Reduced algebraic system

Several appraoches can be used to solve this coupled system

[Kouri-Heinkenschloss-eal 2013],[VanBarel-Vandewalle 2018],[Borz`ı-vonWinckel 2011]

Eliminating (y₁, . . . ,y_N) and (p₁, . . . ,p_N) and introducing the block matrices

A=





 A₁

. .. AN





, M=





 M

. .. M





, 1=





 Id

... Id







leads to a reduced systemGu=ξwith matrix G=βM+ 1

N1^TMA⁻^TMA⁻¹M1 The matrixG is spd and Cond(G) =O(β⁻¹) indep. ofhandN. Reduced system can be solved efficiently by e.g. conjugate gradient. Denotingu_j^h,N thej-th iterate

ku^h,N_? −u_j^h,Nk ≤Cρ^j, ρ=

pCond(G)−1 pCond(G) + 1

(22)

Reduced algebraic system

A=





 A₁

. .. AN





, M=





 M

. .. M





, 1=





 Id

... Id







N1^TMA⁻^TMA⁻¹M1

The matrixG is spd and Cond(G) =O(β⁻¹) indep. ofhandN. Reduced system can be solved efficiently by e.g. conjugate gradient. Denotingu_j^h,N thej-th iterate

(23)

Reduced algebraic system

A=





 A₁

. .. AN





, M=





 M

. .. M





, 1=





 Id

... Id







N1^TMA⁻^TMA⁻¹M1 The matrixG is spd and Cond(G) =O(β⁻¹) indep. ofhandN.

Reduced system can be solved efficiently by e.g. conjugate gradient. Denotingu_j^h,N thej-th iterate

(24)

Reduced algebraic system

A=





 A₁

. .. AN





, M=





 M

. .. M





, 1=





 Id

... Id







N1^TMA⁻^TMA⁻¹M1 The matrixG is spd and Cond(G) =O(β⁻¹) indep. ofhandN.

Reduced system can be solved efficiently by e.g. conjugate gradient.

Denotingu_j^h,N thej-th iterate

(25)

Deterministic approach

Use standard (deterministic) iterative method (e.g. CG) to solve the fully discrete system

Error splitting assuming smooth solutionsy_ω(u_?),p_ω(u_?)∈H^r+1(D) E[ku?−u_j^h,Nk²]≤ C1ρ^2j

| {z }

CG error

+ C₂ N

|{z}

MC error

+C3h^2r+2

| {z }

FE error

Cost to computeu_j^h,N: assume that the cost of solving 1 PDE isO(h⁻^γd) (withγ∈(1,3])

=⇒ Work to computeu^h,N_j : Work ∼jNh^−γd

Complexity analysis. Balancing errors contributions: ρ^j ∼N⁻¹² ∼h^r+1∼tol Work(tol).tol⁻²

| {z }

MC

tol⁻^r+1^γd

| {z }

FE

logtol⁻¹

| {z }

solver

(26)

Deterministic approach

| {z }

CG error

+ C₂ N

|{z}

MC error

+C3h^2r+2

| {z }

FE error

=⇒ Work to compute u^h,N_j : Work ∼jNh^−γd

| {z }

MC

tol⁻^r+1^γd

| {z }

FE

logtol⁻¹

| {z }

solver

(27)

Deterministic approach

| {z }

CG error

+ C₂ N

|{z}

MC error

+C3h^2r+2

| {z }

FE error

=⇒ Work to compute u^h,N_j : Work ∼jNh^−γd

| {z }

MC

tol⁻^r+1^γd

| {z }

FE

logtol⁻¹

| {z }

solver

(28)

Stochastic gradient (Robbins-Monro)

Instead of introducing upfront the Monte Carlo approximation and then solve the discrete problem by a deterministic iterative solver, we could apply a stochastic gradient method to the continuous problem (non-discrete in probability)

u^h_j+1=u_j^h−τj∇^uf^h(u^h_j, ωj)

= (1−τjβ)u_j^h−τjp^h_ω

j(u_j^u) ωj

iid∼ P Learning rate τj =_j+α^τ⁰

Proposition[Martin-Nobile-Krumscheid 2018]

Assumingy_ω(u_?),p_ω(u_?)∈H^r+1(D), for any α∈R⁺andτ0> _2β¹ there exist D1,D2>0 independent ofj andhs.t.

SG convergence: E[ku_?^h−u_j^hk²]≤D1j⁻¹

Error splitting: E[ku_?−u^h_jk²]≤2D1j⁻¹+D2h^2r+2

Complexity: Work(tol).tol⁻²tol⁻^r+1^γd (no log terms !)

(29)

Stochastic gradient (Robbins-Monro)

j(u_j^u) ωj

iid∼ P

Learning rate τj =_j+α^τ⁰

Assumingy_ω(u_?),p_ω(u_?)∈H^r+1(D), for any α∈R⁺ andτ0> _2β¹ there exist D1,D2>0 independent ofj andhs.t.

SG convergence: E[ku^h_?−u_j^hk²]≤D1j⁻¹

(30)

Stochastic gradient (Robbins-Monro)

j(u_j^u) ωj

iid∼ P

(31)

Stochastic gradient (Robbins-Monro)

j(u_j^u) ωj

iid∼ P

(32)

Numerical results

Optimal control problem:

min

u∈L²(D)

1 2Eω

ky_ω(u)−ytargetk² +β

2kuk² subject to

(−div(a(·, ω)∇y_ω(u)) =g+u inD

y_ω(u) = 0 on∂D

Problem parameters

D= (0,1)², g = 1, ytarget(x1,x2) = sin(πx) sin(πy), β= 10⁻⁴

a(x1,x2,ξ) = 1 + exp{θ(ξ1cos(1.1πx1) +ξ2cos(1.2πx1) +ξ3sin(1.3πx2) +ξ4sin(1.4πx2))}

IsoValue 0.873489 0.891562 0.90361 0.915659 0.927708 0.939756 0.951805 0.963854 0.975903 0.987951 1 1.01205 1.0241 1.03615 1.04819 1.06024 1.07229 1.08434 1.09639 1.12651 delta

IsoValue 0.854672 0.875433 0.889274 0.903114 0.916955 0.930796 0.944637 0.958478 0.972318 0.986159 1 1.01384 1.02768 1.04152 1.05536 1.0692 1.08304 1.09689 1.11073 1.14533 delta

IsoValue 0.739162 0.776425 0.801266 0.826108 0.85095 0.875791 0.900633 0.925475 0.950317 0.975158 1 1.02484 1.04968 1.07453 1.09937 1.12421 1.14905 1.17389 1.19873 1.26084 delta

(33)

Numerical results – SG convergence

10⁰ 10¹ 10² 10³

10⁻² 10⁻¹ 10⁰

iteration counterj

E[ku

h j−u

h ?k]

E[ku^hj−u^h_?k]: SG with fixed mesh fit:error= 10^0.16547j^−0.48555 E[ku^hj−u^h?k] +std(ku^hj−u^h?k)

Mean L²error as a function of iteration counter, estimated by sample average over 100 independent realizations.

Fixed mesh size h= 2⁻⁴–P¹ finite elements.

(34)

Numerical results – complexity of CG versus SG

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹ 10¹⁰ 10⁻⁴

10⁻³ 10⁻² 10⁻¹ 10⁰

Work model (γ=1) E[ku−u?k]

SGE[ku−u_?k]

SGE[ku−u?k] +std(ku−u?k) CGE[ku−u?k]

CGE[ku−u_?k] +std(ku−u_?k) reference complexity: error≈ W^-1/3

Complexity plot for CG and SG (average over 20 repetitions)

(35)

Multilevel stochastic gradient algorithms

Outline

5 Conclusions

(36)

Multilevel stochastic gradient

LetY_h^r

0⊂Y_h^r

1⊂. . .⊂Y_h^r

L be a sequence of finer and finer FE spaces.

Idea: In the Stochastic Gradient algorithm, replace the single evaluation

∇^uf^h(u_j, ωj) by a multilevel approx. of the expectation[Heinrich 1998], [Giles 2008]

E^MLMC_{L, ~}_N [∇^uf(uj,·)] =

L

X

`=0

1 N_`

N_`

X

i=1

h∇^uf^h^`(uj, ω^(i,`)_j )− ∇^uf^h^`−1(uj, ω_j^(i,`))i

(with the convention∇^uf^h⁻¹= 0) withω_j^(i,`)^iid∼ P (drawn independently between levels and at each iteration)

Lcontrols the bias of the estimator (FE error on levelh_L) Eh

E^MLMC_{L, ~}_N [∇^uf(uj,·)]−E[∇^uf(uj,·)]i

=E[∇^uf^h^L(uj,·)− ∇^uf(uj,·)] N~ = (N0, . . . ,NL) controls the variance of the estimator (MC error)

Varh

E^MLMC_{L, ~}_N [∇^uf(uj,·)]i

=

L

X

`=0

Var

∇^uf^h^`(uj,·)− ∇^uf^h^`−1(uj,·) N_`

(37)

Multilevel stochastic gradient

LetY_h^r

0⊂Y_h^r

1⊂. . .⊂Y_h^r

∇^uf^h(u_j, ωj) by a multilevel approx. of the expectation[Heinrich 1998],[Giles 2008]

E^MLMC_{L, ~}_N [∇^uf(uj,·)] =

L

X

`=0

1 N_`

N_`

X

i=1

=E[∇^uf^h^L(uj,·)− ∇^uf(uj,·)] N~ = (N0, . . . ,NL) controls the variance of the estimator (MC error)

Varh

=

L

X

`=0

Var

∇^uf^h^`(uj,·)− ∇^uf^h^`−1(uj,·) N_`

(38)

Multilevel stochastic gradient

LetY_h^r

0⊂Y_h^r

1⊂. . .⊂Y_h^r

E^MLMC_{L, ~}_N [∇^uf(uj,·)] =

L

X

`=0

1 N_`

N_`

X

i=1

=E[∇^uf^h^L(uj,·)− ∇^uf(uj,·)]

N~ = (N0, . . . ,NL) controls the variance of the estimator (MC error)

Varh

=

L

X

`=0

Var

∇^uf^h^`(uj,·)− ∇^uf^h^`−1(uj,·) N_`

(39)

Multilevel stochastic gradient

LetY_h^r

0⊂Y_h^r

1⊂. . .⊂Y_h^r

E^MLMC_{L, ~}_N [∇^uf(uj,·)] =

L

X

`=0

1 N_`

N_`

X

i=1

=E[∇^uf^h^L(uj,·)− ∇^uf(uj,·)]

N~ = (N0, . . . ,NL) controls the variance of the estimator (MC error)

Varh

=

L

X

`=0

Var

∇^uf^h^`(uj,·)− ∇^uf^h^`−1(uj,·) N_`

(40)

Multilevel stochastic gradient algorithm – first version

u_j+1=u_j−τjE^MLMC_L_j_{, ~}_N_j [∇^uf(u_j,·)]

= (1−τjβ)uj−τj Lj

X

`=0

1 N_`,j

N`,j

X

i=1

h

p^h^`(uj, ω_j^(i,`))−p^h^`−1(uj, ω^(i,`)_j )i

Learning rateτj =_j+α^τ⁰ . Notice that∀j, uj+1∈Y_h^r

Lj

We allowLandN~ to depend on the iteration counterj. How to choose them optimally ?

To recover the optimal controlu_? in the limit, we needL_j→ ∞asj → ∞. On the other hand,N~ does not need to go to∞.

Similar approach proposed in[Dereich-MuellerGronbach 2015]for abstract optimization problem. Different working assumptions but similar results.

(41)

Multilevel stochastic gradient algorithm – first version

X

`=0

1 N_`,j

N`,j

X

i=1

h

Lj

To recover the optimal controlu_? in the limit, we needL_j→ ∞asj → ∞. On the other hand,N~ does not need to go to∞.

(42)

Multilevel stochastic gradient algorithm – first version

X

`=0

1 N_`,j

N`,j

X

i=1

h

Lj

To recover the optimal controlu_? in the limit, we needL_j → ∞asj → ∞. On the other hand,N~ does not need to go to∞.

(43)

Multilevel stochastic gradient algorithm – first version

X

`=0

1 N_`,j

N`,j

X

i=1

h

Lj

To recover the optimal controlu_? in the limit, we needL_j → ∞asj → ∞. On the other hand,N~ does not need to go to∞.