deeplearningbook_054-1

AI4天前发布 beixibaobao

7 0 0

==================【DeepLearningBook_054.txt】================== While the results suggest that the family of algorithms with adaptive learning rates (represented b y RMSProp and A daDelta) p erformed fairly robustly , no single b est algorithm has emerged. Curren tly , the most p opular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp with momen tum, AdaDelta and Adam. The choice of which algorithm to use, at this p oin t, seems to dep end largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning). 309 — Page Break — CHAPTER 8. OPTIMIZA TION FOR TRAINING DEEP MODELS Algorithm 8.6 RMSProp algorithm with Nestero v momentum Require: Global learning rate , decay rate , momentum co eﬃcient .  ρ α Require: Initial parameter , initial velocity . θ v Initialize accumulation v ariable r = 0 while do stopping criterion not met Sample a minibatch of m examples from the training set { x (1) , . . . , x ( ) m } with corresp onding targets y ( ) i . Compute interim up date: ˜ θ θ v ← + α Compute gradient: g ← 1 m ∇ ˜ θ  i L f ( ( x ( ) i ; ˜ θ y ) , ( ) i ) A ccum ulate gradient: r r g g ← ρ + (1 ) − ρ  Compute velocity up date: v v ← α −  √ r  g . ( 1 √ r applied element-wise) Apply up date: θ θ v ← + end while 8.6 Appro ximate Second- Order Metho ds In this section w e discuss the application of second-order metho ds to the training of deep net w orks. See ( ) for an earlier treatment of this sub ject. LeCun et al. 1998a F or simplicity of exp osition, the only ob jectiv e function w e examine is the empirical risk: J ( ) = θ E x , y ∼ ˆ p data ( ) x ,y [ ( ( ; ) )] = L f x θ , y 1 m m  i =1 L f ( ( x ( ) i ; ) θ , y ( ) i ) . (8.25) Ho w ev er the metho ds w e discuss here extend readily to more general ob jectiv e functions that, for instance, include parameter regularization terms such as those discussed in Chapter . 7 8.6.1 New ton’s Meth o d In Sec. , we introduced second-order gradient metho ds. In contrast to ﬁrst- 4.3 order metho ds, second-order metho ds make use of second deriv atives to improv e optimization. The most widely used second-order metho d is Newton’s metho d. W e no w describ e Newton’s metho d in more detail, with emphasis on its application to neural netw ork training. Newton’s metho d is an optimization sc heme based on using a second-order T ay- lor series expansion to approximate J ( θ ) near some p oin t θ 0 , ignoring deriv ativ es 310 — Page Break — CHAPTER 8. OPTIMIZA TION FOR TRAINING DEEP MODELS Algorithm 8.7 The A dam algorithm Require: Step size (Suggested default: )  0 001 . Require: Exp onen tial deca y rates for momen t estimates, ρ 1 and ρ 2 in [0 , 1) . (Suggested defaults: and resp ectively) 0 9 . 0 999 . Require: Small constant δ used for numerical stabilization. (Suggested default: 10 − 8 ) Require: Initial parameters θ Initialize 1st and 2nd momen t v ariables , s = 0 r = 0 Initialize time step t = 0 while do stopping criterion not met Sample a minibatch of m examples from the training set { x (1) , . . . , x ( ) m } with corresp onding targets y ( ) i . Compute gradient: g ← 1 m ∇ θ  i L f ( ( x ( ) i ; ) θ , y ( ) i ) t t ← + 1 Up date biased ﬁrst moment estimate: s ← ρ 1 s + (1 − ρ 1 ) g Up date biased second moment estimate: r ← ρ 2 r + (1 − ρ 2 ) g g  Correct bias in ﬁrst momen t: ˆ s ← s 1 − ρ t 1 Correct bias in second momen t: ˆ r ← r 1 − ρ t 2 Compute up date: ∆ = θ −  ˆ s √ ˆ r + δ (op erations applied elemen t-wise) Apply up date: θ θ θ ← + ∆ end while of higher order: J J ( ) θ ≈ ( θ 0 ) + ( θ θ − 0 )  ∇ θ J ( θ 0 ) + 1 2 ( θ θ − 0 )  H θ θ ( − 0 ) , (8.26) where H is the Hessian of J with respect to θ ev aluated at θ 0 . If we then solv e for the critical p oin t of this function, we obtain the Newton parameter up date rule: θ ∗ = θ 0 − H − 1 ∇ θ J ( θ 0 ) (8.27) Th us for a lo cally quadratic function (with p ositive deﬁnite H ), b y rescaling the gradien t by H − 1 , Newton’s metho d jumps directly to the minimum. If the ob jectiv e function is conv ex but not quadratic (there are higher-order terms), this up date can b e iterated, yielding the training algorithm asso ciated with Newton’s metho d, given in Algorithm . 8.8 F or surfaces that are not quadratic, as long as the Hessian remains p ositive deﬁnite, Newton’s metho d can b e applied iteratively . This implies a tw o-step 311 — Page Break — CHAPTER 8. OPTIMIZA TION FOR TRAINING DEEP MODELS Algorithm 8.8 Newton’s metho d with ob jectiv e J ( θ ) = 1 m  m i =1 L f ( ( x ( ) i ; ) θ , y ( ) i ) . Require: Initial parameter θ 0 Require: T raining set of examples m while do stopping criterion not met Compute gradient: g ← 1 m ∇ θ  i L f ( ( x ( ) i ; ) θ , y ( ) i ) Compute Hessian: H ← 1 m ∇ 2 θ  i L f ( ( x ( ) i ; ) θ , y ( ) i ) Compute Hessian in v erse: H − 1 Compute up date: ∆ = θ − H − 1 g Apply up date: θ θ θ = + ∆ end while iterativ e pro cedure. First, up date or compute the in v erse Hessian (i.e. by up dating the quadratic approximation). Second, up date the parameters according to Eq. 8.27 . In Sec. , we discussed how Newton’s metho d is appropriate only when 8.2.3 the Hessian is p ositive deﬁnite. In deep learning, the surface of the ob jective function is typically non-con vex with many features, suc h as saddle p oints, that are problematic for Newton’s metho d. If the eigenv alues of the Hessian are not all p ositive, for example, near a saddle p oint, then Newton’s metho d can actually cause up dates to mov e in the wrong direction. This situation can b e av oided b y regularizing the Hessian. Common regularization strategies include adding a constan t, , along the diagonal of the Hessian. The regularized up date b ecomes α θ ∗ = θ 0 − [ ( ( H f θ 0 )) + ] α I − 1 ∇ θ f ( θ 0 ) . (8.28) This regularization strategy is used in approximations to Newton’s metho d, suc h as the Leven b erg–Marquardt algorithm ( Leven b erg 1944 Marquardt 1963 , ; , ), and w orks fairly w ell as long as the negative eigenv alues of the Hessian are still relatively close to zero. In cases where there are more extreme directions of curv ature, the v alue of α w ould hav e to b e suﬃcien tly large to oﬀset the negativ e eigen v alues. Ho w ev er, as α increases in size, the Hessian b ecomes dominated b y the α I diagonal and the direction chosen b y Newton’s metho d conv erges to the standard gradient divided by α . When strong negative curv ature is present, α ma y need to b e so large that Newton’s metho d would make smaller steps than gradient descent with a prop erly c hosen learning rate. Bey ond the c hallenges created by certain features of the ob jective function, suc h as saddle p oin ts, the application of Newton’s metho d for training large neural net w orks is limited b y the signiﬁcant computational burden it imp oses. The 312 — Page Break — CHAPTER 8. OPTIMIZA TION FOR TRAINING DEEP MODELS n um b er of elemen ts in the Hessian is squared in the num b er of parameters, so with k parameters (and for even v ery small neural netw orks the num b er of parameters k can b e in the millions), Newton’s metho d would require the inv ersion of a k k × matrix—with computational complexity of O ( k 3 ) . Also, since the parameters will change with every up date, the inv erse Hessian has to b e computed at ev ery training iteration . As a consequence, only netw orks with a very small num b er of parameters can b e practically trained via Newton’s metho d. In the remainder of this section, w e will discuss alternativ es that attempt to gain some of the adv an tages of Newton’s metho d while side-stepping the computational h urdles. 8.6.2 Conj ugat e Grad ien ts Conjugate gradients is a metho d to eﬃciently av oid the calculation of the inv erse Hessian by iteratively descending c onjugate dir e ctions . The inspiration for this approac h follows from a careful study of the w eakness of the metho d of steep est descen t (see Sec. for details), where line searc hes are applied iteratively in 4.3 the direction asso ciated with the gradient. Fig. illustrates how the metho d of 8.6 steep est descent, when applied in a quadratic b owl, progresses in a rather ineﬀective bac k-and-forth, zig-zag pattern. This happ ens b ecause each line searc h direction, when given b y the gradient, is guaranteed to b e orthogonal to the previous line searc h direction. Let the previous searc h direction b e d t − 1 .

While the results suggest that the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged.
- 固定搭配:"adaptive learning rates"意为 "自适应学习率"；"represented by"意为 "由……代表"。
- 句子分析:主从复合句，“While”引导让步状语从句，从句中“suggest”后接宾语从句。主句和从句形成对比。
- 翻译:"虽然结果表明，具有自适应学习率的算法家族（以RMSProp和AdaDelta为代表）表现相当稳健，但尚未出现单一的最佳算法。"
- 单词分析:
  - adaptive:形容词，词源来自拉丁语 "adaptare"（使适应），词义：自适应的；适应的。
    - 记忆方法:联想 "adapt"（适应）+"-ive"（形容词后缀）→ 自适应的。
    - 形近词:adaptive/adaptable（可适应的）、adaptation（适应）。
    - 发音解析:
      - 音节分解:ad + ap + tive /əˈdæptɪv/，重音在第二音节
      - 规则:ad → /əˈd/， “ad” 发 /əˈd/ 音，其中 “a” 发短元音 /ə/，“d” 发 /d/ 音。
      - 规则:ap → /æp/， “ap” 发 /æp/ 音，其中 “a” 发短元音 /æ/，“p” 发 /p/ 音。
      - 规则:tive → /tɪv/， “tive” 发 /tɪv/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/，“v” 发 /v/ 音。
- robustly:副词，词源来自拉丁语 "robustus"（强壮的），词义：稳健地；强劲地。
  - 记忆方法:联想 "robust"（强壮的）+"-ly"（副词后缀）→ 稳健地。
  - 形近词:robustly/robust（强壮的）、robustness（稳健性）。
  - 发音解析:
    - 音节分解:ro + bust + ly /rəˈbʌstli/，重音在第二音节
    - 规则:ro → /rə/， “ro” 发 /rə/ 音，其中 “r” 发 /r/ 音，“o” 发短元音 /ə/。
    - 规则:bust → /bʌst/， “bust” 发 /bʌst/ 音，其中 “b” 发 /b/ 音，“u” 发短元音 /ʌ/，“s” 发 /s/ 音，“t” 发 /t/ 音。
    - 规则:ly → /li/， “ly” 发 /li/ 音，其中 “l” 发 /l/ 音，“y” 发 /i/ 音。
- emerged:动词过去式，词源来自拉丁语 "emergere"（出现），词义：出现；浮现。
  - 记忆方法:联想 "e-"（向外）+"merge"（合并）→ 向外合并 → 出现。
  - 形近词:emerged/emerge（出现）、emergency（紧急情况）。
  - 发音解析:
    - 音节分解:e + merge + d /ɪˈmɜːrdʒd/，重音在第二音节
    - 规则:e → /ɪ/， “e” 发 /ɪ/ 音，其中 “e” 发短元音 /ɪ/。
    - 规则:merge → /mɜːrdʒ/， “merge” 发 /mɜːrdʒ/ 音，其中 “m” 发 /m/ 音，“e” 发长元音 /ɜːr/，“r” 发 /r/ 音，“g” 发 /dʒ/ 音。
    - 规则:d → /d/， “d” 发 /d/ 音。

Currently, the most popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta and Adam.
- 固定搭配:"in use"意为 "在使用中"。
- 句子分析:简单句，主谓宾结构。
- 翻译:"目前，积极使用的最流行的优化算法包括随机梯度下降（SGD）、带动量的SGD、RMSProp、带动量的RMSProp、AdaDelta和Adam。"
- 单词分析:
  - optimization:名词，词源来自拉丁语 "optimus"（最好的），词义：优化；最佳化。
    - 记忆方法:联想 "optimize"（优化）+"-ation"（名词后缀）→ 优化。
    - 形近词:optimization/optimize（优化）、optimal（最佳的）。
    - 发音解析:
      - 音节分解:op + ti + mi + za + tion /ˌɑːptɪmaɪˈzeɪʃn/，重音在第四音节
      - 规则:op → /ɑːp/， “op” 发 /ɑːp/ 音，其中 “o” 发长元音 /ɑː/，“p” 发 /p/ 音。
      - 规则:ti → /tɪ/， “ti” 发 /tɪ/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/。
      - 规则:mi → /maɪ/， “mi” 发 /maɪ/ 音，其中 “m” 发 /m/ 音，“i” 发长元音 /aɪ/。
      - 规则:za → /zeɪ/， “za” 发 /zeɪ/ 音，其中 “z” 发 /z/ 音，“a” 发长元音 /eɪ/。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音，其中 “t” 发 /ʃ/ 音，“i” 不发音，“o” 不发音，“n” 发 /n/ 音。
- momentum:名词，词源来自拉丁语 "momentum"（运动；动力），词义：动量；动力。
  - 记忆方法:联想 “moment”（时刻），动量与时刻相关，推动事物发展。
  - 形近词:momentum/moment（时刻）、momentary（短暂的）。
  - 发音解析:
    - 音节分解:mo + men + tum /moʊˈmentəm/，重音在第二音节
    - 规则:mo → /moʊ/， “mo” 发 /moʊ/ 音，其中 “m” 发 /m/ 音，“o” 发长元音 /oʊ/。
    - 规则:men → /men/， “men” 发 /men/ 音，其中 “m” 发 /m/ 音，“e” 发短元音 /e/，“n” 发 /n/ 音。
    - 规则:tum → /təm/， “tum” 发 /təm/ 音，其中 “t” 发 /t/ 音，“u” 发短元音 /ə/，“m” 发 /m/ 音。

The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).
- 固定搭配:"depend on"意为 "取决于；依赖"；"at this point"意为 "在这一点上"。
- 句子分析:简单句，“The choice of which algorithm to use”是主语，“seems to depend on”是谓语。
- 翻译:"在这一点上，选择使用哪种算法似乎很大程度上取决于用户对该算法的熟悉程度（以便于超参数调整）。"
- 单词分析:
  - familiarity:名词，词源来自拉丁语 "familiaris"（家庭的；熟悉的），词义：熟悉；通晓。
    - 记忆方法:联想 "familiar"（熟悉的）+"-ity"（名词后缀）→ 熟悉。
    - 形近词:familiarity/familiar（熟悉的）、unfamiliar（不熟悉的）。
    - 发音解析:
      - 音节分解:fa + mil + iar + ity /fəˌmɪliˈærəti/，重音在第三音节
      - 规则:fa → /fə/， “fa” 发 /fə/ 音，其中 “f” 发 /f/ 音，“a” 发短元音 /ə/。
      - 规则:mil → /mɪl/， “mil” 发 /mɪl/ 音，其中 “m” 发 /m/ 音，“i” 发短元音 /ɪ/，“l” 发 /l/ 音。
      - 规则:iar → /iːər/， “iar” 发 /iːər/ 音，其中 “i” 发长元音 /iː/，“a” 不发音，“r” 发 /r/ 音。
      - 规则:ity → /əti/， “ity” 发 /əti/ 音，其中 “i” 发短元音 /ə/，“t” 发 /t/ 音，“y” 发 /i/ 音。
- hyperparameter:名词，由 "hyper-"（超）和 "parameter"（参数）组成，词义：超参数。
  - 记忆方法:联想 “hyper-”（超出，过度）+ “parameter”（参数），表示超出普通参数的参数。
  - 形近词:hyperparameter/parameter（参数）、hypersensitive（过敏的）。
  - 发音解析:
    - 音节分解:hy + per + pa + me + ter /ˌhaɪpərˈpæmɪtər/，重音在第四音节
    - 规则:hy → /haɪ/， “hy” 发 /haɪ/ 音，其中 “h” 发 /h/ 音，“y” 发长元音 /aɪ/。
    - 规则:per → /pər/， “per” 发 /pər/ 音，其中 “p” 发 /p/ 音，“e” 发短元音 /ə/，“r” 发 /r/ 音。
    - 规则:pa → /pæ/， “pa” 发 /pæ/ 音，其中 “p” 发 /p/ 音，“a” 发短元音 /æ/。
    - 规则:me → /mɪ/， “me” 发 /mɪ/ 音，其中 “m” 发 /m/ 音，“e” 发短元音 /ɪ/。
    - 规则:ter → /tər/， “ter” 发 /tər/ 音，其中 “t” 发 /t/ 音，“e” 发短元音 /ə/，“r” 发 /r/ 音。

Algorithm 8.6 RMSProp algorithm with Nesterov momentum Require: Global learning rate , decay rate , momentum coefficient .
- 固定搭配:无
- 句子分析:简单句，介绍算法及所需参数。
- 翻译:"算法8.6 带Nesterov动量的RMSProp算法要求：全局学习率、衰减率、动量系数。"
- 单词分析:
  - coefficient:名词，词源来自拉丁语 "co-"（共同）+ "efficere"（做），词义：系数；率。
    - 记忆方法:联想 “co-”（共同）+ “efficient”（有效率的），系数是共同作用的一个数值。
    - 形近词:coefficient/efficient（有效率的）、inefficient（无效率的）。
    - 发音解析:
      - 音节分解:co + ef + fi + cient /ˌkoʊɪˈfɪʃnt/，重音在第二音节
      - 规则:co → /koʊ/， “co” 发 /koʊ/ 音，其中 “c” 发 /k/ 音，“o” 发长元音 /oʊ/。
      - 规则:ef → /ɪf/， “ef” 发 /ɪf/ 音，其中 “e” 发短元音 /ɪ/，“f” 发 /f/ 音。
      - 规则:fi → /fɪ/， “fi” 发 /fɪ/ 音，其中 “f” 发 /f/ 音，“i” 发短元音 /ɪ/。
      - 规则:cient → /ʃnt/， “cient” 发 /ʃnt/ 音，其中 “c” 发 /ʃ/ 音，“i” 不发音，“e” 不发音，“n” 发 /n/ 音，“t” 发 /t/ 音。

Require: Initial parameter , initial velocity .
- 固定搭配:无
- 句子分析:简单句，说明算法所需的初始参数。
- 翻译:"要求：初始参数、初始速度。"
- 单词分析:
  - velocity:名词，词源来自拉丁语 "velocitas"（速度），词义：速度；速率。
    - 记忆方法:联想 “vel” 类似 “vehicle”（车辆），车辆有速度。
    - 形近词:velocity/vehicle（车辆）、accelerate（加速）。
    - 发音解析:
      - 音节分解:ve + loc + i + ty /vəˈlɑːsəti/，重音在第二音节
      - 规则:ve → /və/， “ve” 发 /və/ 音，其中 “v” 发 /v/ 音，“e” 发短元音 /ə/。
      - 规则:loc → /lɑːk/， “loc” 发 /lɑːk/ 音，其中 “l” 发 /l/ 音，“o” 发长元音 /ɑː/，“c” 发 /k/ 音。
      - 规则:i → /ɪ/， “i” 发 /ɪ/ 音，其中 “i” 发短元音 /ɪ/。
      - 规则:ty → /ti/， “ty” 发 /ti/ 音，其中 “t” 发 /t/ 音，“y” 发 /i/ 音。

Initialize accumulation variable r = 0 while do stopping criterion not met
- 固定搭配:“stopping criterion”意为 "停止准则"。
- 句子分析:这是一个编程逻辑相关的句子，“while”引导条件循环，表达在停止准则未满足时进行初始化操作。
- 翻译:"初始化累积变量r = 0，当停止准则未满足时执行。"
- 单词分析:
  - accumulation:名词，词源来自拉丁语 "accumulare"（积累），词义：积累；累积。
    - 记忆方法:联想 "accumulate"（积累）+"-ion"（名词后缀）→ 积累。
    - 形近词:accumulation/accumulate（积累）、accumulative（累积的）。
    - 发音解析:
      - 音节分解:ac + cu + mu + la + tion /əˌkjuːmjəˈleɪʃn/，重音在第四音节
      - 规则:ac → /əˈk/， “ac” 发 /əˈk/ 音，其中 “a” 发短元音 /ə/，“c” 发 /k/ 音。
      - 规则:cu → /kjuː/， “cu” 发 /kjuː/ 音，其中 “c” 发 /k/ 音，“u” 发长元音 /juː/。
      - 规则:mu → /mjuː/， “mu” 发 /mjuː/ 音，其中 “m” 发 /m/ 音，“u” 发长元音 /juː/。
      - 规则:la → /leɪ/， “la” 发 /leɪ/ 音，其中 “l” 发 /l/ 音，“a” 发长元音 /eɪ/。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音，其中 “t” 发 /ʃ/ 音，“i” 不发音，“o” 不发音，“n” 发 /n/ 音。
- criterion:名词，词源来自希腊语 "kriterion"（标准），词义：标准；准则。
  - 记忆方法:联想 “criteria”（标准，复数形式），“criterion”是单数。
  - 形近词:criterion/criteria（标准，复数）、critic（批评家）。
  - 发音解析:
    - 音节分解:cri + te + ri + on /kraɪˈtɪriən/，重音在第二音节
    - 规则:cri → /kraɪ/， “cri” 发 /kraɪ/ 音，其中 “c” 发 /k/ 音，“r” 发 /r/ 音，“i” 发长元音 /aɪ/。
    - 规则:te → /tɪ/， “te” 发 /tɪ/ 音，其中 “t” 发 /t/ 音，“e” 发短元音 /ɪ/。
    - 规则:ri → /ri/， “ri” 发 /ri/ 音，其中 “r” 发 /r/ 音，“i” 发短元音 /ɪ/。
    - 规则:on → /ən/， “on” 发 /ən/ 音，其中 “o” 发短元音 /ə/，“n” 发 /n/ 音。

Sample a minibatch of m examples from the training set { x (1) ,… , x ( ) m } with corresponding targets y ( ) i .
- 固定搭配:"a minibatch of"意为 "一批；一小批"。
- 句子分析:简单句，描述从训练集中抽样的操作。
- 翻译:"从训练集{x (1) ,… , x (m) }中抽样一批m个示例，并对应目标y (i) 。"
- 单词分析:
  - minibatch:名词，由 "mini-"（小的）和 "batch"（一批）组成，词义：小批量。
    - 记忆方法:联想 “mini”（小的）+ “batch”（一批），表示小的一批。
    - 形近词:minibatch/batch（一批）、minibus（小型公共汽车）。
    - 发音解析:
      - 音节分解:mi + ni + batch /ˈmɪnibætʃ/，重音在第一音节
      - 规则:mi → /mɪ/， “mi” 发 /mɪ/ 音，其中 “m” 发 /m/ 音，“i” 发短元音 /ɪ/。
      - 规则:ni → /nɪ/， “ni” 发 /nɪ/ 音，其中 “n” 发 /n/ 音，“i” 发短元音 /ɪ/。
      - 规则:batch → /bætʃ/， “batch” 发 /bætʃ/ 音，其中 “b” 发 /b/ 音，“a” 发短元音 /æ/，“t” 发 /t/ 音，“ch” 发 /tʃ/ 音。
- corresponding:形容词，词源来自拉丁语 "correspondere"（相应；符合），词义：相应的；对应的。
  - 记忆方法:联想 "correspond"（相应；符合）+"-ing"（形容词后缀）→ 相应的。
  - 形近词:corresponding/correspond（相应；符合）、correspondence（通信；符合）。
  - 发音解析:
    - 音节分解:cor + re + spond + ing /ˌkɔːrəˈspɑːndɪŋ/，重音在第三音节
    - 规则:cor → /kɔːr/， “cor” 发 /kɔːr/ 音，其中 “c” 发 /k/ 音，“o” 发长元音 /ɔː/，“r” 发 /r/ 音。
    - 规则:re → /rɪ/， “re” 发 /rɪ/ 音，其中 “r” 发 /r/ 音，“e” 发短元音 /ɪ/。
    - 规则:spond → /spɑːnd/， “spond” 发 /spɑːnd/ 音，其中 “s” 发 /s/ 音，“p” 发 /p/ 音，“o” 发长元音 /ɑː/，“n” 发 /n/ 音，“d” 发 /d/ 音。
    - 规则:ing → /ɪŋ/， “ing” 发 /ɪŋ/ 音，其中 “i” 发短元音 /ɪ/，“n” 发 /ŋ/ 音。

Compute interim update: ˜ θ θ v ← + α
- 固定搭配:无
- 句子分析:简单句，描述计算临时更新的操作。
- 翻译:"计算临时更新：˜ θ = θ + αv"
- 单词分析:
  - interim:形容词，词源来自拉丁语 "interim"（其间；暂时），词义：临时的；中间的。
    - 记忆方法:联想 “inter-”（在……之间）+ “im”（类似 “in”，里面），表示在中间的、临时的。
    - 形近词:interim/intermediate（中间的）、interrupt（打断）。
    - 发音解析:
      - 音节分解:in + ter + im /ˈɪntərɪm/，重音在第一音节
      - 规则:in → /ɪn/， “in” 发 /ɪn/ 音，其中 “i” 发短元音 /ɪ/，“n” 发 /n/ 音。
      - 规则:ter → /tər/， “ter” 发 /tər/ 音，其中 “t” 发 /t/ 音，“e” 发短元音 /ə/，“r” 发 /r/ 音。
      - 规则:im → /ɪm/， “im” 发 /ɪm/ 音，其中 “i” 发短元音 /ɪ/，“m” 发 /m/ 音。

Compute gradient: g ← 1 m ∇ ˜ θ  i L f ( ( x ( ) i ; ˜ θ y ) , ( ) i )
- 固定搭配:无
- 句子分析:简单句，描述计算梯度的操作。
- 翻译:"计算梯度：g = 1/m ∇ ˜ θ ∑ i L(f(x (i) ; ˜ θ), y (i) )"
- 单词分析:
  - gradient:名词，词源来自拉丁语 "gradus"（步；级），词义：梯度；斜率。
    - 记忆方法:联想 “grade”（等级），梯度和等级有一定关联。
    - 形近词:gradient/grade（等级）、gradual（逐渐的）。
    - 发音解析:
      - 音节分解:gra + di + ent /ˈɡreɪdiənt/，重音在第一音节
      - 规则:gra → /ɡreɪ/， “gra” 发 /ɡreɪ/ 音，其中 “g” 发 /ɡ/ 音，“r” 发 /r/ 音，“a” 发长元音 /eɪ/。
      - 规则:di → /di/， “di” 发 /di/ 音，其中 “d” 发 /d/ 音，“i” 发短元音 /ɪ/。
      - 规则:ent → /ənt/， “ent” 发 /ənt/ 音，其中 “e” 发短元音 /ə/，“n” 发 /n/ 音，“t” 发 /t/ 音。

Accumulate gradient: r r g g ← ρ + (1 ) − ρ 
- 固定搭配:无
- 句子分析:简单句，描述累积梯度的操作。
- 翻译:"累积梯度：r = ρr + (1 – ρ)  g  g"
- 单词分析:
  - accumulate:动词，词源来自拉丁语 "accumulare"（积累），词义：积累；累积。
    - 记忆方法:联想 “ac-”（加强）+ “cumulate”（堆积），加强堆积就是积累。
    - 形近词:accumulate/accumulation（积累）、cumulate（堆积）。
    - 发音解析:
      - 音节分解:ac + cu + mu + late /əˈkjuːmjəleɪt/，重音在第二音节
      - 规则:ac → /əˈk/， “ac” 发 /əˈk/

In contrast to ﬁrst – 4.3 order methods, second – order methods make use of second derivatives to improve optimization.
- 固定搭配:“in contrast to”意为“与……相比”；“make use of”意为“利用”。
- 句子分析:简单句，“In contrast to ﬁrst – 4.3 order methods”作状语，句子主干是“second – order methods make use of second derivatives to improve optimization”。
- 翻译:“与一阶方法相比，二阶方法利用二阶导数来改进优化。”
- 单词分析:
  - derivatives:名词，词源来自拉丁语“derivare”（引出，派生），词义：导数。
    - 记忆方法:“de – ”表示“离开”，“riv”可联想“river”（河流），从河流引出支流，即派生，这里指导数。
    - 形近词:derive（动词，派生）、derivation（名词，派生）。
    - 发音解析:
      - 音节分解:de + ri + va + tives /dɪˈrɪvətɪvz/，重音在第二音节
      - 规则:de → /dɪ/， “de” 发 /dɪ/ 音，其中 “d” 发 /d/ 音，“e” 发短元音 /ɪ/。
      - 规则:ri → /rɪ/， “ri” 发 /rɪ/ 音，其中 “r” 发 /r/ 音，“i” 发短元音 /ɪ/。
      - 规则:va → /veɪ/， “va” 发 /veɪ/ 音，其中 “v” 发 /v/ 音，“a” 发长元音 /eɪ/。
      - 规则:tives → /tɪvz/， “tives” 发 /tɪvz/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/，“v” 发 /v/ 音，“es” 发 /z/ 音。
- optimization:名词，词源来自“optimize”（优化），词义：优化。
  - 记忆方法:“optim”可联想“optimum”（最优的），加上“- ization”表示名词化。
  - 形近词:optimize（动词，优化）、optimal（形容词，最优的）。
  - 发音解析:
    - 音节分解:op + ti + mi + za + tion /ˌɒptɪmaɪˈzeɪʃn/，重音在倒数第二个音节
    - 规则:op → /ɒp/， “op” 发 /ɒp/ 音，其中 “o” 发短元音 /ɒ/，“p” 发 /p/ 音。
    - 规则:ti → /tɪ/， “ti” 发 /tɪ/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/。
    - 规则:mi → /maɪ/， “mi” 发 /maɪ/ 音，其中 “m” 发 /m/ 音，“i” 发长元音 /aɪ/。
    - 规则:za → /zeɪ/， “za” 发 /zeɪ/ 音，其中 “z” 发 /z/ 音，“a” 发长元音 /eɪ/。
    - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音，其中 “t” 不发音，“i” 发短元音 /ɪ/，“on” 发 /ʃn/ 音。

The most widely used second – order method is Newton’s method.
- 句子分析:主系表结构的简单句。
- 翻译:“最广泛使用的二阶方法是牛顿法。”

We now describe Newton’s method in more detail, with emphasis on its application to neural network training.
- 固定搭配:“in more detail”意为“更详细地”；“with emphasis on”意为“着重于；强调”。
- 句子分析:简单句，“with emphasis on its application to neural network training”作伴随状语。
- 翻译:“我们现在更详细地描述牛顿法，着重于它在神经网络训练中的应用。”
- 单词分析:
  - emphasis:名词，词源来自希腊语“emphasis”（强调），词义：强调；重点。
    - 记忆方法:“em – ”表示“使……”，“phas”可联想“phase”（阶段），使某个阶段突出，即强调。
    - 形近词:emphasize（动词，强调）。
    - 发音解析:
      - 音节分解:em + pha + sis /ˈemfəsɪs/，重音在第一音节
      - 规则:em → /em/， “em” 发 /em/ 音，其中 “e” 发短元音 /e/，“m” 发 /m/ 音。
      - 规则:pha → /fə/， “pha” 发 /fə/ 音，其中 “ph” 发 /f/ 音，“a” 发短元音 /ə/。
      - 规则:sis → /sɪs/， “sis” 发 /sɪs/ 音，其中 “s” 发 /s/ 音，“i” 发短元音 /ɪ/，“s” 发 /s/ 音。
- application:名词，词源来自“apply”（应用），词义：应用；申请。
  - 记忆方法:“apply”加上“- ation”表示名词化。
  - 形近词:apply（动词，应用）、applicant（名词，申请人）。
  - 发音解析:
    - 音节分解:ap + pli + ca + tion /ˌæplɪˈkeɪʃn/，重音在倒数第二个音节
    - 规则:ap → /æp/， “ap” 发 /æp/ 音，其中 “a” 发短元音 /æ/，“p” 发 /p/ 音。
    - 规则:pli → /plɪ/， “pli” 发 /plɪ/ 音，其中 “p” 发 /p/ 音，“l” 发 /l/ 音，“i” 发短元音 /ɪ/。
    - 规则:ca → /keɪ/， “ca” 发 /keɪ/ 音，其中 “c” 发 /k/ 音，“a” 发长元音 /eɪ/。
    - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音，其中 “t” 不发音，“i” 发短元音 /ɪ/，“on” 发 /ʃn/ 音。

Newton’s method is an optimization scheme based on using a second – order Taylor series expansion to approximate J (θ) near some point θ0, ignoring derivatives 310
- 固定搭配:“be based on”意为“基于”。
- 句子分析:主系表结构的句子，“based on using a second – order Taylor series expansion to approximate J (θ) near some point θ0”作后置定语修饰“scheme”，“ignoring derivatives 310”作伴随状语。
- 翻译:“牛顿法是一种基于使用二阶泰勒级数展开来在某点θ0附近近似J(θ)的优化方案，忽略310阶导数。”
- 单词分析:
  - scheme:名词，词源来自希腊语“skhema”（形式，计划），词义：方案；计划。
    - 记忆方法:可联想“计划（scheme）”和“屏幕（screen）”发音有点像。
    - 形近词:schism（分裂）。
    - 发音解析:
      - 音节分解:sch + eme /skiːm/，重音在第一音节
      - 规则:sch → /sk/， “sch” 发 /sk/ 音，其中 “s” 发 /s/ 音，“ch” 发 /k/ 音。
      - 规则:eme → /iːm/， “eme” 发 /iːm/ 音，其中 “e” 发长元音 /iː/，“m” 发 /m/ 音。
- approximate:动词，词源来自拉丁语“approximare”（接近），词义：近似；接近。
  - 记忆方法:“ap – ”表示“加强”，“proxim”可联想“proximity”（接近），加强接近，即近似。
  - 形近词:approximation（名词，近似值）。
  - 发音解析:
    - 音节分解:ap + prox + i + mate /əˈprɒksɪmeɪt/，重音在第二音节
    - 规则:ap → /əp/， “ap” 发 /əp/ 音，其中 “a” 发短元音 /ə/，“p” 发 /p/ 音。
    - 规则:prox → /prɒks/， “prox” 发 /prɒks/ 音，其中 “p” 发 /p/ 音，“r” 发 /r/ 音，“o” 发短元音 /ɒ/，“x” 发 /ks/ 音。
    - 规则:i → /ɪ/， “i” 发 /ɪ/ 音，其中 “i” 发短元音 /ɪ/。
    - 规则:mate → /meɪt/， “mate” 发 /meɪt/ 音，其中 “m” 发 /m/ 音，“a” 发长元音 /eɪ/，“t” 发 /t/ 音。

CHAPTER 8.
- 翻译:“第8章。”

OPTIMIZATION FOR TRAINING DEEP MODELS
- 翻译:“深度模型训练的优化”

Algorithm 8.7 The Adam algorithm
- 翻译:“算法8.7 亚当算法”

Require: Step size (Suggested default: )  0 001.
- 翻译:“要求：步长（建议默认值：  0 001）。”

Require: Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1). (Suggested defaults: and respectively) 0 9. 0 999.
- 固定搭配:“exponential decay”意为“指数衰减”。
- 句子分析:简单句，说明对矩估计的指数衰减率的要求。
- 翻译:“要求：矩估计的指数衰减率，ρ1和ρ2在[0, 1)内。（建议默认值分别为0.9和0.999）。”
- 单词分析:
  - exponential:形容词，词源来自拉丁语“exponere”（展开），词义：指数的；迅速增长的。
    - 记忆方法:“ex – ”表示“向外”，“pone”可联想“postpone”（推迟），向外展开，即指数增长。
    - 形近词:exponent（名词，指数）。
    - 发音解析:
      - 音节分解:ex + po + nen + tial /ˌekspəˈnenʃl/，重音在倒数第二个音节
      - 规则:ex → /eks/， “ex” 发 /eks/ 音，其中 “e” 发短元音 /e/，“x” 发 /ks/ 音。
      - 规则:po → /pəʊ/， “po” 发 /pəʊ/ 音，其中 “p” 发 /p/ 音，“o” 发长元音 /əʊ/。
      - 规则:nen → /nen/， “nen” 发 /nen/ 音，其中 “n” 发 /n/ 音，“e” 发短元音 /e/，“n” 发 /n/ 音。
      - 规则:tial → /ʃl/， “tial” 发 /ʃl/ 音，其中 “t” 不发音，“i” 发短元音 /ɪ/，“al” 发 /l/ 音。
- decay:动词、名词，词源来自拉丁语“decadere”（下降，衰败），词义：衰减；衰败。
  - 记忆方法:“de – ”表示“向下”，“cay”可联想“cascade”（瀑布），向下像瀑布一样，即衰减。
  - 形近词:decadence（名词，颓废）。
  - 发音解析:
    - 音节分解:de + cay /dɪˈkeɪ/，重音在第二音节
    - 规则:de → /dɪ/， “de” 发 /dɪ/ 音，其中 “d” 发 /d/ 音，“e” 发短元音 /ɪ/。
    - 规则:cay → /keɪ/， “cay” 发 /keɪ/ 音，其中 “c” 发 /k/ 音，“a” 发长元音 /eɪ/。

Require: Small constant δ used for numerical stabilization. (Suggested default: 10−8)
- 固定搭配:“numerical stabilization”意为“数值稳定化”。
- 句子分析:简单句，说明对用于数值稳定化的小常数δ的要求。
- 翻译:“要求：用于数值稳定化的小常数δ。（建议默认值：10−8）”
- 单词分析:
  - stabilization:名词，词源来自“stabilize”（使稳定），词义：稳定化。
    - 记忆方法:“stabilize”加上“- ation”表示名词化。
    - 形近词:stabilize（动词，使稳定）、stable（形容词，稳定的）。
    - 发音解析:
      - 音节分解:sta + bi + li + za + tion /ˌsteɪbəlaɪˈzeɪʃn/，重音在倒数第二个音节
      - 规则:sta → /steɪ/， “sta” 发 /steɪ/ 音，其中 “s” 发 /s/ 音，“t” 发 /t/ 音，“a” 发长元音 /eɪ/。
      - 规则:bi → /bɪ/， “bi” 发 /bɪ/ 音，其中 “b” 发 /b/ 音，“i” 发短元音 /ɪ/。
      - 规则:li → /laɪ/， “li” 发 /laɪ/ 音，其中 “l” 发 /l/ 音，“i” 发长元音 /aɪ/。
      - 规则:za → /zeɪ/， “za” 发 /zeɪ/ 音，其中 “z” 发 /z/ 音，“a” 发长元音 /eɪ/。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音，其中 “t” 不发音，“i” 发短元音 /ɪ/，“on” 发 /ʃn/ 音。

Require: Initial parameters θ Initialize 1st and 2nd moment variables, s = 0 r = 0
- 句子分析:祈使句，要求初始化初始参数θ以及一阶和二阶矩变量。
- 翻译:“要求：初始参数θ 初始化一阶和二阶矩变量，s = 0，r = 0”

Initialize time step t = 0
- 翻译:“初始化时间步t = 0”

while do stopping criterion not met
- 句子分析:这是一个不完整的“while”循环条件语句，完整形式应该是“while the stopping criterion is not met” 。
- 翻译:“当停止准则未满足时”

Sample a minibatch of m examples from the training set {x(1), …, x(m)} with corresponding targets y(i).
- 固定搭配:“a minibatch of”意为“一批”。
- 句子分析:祈使句，描述从训练集中采样一批m个样本的操作。
- 翻译:“从训练集{x(1), …, x(m)}中采样一批m个样本，对应目标为y(i)。”
- 单词分析:
  - minibatch:名词，由“mini – ”（小的）和“batch”（一批）组成，词义：小批量。
    - 记忆方法:“mini”表示小，“batch”表示一批，合起来就是小批量。
    - 形近词:batch（名词，一批）。
    - 发音解析:
      - 音节分解:mini + batch /ˈmɪnibætʃ/，重音在第一音节
      - 规则:mini → /ˈmɪni/， “mini” 发 /ˈmɪni/ 音，其中 “m” 发 /m/ 音，“i” 发短元音 /ɪ/，“n” 发 /n/ 音，“i” 发短元音 /ɪ/。
      - 规则:batch → /bætʃ/， “batch” 发 /bætʃ/ 音，其中 “b” 发 /b/ 音，“a” 发短元音 /æ/，“t” 发 /t/ 音，“ch” 发 /tʃ/ 音。

Compute gradient: g ← 1m ∇θ i L f ((x(i); )θ, y(i))
- 句子分析:祈使句，描述计算梯度的操作。
- 翻译:“计算梯度：g ← 1m ∇θ i L f ((x(i); )θ, y(i))”

t t ← + 1
- 翻译:“t ← t + 1”

Update biased first moment estimate: s ← ρ1s + (1 – ρ1)g
- 句子分析:祈使句，描述更新有偏一阶矩估计的操作。
- 翻译:“更新有偏一阶矩估计：s ← ρ1s + (1 – ρ1)g”
- 单词分析:
  - biased:形容词，词源来自“bias”（偏见），词义：有偏见的；有偏的。
    - 记忆方法:“bias”加上“- ed”变成形容词。
    - 形近词:bias（名词，偏见）。
    - 发音解析:
      - 音节分解:bi + as + ed /ˈbaɪəst/，重音在第一音节
      - 规则:bi → /baɪ/， “bi” 发 /baɪ/ 音，其中 “b” 发 /b/ 音，“i” 发长元音 /aɪ/。
      - 规则:as → /ə/， “as” 发 /ə/ 音，其中 “a” 发短元音 /ə/，“s” 不发音。
      - 规则:ed → /st/， “ed” 发 /st/ 音，其中 “e” 不发音，“d” 发 /t/ 音。

Update biased second moment estimate: r ← ρ2r + (1 – ρ2)g g 
- 句子分析:祈使句，描述更新有偏二阶矩估计的操作。
- 翻译:“更新有偏二阶矩估计：r ← ρ2r + (1 – ρ2)g g ”

Correct bias in first moment: ˆs ← s 1 – ρt1
- 句子分析:祈使句，描述纠正一阶矩偏差的操作。
- 翻译:“纠正一阶矩偏差：ˆs ← s 1 – ρt1”

Correct bias in second moment: ˆr ← r 1 – ρt2
- 句子分析:祈使句，描述纠正二阶矩偏差的操作。
- 翻译:“纠正二阶矩偏差：ˆr ← r 1 – ρt2”

Compute update: ∆ = θ –  ˆs √ˆr + δ (operations applied element – wise)
- 句子分析:祈使句，描述计算更新量的操作。
- 翻译:“计算更新量：∆ = θ –  ˆs √ˆr + δ（按元素进行操作）”

Apply update: θ θ θ ← + ∆
- 句子分析:祈使句，描述应用更新的操作。
- 翻译:“应用更新：θ ← θ + ∆”

end while
- 翻译:“结束循环”

of higher order: J J ( )θ ≈ (θ0) + (θ θ – 0)  ∇θ J (θ0) + 12 (θ θ – 0)  H θ θ (- 0), (8.26)
- 句子分析:描述高阶的近似公式。
- 翻译:“高阶：J(θ) ≈ J(θ0) + (θ – θ0)  ∇θ J(θ0) + 12 (θ – θ0)  H(θ – θ0)，(8.26)”

where H is the Hessian of J with respect to θ evaluated at θ0.
- 句子分析:“where”引导的定语从句，解释前面公式中H的含义。
- 翻译:“其中H是J关于θ在θ0处求值的海森矩阵。”
- 单词分析:
  - Hessian:名词，以德国数学家奥托·黑塞（Otto Hesse）命名，词义：海森矩阵。
    - 记忆方法:记住数学家的名字来记忆这个专业术语。
    - 形近词:无。
    - 发音解析:
      - 音节分解:Hes + si + an /ˈhesiən/，重音在第一音节
      - 规则:Hes → /hes/， “Hes” 发 /hes/ 音，其中 “H” 发 /h/ 音，“e” 发短元音 /e/，“s” 发 /s/ 音。
      - 规则:si → /si/， “si” 发 /si/ 音，其中 “s” 发 /s/ 音，“i” 发短元音 /i/。
      - 规则:an → /ən/， “an” 发 /ən/ 音，其中 “a” 发短元音 /ə/，“n” 发 /n/ 音。

If we then solve for the critical point of this function, we obtain the Newton parameter update rule: θ ∗ = θ 0 − H − 1 ∇ θ J ( θ 0 ) (8.27)
- 固定搭配:“solve for”，含义：“求解”。
- 句子分析:这是一个复合句，“If we then solve for the critical point of this function”是条件状语从句，“we obtain the Newton parameter update rule…”是主句。句子意思是如果求解该函数的临界点，就能得到牛顿参数更新规则。
- 翻译:如果我们接着求解这个函数的临界点，我们就能得到牛顿参数更新规则：θ ∗ = θ 0 − H − 1 ∇ θ J ( θ 0 ) (8.27)
- 单词分析:
  - critical:形容词，词源来自希腊语“kritikos”（能够判断的），词义：关键的；临界的。
    - 记忆方法:联想“critic”（批评家），批评家的意见往往是关键的。
    - 形近词:critical/criticize（批评）、criticism（批评）。
    - 发音解析:
      - 音节分解:cri + ti + cal /ˈkrɪtɪkl/，重音在第一音节
      - 规则:cri → /krɪ/， “cri” 发 /krɪ/ 音，其中 “c” 发 /k/ 音，“r” 发 /r/ 音，“i” 发短元音 /ɪ/。
      - 规则:ti → /tɪ/， “ti” 发 /tɪ/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/。
      - 规则:cal → /kl/， “cal” 发 /kl/ 音，其中 “c” 发 /k/ 音，“a” 不发音，“l” 发 /l/ 音。
- parameter:名词，词源来自希腊语“para-”（旁边）+“metron”（测量），词义：参数；参量。
  - 记忆方法:联想“para”（旁边）+“meter”（测量），在旁边测量的量就是参数。
  - 形近词:parameter/perimeter（周长）。
  - 发音解析:
    - 音节分解:pa + ram + e + ter /pəˈræmɪtə(r)/，重音在第二音节
    - 规则:pa → /pə/， “pa” 发 /pə/ 音，其中 “p” 发 /p/ 音，“a” 发短元音 /ə/。
    - 规则:ram → /ræm/， “ram” 发 /ræm/ 音，其中 “r” 发 /r/ 音，“a” 发短元音 /æ/，“m” 发 /m/ 音。
    - 规则:e → /ɪ/， “e” 发 /ɪ/ 音，发短元音 /ɪ/。
    - 规则:ter → /tə(r)/， “ter” 发 /tə(r)/ 音，其中 “t” 发 /t/ 音，“e” 发短元音 /ə/，“r” 发音。

Thus for a locally quadratic function (with positive definite H), by rescaling the gradient by H − 1, Newton’s method jumps directly to the minimum.
- 固定搭配:“positive definite”，含义：“正定的”；“jump to”，含义：“直接到达”。
- 句子分析:这是一个简单句，“by rescaling the gradient by H − 1”是方式状语。句子表明对于局部二次函数，通过用H – 1重新缩放梯度，牛顿法能直接跳到最小值。
- 翻译:因此，对于一个局部二次函数（具有正定的H），通过用H – 1对梯度进行重新缩放，牛顿法直接跳到最小值。
- 单词分析:
  - quadratic:形容词，词源来自拉丁语“quadratus”（平方的），词义：二次的。
    - 记忆方法:“quad”有“四”的意思，二次函数和平方有关，可联想记忆。
    - 形近词:quadratic/quadrangle（四边形）。
    - 发音解析:
      - 音节分解:qua + dra + tic /kwɒˈdrætɪk/，重音在第二音节
      - 规则:qua → /kwɒ/， “qua” 发 /kwɒ/ 音，其中 “qu” 发 /kw/ 音，“a” 发短元音 /ɒ/。
      - 规则:dra → /dræ/， “dra” 发 /dræ/ 音，其中 “d” 发 /d/ 音，“r” 发 /r/ 音，“a” 发短元音 /æ/。
      - 规则:tic → /tɪk/， “tic” 发 /tɪk/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/，“c” 发 /k/ 音。
- rescaling:动词现在分词，由“re-”（重新）+“scale”（缩放）构成，词义：重新缩放。
  - 记忆方法:“re-”表示“重新”，“scale”是“缩放”，合起来就是“重新缩放”。
  - 形近词:rescaling/scale（规模；缩放）。
  - 发音解析:
    - 音节分解:re + scal + ing /riːˈskeɪlɪŋ/，重音在第二音节
    - 规则:re → /riː/， “re” 发 /riː/ 音，其中 “r” 发 /r/ 音，“e” 发长元音 /iː/。
    - 规则:scal → /skeɪl/， “scal” 发 /skeɪl/ 音，其中 “s” 发 /s/ 音，“c” 发 /k/ 音，“a” 发长元音 /eɪ/，“l” 发 /l/ 音。
    - 规则:ing → /ɪŋ/， “ing” 发 /ɪŋ/ 音，其中 “i” 发短元音 /ɪ/，“n” 发鼻音，“g” 发 /ŋ/ 音。

If the objective function is convex but not quadratic (there are higher – order terms), this update can be iterated, yielding the training algorithm associated with Newton’s method, given in Algorithm 8.8.
- 固定搭配:“associated with”，含义：“与……相关联”。
- 句子分析:这是一个复合句，“If the objective function is convex but not quadratic…”是条件状语从句，主句“this update can be iterated…”，“yielding…”是结果状语。句子说如果目标函数是凸的但不是二次的，更新可以迭代，从而产生与牛顿法相关的训练算法。
- 翻译:如果目标函数是凸的但不是二次的（存在高阶项），这个更新可以迭代，从而产生与牛顿法相关的训练算法，如算法8.8所示。
- 单词分析:
  - objective:形容词，词源来自拉丁语“objectivus”，词义：客观的；目标的。
    - 记忆方法:联想“object”（物体），和物体相关的就是客观的、目标的。
    - 形近词:objective/object（物体；目标）。
    - 发音解析:
      - 音节分解:ob + jec + tive /əbˈdʒektɪv/，重音在第二音节
      - 规则:ob → /əb/， “ob” 发 /əb/ 音，其中 “o” 发短元音 /ə/，“b” 发 /b/ 音。
      - 规则:jec → /dʒek/， “jec” 发 /dʒek/ 音，其中 “j” 发 /dʒ/ 音，“e” 发短元音 /e/，“c” 发 /k/ 音。
      - 规则:tive → /tɪv/， “tive” 发 /tɪv/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/，“v” 发 /v/ 音。
- convex:形容词，词源来自拉丁语“convexus”（凸出的），词义：凸的。
  - 记忆方法:联想“con-”（一起）+“vex”（弯曲），一起弯曲成凸出的形状。
  - 形近词:convex/concave（凹的）。
  - 发音解析:
    - 音节分解:con + vex /ˈkɒnveks/，重音在第一音节
    - 规则:con → /kɒn/， “con” 发 /kɒn/ 音，其中 “c” 发 /k/ 音，“o” 发短元音 /ɒ/，“n” 发鼻音。
    - 规则:vex → /veks/， “vex” 发 /veks/ 音，其中 “v” 发 /v/ 音，“e” 发短元音 /e/，“x” 发 /ks/ 音。
- iterated:动词过去式，词源来自拉丁语“iterare”（重复），词义：迭代。
  - 记忆方法:“it”可联想“一次”，“erate”可看作“反复做”，合起来就是反复做一次又一次，即迭代。
  - 形近词:iterated/iterate（迭代）、iteration（迭代）。
  - 发音解析:
    - 音节分解:it + er + ated /ˈɪtəreɪtɪd/，重音在第一音节
    - 规则:it → /ɪt/， “it” 发 /ɪt/ 音，其中 “i” 发短元音 /ɪ/，“t” 发 /t/ 音。
    - 规则:er → /ə(r)/， “er” 发 /ə(r)/ 音，其中 “e” 发短元音 /ə/，“r” 发音。
    - 规则:ated → /reɪtɪd/， “ated” 发 /reɪtɪd/ 音，其中 “a” 发长元音 /eɪ/，“t” 发 /t/ 音，“e” 不发音，“d” 发 /d/ 音。
- algorithm:名词，词源来自阿拉伯语“al – Khwarizmi”（花拉子米，数学家），词义：算法。
  - 记忆方法:可联想“al”（像“all”全部）+“go”（走）+“rithm”（可看作“rhythm”节奏），全部按照一定节奏走就是算法。
  - 形近词:algorithm/logarithm（对数）。
  - 发音解析:
    - 音节分解:al + go + rithm /ˈælɡərɪðəm/，重音在第一音节
    - 规则:al → /æl/， “al” 发 /æl/ 音，其中 “a” 发短元音 /æ/，“l” 发 /l/ 音。
    - 规则:go → /ɡəʊ/， “go” 发 /ɡəʊ/ 音，其中 “g” 发 /ɡ/ 音，“o” 发长元音 /əʊ/。
    - 规则:rithm → /rɪðəm/， “rithm” 发 /rɪðəm/ 音，其中 “r” 发 /r/ 音，“i” 发短元音 /ɪ/，“th” 发 /ð/ 音，“m” 发 /m/ 音。

For surfaces that are not quadratic, as long as the Hessian remains positive definite, Newton’s method can be applied iteratively.
- 固定搭配:“as long as”，含义：“只要”。
- 句子分析:这是一个复合句，“For surfaces that are not quadratic”是状语，“as long as the Hessian remains positive definite”是条件状语从句，主句是“Newton’s method can be applied iteratively”。句子表示对于非二次曲面，只要海森矩阵保持正定，牛顿法就可以迭代应用。
- 翻译:对于非二次的曲面，只要海森矩阵保持正定，牛顿法就可以迭代应用。
- 单词分析:
  - Hessian:名词，以数学家Hesse命名，词义：海森矩阵。
    - 记忆方法:直接记忆人名相关的专业术语。
    - 形近词:无。
    - 发音解析:
      - 音节分解:Hes + si + an /ˈhesiən/，重音在第一音节
      - 规则:Hes → /hes/， “Hes” 发 /hes/ 音，其中 “H” 发音，“e” 发短元音 /e/，“s” 发 /s/ 音。
      - 规则:si → /si/， “si” 发 /si/ 音，其中 “s” 发 /s/ 音，“i” 发长元音 /i/。
      - 规则:an → /ən/， “an” 发 /ən/ 音，其中 “a” 发短元音 /ə/，“n” 发鼻音。

This implies a two – step 311 — Page Break — CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS Algorithm 8.8 Newton’s method with objective J ( θ ) = 1 m  m i =1 L f ( ( x ( ) i ; ) θ , y ( ) i ).
- 句子分析:此句包含专业公式和表述，主要说明牛顿法的目标函数。句子表明这意味着一个两步过程，接着介绍了算法8.8中牛顿法的目标函数。
- 翻译:这意味着一个两步过程 311 — 分页 — 第8章。深度模型训练的优化算法8.8 目标为J ( θ ) = 1 m  m i =1 L f ( ( x ( ) i ; ) θ , y ( ) i )的牛顿法。

Require: Initial parameter θ 0 Require: Training set of examples m while do stopping criterion not met Compute gradient: g ← 1 m ∇ θ  i L f ( ( x ( ) i ; ) θ , y ( ) i ) Compute Hessian: H ← 1 m ∇ 2 θ  i L f ( ( x ( ) i ; ) θ , y ( ) i ) Compute Hessian inverse: H − 1 Compute update: ∆ = θ − H − 1 g Apply update: θ θ θ = + ∆ end while
- 句子分析:这是一段算法流程描述，包含循环结构，当停止准则未满足时，依次计算梯度、海森矩阵、海森矩阵逆、更新量并应用更新。
- 翻译:要求：初始参数θ 0 要求：示例训练集m 当停止准则未满足时计算梯度：g ← 1 m ∇ θ  i L f ( ( x ( ) i ; ) θ , y ( ) i ) 计算海森矩阵：H ← 1 m ∇ 2 θ  i L f ( ( x ( ) i ; ) θ , y ( ) i ) 计算海森矩阵逆：H − 1 计算更新量：∆ = θ − H − 1 g 应用更新：θ θ θ = + ∆ 结束循环

iterative procedure.
- 句子分析:简单短语，表明这是一个迭代过程。
- 翻译:迭代过程。
- 单词分析:
  - iterative:形容词，词源来自拉丁语“iterare”（重复），词义：迭代的。
    - 记忆方法:和“iterated”类似，“it”（一次）+“erate”（反复做），反复做一次又一次就是迭代的。
    - 形近词:iterative/iterate（迭代）、iteration（迭代）。
    - 发音解析:
      - 音节分解:it + er + a + tive /ˈɪtərətɪv/，重音在第一音节
      - 规则:it → /ɪt/， “it” 发 /ɪt/ 音，其中 “i” 发短元音 /ɪ/，“t” 发 /t/ 音。
      - 规则:er → /ə(r)/， “er” 发 /ə(r)/ 音，其中 “e” 发短元音 /ə/，“r” 发音。
      - 规则:a → /ə/， “a” 发 /ə/ 音，发短元音 /ə/。
      - 规则:tive → /tɪv/， “tive” 发 /tɪv/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/，“v” 发 /v/ 音。

First, update or compute the inverse Hessian (i.e. by updating the quadratic approximation).
- 句子分析:简单句，说明第一步是更新或计算海森矩阵的逆。
- 翻译:首先，更新或计算海森矩阵的逆（即通过更新二次近似）。
- 单词分析:
  - inverse:形容词，词源来自拉丁语“inversus”（反转的），词义：逆的；相反的。
    - 记忆方法:“in-”（相反）+“verse”（转），反转过来就是逆的。
    - 形近词:inverse/convert（转变）、diverse（多样的）。
    - 发音解析:
      - 音节分解:in + verse /ˈɪnvɜːs/，重音在第一音节
      - 规则:in → /ɪn/， “in” 发 /ɪn/ 音，其中 “i” 发短元音 /ɪ/，“n” 发鼻音。
      - 规则:verse → /vɜːs/， “verse” 发 /vɜːs/ 音，其中 “v” 发 /v/ 音，“e” 发长元音 /ɜː/，“r” 发音，“s” 发 /s/ 音。

Second, update the parameters according to Eq. 8.27.
- 固定搭配:“according to”，含义：“根据”。
- 句子分析:简单句，说明第二步是根据公式8.27更新参数。
- 翻译:其次，根据公式8.27更新参数。

In Sec. 8.2.3, we discussed how Newton’s method is appropriate only when the Hessian is positive definite.
- 句子分析:这是一个复合句，“In Sec. 8.2.3”是状语，“how Newton’s method is appropriate only when the Hessian is positive definite”是宾语从句。句子说在8.2.3节中，讨论了只有当海森矩阵正定时牛顿法才适用。
- 翻译:在8.2.3节中，我们讨论了只有当海森矩阵正定时牛顿法才适用。

In deep learning, the surface of the objective function is typically non – convex with many features, such as saddle points, that are problematic for Newton’s method.
- 固定搭配:“such as”，含义：“例如”。
- 句子分析:这是一个主系表结构的句子。“In deep learning”是状语，“the surface of the objective function”是主语，“is”是系动词，“typically non – convex”是表语，“with many features”是伴随状语，“such as saddle points”是对“features”的举例说明，“that are problematic for Newton’s method”是定语从句，修饰“features”。
- 翻译:在深度学习中，目标函数的表面通常是非凸的，具有许多特征，如鞍点，这些特征对牛顿法来说是有问题的。
- 单词分析:
  - non – convex:形容词，词源：“non -”是否定前缀，“convex”来自拉丁语“convexus”，词义：非凸的。
    - 记忆方法:“non -”表示否定，“convex”是“凸的”，合起来就是“非凸的”。
    - 形近词:convex（凸的）。
    - 发音解析:
      - 音节分解:non + con + vex /ˌnɒnˈkɒnveks/，重音在第二音节
      - 规则:non → /nɒn/， “non” 发 /nɒn/ 音，其中 “o” 发短元音 /ɒ/。
      - 规则:con → /kɒn/， “con” 发 /kɒn/ 音，其中 “o” 发短元音 /ɒ/。
      - 规则:vex → /veks/， “vex” 发 /veks/ 音，其中 “e” 发短元音 /e/。
- saddle:名词，词源：古英语“sadol”，词义：鞍点。
  - 记忆方法:联想骑马用的马鞍，这里表示函数中的鞍点。
  - 形近词:saddle/sad（悲伤的）。
  - 发音解析:
    - 音节分解:sad + dle /ˈsædl/，重音在第一音节
    - 规则:sad → /sæd/， “sad” 发 /sæd/ 音，其中 “a” 发短元音 /æ/。
    - 规则:dle → /dl/， “dle” 发 /dl/ 音。

If the eigenvalues of the Hessian are not all positive, for example, near a saddle point, then Newton’s method can actually cause updates to move in the wrong direction.
- 固定搭配:“for example”，含义：“例如”。
- 句子分析:这是一个复合句，“If the eigenvalues of the Hessian are not all positive”是条件状语从句，“then Newton’s method can actually cause updates to move in the wrong direction”是主句。“near a saddle point”是插入语。
- 翻译:例如，如果海森矩阵的特征值并非全为正，比如在鞍点附近，那么牛顿法实际上可能会导致更新朝着错误的方向进行。
- 单词分析:
  - eigenvalues:名词复数，词源：来自德语“Eigenwert”，词义：特征值。
    - 记忆方法:“eigen -”有“自身的”意思，“value”是“值”，合起来就是“自身的值”即“特征值”。
    - 形近词:eigenvalues/value（价值）。
    - 发音解析:
      - 音节分解:ei + gen + val + ues /ˈaɪɡənˌvæljuːz/，重音在第一音节
      - 规则:ei → /aɪ/， “ei” 发 /aɪ/ 音。
      - 规则:gen → /dʒen/， “gen” 发 /dʒen/ 音，其中 “g” 发 /dʒ/ 音。
      - 规则:val → /væl/， “val” 发 /væl/ 音，其中 “a” 发短元音 /æ/。
      - 规则:ues → /juːz/， “ues” 发 /juːz/ 音。
- Hessian:名词，词源：以德国数学家路德维希·奥托·赫斯（Ludwig Otto Hesse）命名，词义：海森矩阵。
  - 记忆方法:记住是与数学家赫斯相关的矩阵。
  - 形近词:无。
  - 发音解析:
    - 音节分解:Hes + si + an /ˈhesiən/，重音在第一音节
    - 规则:Hes → /hes/， “Hes” 发 /hes/ 音，其中 “e” 发短元音 /e/。
    - 规则:si → /si/， “si” 发 /si/ 音。
    - 规则:an → /ən/， “an” 发 /ən/ 音。

This situation can be avoided by regularizing the Hessian.
- 句子分析:这是一个含有情态动词的被动语态句子，“This situation”是主语，“can be avoided”是谓语，“by regularizing the Hessian”是方式状语。
- 翻译:这种情况可以通过对海森矩阵进行正则化来避免。
- 单词分析:
  - regularizing:动词现在分词，词源：“regular”（规则的）+“-ize”（使……化），词义：正则化。
    - 记忆方法:“regular”是“规则的”，“-ize”表示“使……化”，就是“使规则化”即“正则化”。
    - 形近词:regular（规则的）。
    - 发音解析:
      - 音节分解:reg + u + lar + iz + ing /ˈreɡjələraɪzɪŋ/，重音在第一音节
      - 规则:reg → /reɡ/， “reg” 发 /reɡ/ 音，其中 “e” 发短元音 /e/。
      - 规则:u → /juː/， “u” 发 /juː/ 音。
      - 规则:lar → /lə(r)/， “lar” 发 /lə(r)/ 音。
      - 规则:iz → /aɪz/， “iz” 发 /aɪz/ 音。
      - 规则:ing → /ɪŋ/， “ing” 发 /ɪŋ/ 音。

Common regularization strategies include adding a constant, , along the diagonal of the Hessian.
- 固定搭配:“along the diagonal”，含义：“沿着对角线”。
- 句子分析:这是一个主谓宾结构的句子，“Common regularization strategies”是主语，“include”是谓语，“adding a constant along the diagonal of the Hessian”是宾语。
- 翻译:常见的正则化策略包括沿着海森矩阵的对角线添加一个常数。
- 单词分析:
  - regularization:名词，词源：“regular”（规则的）+“-ization”（名词后缀），词义：正则化。
    - 记忆方法:由“regular”加后缀“-ization”构成，表示“规则化”的名词形式。
    - 形近词:regular（规则的）。
    - 发音解析:
      - 音节分解:reg + u + lar + i + za + tion /ˌreɡjələˈzeɪʃn/，重音在第二音节
      - 规则:reg → /reɡ/， “reg” 发 /reɡ/ 音，其中 “e” 发短元音 /e/。
      - 规则:u → /juː/， “u” 发 /juː/ 音。
      - 规则:lar → /lə(r)/， “lar” 发 /lə(r)/ 音。
      - 规则:i → /ɪ/， “i” 发 /ɪ/ 音。
      - 规则:za → /zeɪ/， “za” 发 /zeɪ/ 音。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。
- diagonal:形容词、名词，词源：希腊语“diagonios”，词义：对角线的；对角线。
  - 记忆方法:“dia -”表示“穿过”，“gon”表示“角”，穿过角的线就是“对角线”。
  - 形近词:diagonal/dialogue（对话）。
  - 发音解析:
    - 音节分解:di + a + go + nal /daɪˈæɡənl/，重音在第二音节
    - 规则:di → /daɪ/， “di” 发 /daɪ/ 音。
    - 规则:a → /æ/， “a” 发 /æ/ 音。
    - 规则:go → /ɡəʊ/， “go” 发 /ɡəʊ/ 音。
    - 规则:nal → /nəl/， “nal” 发 /nəl/ 音。

The regularized update becomes α θ ∗ = θ 0 − [ ( ( H f θ 0 )) + ] α I − 1 ∇ θ f ( θ 0 ). (8.28)
- 句子分析:这是一个简单的陈述句子，表明正则化更新的公式。
- 翻译:正则化更新变为 α θ ∗ = θ 0 − [ ( ( H f θ 0 )) + ] α I − 1 ∇ θ f ( θ 0 )。（8.28）

This regularization strategy is used in approximations to Newton’s method, such as the Levenberg–Marquardt algorithm (Levenberg 1944 Marquardt 1963, ;, ), and works fairly well as long as the negative eigenvalues of the Hessian are still relatively close to zero.
- 固定搭配:“such as”，含义：“例如”；“as long as”，含义：“只要”。
- 句子分析:这是一个并列复合句，由“and”连接两个并列的句子。前一个句子“ This regularization strategy is used in approximations to Newton’s method”是主谓结构，后一个句子“works fairly well as long as the negative eigenvalues of the Hessian are still relatively close to zero”中“as long as…”引导条件状语从句。
- 翻译:这种正则化策略用于牛顿法的近似方法中，例如列文伯格 – 马夸尔特算法（列文伯格1944年，马夸尔特1963年），并且只要海森矩阵的负特征值仍然相对接近零，效果就相当好。
- 单词分析:
  - approximations:名词复数，词源：“approximate”（近似）+“-ion”（名词后缀），词义：近似值；近似方法。
    - 记忆方法:由“approximate”加后缀“-ion”构成，表示“近似”的名词形式。
    - 形近词:approximate（近似的）。
    - 发音解析:
      - 音节分解:ap + prox + i + ma + tion /əˌprɒksɪˈmeɪʃn/，重音在第二音节
      - 规则:ap → /əp/， “ap” 发 /əp/ 音。
      - 规则:prox → /prɒks/， “prox” 发 /prɒks/ 音，其中 “o” 发短元音 /ɒ/。
      - 规则:i → /ɪ/， “i” 发 /ɪ/ 音。
      - 规则:ma → /meɪ/， “ma” 发 /meɪ/ 音。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。
- algorithm:名词，词源：阿拉伯语“al – Khwarizmi”（花拉子米，数学家），词义：算法。
  - 记忆方法:可以联想数学家花拉子米与算法的关系来记忆。
  - 形近词:algorithm/algebra（代数）。
  - 发音解析:
    - 音节分解:al + go + rithm /ˈælɡərɪðəm/，重音在第一音节
    - 规则:al → /æl/， “al” 发 /æl/ 音，其中 “a” 发短元音 /æ/。
    - 规则:go → /ɡəʊ/， “go” 发 /ɡəʊ/ 音。
    - 规则:rithm → /rɪðəm/， “rithm” 发 /rɪðəm/ 音。

In cases where there are more extreme directions of curvature, the value of α would have to be sufficiently large to offset the negative eigenvalues.
- 句子分析:这是一个含有定语从句的复合句，“where there are more extreme directions of curvature”是定语从句，修饰“cases”，主句是“the value of α would have to be sufficiently large to offset the negative eigenvalues”。
- 翻译:在存在更极端曲率方向的情况下，α 的值必须足够大以抵消负特征值。
- 单词分析:
  - curvature:名词，词源：“curve”（曲线）+“-ature”（名词后缀），词义：曲率。
    - 记忆方法:由“curve”加后缀“-ature”构成，表示“曲线的性质”即“曲率”。
    - 形近词:curve（曲线）。
    - 发音解析:
      - 音节分解:cur + va + ture /ˈkɜːvətʃə(r)/，重音在第一音节
      - 规则:cur → /kɜː(r)/， “cur” 发 /kɜː(r)/ 音，其中 “u” 发长元音 /ɜː(r)/。
      - 规则:va → /veɪ/， “va” 发 /veɪ/ 音。
      - 规则:ture → /tʃə(r)/， “ture” 发 /tʃə(r)/ 音。
- offset:动词、名词，词源：“off”（离开）+“set”（放置），词义：抵消；补偿。
  - 记忆方法:“off”表示离开，“set”表示放置，把离开的部分放置回来就是“抵消”。
  - 形近词:offset/setoff（出发）。
  - 发音解析:
    - 音节分解:off + set /ˈɒfset/，重音在第一音节
    - 规则:off → /ɒf/， “off” 发 /ɒf/ 音，其中 “o” 发短元音 /ɒ/。
    - 规则:set → /set/， “set” 发 /set/ 音。

However, as α increases in size, the Hessian becomes dominated by the α I diagonal and the direction chosen by Newton’s method converges to the standard gradient divided by α.
- 句子分析:这是一个复合句，“as α increases in size”是时间状语从句，主句是由“and”连接的并列句，“the Hessian becomes dominated by the α I diagonal”和“the direction chosen by Newton’s method converges to the standard gradient divided by α”。
- 翻译:然而，随着 α 的值增大，海森矩阵由 α I 对角线主导，并且牛顿法选择的方向收敛到标准梯度除以 α。
- 单词分析:
  - dominated:动词过去分词，词源：“dominate”（支配），词义：被支配；占主导地位。
    - 记忆方法:“dominate”是“支配”，“-ed”表示被动，就是“被支配”。
    - 形近词:dominate（支配）。
    - 发音解析:
      - 音节分解:dom + i + nate + d /ˈdɒmɪneɪtɪd/，重音在第一音节
      - 规则:dom → /dɒm/， “dom” 发 /dɒm/ 音，其中 “o” 发短元音 /ɒ/。
      - 规则:i → /ɪ/， “i” 发 /ɪ/ 音。
      - 规则:nate → /neɪt/， “nate” 发 /neɪt/ 音。
      - 规则:d → /d/， “d” 发 /d/ 音。
- converges:动词第三人称单数，词源：“con -”（共同）+“verge”（边缘），词义：收敛。
  - 记忆方法:“con -”表示共同，“verge”表示边缘，共同到一个边缘就是“收敛”。
  - 形近词:converge/diverge（发散）。
  - 发音解析:
    - 音节分解:con + verge + s /kənˈvɜːdʒɪz/，重音在第二音节
    - 规则:con → /kən/， “con” 发 /kən/ 音。
    - 规则:verge → /vɜːdʒ/， “verge” 发 /vɜːdʒ/ 音，其中 “e” 发长元音 /ɜː/。
    - 规则:s → /ɪz/， “s” 发 /ɪz/ 音。

When strong negative curvature is present, α may need to be so large that Newton’s method would make smaller steps than gradient descent with a properly chosen learning rate.
- 固定搭配:“so…that…”，含义：“如此……以至于……”。
- 句子分析:这是一个复合句，“When strong negative curvature is present”是时间状语从句，主句中“so…that…”引导结果状语从句。
- 翻译:当存在强烈的负曲率时，α 可能需要如此之大，以至于牛顿法的步长会比使用适当选择的学习率的梯度下降法更小。

Beyond the challenges created by certain features of the objective function, such as saddle points, the application of Newton’s method for training large neural networks is limited by the significant computational burden it imposes.
- 固定搭配:“beyond”，含义：“除了”；“such as”，含义：“例如”。
- 句子分析:“Beyond the challenges created by certain features of the objective function”是状语，“the application of Newton’s method for training large neural networks”是主语，“is limited by”是谓语，“the significant computational burden it imposes”是宾语。
- 翻译:除了目标函数的某些特征（如鞍点）所带来的挑战之外，牛顿法在训练大型神经网络方面的应用受到其带来的巨大计算负担的限制。
- 单词分析:
  - computational:形容词，词源：“compute”（计算）+“-ational”（形容词后缀），词义：计算的。
    - 记忆方法:由“compute”加后缀“-ational”构成，表示与“计算”相关的形容词。
    - 形近词:compute（计算）。
    - 发音解析:
      - 音节分解:com + pu + ta + tion + al /ˌkɒmpjuˈteɪʃənl/，重音在第三音节
      - 规则:com → /kɒm/， “com” 发 /kɒm/ 音，其中 “o” 发短元音 /ɒ/。
      - 规则:pu → /pjuː/， “pu” 发 /pjuː/ 音。
      - 规则:ta → /teɪ/， “ta” 发 /teɪ/ 音。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。
      - 规则:al → /l/， “al” 发 /l/ 音。
- burden:名词、动词，词源：古英语“byrðen”，词义：负担；使负担。
  - 记忆方法:联想“bur”（类似“bear”承担）+“den”，承担的东西就是“负担”。
  - 形近词:burden/burger（汉堡包）。
  - 发音解析:
    - 音节分解:bur + den /ˈbɜːdn/，重音在第一音节
    - 规则:bur → /bɜː(r)/， “bur” 发 /bɜː(r)/ 音，其中 “u” 发长元音 /ɜː(r)/。
    - 规则:den → /dn/， “den” 发 /dn/ 音。

OPTIMIZATION FOR TRAINING DEEP MODELS The number of elements in the Hessian is squared in the number of parameters, so with k parameters (and for even very small neural networks the number of parameters k can be in the millions), Newton’s method would require the inversion of a k × k matrix—with computational complexity of O ( k 3 ) .
- 固定搭配:“be squared in”意为“在……方面呈平方关系”；“require sth. of sb/sth.”表示“要求某人/某物具备某物”，这里是“require the inversion of…”即“要求对……进行求逆”。
- 句子分析:这是一个复合句，“so”连接两个句子表示因果关系。前一个句子说明海森矩阵中元素数量与参数数量的平方关系，后一个句子说明在给定参数数量情况下牛顿法需要对矩阵求逆及相应的计算复杂度。
- 翻译:深度模型训练的优化：海森矩阵中元素的数量与参数数量呈平方关系，所以对于k个参数（即使是非常小的神经网络，参数数量k也可能达到数百万），牛顿法需要对一个k × k的矩阵进行求逆，其计算复杂度为O ( k 3 )。
- 单词分析:
  - Hessian:名词，词源：以德国数学家路德维希·奥托·赫斯（Ludwig Otto Hesse）的名字命名，词义：海森矩阵。
    - 记忆方法:可以联想数学家赫斯的名字来记忆。
    - 形近词:Hessian/Hessite（碲银矿）。
    - 发音解析:
      - 音节分解:Hes + si + an /ˈhesiən/，重音在第一音节
      - 规则:Hes → /hes/， “Hes” 发 /hes/ 音，其中 “H” 发 /h/ 音，“e” 发短元音 /ɛ/，“s” 发 /s/ 音。
      - 规则:si → /siː/， “si” 发长音 /siː/，类似于 “see” 的发音。
      - 规则:an → /ən/， “an” 发 /ən/ 音，其中 “a” 发短元音 /ə/，“n” 发鼻音。
- inversion:名词，词源：来自拉丁语 “inversio”，由 “invertere”（反转）派生而来，词义：反转；求逆。
  - 记忆方法:“in-”（向内，相反）+ “vers”（转）+ “-ion”（名词后缀）→ 反转过来 → 求逆。
  - 形近词:inversion/invert（使反转）、conversation（对话）。
  - 发音解析:
    - 音节分解:in + ver + sion /ɪnˈvɜːʃn/，重音在第二音节
    - 规则:in → /ɪn/， “in” 发 /ɪn/ 音，其中 “i” 发短元音 /ɪ/，“n” 发鼻音。
    - 规则:ver → /vɜːr/， “ver” 发 /vɜːr/ 音，其中 “v” 发 /v/ 音，“er” 发长元音 /ɜːr/。
    - 规则:sion → /ʃn/， “sion” 发 /ʃn/ 音，类似于 “session” 中 “sion” 的发音。
- computational:形容词，词源：来自 “compute”（计算）+ “-ation”（名词后缀）+ “-al”（形容词后缀），词义：计算的。
  - 记忆方法:联想 “compute”（计算），后面加相关后缀表示“与计算有关的”。
  - 形近词:computational/computer（计算机）、computation（计算）。
  - 发音解析:
    - 音节分解:com + pu + ta + tion + al /ˌkɒmpjuˈteɪʃənl/，重音在倒数第三个音节
    - 规则:com → /kəm/， “com” 发 /kəm/ 音，其中 “c” 发 /k/ 音，“o” 发短元音 /ə/。
    - 规则:pu → /pjuː/， “pu” 发 /pjuː/ 音，其中 “p” 发 /p/ 音，“u” 发 /juː/ 音。
    - 规则:ta → /teɪ/， “ta” 发 /teɪ/ 音，其中 “t” 发 /t/ 音，“a” 发长元音 /eɪ/。
    - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音，类似于 “session” 中 “sion” 的发音。
    - 规则:al → /l/， “al” 发 /l/ 音，其中 “a” 不发音，“l” 发 /l/ 音。

Also, since the parameters will change with every update, the inverse Hessian has to be computed at every training iteration.
- 固定搭配:“change with”意为“随……而变化”；“have to be done”表示“不得不被做”。
- 句子分析:这是一个复合句，“since”引导原因状语从句，说明因为参数每次更新都会变化，所以主句指出逆海森矩阵在每次训练迭代时都必须被计算。
- 翻译:此外，由于参数每次更新都会发生变化，所以在每次训练迭代时都必须计算逆海森矩阵。
- 单词分析:
  - inverse:形容词、名词，词源：来自拉丁语 “inversus”，是 “invertere”（反转）的过去分词，词义：相反的；逆的；倒数。
    - 记忆方法:“in-”（相反）+ “vers”（转）→ 反转过来的 → 相反的、逆的。
    - 形近词:inverse/invert（使反转）、diverse（多样的）。
    - 发音解析:
      - 音节分解:in + ver + se /ˈɪnvɜːs/，重音在第一音节
      - 规则:in → /ɪn/， “in” 发 /ɪn/ 音，其中 “i” 发短元音 /ɪ/，“n” 发鼻音。
      - 规则:ver → /vɜːr/， “ver” 发 /vɜːr/ 音，其中 “v” 发 /v/ 音，“er” 发长元音 /ɜːr/。
      - 规则:se → /s/， “se” 发 /s/ 音，其中 “e” 不发音，“s” 发 /s/ 音。
- iteration:名词，词源：来自拉丁语 “iterare”（重复），词义：迭代；反复。
  - 记忆方法:联想 “iter”（重复）+ “-ation”（名词后缀）→ 重复的过程 → 迭代。
  - 形近词:iteration/iterate（迭代）、radiation（辐射）。
  - 发音解析:
    - 音节分解:it + er + a + tion /ˌɪtəˈreɪʃn/，重音在倒数第三个音节
    - 规则:it → /ɪt/， “it” 发 /ɪt/ 音，其中 “i” 发短元音 /ɪ/，“t” 发 /t/ 音。
    - 规则:er → /ə/， “er” 发 /ə/ 音，其中 “e” 发短元音 /ə/，“r” 不发音。
    - 规则:a → /eɪ/， “a” 发长元音 /eɪ/。
    - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音，类似于 “session” 中 “sion” 的发音。

As a consequence, only networks with a very small number of parameters can be practically trained via Newton’s method.
- 固定搭配:“as a consequence”意为“因此；结果”；“a number of”表示“许多”；“via”意为“通过；经由”。
- 句子分析:简单句，说明由于前面提到的原因，结果是只有参数数量非常少的网络才能通过牛顿法进行实际训练。
- 翻译:因此，只有参数数量非常少的网络才能通过牛顿法进行实际训练。

In the remainder of this section, we will discuss alternatives that attempt to gain some of the advantages of Newton’s method while side – stepping the computational hurdles.
- 固定搭配:“in the remainder of”意为“在……的剩余部分”；“attempt to do sth.”表示“试图做某事”；“side – step”意为“避开；回避”。
- 句子分析:这是一个复合句，“that”引导定语从句，修饰“alternatives”，说明这些替代方法试图获得牛顿法的一些优点，同时避开计算上的障碍。
- 翻译:在本节的剩余部分，我们将讨论一些替代方法，这些方法试图在避开计算障碍的同时，获得牛顿法的一些优点。
- 单词分析:
  - remainder:名词，词源：来自古法语 “remaindre”，是 “remaner”（留下）的名词形式，词义：剩余部分；余数。
    - 记忆方法:“re-”（再，又）+ “main”（拿，保持）+ “-der”（名词后缀）→ 再次留下的部分 → 剩余部分。
    - 形近词:remainder/remain（留下）、mainder（无常见词义）。
    - 发音解析:
      - 音节分解:re + main + der /rɪˈmeɪndə(r)/，重音在第二音节
      - 规则:re → /rɪ/， “re” 发 /rɪ/ 音，其中 “r” 发 /r/ 音，“e” 发短元音 /ɪ/。
      - 规则:main → /meɪn/， “main” 发 /meɪn/ 音，其中 “m” 发 /m/ 音，“ai” 发长元音 /eɪ/，“n” 发鼻音。
      - 规则:der → /də(r)/， “der” 发 /də(r)/ 音，其中 “d” 发 /d/ 音，“e” 发短元音 /ə/，“r” 发音较轻。
- alternatives:名词，词源：来自拉丁语 “alternus”（交替的），词义：替代方案；选择。
  - 记忆方法:联想 “alter”（改变），能改变现状的东西 → 替代方案。
  - 形近词:alternatives/alternative（替代的）、alter（改变）。
  - 发音解析:
    - 音节分解:al + ter + na + tives /ɔːlˈtɜːnətɪvz/，重音在第二音节
    - 规则:al → /ɔːl/， “al” 发 /ɔːl/ 音，其中 “a” 发长元音 /ɔː/，“l” 发 /l/ 音。
    - 规则:ter → /tɜːr/， “ter” 发 /tɜːr/ 音，其中 “t” 发 /t/ 音，“er” 发长元音 /ɜːr/。
    - 规则:na → /nə/， “na” 发 /nə/ 音，其中 “n” 发鼻音，“a” 发短元音 /ə/。
    - 规则:tives → /tɪvz/， “tives” 发 /tɪvz/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/，“v” 发 /v/ 音，“es” 发 /z/ 音。
- hurdles:名词，词源：来自中古英语 “hurel”，原指跨越的障碍物，词义：障碍；难关。
  - 记忆方法:可以联想跑步比赛中的跨栏，那就是一种障碍。
  - 形近词:hurdles/hurdle（跨栏；障碍）、curdle（使凝结）。
  - 发音解析:
    - 音节分解:hur + dles /ˈhɜːdlz/，重音在第一音节
    - 规则:hur → /hɜːr/， “hur” 发 /hɜːr/ 音，其中 “h” 发 /h/ 音，“ur” 发长元音 /ɜːr/。
    - 规则:dles → /dlz/， “dles” 发 /dlz/ 音，其中 “d” 发 /d/ 音，“le” 不发音，“s” 发 /z/ 音。

8.6.2 Conjugate Gradients Conjugate gradients is a method to efficiently avoid the calculation of the inverse Hessian by iteratively descending conjugate directions.
- 固定搭配:“conjugate gradients”意为“共轭梯度”；“avoid the calculation of”表示“避免对……的计算”；“by doing sth.”表示“通过做某事”。
- 句子分析:简单句，介绍共轭梯度是一种通过迭代地沿共轭方向下降来有效避免计算逆海森矩阵的方法。
- 翻译:8.6.2 共轭梯度：共轭梯度是一种通过迭代地沿共轭方向下降来有效避免计算逆海森矩阵的方法。
- 单词分析:
  - Conjugate:形容词、动词，词源：来自拉丁语 “conjugare”（结合，联合），词义：共轭的；使共轭。
    - 记忆方法:“con-”（共同）+ “jug”（连接）+ “-ate”（动词后缀）→ 共同连接 → 共轭。
    - 形近词:conjugate/jugate（成对的）、conjunction（连接词；结合）。
    - 发音解析:
      - 音节分解:con + ju + gate /ˈkɒndʒəɡeɪt/，重音在第一音节
      - 规则:con → /kən/， “con” 发 /kən/ 音，其中 “c” 发 /k/ 音，“o” 发短元音 /ə/。
      - 规则:ju → /dʒuː/， “ju” 发 /dʒuː/ 音，其中 “j” 发 /dʒ/ 音，“u” 发 /juː/ 音。
      - 规则:gate → /ɡeɪt/， “gate” 发 /ɡeɪt/ 音，其中 “g” 发 /ɡ/ 音，“a” 发长元音 /eɪ/，“t” 发 /t/ 音。
- iteratively:副词，词源：来自 “iterate”（迭代）+ “-ively”（副词后缀），词义：迭代地。
  - 记忆方法:联想 “iterate”（迭代），加后缀变为副词表示“以迭代的方式”。
  - 形近词:iteratively/iterate（迭代）、formatively（形成地）。
  - 发音解析:
    - 音节分解:it + er + a + tively /ˈɪtərətɪvli/，重音在第一音节
    - 规则:it → /ɪt/， “it” 发 /ɪt/ 音，其中 “i” 发短元音 /ɪ/，“t” 发 /t/ 音。
    - 规则:er → /ə/， “er” 发 /ə/ 音，其中 “e” 发短元音 /ə/，“r” 不发音。
    - 规则:a → /eɪ/， “a” 发长元音 /eɪ/。
    - 规则:tively → /tɪvli/， “tively” 发 /tɪvli/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/，“v” 发 /v/ 音，“ly” 发 /li/ 音。

The inspiration for this approach follows from a careful study of the weakness of the method of steepest descent (see Sec. 4.3 for details), where line searches are applied iteratively in the direction associated with the gradient.
- 固定搭配:“follow from”意为“由……得出；是……的结果”；“steepest descent”表示“最速下降法”；“be associated with”意为“与……相关联”。
- 句子分析:这是一个复合句，“where”引导定语从句，修饰前面提到的“method of steepest descent”，说明在最速下降法中，线搜索是沿着与梯度相关的方向迭代进行的。
- 翻译:这种方法的灵感来自对最速下降法弱点的仔细研究（详情见4.3节），在最速下降法中，线搜索是沿着与梯度相关的方向迭代进行的。
- 单词分析:
  - inspiration:名词，词源：来自拉丁语 “inspirare”（吸入；激发），词义：灵感；启发。
    - 记忆方法:“in-”（进入）+ “spir”（呼吸）+ “-ation”（名词后缀）→ 吸入气息，引申为灵感。
    - 形近词:inspiration/inspire（激发；鼓舞）、aspiration（抱负；渴望）。
    - 发音解析:
      - 音节分解:in + spi + ra + tion /ˌɪnspəˈreɪʃn/，重音在倒数第三个音节
      - 规则:in → /ɪn/， “in” 发 /ɪn/ 音，其中 “i” 发短元音 /ɪ/，“n” 发鼻音。
      - 规则:spi → /spaɪ/， “spi” 发 /spaɪ/ 音，其中 “s” 发 /s/ 音，“p” 发 /p/ 音，“i” 发 /aɪ/ 音。
      - 规则:ra → /rə/， “ra” 发 /rə/ 音，其中 “r” 发 /r/ 音，“a” 发短元音 /ə/。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音，类似于 “session” 中 “sion” 的发音。
- steepest:形容词最高级，词源：“steep”（陡峭的）+ “-est”（最高级后缀），词义：最陡峭的。
  - 记忆方法:直接在 “steep” 基础上加最高级后缀。
  - 形近词:steepest/steep（陡峭的）、sheepest（无常见词义）。
  - 发音解析:
    - 音节分解:steep + est /ˈstiːpɪst/，重音在第一音节
    - 规则:steep → /stiːp/， “steep” 发 /stiːp/ 音，其中 “s” 发 /s/ 音，“t” 发 /t/ 音，“ee” 发长元音 /iː/，“p” 发 /p/ 音。
    - 规则:est → /ɪst/， “est” 发 /ɪst/ 音，其中 “e” 发短元音 /ɪ/，“s” 发 /s/ 音，“t” 发 /t/ 音。
- gradient:名词，词源：来自拉丁语 “gradus”（步；等级），词义：梯度；坡度。
  - 记忆方法:联想 “grad”（步），有一步步变化的意思，引申为梯度。
  - 形近词:gradient/gradual（逐渐的）、gratitude（感激）。
  - 发音解析:
    - 音节分解:gra + di + ent /ˈɡreɪdiənt/，重音在第一音节
    - 规则:gra → /ɡreɪ/， “gra” 发 /ɡreɪ/ 音，其中 “g” 发 /ɡ/ 音，“a” 发长元音 /eɪ/。
    - 规则:di → /diː/， “di” 发 /diː/ 音，其中 “d” 发 /d/ 音，“i” 发长元音 /iː/。
    - 规则:ent → /ənt/， “ent” 发 /ənt/ 音，其中 “e” 发短元音 /ə/，“n” 发鼻音，“t” 发 /t/ 音。

Fig. 8.6 illustrates how the method of steepest descent, when applied in a quadratic bowl, progresses in a rather ineffective back – and – forth, zig – zag pattern.
- 固定搭配:“back – and – forth”意为“来回地；反复地”；“zig – zag”表示“之字形；曲折地”。
- 句子分析:这是一个复合句，“how”引导宾语从句，说明图8.6展示了最速下降法在二次碗状区域应用时的进展模式。“when applied in a quadratic bowl”是省略的时间状语从句。
- 翻译:图8.6展示了最速下降法在二次碗状区域应用时，是如何以一种相当低效的来回曲折模式进行的。
- 单词分析:
  - illustrates:动词第三人称单数，词源：来自拉丁语 “illustrare”（照亮；阐明），词义：说明；阐明；举例说明。
    - 记忆方法:“il-”（加强）+ “lustr”（照亮）+ “-ate”（动词后缀）→ 加强照亮，引申为阐明。
    - 形近词:illustrates/illustrate（说明）、lustrous（有光泽的）。
    - 发音解析:
      - 音节分解:il + lus + trates /ˈɪləstreɪts/，重音在第一音节
      - 规则:il → /ɪl/， “il” 发 /ɪl/ 音，其中 “i” 发短元音 /ɪ/，“l” 发 /l/ 音。
      - 规则:lus → /lʌs/， “lus” 发 /lʌs/ 音，其中 “l” 发 /l/ 音，“u” 发短元音 /ʌ/，“s” 发 /s/ 音。
      - 规则:trates → /treɪts/， “trates” 发 /treɪts/ 音，其中 “t” 发 /t/ 音，“r” 发 /r/ 音，“a” 发长元音 /eɪ/，“tes” 发 /ts/ 音。
- quadratic:形容词，词源：来自拉丁语 “quadratus”（平方的；正方形的），词义：二次的；平方的。
  - 记忆方法:联想 “quad”（四，平方），与二次方有关。
  - 形近词:quadratic/quadrangle（四边形）、quadruple（四倍的）。
  - 发音解析:
    - 音节分解:qua + dra + tic /kwɒˈdrætɪk/，重音在第二音节
    - 规则:qua → /kwɒ/， “qua” 发 /kwɒ/ 音，其中 “qu” 发 /kw/ 音，“a” 发短元音 /ɒ/。
    - 规则:dra → /drə/， “dra” 发 /drə/ 音，其中 “d” 发 /d/ 音，“r” 发 /r/ 音，“a” 发短元音 /ə/。
    - 规则:tic → /tɪk/， “tic” 发 /tɪk/ 音，其中 “t” 发 /t/ 音，“i” 发短元音 /ɪ/，“c” 发 /k/ 音。

This happens because each line search direction, when given by the gradient, is guaranteed to be orthogonal to the previous line search direction.
- 固定搭配:“be guaranteed to do sth.”意为“肯定会做某事；保证会做某事”；“be orthogonal to”表示“与……正交”。
- 句子分析:这是一个复合句，“because”引导原因状语从句，解释了前面提到的最速下降法低效模式的原因，即每次线搜索方向与前一个方向正交。“when given by the gradient”是省略的时间状语从句。
- 翻译:这是因为每次线搜索方向（由梯度给出时）都保证与前一个线搜索方向正交。
- 单词分析:
  - guaranteed:形容词、动词过去式，词源：来自古法语 “garantir”（保证），词义：有保证的；确保。
    - 记忆方法:联想 “guar” 发音像“瓜儿”，联想卖瓜的保证瓜甜。
    - 形近词:guaranteed/guarantee（保证）、guardian（监护人）。
    - 发音解析:
      - 音节分解:guar + an + teed /ˌɡærənˈtiːd/，重音在倒数第二个音节
      - 规则:guar → /ɡɑː(r)/， “guar” 发 /ɡɑː(r)/ 音，其中 “g” 发 /ɡ/ 音，“ua” 发长元音 /ɑː/，“r” 发音较轻。
      - 规则:an → /ən/， “an” 发 /ən/ 音，其中 “a” 发短元音 /ə/，“n” 发鼻音。
      - 规则:teed → /tiːd/， “teed” 发 /tiːd/ 音，其中 “t” 发 /t/ 音，“ee” 发长元音 /iː/，“d” 发 /d/ 音。
- orthogonal:形容词，词源：来自希腊语 “orthogonios”（直角的），词义：正交的；垂直的。
  - 记忆方法:“ortho-”（直，正）+ “gon”（角）+ “-al”（形容词后缀）→ 直角的 → 正交的。
  - 形近词:orthogonal/orthodoxy（正统观念）、diagonal（对角线的）。
  - 发音解析:
    - 音节分解:or + tho + go + nal /ɔːˈθɒɡənl/，重音在第二音节
    - 规则:or → /ɔː/， “or” 发 /ɔː/ 音，其中 “o” 发长元音 /ɔː/，“r” 不发音。
    - 规则:tho → /θəʊ/， “tho” 发 /θəʊ/ 音，其中 “th” 发 /θ/ 音，“o” 发长元音 /əʊ/。
    - 规则:go → /ɡəʊ/， “go” 发 /ɡəʊ/ 音，其中 “g” 发 /ɡ/ 音，“o” 发长元音 /əʊ/。
    - 规则:nal → /nəl/， “nal” 发 /nəl/ 音，其中 “n” 发鼻音，“a” 发短元音 /ə/，“l” 发 /l/ 音。