forked from mnielsen/nnadl_site
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathchap5.html
804 lines (768 loc) · 73.3 KB
/
chap5.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file. Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other -->
<!-- code is quite ugly! Later versions should be better. -->
<head>
<meta charset="utf-8">
<meta name="citation_title" content="Neural Networks and Deep Learning">
<meta name="citation_author" content="Nielsen, Michael A.">
<meta name="citation_publication_date" content="2015">
<meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
<meta name="citation_publisher" content="Determination Press">
<link rel="icon" href="nnadl_favicon.ICO" />
<title>Neural networks and deep learning</title>
<script src="assets/jquery.min.js"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$']]}, "HTML-CSS": {scale: 92}, TeX: { equationNumbers: { autoNumber: "AMS" }}});
</script>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<link href="assets/style.css" rel="stylesheet">
<link href="assets/pygments.css" rel="stylesheet">
<link rel="stylesheet" href="https://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">
<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */
@font-face {
font-family: 'MJX_Math';
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot');
/* IE9 Compat Modes */
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}
@font-face {
font-family: 'MJX_Main';
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot');
/* IE9 Compat Modes */
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>
</head>
<body>
<div class="nonumber_header">
<h2><a href="index.html">Նեյրոնային ցանցեր և խորը ուսուցում</a></h2>
</div>
<div class="section">
<div id="toc">
<p class="toc_title">
<a href="index.html">Նեյրոնային ցանցեր և խորը ուսուցում</a>
</p>
<p class="toc_not_mainchapter">
<a href="about.html">Ինչի՞ մասին է գիրքը</a>
</p>
<p class="toc_not_mainchapter">
<a href="exercises_and_problems.html">Խնդիրների և վարժությունների մասին</a>
</p>
<p class='toc_mainchapter'>
<a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a>
<a href="chap1.html">Ձեռագիր թվանշանների ճանաչում՝ օգտագործելով նեյրոնային ցանցեր</a>
<div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap1.html#perceptrons">
<li>Պերսեպտրոններ</li>
</a>
<a href="chap1.html#sigmoid_neurons">
<li>Սիգմոիդ նեյրոններ</li>
</a>
<a href="chap1.html#the_architecture_of_neural_networks">
<li>Նեյրոնային ցանցերի կառուցվածքը</li>
</a>
<a href="chap1.html#a_simple_network_to_classify_handwritten_digits">
<li>Պարզ ցանց ձեռագիր թվանշանների ճանաչման համար</li>
</a>
<a href="chap1.html#learning_with_gradient_descent">
<li>Ուսուցում գրադիենտային վայրէջքի միջոցով</li>
</a>
<a href="chap1.html#implementing_our_network_to_classify_digits">
<li>Թվանշանները ճանաչող ցանցի իրականացումը</li>
</a>
<a href="chap1.html#toward_deep_learning">
<li>Խորը ուսուցմանն ընդառաջ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
};
$('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a>
<a href="chap2.html">Ինչպե՞ս է աշխատում հետադարձ տարածումը</a>
<div id="toc_how_the_backpropagation_algorithm_works" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output
_from_a_neural_network">
<li>Մարզանք. նեյրոնային ցանցի ելքային արժեքների հաշվման արագագործ, մատրիցային մոտեցում</li>
</a>
<a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function">
<li>Երկու ենթադրություն գնային ֆունկցիայի վերաբերյալ</li>
</a>
<a href="chap2.html#the_hadamard_product_$s_\odot_t$">
<li>Հադամարի արտադրյալը՝ $s \odot t$</li>
</a>
<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">
<li>Հետադարձ տարածման հիմքում ընկած չորս հիմնական հավասարումները</li>
</a>
<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">
<li>Չորս հիմնական հավասարումների ապացույցները (ընտրովի)</li>
</a>
<a href="chap2.html#the_backpropagation_algorithm">
<li>Հետադարձ տարածման ալգորիթմը</li>
</a>
<a href="chap2.html#the_code_for_backpropagation">
<li>Հետադարձ տարածման իրականացման կոդը</li>
</a>
<a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm">
<li>Ի՞նչ իմաստով է հետադարձ տարածումն արագագործ ալգորիթմ</li>
</a>
<a href="chap2.html#backpropagation_the_big_picture">
<li>Հետադարձ տարածում. ամբողջական պատկերը</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
};
$('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a>
<a href="chap3.html">Նեյրոնային ցանցերի ուսուցման բարելավումը</a>
<div id="toc_improving_the_way_neural_networks_learn" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap3.html#the_cross-entropy_cost_function">
<li>Գնային ֆունկցիան՝ միջէնտրոպիայով</li>
</a>
<a href="chap3.html#overfitting_and_regularization">
<li>Գերմարզում և ռեգուլյարացում</li>
</a>
<a href="chap3.html#weight_initialization">
<li>Կշիռների սկզբնարժեքավորումը</li>
</a>
<a href="chap3.html#handwriting_recognition_revisited_the_code">
<li>Ձեռագրերի ճամաչման կոդի վերանայում</li>
</a>
<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">
<li>Ինչպե՞ս ընտրել նեյրոնային ցանցերի հիպեր-պարամետրերը</li>
</a>
<a href="chap3.html#other_techniques">
<li>Այլ տեխնիկաներ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
};
$('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a>
<a href="chap4.html">Տեսողական ապացույց այն մասին, որ նեյրոնային ֆունկցիաները կարող են մոտարկել կամայական ֆունկցիա</a>
<div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap4.html#two_caveats">
<li>Երկու զգուշացում</li>
</a>
<a href="chap4.html#universality_with_one_input_and_one_output">
<li>Ունիվերսալություն մեկ մուտքով և մեկ ելքով</li>
</a>
<a href="chap4.html#many_input_variables">
<li>Մեկից ավել մուտքային փոփոխականներ</li>
</a>
<a href="chap4.html#extension_beyond_sigmoid_neurons">
<li>Ընդլայնումը Սիգմոիդ նեյրոններից դուրս </li>
</a>
<a href="chap4.html#fixing_up_the_step_functions">
<li>Քայլի ֆունկցիայի ուղղումը</li>
</a>
<a href="chap4.html#conclusion">
<li>Եզրակացություն</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
};
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a>
<a href="chap5.html">Ինչու՞մն է կայանում նեյրոնային ցանցերի մարզման բարդությունը</a>
<div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap5.html#the_vanishing_gradient_problem">
<li>Անհետացող գրադիենտի խնդիրը</li>
</a>
<a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets">
<li>Ի՞նչն է անհետացող գրադիենտի խնդրի պատճառը։ Խորը նեյրոնային ցանցերի անկայուն գրադիենտները</li>
</a>
<a href="chap5.html#unstable_gradients_in_more_complex_networks">
<li>Անկայուն գրադիենտներն ավելի կոմպլեքս ցանցերում</li>
</a>
<a href="chap5.html#other_obstacles_to_deep_learning">
<li>Այլ խոչընդոտներ խորը ուսուցման մեջ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
};
$('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a>
<a href="chap6.html">Խորը ուսուցում</a>
<div id="toc_deep_learning" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap6.html#introducing_convolutional_networks">
<li>Փաթույթային ցանցեր</li>
</a>
<a href="chap6.html#convolutional_neural_networks_in_practice">
<li>Փաթույթային ցանցերը կիրառության մեջ</li>
</a>
<a href="chap6.html#the_code_for_our_convolutional_networks">
<li>Փաթույթային ցանցերի կոդը</li>
</a>
<a href="chap6.html#recent_progress_in_image_recognition">
<li>Առաջխաղացումները պատկերների ճանաչման ասպարեզում</li>
</a>
<a href="chap6.html#other_approaches_to_deep_neural_nets">
<li>Այլ մոտեցումներ խորը նեյրոնային ցանցերի համար</li>
</a>
<a href="chap6.html#on_the_future_of_neural_networks">
<li>Նեյրոնային ցանցերի ապագայի մասին</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_deep_learning_reveal').click(function() {
var src = $('#toc_img_deep_learning').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_deep_learning").attr('src', 'images/arrow.png');
};
$('#toc_deep_learning').toggle('fast', function() {});
});
</script>
<p class="toc_not_mainchapter">
<a href="sai.html">Հավելված: Արդյո՞ք գոյություն ունի ինտելեկտի <em>պարզ</em> ալգորիթմ</a>
</p>
<p class="toc_not_mainchapter">
<a href="acknowledgements.html">Երախտագիտություն</a>
</p>
<p class="toc_not_mainchapter"><a href="faq.html">Հաճախ տրվող հարցեր</a>
</p>
<!--
<hr>
<p class="sidebar"> If you benefit from the book, please make a small
donation. I suggest $3, but you can choose the amount.</p>
<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="encrypted" value="-----BEGIN PKCS7-----MIIHTwYJKoZIhvcNAQcEoIIHQDCCBzwCAQExggEwMIIBLAIBADCBlDCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20CAQAwDQYJKoZIhvcNAQEBBQAEgYAtusFIFTgWVpgZsMgI9zMrWRAFFKQqeFiE6ay1nbmP360YzPtR+vvCXwn214Az9+F9g7mFxe0L+m9zOCdjzgRROZdTu1oIuS78i0TTbcbD/Vs/U/f9xcmwsdX9KYlhimfsya0ydPQ2xvr4iSGbwfNemIPVRCTadp/Y4OQWWRFKGTELMAkGBSsOAwIaBQAwgcwGCSqGSIb3DQEHATAUBggqhkiG9w0DBwQIK5obVTaqzmyAgajgc4w5t7l6DjTGVI7k+4UyO3uafxPac23jOyBGmxSnVRPONB9I+/Q6OqpXZtn8JpTuzFmuIgkNUf1nldv/DA1mhPOeeVxeuSGL8KpWxpJboKZ0mEu9b+0FJXvZW+snv0jodnRDtI4g0AXDZNPyRWIdJ3m+tlYfsXu4mQAe0q+CyT+QrSRhPGI/llicF4x3rMbRBNqlDze/tFqp/jbgW84Puzz6KyxAez6gggOHMIIDgzCCAuygAwIBAgIBADANBgkqhkiG9w0BAQUFADCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wHhcNMDQwMjEzMTAxMzE1WhcNMzUwMjEzMTAxMzE1WjCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMFHTt38RMxLXJyO2SmS+Ndl72T7oKJ4u4uw+6awntALWh03PewmIJuzbALScsTS4sZoS1fKciBGoh11gIfHzylvkdNe/hJl66/RGqrj5rFb08sAABNTzDTiqqNpJeBsYs/c2aiGozptX2RlnBktH+SUNpAajW724Nv2Wvhif6sFAgMBAAGjge4wgeswHQYDVR0OBBYEFJaffLvGbxe9WT9S1wob7BDWZJRrMIG7BgNVHSMEgbMwgbCAFJaffLvGbxe9WT9S1wob7BDWZJRroYGUpIGRMIGOMQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFjAUBgNVBAcTDU1vdW50YWluIFZpZXcxFDASBgNVBAoTC1BheVBhbCBJbmMuMRMwEQYDVQQLFApsaXZlX2NlcnRzMREwDwYDVQQDFAhsaXZlX2FwaTEcMBoGCSqGSIb3DQEJARYNcmVAcGF5cGFsLmNvbYIBADAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAIFfOlaagFrl71+jq6OKidbWFSE+Q4FqROvdgIONth+8kSK//Y/4ihuE4Ymvzn5ceE3S/iBSQQMjyvb+s2TWbQYDwcp129OPIbD9epdr4tJOUNiSojw7BHwYRiPh58S1xGlFgHFXwrEBb3dgNbMUa+u4qectsMAXpVHnD9wIyfmHMYIBmjCCAZYCAQEwgZQwgY4xCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEWMBQGA1UEBxMNTW91bnRhaW4gVmlldzEUMBIGA1UEChMLUGF5UGFsIEluYy4xEzARBgNVBAsUCmxpdmVfY2VydHMxETAPBgNVBAMUCGxpdmVfYXBpMRwwGgYJKoZIhvcNAQkBFg1yZUBwYXlwYWwuY29tAgEAMAkGBSsOAwIaBQCgXTAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA4MDUxMzMyMTRaMCMGCSqGSIb3DQEJBDEWBBRtGLYvbZ45sWVegWVP2CuXTHPmJTANBgkqhkiG9w0BAQEFAASBgKgrMHMINfV7yVuZgcTjp8gUzejPF2x2zRPU/G8pKUvYIl1F38TjV2pe4w0QXcGMJRT8mQfxHCy9UmF3LfblH8F0NSMMDrZqu3M0eLk96old+L0Xl6ING8l3idFDkLagE+lZK4A0rNV35aMci3VLvjQ34CvEj7jaHeLpbkgk/l6v-----END PKCS7-----
">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>
-->
<hr>
<span class="sidebar_title">Հովանավորներ</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>
<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>
<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>
<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>
<p class="sidebar">Շնորհակալություն եմ հայտնում բոլոր <a href="supporters.html">աջակցողներին</a>, ովքեր օգնել են գիրքն իրականություն դարձնել: Հատուկ շնորհակալություններ Պավել Դուդրենովին. Շնորհակալություն եմ հայտնում նաև նրանց, ովքեր ներդրում են ունեցել
<a href="bugfinder.html">Սխալների որոնման հուշատախտակում</a>. </p>
<hr>
<span class="sidebar_title">Ռեսուրսներ</span>
<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Մայքլ Նիլսենը թվիթերում</a></p>
<p class="sidebar"><a href="faq.html">Գրքի մասին հաճախակի տրբող հարցեր</a></p>
<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Կոդի պահոցը</a></p>
<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Մայքլ Նիլսենի նախագծերի հայտարարման էլ հասցեների ցուցակը</a>
</p>
<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Խորը Ուսուցում</a>, գրքի հեղինակներ` Յան Գուդֆելլո, Յոշուա Բենջիո և Ահարոն Կուրվիլ</p>
<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
<p class="sidebar">
<a href="http://michaelnielsen.org">Մայքլ Նիլսեն</a>, Հունվար 2017
</p>
</div>
<p>Imagine you're an engineer who has been asked to design a computer from scratch. One day you're working away in your office, designing logical circuits, setting out
<CODE>AND</CODE> gates,
<CODE>OR</CODE> gates, and so on, when your boss walks in with bad news. The customer has just added a surprising design requirement: the circuit for the entire computer must be just two layers deep:</p>
<p>
<center><img src="images/shallow_circuit.png" width="500px"></center>
</p>
<p>You're dumbfounded, and tell your boss: "The customer is crazy!"</p>
<p>Your boss replies: "I think they're crazy, too. But what the customer wants, they get."</p>
<p>In fact, there's a limited sense in which the customer isn't crazy. Suppose you're allowed to use a special logical gate which lets you
<CODE>AND</CODE> together as many inputs as you want. And you're also allowed a many-input
<CODE>NAND</CODE> gate, that is, a gate which can
<CODE>AND</CODE> multiple inputs and then negate the output. With these special gates it turns out to be possible to compute any function at all using a circuit that's just two layers deep.</p>
<p>But just because something is possible doesn't make it a good idea. In practice, when solving circuit design problems (or most any kind of algorithmic problem), we usually start by figuring out how to solve sub-problems, and then gradually integrate
the solutions. In other words, we build up to a solution through multiple layers of abstraction. </p>
<p>For instance, suppose we're designing a logical circuit to multiply two numbers. Chances are we want to build it up out of sub-circuits doing operations like adding two numbers. The sub-circuits for adding two numbers will, in turn, be built up out
of sub-sub-circuits for adding two bits. Very roughly speaking our circuit will look like:</p>
<p>
<center><img src="images/circuit_multiplication.png" width="500px"></center>
</p>
<p>That is, our final circuit contains at least three layers of circuit elements. In fact, it'll probably contain more than three layers, as we break the sub-tasks down into smaller units than I've described. But you get the general idea.</p>
<p>So deep circuits make the process of design easier. But they're not just helpful for design. There are, in fact, mathematical proofs showing that for some functions very shallow circuits require exponentially more circuit elements to compute than
do deep circuits. For instance, a famous series of papers in the early 1980s*
<span class="marginnote">
*The history is somewhat complex, so I won't give
detailed references. See Johan Håstad's 2012 paper
<a href="http://eccc.hpi-web.de/report/2012/137/">On the correlation of
parity and small-depth circuits</a> for an account of the early
history and references.</span> showed that computing the parity of a set of bits requires exponentially many gates, if done with a shallow circuit. On the other hand, if you use deeper circuits it's easy to compute the parity using a small circuit:
you just compute the parity of pairs of bits, then use those results to compute the parity of pairs of pairs of bits, and so on, building up quickly to the overall parity. Deep circuits thus can be intrinsically much more powerful than shallow circuits.</p>
<p>Up to now, this book has approached neural networks like the crazy customer. Almost all the networks we've worked with have just a single hidden layer of neurons (plus the input and output layers):</p>
<p>
<center>
<img src="images/tikz35.png" />
</center>
</p>
<p>These simple networks have been remarkably useful: in earlier chapters we used networks like this to classify handwritten digits with better than 98 percent accuracy! Nonetheless, intuitively we'd expect networks with many more hidden layers to be
more powerful:</p>
<p>
<center>
<img src="images/tikz36.png" />
</center>
</p>
<p>Such networks could use the intermediate layers to build up multiple layers of abstraction, just as we do in Boolean circuits. For instance, if we're doing visual pattern recognition, then the neurons in the first layer might learn to recognize edges,
the neurons in the second layer could learn to recognize more complex shapes, say triangle or rectangles, built up from edges. The third layer would then recognize still more complex shapes. And so on. These multiple layers of abstraction seem likely
to give deep networks a compelling advantage in learning to solve complex pattern recognition problems. Moreover, just as in the case of circuits, there are theoretical results suggesting that deep networks are intrinsically more powerful than shallow
networks*<span class="marginnote">
*For certain problems and network
architectures this is proved in
<a href="http://arxiv.org/pdf/1312.6098.pdf">On the number of response
regions of deep feed forward networks with piece-wise linear
activations</a>, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio
(2014). See also the more informal discussion in section 2 of
<a href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf">Learning
deep architectures for AI</a>, by Yoshua Bengio (2009).</span>.</p>
<p>How can we train such deep networks? In this chapter, we'll try training deep networks using our workhorse learning algorithm -
<a href="chap1.html#learning_with_gradient_descent">stochastic
gradient descent</a> by <a href="chap2.html">backpropagation</a>. But we'll run into trouble, with our deep networks not performing much (if at all) better than shallow networks.</p>
<p>That failure seems surprising in the light of the discussion above. Rather than give up on deep networks, we'll dig down and try to understand what's making our deep networks hard to train. When we look closely, we'll discover that the different layers
in our deep network are learning at vastly different speeds. In particular, when later layers in the network are learning well, early layers often get stuck during training, learning almost nothing at all. This stuckness isn't simply due to bad
luck. Rather, we'll discover there are fundamental reasons the learning slowdown occurs, connected to our use of gradient-based learning techniques.</p>
<p>As we delve into the problem more deeply, we'll learn that the opposite phenomenon can also occur: the early layers may be learning well, but later layers can become stuck. In fact, we'll find that there's an intrinsic instability associated to learning
by gradient descent in deep, many-layer neural networks. This instability tends to result in either the early or the later layers getting stuck during training.
</p>
<p>This all sounds like bad news. But by delving into these difficulties, we can begin to gain insight into what's required to train deep networks effectively. And so these investigations are good preparation for the next chapter, where we'll use deep
learning to attack image recognition problems.</p>
<p>
<h3><a name="the_vanishing_gradient_problem"></a><a href="#the_vanishing_gradient_problem">The vanishing gradient problem</a></h3></p>
<p>So, what goes wrong when we try to train a deep network?</p>
<p>To answer that question, let's first revisit the case of a network with just a single hidden layer. As per usual, we'll use the MNIST digit classification problem as our playground for learning and experimentation*
<span class="marginnote">
*I introduced the MNIST problem and data
<a href="chap1.html#learning_with_gradient_descent">here</a> and
<a href="chap1.html#implementing_our_network_to_classify_digits">here</a>.</span>.</p>
<p>If you wish, you can follow along by training networks on your computer. It is also, of course, fine to just read along. If you do wish to follow live, then you'll need Python 2.7, Numpy, and a copy of the code, which you can get by cloning the relevant
repository from the command line:
<div class="highlight"><pre><span></span>git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
</pre></div>
If you don't use <tt>git</tt> then you can download the data and code
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip">here</a>. You'll need to change into the <tt>src</tt> subdirectory.</p>
<p>Then, from a Python shell we load the MNIST data:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">>>></span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
</pre></div>
</p>
<p>We set up our network:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
</pre></div>
</p>
<p>This network has 784 neurons in the input layer, corresponding to the $28 \times 28 = 784$ pixels in the input image. We use 30 hidden neurons, as well as 10 output neurons, corresponding to the 10 possible classifications for the MNIST digits ('0',
'1', '2', $\ldots$, '9').
</p>
<p>Let's try training our network for 30 complete epochs, using mini-batches of 10 training examples at a time, a learning rate $\eta = 0.1$, and regularization parameter $\lambda = 5.0$. As we train we'll monitor the classification accuracy on the
<tt>validation_data</tt>*<span class="marginnote">
*Note that the networks is likely to
take some minutes to train, depending on the speed of your machine.
So if you're running the code you may wish to continue reading and
return later, not wait for the code to finish executing.</span>:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p>
<p>We get a classification accuracy of 96.48 percent (or thereabouts - it'll vary a bit from run to run), comparable to our earlier results with a similar configuration.</p>
<p>Now, let's add another hidden layer, also with 30 neurons in it, and try training with the same hyper-parameters:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p>
<p>This gives an improved classification accuracy, 96.90 percent. That's encouraging: a little more depth is helping. Let's add another 30-neuron hidden layer:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p>
<p>That doesn't help at all. In fact, the result drops back down to 96.57 percent, close to our original shallow network. And suppose we insert one further hidden layer:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p>
<p>The classification accuracy drops again, to 96.53 percent. That's probably not a statistically significant drop, but it's not encouraging, either.</p>
<p>This behaviour seems strange. Intuitively, extra hidden layers ought to make the network able to learn more complex classification functions, and thus do a better job classifying. Certainly, things shouldn't get worse, since the extra layers can,
in the worst case, simply do nothing*<span class="marginnote">
*See <a href="#identity_neuron">this later
problem</a> to understand how to build a hidden layer that does
nothing.</span>. But that's not what's going on.</p>
<p>So what is going on? Let's assume that the extra hidden layers really could help in principle, and the problem is that our learning algorithm isn't finding the right weights and biases. We'd like to figure out what's going wrong in our learning algorithm,
and how to do better.
</p>
<p>To get some insight into what's going wrong, let's visualize how the network learns. Below, I've plotted part of a $[784, 30, 30, 10]$ network, i.e., a network with two hidden layers, each containing $30$ hidden neurons. Each neuron in the diagram
has a little bar on it, representing how quickly that neuron is changing as the network learns. A big bar means the neuron's weights and bias are changing rapidly, while a small bar means the weights and bias are changing slowly. More precisely,
the bars denote the gradient $\partial C / \partial b$ for each neuron, i.e., the rate of change of the cost with respect to the neuron's bias. Back in
<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter
2</a> we saw that this gradient quantity controlled not just how rapidly the bias changes during learning, but also how rapidly the weights input to the neuron change, too. Don't worry if you don't recall the details: the thing to keep in mind
is simply that these bars show how quickly each neuron's weights and bias are changing as the network learns.</p>
<p>To keep the diagram simple, I've shown just the top six neurons in the two hidden layers. I've omitted the input neurons, since they've got no weights or biases to learn. I've also omitted the output neurons, since we're doing layer-wise comparisons,
and it makes most sense to compare layers with the same number of neurons. The results are plotted at the very beginning of training, i.e., immediately after the network is initialized. Here they are*<span class="marginnote">
*The data plotted is
generated using the program
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/generate_gradient.py">generate_gradient.py</a>.
The same program is also used to generate the results quoted later
in this section.</span>:</p>
<p><canvas id="initial_gradient" width=470 height=620></canvas></p>
<p>The network was initialized randomly, and so it's not surprising that there's a lot of variation in how rapidly the neurons learn. Still, one thing that jumps out is that the bars in the second hidden layer are mostly much larger than the bars in
the first hidden layer. As a result, the neurons in the second hidden layer will learn quite a bit faster than the neurons in the first hidden layer. Is this merely a coincidence, or are the neurons in the second hidden layer likely to learn faster
than neurons in the first hidden layer in general?</p>
<p>To determine whether this is the case, it helps to have a global way of comparing the speed of learning in the first and second hidden layers. To do this, let's denote the gradient as $\delta^l_j = \partial C / \partial b^l_j$, i.e., the gradient
for the $j$th neuron in the $l$th layer*
<span class="marginnote">
*Back in
<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter
2</a> we referred to this as the error, but here we'll adopt the
informal term "gradient". I say "informal" because of course
this doesn't explicitly include the partial derivatives of the cost
with respect to the weights, $\partial C / \partial w$.</span>. We can think of the gradient $\delta^1$ as a vector whose entries determine how quickly the first hidden layer learns, and $\delta^2$ as a vector whose entries determine how quickly
the second hidden layer learns. We'll then use the lengths of these vectors as (rough!) global measures of the speed at which the layers are learning. So, for instance, the length $\| \delta^1 \|$ measures the speed at which the first hidden layer
is learning, while the length $\| \delta^2 \|$ measures the speed at which the second hidden layer is learning.</p>
<p>With these definitions, and in the same configuration as was plotted above, we find $\| \delta^1 \| = 0.07\ldots$ and $\| \delta^2 \| = 0.31\ldots$. So this confirms our earlier suspicion: the neurons in the second hidden layer really are learning
much faster than the neurons in the first hidden layer.</p>
<p>What happens if we add more hidden layers? If we have three hidden layers, in a $[784, 30, 30, 30, 10]$ network, then the respective speeds of learning turn out to be 0.012, 0.060, and 0.283. Again, earlier hidden layers are learning much slower than
later hidden layers. Suppose we add yet another layer with $30$ hidden neurons. In that case, the respective speeds of learning are 0.003, 0.017, 0.070, and 0.285. The pattern holds: early layers learn slower than later layers.</p>
<p>We've been looking at the speed of learning at the start of training, that is, just after the networks are initialized. How does the speed of learning change as we train our networks? Let's return to look at the network with just two hidden layers.
The speed of learning changes as follows:</p>
<p>
<center><img src="images/training_speed_2_layers.png" width="500px"></center>
</p>
<p>To generate these results, I used batch gradient descent with just 1,000 training images, trained over 500 epochs. This is a bit different than the way we usually train - I've used no mini-batches, and just 1,000 training images, rather than the full
50,000 image training set. I'm not trying to do anything sneaky, or pull the wool over your eyes, but it turns out that using mini-batch stochastic gradient descent gives much noisier (albeit very similar, when you average away the noise) results.
Using the parameters I've chosen is an easy way of smoothing the results out, so we can see what's going on.
</p>
<p>In any case, as you can see the two layers start out learning at very different speeds (as we already know). The speed in both layers then drops very quickly, before rebounding. But through it all, the first hidden layer learns much more slowly than
the second hidden layer.</p>
<p>What about more complex networks? Here's the results of a similar experiment, but this time with three hidden layers (a $[784, 30, 30, 30, 10]$ network):</p>
<p>
<center><img src="images/training_speed_3_layers.png" width="500px"></center>
</p>
<p>Again, early hidden layers learn much more slowly than later hidden layers. Finally, let's add a fourth hidden layer (a $[784, 30, 30, 30, 30, 10]$ network), and see what happens when we train:</p>
<p>
<center><img src="images/training_speed_4_layers.png" width="500px"></center>
</p>
<p>Again, early hidden layers learn much more slowly than later hidden layers. In this case, the first hidden layer is learning roughly 100 times slower than the final hidden layer. No wonder we were having trouble training these networks earlier!</p>
<p>We have here an important observation: in at least some deep neural networks, the gradient tends to get smaller as we move backward through the hidden layers. This means that neurons in the earlier layers learn much more slowly than neurons in later
layers. And while we've seen this in just a single network, there are fundamental reasons why this happens in many neural networks. The phenomenon is known as the <em>vanishing gradient problem</em>*<span class="marginnote">
*See
<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.7321">Gradient
flow in recurrent nets: the difficulty of learning long-term
dependencies</a>, by Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi,
and Jürgen Schmidhuber (2001). This paper studied recurrent
neural nets, but the essential phenomenon is the same as in the
feedforward networks we are studying. See also Sepp Hochreiter's
earlier Diploma Thesis,
<a href="http://www.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf">Untersuchungen
zu dynamischen neuronalen Netzen</a> (1991, in German).</span>.</p>
<p>Why does the vanishing gradient problem occur? Are there ways we can avoid it? And how should we deal with it in training deep neural networks? In fact, we'll learn shortly that it's not inevitable, although the alternative is not very attractive,
either: sometimes the gradient gets much larger in earlier layers! This is the
<em>exploding gradient problem</em>, and it's not much better news than the vanishing gradient problem. More generally, it turns out that the gradient in deep neural networks is <em>unstable</em>, tending to either explode or vanish in earlier layers.
This instability is a fundamental problem for gradient-based learning in deep neural networks. It's something we need to understand, and, if possible, take steps to address.</p>
<p>One response to vanishing (or unstable) gradients is to wonder if they're really such a problem. Momentarily stepping away from neural nets, imagine we were trying to numerically minimize a function $f(x)$ of a single variable. Wouldn't it be good
news if the derivative $f'(x)$ was small? Wouldn't that mean we were already near an extremum? In a similar way, might the small gradient in early layers of a deep network mean that we don't need to do much adjustment of the weights and biases?</p>
<p>Of course, this isn't the case. Recall that we randomly initialized the weight and biases in the network. It is extremely unlikely our initial weights and biases will do a good job at whatever it is we want our network to do. To be concrete, consider
the first layer of weights in a $[784, 30, 30, 30, 10]$ network for the MNIST problem. The random initialization means the first layer throws away most information about the input image. Even if later layers have been extensively trained, they will
still find it extremely difficult to identify the input image, simply because they don't have enough information. And so it can't possibly be the case that not much learning needs to be done in the first layer. If we're going to train deep networks,
we need to figure out how to address the vanishing gradient problem.</p>
<p>
<h3><a name="what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"></a><a href="#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets">What's causing the vanishing gradient problem? Unstable gradients in deep neural nets</a></h3></p>
<p>To get insight into why the vanishing gradient problem occurs, let's consider the simplest deep neural network: one with just a single neuron in each layer. Here's a network with three hidden layers:
<center>
<img src="images/tikz37.png" />
</center>
</p>
<p>Here, $w_1, w_2, \ldots$ are the weights, $b_1, b_2, \ldots$ are the biases, and $C$ is some cost function. Just to remind you how this works, the output $a_j$ from the $j$th neuron is $\sigma(z_j)$, where $\sigma$ is the usual <a href="chap1.html#sigmoid_neurons">sigmoid
activation function</a>, and $z_j = w_{j} a_{j-1}+b_j$ is the weighted input to the neuron. I've drawn the cost $C$ at the end to emphasize that the cost is a function of the network's output, $a_4$: if the actual output from the network is close to
the desired output, then the cost will be low, while if it's far away, the cost will be high.</p>
<p>We're going to study the gradient $\partial C / \partial b_1$ associated to the first hidden neuron. We'll figure out an expression for $\partial C / \partial b_1$, and by studying that expression we'll understand why the vanishing gradient problem
occurs.</p>
<p>I'll start by simply showing you the expression for $\partial C / \partial b_1$. It looks forbidding, but it's actually got a simple structure, which I'll describe in a moment. Here's the expression (ignore the network, for now, and note that $\sigma'$
is just the derivative of the $\sigma$ function):</p>
<p>
<center>
<img src="images/tikz38.png" />
</center>
</p>
<p>The structure in the expression is as follows: there is a $\sigma'(z_j)$ term in the product for each neuron in the network; a weight $w_j$ term for each weight in the network; and a final $\partial C / \partial a_4$ term, corresponding to the cost
function at the end. Notice that I've placed each term in the expression above the corresponding part of the network. So the network itself is a mnemonic for the expression.</p>
<p>You're welcome to take this expression for granted, and skip to the
<a href="#discussion_why">discussion of how it relates to the vanishing
gradient problem</a>. There's no harm in doing this, since the expression is a special case of our
<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">earlier
discussion of backpropagation</a>. But there's also a simple explanation of why the expression is true, and so it's fun (and perhaps enlightening) to take a look at that explanation.</p>
<p>Imagine we make a small change $\Delta b_1$ in the bias $b_1$. That will set off a cascading series of changes in the rest of the network. First, it causes a change $\Delta a_1$ in the output from the first hidden neuron. That, in turn, will cause
a change $\Delta z_2$ in the weighted input to the second hidden neuron. Then a change $\Delta a_2$ in the output from the second hidden neuron. And so on, all the way through to a change $\Delta C$ in the cost at the output. We have
<a class="displaced_anchor" name="eqtn114"></a>\begin{eqnarray} \frac{\partial C}{\partial b_1} \approx \frac{\Delta C}{\Delta b_1}. \tag{114}\end{eqnarray} This suggests that we can figure out an expression for the gradient $\partial C / \partial
b_1$ by carefully tracking the effect of each step in this cascade.</p>
<p>To do this, let's think about how $\Delta b_1$ causes the output $a_1$ from the first hidden neuron to change. We have $a_1 = \sigma(z_1) = \sigma(w_1 a_0 + b_1)$, so
<a class="displaced_anchor" name="eqtn115"></a><a class="displaced_anchor" name="eqtn116"></a>\begin{eqnarray} \Delta a_1 & \approx & \frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1 \tag{115}\\ & = & \sigma'(z_1) \Delta b_1. \tag{116}\end{eqnarray}
That $\sigma'(z_1)$ term should look familiar: it's the first term in our claimed expression for the gradient $\partial C / \partial b_1$. Intuitively, this term converts a change $\Delta b_1$ in the bias into a change $\Delta a_1$ in the output
activation. That change $\Delta a_1$ in turn causes a change in the weighted input $z_2 = w_2 a_1 + b_2$ to the second hidden neuron:
<a class="displaced_anchor" name="eqtn117"></a><a class="displaced_anchor" name="eqtn118"></a>\begin{eqnarray} \Delta z_2 & \approx & \frac{\partial z_2}{\partial a_1} \Delta a_1 \tag{117}\\ & = & w_2 \Delta a_1. \tag{118}\end{eqnarray} Combining
our expressions for $\Delta z_2$ and $\Delta a_1$, we see how the change in the bias $b_1$ propagates along the network to affect $z_2$:
<a class="displaced_anchor" name="eqtn119"></a>\begin{eqnarray} \Delta z_2 & \approx & \sigma'(z_1) w_2 \Delta b_1. \tag{119}\end{eqnarray} Again, that should look familiar: we've now got the first two terms in our claimed expression for the gradient
$\partial C / \partial b_1$.</p>
<p>We can keep going in this fashion, tracking the way changes propagate through the rest of the network. At each neuron we pick up a $\sigma'(z_j)$ term, and through each weight we pick up a $w_j$ term. The end result is an expression relating the final
change $\Delta C$ in cost to the initial change $\Delta b_1$ in the bias:
<a class="displaced_anchor" name="eqtn120"></a>\begin{eqnarray} \Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4} \Delta b_1. \tag{120}\end{eqnarray} Dividing by $\Delta b_1$ we do indeed get
the desired expression for the gradient:
<a class="displaced_anchor" name="eqtn121"></a>\begin{eqnarray} \frac{\partial C}{\partial b_1} = \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4}. \tag{121}\end{eqnarray}
</p>
<p><a id="discussion_why"></a></p>
<p><strong>Why the vanishing gradient problem occurs:</strong> To understand why the vanishing gradient problem occurs, let's explicitly write out the entire expression for the gradient:
<a class="displaced_anchor" name="eqtn122"></a>\begin{eqnarray} \frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \, w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}. \tag{122}\end{eqnarray} Excepting the
very last term, this expression is a product of terms of the form $w_j \sigma'(z_j)$. To understand how each of those terms behave, let's look at a plot of the function $\sigma'$:
<div id="sigmoid_prime_graph"><a name="sigmoid_prime_graph"></a></div>
<script type="text/javascript" src="js/d3.v3.min.js"></script>
</p>
<p>The derivative reaches a maximum at $\sigma'(0) = 1/4$. Now, if we use our <a href="chap3.html#weight_initialization">standard approach</a> to initializing the weights in the network, then we'll choose the weights using a Gaussian with mean $0$ and
standard deviation $1$. So the weights will usually satisfy $|w_j|
< 1$. Putting these observations together, we see that the terms $w_j \sigma '(z_j)$ will
usually satisfy $|w_j \sigma'(z_j)| < 1/4$. And when we take a product of many such terms, the product will tend to exponentially decrease: the more terms, the smaller the product will be. This is starting to smell like a possible explanation for the
vanishing gradient problem.</p>
<p>To make this all a bit more explicit, let's compare the expression for $\partial C / \partial b_1$ to an expression for the gradient with respect to a later bias, say $\partial C / \partial b_3$. Of course, we haven't explicitly worked out an
expression for $\partial C / \partial b_3$, but it follows the same pattern described above for $\partial C / \partial b_1$. Here's the comparison of the two expressions:
<center>
<img src="images/tikz39.png" />
</center>
The two expressions share many terms. But the gradient $\partial C / \partial b_1$ includes two extra terms each of the form $w_j \sigma'(z_j)$. As we've seen, such terms are typically less than $1/4$ in magnitude. And so the gradient $\partial C / \partial
b_1$ will usually be a factor of $16$ (or more) smaller than $\partial C / \partial b_3$. This is the essential origin of the vanishing gradient problem.</p>
<p>Of course, this is an informal argument, not a rigorous proof that the vanishing gradient problem will occur. There are several possible escape clauses. In particular, we might wonder whether the weights $w_j$ could grow during training. If they
do, it's possible the terms $w_j \sigma'(z_j)$ in the product will no longer satisfy $|w_j \sigma'(z_j)|
< 1/4$. Indeed, if the terms get large enough - greater than $1$ - then we will no longer have a vanishing gradient problem. Instead, the
gradient will actually grow exponentially as we move backward through the layers. Instead of a vanishing gradient problem, we 'll have an exploding gradient problem.</p><p><strong>The exploding gradient problem:</strong> Let's look at an explicit
example where exploding gradients occur. The example is somewhat contrived: I 'm going to fix parameters in the network in just the
right way to ensure we get an exploding gradient. But even though the
example is contrived, it has the virtue of firmly establishing that
exploding gradients aren't merely a hypothetical possibility, they really can happen.</p>
<p>There are two steps to getting an exploding gradient. First, we choose all the weights in the network to be large, say $w_1 = w_2 = w_3 = w_4 = 100$. Second, we'll choose the biases so that the $\sigma'(z_j)$ terms are not too small. That's
actually pretty easy to do: all we need do is choose the biases to ensure that the weighted input to each neuron is $z_j = 0$ (and so $\sigma'(z_j) = 1/4$). So, for instance, we want $z_1 = w_1 a_0 + b_1 = 0$. We can achieve this by setting
$b_1 = -100 * a_0$. We can use the same idea to select the other biases. When we do this, we see that all the terms $w_j \sigma'(z_j)$ are equal to $100 * \frac{1}{4} = 25$. With these choices we get an exploding gradient.</p>
<p><strong>The unstable gradient problem:</strong> The fundamental problem here isn't so much the vanishing gradient problem or the exploding gradient problem. It's that the gradient in early layers is the product of terms from all the later
layers. When there are many layers, that's an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out. Without some mechanism or underlying
reason for that balancing to occur, it's highly unlikely to happen simply by chance. In short, the real problem here is that neural networks suffer from an <em>unstable gradient problem</em>. As a result, if we use standard gradient-based
learning techniques, different layers in the network will tend to learn at wildly different speeds.
</p>
<p>
<h4><a name="exercise_255808"></a><a href="#exercise_255808">Exercise</a></h4>
<ul>
<li> In our discussion of the vanishing gradient problem, we made use of the fact that $|\sigma'(z)|
< 1/4$. Suppose we used a different activation function, one whose derivative could be much larger. Would that help us avoid the unstable gradient
problem? </ul>
</p>
<p><strong>The prevalence of the vanishing gradient problem:</strong> We've seen that the gradient can either vanish or explode in the early layers of a deep network. In fact, when using sigmoid neurons the gradient will usually vanish. To see
why, consider again the expression $|w \sigma'(z)|$. To avoid the vanishing gradient problem we need $|w \sigma'(z)| \geq 1$. You might think this could happen easily if $w$ is very large. However, it's more difficult than it looks. The
reason is that the $\sigma'(z)$ term also depends on $w$: $\sigma'(z) = \sigma'(wa +b)$, where $a$ is the input activation. So when we make $w$ large, we need to be careful that we're not simultaneously making $\sigma'(wa+b)$ small. That
turns out to be a considerable constraint. The reason is that when we make $w$ large we tend to make $wa+b$ very large. Looking at the graph of $\sigma'$ you can see that this puts us off in the "wings" of the $\sigma'$ function, where it
takes very small values. The only way to avoid this is if the input activation falls within a fairly narrow range of values (this qualitative explanation is made quantitative in the first problem below). Sometimes that will chance to happen.
More often, though, it does not happen. And so in the generic case we have vanishing gradients.
</p>
<p>
<h4><a name="problems_778071"></a><a href="#problems_778071">Problems</a></h4>
<ul>
<li> Consider the product $|w \sigma'(wa+b)|$. Suppose $|w \sigma'(wa+b)| \geq 1$. (1) Argue that this can only ever occur if $|w| \geq 4$. (2) Supposing that $|w| \geq 4$, consider the set of input activations $a$ for which $|w \sigma'(wa+b)|
\geq 1$. Show that the set of $a$ satisfying that constraint can range over an interval no greater in width than
<a class="displaced_anchor" name="eqtn123"></a>\begin{eqnarray} \frac{2}{|w|} \ln\left( \frac{|w|(1+\sqrt{1-4/|w|})}{2}-1\right). \tag{123}\end{eqnarray} (3) Show numerically that the above expression bounding the width of the range
is greatest at $|w| \approx 6.9$, where it takes a value $\approx 0.45$. And so even given that everything lines up just perfectly, we still have a fairly narrow range of input activations which can avoid the vanishing gradient problem.</p>
<p>
<li><strong>Identity neuron:</strong> <a id="identity_neuron"></a> Consider a neuron with a single input, $x$, a corresponding weight, $w_1$, a bias $b$, and a weight $w_2$ on the output. Show that by choosing the weights and bias appropriately,
we can ensure $w_2 \sigma(w_1 x+b) \approx x$ for $x \in [0, 1]$. Such a neuron can thus be used as a kind of identity neuron, that is, a neuron whose output is the same (up to rescaling by a weight factor) as its input. <em>Hint:
It helps to rewrite $x = 1/2+\Delta$, to assume $w_1$ is small,
and to use a Taylor series expansion in $w_1 \Delta$.</em>
</ul>
</p>
<p>
<h3><a name="unstable_gradients_in_more_complex_networks"></a><a href="#unstable_gradients_in_more_complex_networks">Unstable gradients in more complex networks</a></h3></p>
<p>We've been studying toy networks, with just one neuron in each hidden layer. What about more complex deep networks, with many neurons in each hidden layer?</p>
<p>
<center>
<img src="images/tikz40.png" />
</center>
</p>
<p>In fact, much the same behaviour occurs in such networks. In the earlier chapter on backpropagation we saw that the gradient in the $l$th layer of an $L$ layer network
<a href="chap2.html#alternative_backprop">is given by</a>:</p>
<p><a class="displaced_anchor" name="eqtn124"></a>\begin{eqnarray} \delta^l = \Sigma'(z^l) (w^{l+1})^T \Sigma'(z^{l+1}) (w^{l+2})^T \ldots \Sigma'(z^L) \nabla_a C \tag{124}\end{eqnarray}
</p>
<p>Here, $\Sigma'(z^l)$ is a diagonal matrix whose entries are the $\sigma'(z)$ values for the weighted inputs to the $l$th layer. The $w^l$ are the weight matrices for the different layers. And $\nabla_a C$ is the vector of partial derivatives
of $C$ with respect to the output activations.</p>
<p>This is a much more complicated expression than in the single-neuron case. Still, if you look closely, the essential form is very similar, with lots of pairs of the form $(w^j)^T \Sigma'(z^j)$. What's more, the matrices $\Sigma'(z^j)$ have
small entries on the diagonal, none larger than $\frac{1}{4}$. Provided the weight matrices $w^j$ aren't too large, each additional term $(w^j)^T \Sigma'(z^l)$ tends to make the gradient vector smaller, leading to a vanishing gradient. More
generally, the large number of terms in the product tends to lead to an unstable gradient, just as in our earlier example. In practice, empirically it is typically found in sigmoid networks that gradients vanish exponentially quickly in
earlier layers. As a result, learning slows down in those layers. This slowdown isn't merely an accident or an inconvenience: it's a fundamental consequence of the approach we're taking to learning.</p>
<p>
<h3><a name="other_obstacles_to_deep_learning"></a><a href="#other_obstacles_to_deep_learning">Other obstacles to deep learning</a></h3></p>
<p>In this chapter we've focused on vanishing gradients - and, more generally, unstable gradients - as an obstacle to deep learning. In fact, unstable gradients are just one obstacle to deep learning, albeit an important fundamental obstacle.
Much ongoing research aims to better understand the challenges that can occur when training deep networks. I won't comprehensively summarize that work here, but just want to briefly mention a couple of papers, to give you the flavor of some
of the questions people are asking.</p>
<p>As a first example, in 2010 Glorot and Bengio*
<span class="marginnote">
*<a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Understanding
the difficulty of training deep feedforward neural networks</a>, by
Xavier Glorot and Yoshua Bengio (2010). See also the earlier
discussion of the use of sigmoids in
<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient
BackProp</a>, by Yann LeCun, Léon Bottou,
Genevieve Orr and Klaus-Robert Müller (1998).</span> found evidence suggesting that the use of sigmoid activation functions can cause problems training deep networks. In particular, they found evidence that the use of sigmoids will cause
the activations in the final hidden layer to saturate near $0$ early in training, substantially slowing down learning. They suggested some alternative activation functions, which appear not to suffer as much from this saturation problem.</p>
<p>As a second example, in 2013 Sutskever, Martens, Dahl and Hinton*
<span class="marginnote">
*<a href="http://www.cs.toronto.edu/~hinton/absps/momentum.pdf">On
the importance of initialization and momentum in deep learning</a>,
by Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton
(2013).</span> studied the impact on deep learning of both the random weight initialization and the momentum schedule in momentum-based stochastic gradient descent. In both cases, making good choices made a substantial difference in the
ability to train deep networks.</p>
<p>These examples suggest that "What makes deep networks hard to train?" is a complex question. In this chapter, we've focused on the instabilities associated to gradient-based learning in deep networks. The results in the last two paragraphs
suggest that there is also a role played by the choice of activation function, the way weights are initialized, and even details of how learning by gradient descent is implemented. And, of course, choice of network architecture and other
hyper-parameters is also important. Thus, many factors can play a role in making deep networks hard to train, and understanding all those factors is still a subject of ongoing research. This all seems rather downbeat and pessimism-inducing.
But the good news is that in the next chapter we'll turn that around, and develop several approaches to deep learning that to some extent manage to overcome or route around all these challenges.</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>
<script src="js/misc.js" type="text/javascript"></script>
<script src="js/canvas.js" type="text/javascript"></script>
<script src="js/neuron.js" type="text/javascript"></script>
<script src="js/chap5.js" type="text/javascript"></script>
</p>
<p>
</div>
<div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2015
<br/>
<br/>
This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>. This means you're free to copy, share, and
build on this book, but not to sell it. If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Thu Jan 19 06:09:48 2017
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
(function(i, s, o, g, r, a, m) {
i['GoogleAnalyticsObject'] = r;
i[r] = i[r] || function() {
(i[r].q = i[r].q || []).push(arguments)
}, i[r].l = 1 * new Date();
a = s.createElement(o),
m = s.getElementsByTagName(o)[0];
a.async = 1;
a.src = g;
m.parentNode.insertBefore(a, m)
})(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');
ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
ga('send', 'pageview');
</script>
</body>
</html>