forked from mnielsen/nnadl_site
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathchap6.html
2541 lines (2451 loc) · 273 KB
/
chap6.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file. Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other -->
<!-- code is quite ugly! Later versions should be better. -->
<head>
<meta charset="utf-8">
<meta name="citation_title" content="Neural Networks and Deep Learning">
<meta name="citation_author" content="Nielsen, Michael A.">
<meta name="citation_publication_date" content="2015">
<meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
<meta name="citation_publisher" content="Determination Press">
<link rel="icon" href="nnadl_favicon.ICO" />
<title>Neural networks and deep learning</title>
<script src="assets/jquery.min.js"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$']]}, "HTML-CSS": {scale: 92}, TeX: { equationNumbers: { autoNumber: "AMS" }}});
</script>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<link href="assets/style.css" rel="stylesheet">
<link href="assets/pygments.css" rel="stylesheet">
<link rel="stylesheet" href="https://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">
<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */
@font-face {
font-family: 'MJX_Math';
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot');
/* IE9 Compat Modes */
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}
@font-face {
font-family: 'MJX_Main';
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot');
/* IE9 Compat Modes */
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>
</head>
<body>
<div class="nonumber_header">
<h2><a href="index.html">Նեյրոնային ցանցեր և խորը ուսուցում</a></h2>
</div>
<div class="section">
<div id="toc">
<p class="toc_title">
<a href="index.html">Նեյրոնային ցանցեր և խորը ուսուցում</a>
</p>
<p class="toc_not_mainchapter">
<a href="about.html">Ինչի՞ մասին է գիրքը</a>
</p>
<p class="toc_not_mainchapter">
<a href="exercises_and_problems.html">Խնդիրների և վարժությունների մասին</a>
</p>
<p class='toc_mainchapter'>
<a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a>
<a href="chap1.html">Ձեռագիր թվանշանների ճանաչում՝ օգտագործելով նեյրոնային ցանցեր</a>
<div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap1.html#perceptrons">
<li>Պերսեպտրոններ</li>
</a>
<a href="chap1.html#sigmoid_neurons">
<li>Սիգմոիդ նեյրոններ</li>
</a>
<a href="chap1.html#the_architecture_of_neural_networks">
<li>Նեյրոնային ցանցերի կառուցվածքը</li>
</a>
<a href="chap1.html#a_simple_network_to_classify_handwritten_digits">
<li>Պարզ ցանց ձեռագիր թվանշանների ճանաչման համար</li>
</a>
<a href="chap1.html#learning_with_gradient_descent">
<li>Ուսուցում գրադիենտային վայրէջքի միջոցով</li>
</a>
<a href="chap1.html#implementing_our_network_to_classify_digits">
<li>Թվանշանները ճանաչող ցանցի իրականացումը</li>
</a>
<a href="chap1.html#toward_deep_learning">
<li>Խորը ուսուցմանն ընդառաջ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
};
$('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a>
<a href="chap2.html">Ինչպե՞ս է աշխատում հետադարձ տարածումը</a>
<div id="toc_how_the_backpropagation_algorithm_works" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output
_from_a_neural_network">
<li>Մարզանք. նեյրոնային ցանցի ելքային արժեքների հաշվման արագագործ, մատրիցային մոտեցում</li>
</a>
<a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function">
<li>Երկու ենթադրություն գնային ֆունկցիայի վերաբերյալ</li>
</a>
<a href="chap2.html#the_hadamard_product_$s_\odot_t$">
<li>Հադամարի արտադրյալը՝ $s \odot t$</li>
</a>
<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">
<li>Հետադարձ տարածման հիմքում ընկած չորս հիմնական հավասարումները</li>
</a>
<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">
<li>Չորս հիմնական հավասարումների ապացույցները (ընտրովի)</li>
</a>
<a href="chap2.html#the_backpropagation_algorithm">
<li>Հետադարձ տարածման ալգորիթմը</li>
</a>
<a href="chap2.html#the_code_for_backpropagation">
<li>Հետադարձ տարածման իրականացման կոդը</li>
</a>
<a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm">
<li>Ի՞նչ իմաստով է հետադարձ տարածումն արագագործ ալգորիթմ</li>
</a>
<a href="chap2.html#backpropagation_the_big_picture">
<li>Հետադարձ տարածում. ամբողջական պատկերը</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
};
$('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a>
<a href="chap3.html">Նեյրոնային ցանցերի ուսուցման բարելավումը</a>
<div id="toc_improving_the_way_neural_networks_learn" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap3.html#the_cross-entropy_cost_function">
<li>Գնային ֆունկցիան՝ միջէնտրոպիայով</li>
</a>
<a href="chap3.html#overfitting_and_regularization">
<li>Գերմարզում և ռեգուլյարացում</li>
</a>
<a href="chap3.html#weight_initialization">
<li>Կշիռների սկզբնարժեքավորումը</li>
</a>
<a href="chap3.html#handwriting_recognition_revisited_the_code">
<li>Ձեռագրերի ճամաչման կոդի վերանայում</li>
</a>
<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">
<li>Ինչպե՞ս ընտրել նեյրոնային ցանցերի հիպեր-պարամետրերը</li>
</a>
<a href="chap3.html#other_techniques">
<li>Այլ տեխնիկաներ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
};
$('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a>
<a href="chap4.html">Տեսողական ապացույց այն մասին, որ նեյրոնային ֆունկցիաները կարող են մոտարկել կամայական ֆունկցիա</a>
<div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap4.html#two_caveats">
<li>Երկու զգուշացում</li>
</a>
<a href="chap4.html#universality_with_one_input_and_one_output">
<li>Ունիվերսալություն մեկ մուտքով և մեկ ելքով</li>
</a>
<a href="chap4.html#many_input_variables">
<li>Մեկից ավել մուտքային փոփոխականներ</li>
</a>
<a href="chap4.html#extension_beyond_sigmoid_neurons">
<li>Ընդլայնումը Սիգմոիդ նեյրոններից դուրս </li>
</a>
<a href="chap4.html#fixing_up_the_step_functions">
<li>Քայլի ֆունկցիայի ուղղումը</li>
</a>
<a href="chap4.html#conclusion">
<li>Եզրակացություն</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
};
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a>
<a href="chap5.html">Ինչու՞մն է կայանում նեյրոնային ցանցերի մարզման բարդությունը</a>
<div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap5.html#the_vanishing_gradient_problem">
<li>Անհետացող գրադիենտի խնդիրը</li>
</a>
<a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets">
<li>Ի՞նչն է անհետացող գրադիենտի խնդրի պատճառը։ Խորը նեյրոնային ցանցերի անկայուն գրադիենտները</li>
</a>
<a href="chap5.html#unstable_gradients_in_more_complex_networks">
<li>Անկայուն գրադիենտներն ավելի կոմպլեքս ցանցերում</li>
</a>
<a href="chap5.html#other_obstacles_to_deep_learning">
<li>Այլ խոչընդոտներ խորը ուսուցման մեջ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
};
$('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a>
<a href="chap6.html">Խորը ուսուցում</a>
<div id="toc_deep_learning" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap6.html#introducing_convolutional_networks">
<li>Փաթույթային ցանցեր</li>
</a>
<a href="chap6.html#convolutional_neural_networks_in_practice">
<li>Փաթույթային ցանցերը կիրառության մեջ</li>
</a>
<a href="chap6.html#the_code_for_our_convolutional_networks">
<li>Փաթույթային ցանցերի կոդը</li>
</a>
<a href="chap6.html#recent_progress_in_image_recognition">
<li>Առաջխաղացումները պատկերների ճանաչման ասպարեզում</li>
</a>
<a href="chap6.html#other_approaches_to_deep_neural_nets">
<li>Այլ մոտեցումներ խորը նեյրոնային ցանցերի համար</li>
</a>
<a href="chap6.html#on_the_future_of_neural_networks">
<li>Նեյրոնային ցանցերի ապագայի մասին</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_deep_learning_reveal').click(function() {
var src = $('#toc_img_deep_learning').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_deep_learning").attr('src', 'images/arrow.png');
};
$('#toc_deep_learning').toggle('fast', function() {});
});
</script>
<p class="toc_not_mainchapter">
<a href="sai.html">Հավելված: Արդյո՞ք գոյություն ունի ինտելեկտի <em>պարզ</em> ալգորիթմ</a>
</p>
<p class="toc_not_mainchapter">
<a href="acknowledgements.html">Երախտագիտություն</a>
</p>
<p class="toc_not_mainchapter"><a href="faq.html">Հաճախ տրվող հարցեր</a>
</p>
<!--
<hr>
<p class="sidebar"> If you benefit from the book, please make a small
donation. I suggest $3, but you can choose the amount.</p>
<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="encrypted" value="-----BEGIN PKCS7-----MIIHTwYJKoZIhvcNAQcEoIIHQDCCBzwCAQExggEwMIIBLAIBADCBlDCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20CAQAwDQYJKoZIhvcNAQEBBQAEgYAtusFIFTgWVpgZsMgI9zMrWRAFFKQqeFiE6ay1nbmP360YzPtR+vvCXwn214Az9+F9g7mFxe0L+m9zOCdjzgRROZdTu1oIuS78i0TTbcbD/Vs/U/f9xcmwsdX9KYlhimfsya0ydPQ2xvr4iSGbwfNemIPVRCTadp/Y4OQWWRFKGTELMAkGBSsOAwIaBQAwgcwGCSqGSIb3DQEHATAUBggqhkiG9w0DBwQIK5obVTaqzmyAgajgc4w5t7l6DjTGVI7k+4UyO3uafxPac23jOyBGmxSnVRPONB9I+/Q6OqpXZtn8JpTuzFmuIgkNUf1nldv/DA1mhPOeeVxeuSGL8KpWxpJboKZ0mEu9b+0FJXvZW+snv0jodnRDtI4g0AXDZNPyRWIdJ3m+tlYfsXu4mQAe0q+CyT+QrSRhPGI/llicF4x3rMbRBNqlDze/tFqp/jbgW84Puzz6KyxAez6gggOHMIIDgzCCAuygAwIBAgIBADANBgkqhkiG9w0BAQUFADCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wHhcNMDQwMjEzMTAxMzE1WhcNMzUwMjEzMTAxMzE1WjCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMFHTt38RMxLXJyO2SmS+Ndl72T7oKJ4u4uw+6awntALWh03PewmIJuzbALScsTS4sZoS1fKciBGoh11gIfHzylvkdNe/hJl66/RGqrj5rFb08sAABNTzDTiqqNpJeBsYs/c2aiGozptX2RlnBktH+SUNpAajW724Nv2Wvhif6sFAgMBAAGjge4wgeswHQYDVR0OBBYEFJaffLvGbxe9WT9S1wob7BDWZJRrMIG7BgNVHSMEgbMwgbCAFJaffLvGbxe9WT9S1wob7BDWZJRroYGUpIGRMIGOMQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFjAUBgNVBAcTDU1vdW50YWluIFZpZXcxFDASBgNVBAoTC1BheVBhbCBJbmMuMRMwEQYDVQQLFApsaXZlX2NlcnRzMREwDwYDVQQDFAhsaXZlX2FwaTEcMBoGCSqGSIb3DQEJARYNcmVAcGF5cGFsLmNvbYIBADAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAIFfOlaagFrl71+jq6OKidbWFSE+Q4FqROvdgIONth+8kSK//Y/4ihuE4Ymvzn5ceE3S/iBSQQMjyvb+s2TWbQYDwcp129OPIbD9epdr4tJOUNiSojw7BHwYRiPh58S1xGlFgHFXwrEBb3dgNbMUa+u4qectsMAXpVHnD9wIyfmHMYIBmjCCAZYCAQEwgZQwgY4xCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEWMBQGA1UEBxMNTW91bnRhaW4gVmlldzEUMBIGA1UEChMLUGF5UGFsIEluYy4xEzARBgNVBAsUCmxpdmVfY2VydHMxETAPBgNVBAMUCGxpdmVfYXBpMRwwGgYJKoZIhvcNAQkBFg1yZUBwYXlwYWwuY29tAgEAMAkGBSsOAwIaBQCgXTAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA4MDUxMzMyMTRaMCMGCSqGSIb3DQEJBDEWBBRtGLYvbZ45sWVegWVP2CuXTHPmJTANBgkqhkiG9w0BAQEFAASBgKgrMHMINfV7yVuZgcTjp8gUzejPF2x2zRPU/G8pKUvYIl1F38TjV2pe4w0QXcGMJRT8mQfxHCy9UmF3LfblH8F0NSMMDrZqu3M0eLk96old+L0Xl6ING8l3idFDkLagE+lZK4A0rNV35aMci3VLvjQ34CvEj7jaHeLpbkgk/l6v-----END PKCS7-----
">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>
-->
<hr>
<span class="sidebar_title">Հովանավորներ</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>
<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>
<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>
<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>
<p class="sidebar">Շնորհակալություն եմ հայտնում բոլոր <a href="supporters.html">աջակցողներին</a>, ովքեր օգնել են գիրքն իրականություն դարձնել: Հատուկ շնորհակալություններ Պավել Դուդրենովին. Շնորհակալություն եմ հայտնում նաև նրանց, ովքեր ներդրում են ունեցել
<a href="bugfinder.html">Սխալների որոնման հուշատախտակում</a>. </p>
<hr>
<span class="sidebar_title">Ռեսուրսներ</span>
<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Մայքլ Նիլսենը թվիթերում</a></p>
<p class="sidebar"><a href="faq.html">Գրքի մասին հաճախակի տրբող հարցեր</a></p>
<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Կոդի պահոցը</a></p>
<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Մայքլ Նիլսենի նախագծերի հայտարարման էլ հասցեների ցուցակը</a>
</p>
<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Խորը Ուսուցում</a>, գրքի հեղինակներ` Յան Գուդֆելլո, Յոշուա Բենջիո և Ահարոն Կուրվիլ</p>
<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
<p class="sidebar">
<a href="http://michaelnielsen.org">Մայքլ Նիլսեն</a>, Հունվար 2017
</p>
</div>
<p>
In the <a href="chap5.html">last chapter</a> we learned that deep neural networks are often much harder to train than shallow neural networks. That's unfortunate, since we have good reason to believe that
<em>if</em> we could train deep nets they'd be much more powerful than shallow nets. But while the news from the last chapter is discouraging, we won't let it stop us. In this chapter, we'll develop techniques which can be used to train deep networks,
and apply them in practice. We'll also look at the broader picture, briefly reviewing recent progress on using deep nets for image recognition, speech recognition, and other applications. And we'll take a brief, speculative look at what the future
may hold for neural nets, and for artificial intelligence.</p>
<p>The chapter is a long one. To help you navigate, let's take a tour. The sections are only loosely coupled, so provided you have some basic familiarity with neural nets, you can jump to whatever most interests you.
</p>
<p>The <a href="#convolutional_networks">main part of the chapter</a> is an introduction to one of the most widely used types of deep network: deep convolutional networks. We'll work through a detailed example - code and all - of using convolutional
nets to solve the problem of classifying handwritten digits from the MNIST data set:</p>
<p>
<center><img src="images/digits.png" width="160px"></center>
</p>
<p>We'll start our account of convolutional networks with the shallow networks used to attack this problem earlier in the book. Through many iterations we'll build up more and more powerful networks. As we go we'll explore many powerful techniques: convolutions,
pooling, the use of GPUs to do far more training than we did with our shallow networks, the algorithmic expansion of our training data (to reduce overfitting), the use of the dropout technique (also to reduce overfitting), the use of ensembles of
networks, and others. The result will be a system that offers near-human performance. Of the 10,000 MNIST test images - images not seen during training! - our system will classify 9,967 correctly. Here's a peek at the 33 images which are misclassified.
Note that the correct classification is in the top right; our program's classification is in the bottom right:</p>
<p>
<center><img src="images/ensemble_errors.png" width="580px"></center>
</p>
<p>Many of these are tough even for a human to classify. Consider, for example, the third image in the top row. To me it looks more like a "9" than an "8", which is the official classification. Our network also thinks it's a "9". This kind of "error"
is at the very least understandable, and perhaps even commendable. We conclude our discussion of image recognition with a
<a href="#recent_progress_in_image_recognition">survey of some of the
spectacular recent progress</a> using networks (particularly convolutional nets) to do image recognition.</p>
<p>The remainder of the chapter discusses deep learning from a broader and less detailed perspective. We'll
<a href="#things_we_didn't_cover_but_which_you'll_eventually_want_to_know">briefly
survey other models of neural networks</a>, such as recurrent neural nets and long short-term memory units, and how such models can be applied to problems in speech recognition, natural language processing, and other areas. And we'll
<a href="#on_the_future_of_neural_networks">speculate about the
future of neural networks and deep learning</a>, ranging from ideas like intention-driven user interfaces, to the role of deep learning in artificial intelligence.</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>The chapter builds on the earlier chapters in the book, making use of and integrating ideas such as backpropagation, regularization, the softmax function, and so on. However, to read the chapter you don't need to have worked in detail through all
the earlier chapters. It will, however, help to have read <a href="chap1.html">Chapter 1</a>, on the basics of neural networks. When I use concepts from Chapters 2 to 5, I provide links so you can familiarize yourself, if necessary.</p>
<p>It's worth noting what the chapter is not. It's not a tutorial on the latest and greatest neural networks libraries. Nor are we going to be training deep networks with dozens of layers to solve problems at the very leading edge. Rather, the focus
is on understanding some of the core principles behind deep neural networks, and applying them in the simple, easy-to-understand context of the MNIST problem. Put another way: the chapter is not going to bring you right up to the frontier. Rather,
the intent of this and earlier chapters is to focus on fundamentals, and so to prepare you to understand a wide range of current work.</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>
<h3><a name="introducing_convolutional_networks"></a><a href="#introducing_convolutional_networks">Introducing convolutional networks</a></h3></p>
<p>
In earlier chapters, we taught our neural networks to do a pretty good job recognizing images of handwritten digits:</p>
<p>
<center><img src="images/digits.png" width="160px"></center>
</p>
<p>We did this using networks in which adjacent network layers are fully connected to one another. That is, every neuron in the network is connected to every neuron in adjacent layers:</p>
<p>
<center>
<img src="images/tikz41.png" />
</center>
</p>
<p>In particular, for each pixel in the input image, we encoded the pixel's intensity as the value for a corresponding neuron in the input layer. For the $28 \times 28$ pixel images we've been using, this means our network has $784$ ($= 28 \times 28$)
input neurons. We then trained the network's weights and biases so that the network's output would - we hope! - correctly identify the input image: '0', '1', '2', ..., '8', or '9'.</p>
<p>Our earlier networks work pretty well: we've
<a href="chap3.html#98percent">obtained a classification accuracy better
than 98 percent</a>, using training and test data from the
<a href="chap1.html#learning_with_gradient_descent">MNIST handwritten
digit data set</a>. But upon reflection, it's strange to use networks with fully-connected layers to classify images. The reason is that such a network architecture does not take into account the spatial structure of the images. For instance,
it treats input pixels which are far apart and close together on exactly the same footing. Such concepts of spatial structure must instead be inferred from the training data. But what if, instead of starting with a network architecture which is
<em>tabula rasa</em>, we used an architecture which tries to take advantage of the spatial structure? In this section I describe <em>convolutional neural networks</em>*<span class="marginnote">
*The
origins of convolutional neural networks go back to the 1970s. But
the seminal paper establishing the modern subject of convolutional
networks was a 1998 paper,
<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">"Gradient-based
learning applied to document recognition"</a>, by Yann LeCun,
Léon Bottou, Yoshua Bengio, and Patrick Haffner.
LeCun has since made an interesting
<a href="https://www.facebook.com/yann.lecun/posts/10152348155137143">remark</a>
on the terminology for convolutional nets: "The [biological] neural
inspiration in models like convolutional nets is very
tenuous. That's why I call them 'convolutional nets' not
'convolutional neural nets', and why we call the nodes 'units' and
not 'neurons' ". Despite this remark, convolutional nets use many
of the same ideas as the neural networks we've studied up to now:
ideas such as backpropagation, gradient descent, regularization,
non-linear activation functions, and so on. And so we will follow
common practice, and consider them a type of neural network. I will
use the terms "convolutional neural network" and "convolutional
net(work)" interchangeably. I will also use the terms
"[artificial] neuron" and "unit" interchangeably.</span>. These networks use a special architecture which is particularly well-adapted to classify images. Using this architecture makes convolutional networks fast to train. This, in turn, helps us train
deep, many-layer networks, which are very good at classifying images. Today, deep convolutional networks or some close variant are used in most neural networks for image recognition.</p>
<p>Convolutional neural networks use three basic ideas: <em>local
receptive fields</em>, <em>shared weights</em>, and <em>pooling</em>. Let's look at each of these ideas in turn.</p>
<p><strong>Local receptive fields:</strong> In the fully-connected layers shown earlier, the inputs were depicted as a vertical line of neurons. In a convolutional net, it'll help to think instead of the inputs as a $28 \times 28$ square of neurons,
whose values correspond to the $28 \times 28$ pixel intensities we're using as inputs:</p>
<p>
<center>
<img src="images/tikz42.png" />
</center>
</p>
<p>As per usual, we'll connect the input pixels to a layer of hidden neurons. But we won't connect every input pixel to every hidden neuron. Instead, we only make connections in small, localized regions of the input image.</p>
<p>To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a $5 \times 5$ region, corresponding to $25$ input pixels. So, for a particular hidden neuron, we might have connections
that look like this:
<center>
<img src="images/tikz43.png" />
</center>
</p>
<p>That region in the input image is called the <em>local receptive
field</em> for the hidden neuron. It's a little window on the input pixels. Each connection learns a weight. And the hidden neuron learns an overall bias as well. You can think of that particular hidden neuron as learning to analyze its particular local
receptive field.
</p>
<p>We then slide the local receptive field across the entire input image. For each local receptive field, there is a different hidden neuron in the first hidden layer. To illustrate this concretely, let's start with a local receptive field in the top-left
corner:
<center>
<img src="images/tikz44.png" />
</center>
</p>
<p>Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron:</p>
<p>
<center>
<img src="images/tikz45.png" />
</center>
</p>
<p>And so on, building up the first hidden layer. Note that if we have a $28 \times 28$ input image, and $5 \times 5$ local receptive fields, then there will be $24 \times 24$ neurons in the hidden layer. This is because we can only move the local receptive
field $23$ neurons across (or $23$ neurons down), before colliding with the right-hand side (or bottom) of the input image.</p>
<p>I've shown the local receptive field being moved by one pixel at a time. In fact, sometimes a different <em>stride length</em> is used. For instance, we might move the local receptive field $2$ pixels to the right (or down), in which case we'd say
a stride length of $2$ is used. In this chapter we'll mostly stick with stride length $1$, but it's worth knowing that people sometimes experiment with different stride lengths*<span class="marginnote">
*As was done in earlier chapters, if we're
interested in trying different stride lengths then we can use
validation data to pick out the stride length which gives the best
performance. For more details, see the
<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">earlier
discussion</a> of how to choose hyper-parameters in a neural network.
The same approach may also be used to choose the size of the local
receptive field - there is, of course, nothing special about using
a $5 \times 5$ local receptive field. In general, larger local
receptive fields tend to be helpful when the input images are
significantly larger than the $28 \times 28$ pixel MNIST images.</span>.</p>
<p><strong>Shared weights and biases:</strong> I've said that each hidden neuron has a bias and $5 \times 5$ weights connected to its local receptive field. What I did not yet mention is that we're going to use the
<em>same</em> weights and bias for each of the $24 \times 24$ hidden neurons. In other words, for the $j, k$th hidden neuron, the output is:
<a class="displaced_anchor" name="eqtn125"></a>\begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right). \tag{125}\end{eqnarray} Here, $\sigma$ is the neural activation function - perhaps the
<a href="chap1.html#sigmoid_neurons">sigmoid function</a> we used in earlier chapters. $b$ is the shared value for the bias. $w_{l,m}$ is a $5 \times 5$ array of shared weights. And, finally, we use $a_{x, y}$ to denote the input activation at position
$x, y$.</p>
<p>This means that all the neurons in the first hidden layer detect exactly the same feature*<span class="marginnote">
*I haven't precisely defined the
notion of a feature. Informally, think of the feature detected by a
hidden neuron as the kind of input pattern that will cause the
neuron to activate: it might be an edge in the image, for instance,
or maybe some other type of shape. </span>, just at different locations in the input image. To see why this makes sense, suppose the weights and bias are such that the hidden neuron can pick out, say, a vertical edge in a particular local receptive
field. That ability is also likely to be useful at other places in the image. And so it is useful to apply the same feature detector everywhere in the image. To put it in slightly more abstract terms, convolutional networks are well adapted to the
translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat*<span class="marginnote">
*In
fact, for the MNIST digit classification problem we've been
studying, the images are centered and size-normalized. So MNIST has
less translation invariance than images found "in the wild", so to
speak. Still, features like edges and corners are likely to be
useful across much of the input space. </span>.</p>
<p>For this reason, we sometimes call the map from the input layer to the hidden layer a <em>feature map</em>. We call the weights defining the feature map the <em>shared weights</em>. And we call the bias defining the feature map in this way the <em>shared bias</em>.
The shared weights and bias are often said to define a <em>kernel</em> or
<em>filter</em>. In the literature, people sometimes use these terms in slightly different ways, and for that reason I'm not going to be more precise; rather, in a moment, we'll look at some concrete examples.</p>
<p></p>
<p>The network structure I've described so far can detect just a single kind of localized feature. To do image recognition we'll need more than one feature map. And so a complete convolutional layer consists of several different feature maps:</p>
<p>
<center>
<img src="images/tikz46.png" />
</center>
In the example shown, there are $3$ feature maps. Each feature map is defined by a set of $5 \times 5$ shared weights, and a single shared bias. The result is that the network can detect $3$ different kinds of features, with each feature being detectable
across the entire image.
</p>
<p></p>
<p>I've shown just $3$ feature maps, to keep the diagram above simple. However, in practice convolutional networks may use more (and perhaps many more) feature maps. One of the early convolutional networks, LeNet-5, used $6$ feature maps, each associated
to a $5 \times 5$ local receptive field, to recognize MNIST digits. So the example illustrated above is actually pretty close to LeNet-5. In the examples we develop later in the chapter we'll use convolutional layers with $20$ and $40$ feature maps.
Let's take a quick peek at some of the features which are learned*<span class="marginnote">
*The feature maps
illustrated come from the final convolutional network we train, see
<a href="#final_conv">here</a>.</span>:</p>
<p>
<center><img src="images/net_full_layer_0.png" width="400px"></center>
</p>
<p>The $20$ images correspond to $20$ different feature maps (or filters, or kernels). Each map is represented as a $5 \times 5$ block image, corresponding to the $5 \times 5$ weights in the local receptive field. Whiter blocks mean a smaller (typically,
more negative) weight, so the feature map responds less to corresponding input pixels. Darker blocks mean a larger weight, so the feature map responds more to the corresponding input pixels. Very roughly speaking, the images above show the type
of features the convolutional layer responds to.</p>
<p>So what can we conclude from these feature maps? It's clear there is spatial structure here beyond what we'd expect at random: many of the features have clear sub-regions of light and dark. That shows our network really is learning things related
to the spatial structure. However, beyond that, it's difficult to see what these feature detectors are learning. Certainly, we're not learning (say) the
<a href="http://en.wikipedia.org/wiki/Gabor_filter">Gabor filters</a> which have been used in many traditional approaches to image recognition. In fact, there's now a lot of work on better understanding the features learnt by convolutional networks.
If you're interested in following up on that work, I suggest starting with the paper
<a href="http://arxiv.org/abs/1311.2901">Visualizing and Understanding
Convolutional Networks</a> by Matthew Zeiler and Rob Fergus (2013).</p>
<p></p>
<p>A big advantage of sharing weights and biases is that it greatly reduces the number of parameters involved in a convolutional network. For each feature map we need $25 = 5 \times 5$ shared weights, plus a single shared bias. So each feature map requires
$26$ parameters. If we have $20$ feature maps that's a total of $20 \times 26 = 520$ parameters defining the convolutional layer. By comparison, suppose we had a fully connected first layer, with $784 = 28 \times 28$ input neurons, and a relatively
modest $30$ hidden neurons, as we used in many of the examples earlier in the book. That's a total of $784 \times 30$ weights, plus an extra $30$ biases, for a total of $23,550$ parameters. In other words, the fully-connected layer would have more
than $40$ times as many parameters as the convolutional layer.</p>
<p>Of course, we can't really do a direct comparison between the number of parameters, since the two models are different in essential ways. But, intuitively, it seems likely that the use of translation invariance by the convolutional layer will reduce
the number of parameters it needs to get the same performance as the fully-connected model. That, in turn, will result in faster training for the convolutional model, and, ultimately, will help us build deep networks using convolutional layers.</p>
<p></p>
<p>Incidentally, the name <em>convolutional</em> comes from the fact that the operation in Equation <span id="margin_903716135329_reveal" class="equation_link">(125)</span><span id="margin_903716135329" class="marginequation" style="display: none;"><a href="chap6.html#eqtn125" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_903716135329_reveal').click(function() {
$('#margin_903716135329').toggle('slow', function() {});
});
</script> is sometimes known as a
<em>convolution</em>. A little more precisely, people sometimes write that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the set of output activations from one feature map, $a^0$ is the set of input activations, and $*$ is called
a convolution operation. We're not going to make any deep use of the mathematics of convolutions, so you don't need to worry too much about this connection. But it's worth at least knowing where the name comes from.</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p><strong>Pooling layers:</strong> In addition to the convolutional layers just described, convolutional neural networks also contain <em>pooling
layers</em>. Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.</p>
<p></p>
<p>In detail, a pooling layer takes each feature map*<span class="marginnote">
*The
nomenclature is being used loosely here. In particular, I'm using
"feature map" to mean not the function computed by the
convolutional layer, but rather the activation of the hidden neurons
output from the layer. This kind of mild abuse of nomenclature is
pretty common in the research literature.</span> output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of (say) $2 \times 2$ neurons in the previous layer. As a
concrete example, one common procedure for pooling is known as
<em>max-pooling</em>. In max-pooling, a pooling unit simply outputs the maximum activation in the $2 \times 2$ input region, as illustrated in the following diagram:</p>
<p>
<center>
<img src="images/tikz47.png" />
</center>
</p>
<p>Note that since we have $24 \times 24$ neurons output from the convolutional layer, after pooling we have $12 \times 12$ neurons.</p>
<p>As mentioned above, the convolutional layer usually involves more than a single feature map. We apply max-pooling to each feature map separately. So if there were three feature maps, the combined convolutional and max-pooling layers would look like:</p>
<p>
<center>
<img src="images/tikz48.png" />
</center>
</p>
<p>We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information. The intuition is that once a feature has been found, its exact location
isn't as important as its rough location relative to other features. A big benefit is that there are many fewer pooled features, and so this helps reduce the number of parameters needed in later layers.</p>
<p></p>
<p>Max-pooling isn't the only technique used for pooling. Another common approach is known as <em>L2 pooling</em>. Here, instead of taking the maximum activation of a $2 \times 2$ region of neurons, we take the square root of the sum of the squares of
the activations in the $2 \times 2$ region. While the details are different, the intuition is similar to max-pooling: L2 pooling is a way of condensing information from the convolutional layer. In practice, both techniques have been widely used.
And sometimes people use other types of pooling operation. If you're really trying to optimize performance, you may use validation data to compare several different approaches to pooling, and choose the approach which works best. But we're not going
to worry about that kind of detailed optimization.</p>
<p></p>
<p><strong>Putting it all together:</strong> We can now put all these ideas together to form a complete convolutional neural network. It's similar to the architecture we were just looking at, but has the addition of a layer of $10$ output neurons, corresponding
to the $10$ possible values for MNIST digits ('0', '1', '2', <em>etc</em>):</p>
<p>
<center>
<img src="images/tikz49.png" />
</center>
</p>
<p>The network begins with $28 \times 28$ input neurons, which are used to encode the pixel intensities for the MNIST image. This is then followed by a convolutional layer using a $5 \times 5$ local receptive field and $3$ feature maps. The result is
a layer of $3 \times 24 \times 24$ hidden feature neurons. The next step is a max-pooling layer, applied to $2 \times 2$ regions, across each of the $3$ feature maps. The result is a layer of $3 \times 12 \times 12$ hidden feature neurons.
</p>
<p>The final layer of connections in the network is a fully-connected layer. That is, this layer connects <em>every</em> neuron from the max-pooled layer to every one of the $10$ output neurons. This fully-connected architecture is the same as we used
in earlier chapters. Note, however, that in the diagram above, I've used a single arrow, for simplicity, rather than showing all the connections. Of course, you can easily imagine the connections.</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>This convolutional architecture is quite different to the architectures used in earlier chapters. But the overall picture is similar: a network made of many simple units, whose behaviors are determined by their weights and biases. And the overall
goal is still the same: to use training data to train the network's weights and biases so that the network does a good job classifying input digits.</p>
<p>In particular, just as earlier in the book, we will train our network using stochastic gradient descent and backpropagation. This mostly proceeds in exactly the same way as in earlier chapters. However, we do need to make a few modifications to the
backpropagation procedure. The reason is that our earlier <a href="chap2.html">derivation of
backpropagation</a> was for networks with fully-connected layers. Fortunately, it's straightforward to modify the derivation for convolutional and max-pooling layers. If you'd like to understand the details, then I invite you to work through the following
problem. Be warned that the problem will take some time to work through, unless you've really internalized the <a href="chap2.html">earlier derivation of
backpropagation</a> (in which case it's easy).</p>
<p>
<h4><a name="problem_214396"></a><a href="#problem_214396">Problem</a></h4>
<ul>
<li><strong>Backpropagation in a convolutional network</strong> The core equations of backpropagation in a network with fully-connected layers are <span id="margin_709574921443_reveal" class="equation_link">(BP1)</span><span id="margin_709574921443"
class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_709574921443_reveal').click(function() {
$('#margin_709574921443').toggle('slow', function() {});
});
</script>-<span id="margin_220452626963_reveal" class="equation_link">(BP4)</span><span id="margin_220452626963" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_220452626963_reveal').click(function() {
$('#margin_220452626963').toggle('slow', function() {});
});
</script>
(<a href="chap2.html#backpropsummary">link</a>). Suppose we have a network containing a convolutional layer, a max-pooling layer, and a fully-connected output layer, as in the network discussed above. How are the equations of backpropagation
modified?
</ul>
</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>
<h3><a name="convolutional_neural_networks_in_practice"></a><a href="#convolutional_neural_networks_in_practice">Convolutional neural networks in practice</a></h3></p>
<p>We've now seen the core ideas behind convolutional neural networks. Let's look at how they work in practice, by implementing some convolutional networks, and applying them to the MNIST digit classification problem. The program we'll use to do this
is called
<tt>network3.py</tt>, and it's an improved version of the programs
<tt>network.py</tt> and <tt>network2.py</tt> developed in earlier chapters*
<span class="marginnote">
*Note also that <tt>network3.py</tt> incorporates ideas
from the Theano library's documentation on convolutional neural nets
(notably the implementation of
<a href="http://deeplearning.net/tutorial/lenet.html">LeNet-5</a>), from
Misha Denil's
<a href="https://github.com/mdenil/dropout">implementation of dropout</a>,
and from <a href="http://colah.github.io">Chris Olah</a>.</span>. If you wish to follow along, the code is available
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network3.py">on
GitHub</a>. Note that we'll work through the code for
<tt>network3.py</tt> itself in the next section. In this section, we'll use <tt>network3.py</tt> as a library to build convolutional networks.</p>
<p></p>
<p>The programs <tt>network.py</tt> and <tt>network2.py</tt> were implemented using Python and the matrix library Numpy. Those programs worked from first principles, and got right down into the details of backpropagation, stochastic gradient descent,
and so on. But now that we understand those details, for <tt>network3.py</tt> we're going to use a machine learning library known as
<a href="http://deeplearning.net/software/theano/">Theano</a>*<span class="marginnote">
*See
<a href="http://www.iro.umontreal.ca/~lisa/pointeurs/theano_scipy2010.pdf">Theano:
A CPU and GPU Math Expression Compiler in Python</a>, by James
Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan
Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,
and Yoshua Bengio (2010). Theano is also the basis for the popular
<a href="http://deeplearning.net/software/pylearn2/">Pylearn2</a> and
<a href="http://keras.io/">Keras</a> neural networks libraries. Other
popular neural nets libraries at the time of this writing include
<a href="http://caffe.berkeleyvision.org">Caffe</a> and
<a href="http://torch.ch">Torch</a>. </span>. Using Theano makes it easy to implement backpropagation for convolutional neural networks, since it automatically computes all the mappings involved. Theano is also quite a bit faster than our earlier code
(which was written to be easy to understand, not fast), and this makes it practical to train more complex networks. In particular, one great feature of Theano is that it can run code on either a CPU or, if available, a GPU. Running on a GPU provides
a substantial speedup and, again, helps make it practical to train more complex networks.</p>
<p></p>
<p>If you wish to follow along, then you'll need to get Theano running on your system. To install Theano, follow the instructions at the project's <a href="http://deeplearning.net/software/theano/">homepage</a>. The examples which follow were run using
Theano 0.6*<span class="marginnote">
*As I
release this chapter, the current version of Theano has changed to
version 0.7. I've actually rerun the examples under Theano 0.7 and
get extremely similar results to those reported in the text.</span>. Some were run under Mac OS X Yosemite, with no GPU. Some were run on Ubuntu 14.04, with an NVIDIA GPU. And some of the experiments were run under both. To get <tt>network3.py</tt> running you'll need to set the
<tt>GPU</tt> flag to either <tt>True</tt> or <tt>False</tt> (as appropriate) in the <tt>network3.py</tt> source. Beyond that, to get Theano up and running on a GPU you may find
<a href="http://deeplearning.net/software/theano/tutorial/using_gpu.html">the
instructions here</a> helpful. There are also tutorials on the web, easily found using Google, which can help you get things working. If you don't have a GPU available locally, then you may wish to look into
<a href="http://aws.amazon.com/ec2/instance-types/">Amazon Web Services</a> EC2 G2 spot instances. Note that even with a GPU the code will take some time to execute. Many of the experiments take from minutes to hours to run. On a CPU it may take
days to run the most complex of the experiments. As in earlier chapters, I suggest setting things running, and continuing to read, occasionally coming back to check the output from the code. If you're using a CPU, you may wish to reduce the number
of training epochs for the more complex experiments, or perhaps omit them entirely.</p>
<p>To get a baseline, we'll start with a shallow architecture using just a single hidden layer, containing $100$ hidden neurons. We'll train for $60$ epochs, using a learning rate of $\eta = 0.1$, a mini-batch size of $10$, and no regularization. Here
we go*<span class="marginnote">
*Code for the
experiments in this section may be found
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/conv.py">in
this script</a>. Note that the code in the script simply duplicates
and parallels the discussion in this section.<br><br>Note also that
throughout the section I've explicitly specified the number of
training epochs. I've done this for clarity about how we're
training. In practice, it's worth using
<a href="chap3.html#early_stopping">early stopping</a>, that is,
tracking accuracy on the validation set, and stopping training when
we are confident the validation accuracy has stopped improving.</span>:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">network3</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">network3</span> <span class="kn">import</span> <span class="n">Network</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">network3</span> <span class="kn">import</span> <span class="n">ConvPoolLayer</span><span class="p">,</span> <span class="n">FullyConnectedLayer</span><span class="p">,</span> <span class="n">SoftmaxLayer</span>
<span class="o">>>></span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> <span class="n">network3</span><span class="o">.</span><span class="n">load_data_shared</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">mini_batch_size</span> <span class="o">=</span> <span class="mi">10</span>
<span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
<span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">784</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span>
<span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span>
<span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>
</pre></div>
</p>
<p></p>
<p>
I obtained a best classification accuracy of $97.80$ percent. This is the classification accuracy on the <tt>test_data</tt>, evaluated at the training epoch where we get the best classification accuracy on the
<tt>validation_data</tt>. Using the validation data to decide when to evaluate the test accuracy helps avoid overfitting to the test data (see this <a href="chap3.html#validation_explanation">earlier
discussion</a> of the use of validation data). We will follow this practice below. Your results may vary slightly, since the network's weights and biases are randomly initialized*<span class="marginnote">
*In fact, in this
experiment I actually did three separate runs training a network
with this architecture. I then reported the test accuracy which
corresponded to the best validation accuracy from any of the three
runs. Using multiple runs helps reduce variation in results, which
is useful when comparing many architectures, as we are doing. I've
followed this procedure below, except where noted. In practice, it
made little difference to the results obtained.</span>.</p>
<p>This $97.80$ percent accuracy is close to the $98.04$ percent accuracy obtained back in <a href="chap3.html#chap3_98_04_percent">Chapter 3</a>, using a similar network architecture and learning hyper-parameters. In particular, both examples used a
shallow network, with a single hidden layer containing $100$ hidden neurons. Both also trained for $60$ epochs, used a mini-batch size of $10$, and a learning rate of $\eta = 0.1$.</p>
<p>There were, however, two differences in the earlier network. First, we <a href="chap3.html#overfitting_and_regularization">regularized</a> the earlier network, to help reduce the effects of overfitting. Regularizing the current network does improve
the accuracies, but the gain is only small, and so we'll hold off worrying about regularization until later. Second, while the final layer in the earlier network used sigmoid activations and the cross-entropy cost function, the current network uses
a softmax final layer, and the log-likelihood cost function. As
<a href="chap3.html#softmax">explained</a> in Chapter 3 this isn't a big change. I haven't made this switch for any particularly deep reason - mostly, I've done it because softmax plus log-likelihood cost is more common in modern image classification
networks.</p>
<p>Can we do better than these results using a deeper network architecture?
</p>
<p>Let's begin by inserting a convolutional layer, right at the beginning of the network. We'll use $5$ by $5$ local receptive fields, a stride length of $1$, and $20$ feature maps. We'll also insert a max-pooling layer, which combines the features using
$2$ by $2$ pooling windows. So the overall network architecture looks much like the architecture discussed in the last section, but with an extra fully-connected layer:
</p>
<p>
<center><img src="images/simple_conv.png" width="550px"></center>
</p>
<p>In this architecture, we can think of the convolutional and pooling layers as learning about local spatial structure in the input training image, while the later, fully-connected layer learns at a more abstract level, integrating global information
from across the entire image. This is a common pattern in convolutional neural networks.</p>
<p>Let's train such a network, and see how it performs*<span class="marginnote">
*I've
continued to use a mini-batch size of $10$ here. In fact, as we
<a href="chap3.html#mini_batch_size">discussed earlier</a> it may be
possible to speed up training using larger mini-batches. I've
continued to use the same mini-batch size mostly for consistency
with the experiments in earlier chapters.</span>:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
<span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)),</span>
<span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">20</span><span class="o">*</span><span class="mi">12</span><span class="o">*</span><span class="mi">12</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span>
<span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span>
<span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>
</pre></div>
</p>
<p></p>
<p></p>
<p>That gets us to $98.78$ percent accuracy, which is a considerable improvement over any of our previous results. Indeed, we've reduced our error rate by better than a third, which is a great improvement.</p>
<p>In specifying the network structure, I've treated the convolutional and pooling layers as a single layer. Whether they're regarded as separate layers or as a single layer is to some extent a matter of taste. <tt>network3.py</tt> treats them as a single
layer because it makes the code for <tt>network3.py</tt> a little more compact. However, it is easy to modify <tt>network3.py</tt> so the layers can be specified separately, if desired.</p>
<p>
<h4><a name="exercise_683491"></a><a href="#exercise_683491">Exercise</a></h4>
<ul>
<li> What classification accuracy do you get if you omit the fully-connected layer, and just use the convolutional-pooling layer and softmax layer? Does the inclusion of the fully-connected layer help?
</ul>
</p>
<p>Can we improve on the $98.78$ percent classification accuracy?</p>
<p>Let's try inserting a second convolutional-pooling layer. We'll make the insertion between the existing convolutional-pooling layer and the fully-connected hidden layer. Again, we'll use a $5 \times 5$ local receptive field, and pool over $2 \times
2$ regions. Let's see what happens when we train using similar hyper-parameters to before:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
<span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)),</span>
<span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)),</span>
<span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">40</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span>
<span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span>
<span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>
</pre></div>
</p>
<p></p>
<p></p>
<p>Once again, we get an improvement: we're now at $99.06$ percent classification accuracy!</p>
<p>There's two natural questions to ask at this point. The first question is: what does it even mean to apply a second convolutional-pooling layer? In fact, you can think of the second convolutional-pooling layer as having as input $12 \times 12$ "images",
whose "pixels" represent the presence (or absence) of particular localized features in the original input image. So you can think of this layer as having as input a version of the original input image. That version is abstracted and condensed, but
still has a lot of spatial structure, and so it makes sense to use a second convolutional-pooling layer.</p>
<p>That's a satisfying point of view, but gives rise to a second question. The output from the previous layer involves $20$ separate feature maps, and so there are $20 \times 12 \times 12$ inputs to the second convolutional-pooling layer. It's as though
we've got $20$ separate images input to the convolutional-pooling layer, not a single image, as was the case for the first convolutional-pooling layer. How should neurons in the second convolutional-pooling layer respond to these multiple input
images? In fact, we'll allow each neuron in this layer to learn from <em>all</em> $20 \times 5 \times 5$ input neurons in its local receptive field. More informally: the feature detectors in the second convolutional-pooling layer have access to
<em>all</em> the features from the previous layer, but only within their particular local receptive field*<span class="marginnote">
*This issue would have arisen in the
first layer if the input images were in color. In that case we'd
have 3 input features for each pixel, corresponding to red, green
and blue channels in the input image. So we'd allow the feature
detectors to have access to all color information, but only within a
given local receptive field.</span>.</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>
<h4><a name="problem_834310"></a><a href="#problem_834310">Problem</a></h4>
<ul>
<li><strong>Using the tanh activation function</strong> Several times earlier in the book I've mentioned arguments that the
<a href="chap3.html#other_models_of_artificial_neuron">tanh
function</a> may be a better activation function than the sigmoid function. We've never acted on those suggestions, since we were already making plenty of progress with the sigmoid. But now let's try some experiments with tanh as our activation
function. Try training the network with tanh activations in the convolutional and fully-connected layers*<span class="marginnote">
*Note that you can pass
<tt>activation_fn=tanh</tt> as a parameter to the
<tt>ConvPoolLayer</tt> and <tt>FullyConnectedLayer</tt> classes.</span>. Begin with the same hyper-parameters as for the sigmoid network, but train for $20$ epochs instead of $60$. How well does your network perform? What if you continue out to $60$
epochs? Try plotting the per-epoch validation accuracies for both tanh- and sigmoid-based networks, all the way out to $60$ epochs. If your results are similar to mine, you'll find the tanh networks train a little faster, but the final accuracies
are very similar. Can you explain why the tanh network might train faster? Can you get a similar training speed with the sigmoid, perhaps by changing the learning rate, or doing some rescaling*<span class="marginnote">
*You may perhaps find
inspiration in recalling that $\sigma(z) = (1+\tanh(z/2))/2$.</span>? Try a half-dozen iterations on the learning hyper-parameters or network architecture, searching for ways that tanh may be superior to the sigmoid. <em>Note: This is an open-ended problem.
Personally, I did not find much advantage in switching to tanh,
although I haven't experimented exhaustively, and perhaps you may
find a way. In any case, in a moment we will find an advantage in
switching to the rectified linear activation function, and so we
won't go any deeper into the use of tanh.</em>
</ul>
</p>
<p></p>
<p>
<strong>Using rectified linear units:</strong> The network we've developed at this point is actually a variant of one of the networks used in the seminal 1998 paper*
<span class="marginnote">
*<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">"Gradient-based
learning applied to document recognition"</a>, by Yann LeCun,
Léon Bottou, Yoshua Bengio, and Patrick Haffner
(1998). There are many differences of detail, but broadly speaking
our network is quite similar to the networks described in the
paper.</span> introducing the MNIST problem, a network known as LeNet-5. It's a good foundation for further experimentation, and for building up understanding and intuition. In particular, there are many ways we can vary the network in an attempt
to improve our results.</p>
<p>As a beginning, let's change our neurons so that instead of using a sigmoid activation function, we use
<a href="chap3.html#other_models_of_artificial_neuron">rectified
linear units</a>. That is, we'll use the activation function $f(z) \equiv \max(0, z)$. We'll train for $60$ epochs, with a learning rate of $\eta = 0.03$. I also found that it helps a little to use some
<a href="chap3.html#overfitting_and_regularization">l2
regularization</a>, with regularization parameter $\lambda = 0.1$:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">network3</span> <span class="kn">import</span> <span class="n">ReLU</span>
<span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
<span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
<span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
<span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">40</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
<span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">,</span>
<span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>
</pre></div>
</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>I obtained a classification accuracy of $99.23$ percent. It's a modest improvement over the sigmoid results ($99.06$). However, across all my experiments I found that networks based on rectified linear units consistently outperformed networks based
on sigmoid activation functions. There appears to be a real gain in moving to rectified linear units for this problem.</p>
<p>What makes the rectified linear activation function better than the sigmoid or tanh functions? At present, we have a poor understanding of the answer to this question. Indeed, rectified linear units have only begun to be widely used in the past few
years. The reason for that recent adoption is empirical: a few people tried rectified linear units, often on the basis of hunches or heuristic arguments*<span class="marginnote">
*A
common justification is that $\max(0, z)$ doesn't saturate in the
limit of large $z$, unlike sigmoid neurons, and this helps rectified
linear units continue learning. The argument is fine, as far it
goes, but it's hardly a detailed justification, more of a just-so
story. Note that we discussed the problems with saturation back in
<a href="chap2.html#saturation">Chapter 2</a>.</span>. They got good results classifying benchmark data sets, and the practice has spread. In an ideal world we'd have a theory telling us which activation function to pick for which application. But at
present we're a long way from such a world. I should not be at all surprised if further major improvements can be obtained by an even better choice of activation function. And I also expect that in coming decades a powerful theory of activation
functions will be developed. Today, we still have to rely on poorly understood rules of thumb and experience.</p>
<p><strong>Expanding the training data:</strong> Another way we may hope to improve our results is by algorithmically expanding the training data. A simple way of expanding the training data is to displace each training image by a single pixel, either
up one pixel, down one pixel, left one pixel, or right one pixel. We can do this by running the program <tt>expand_mnist.py</tt> from the shell prompt*<span class="marginnote">
*The code
for <tt>expand_mnist.py</tt> is available
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/expand_mnist.py">here</a>.</span>:</p>
<p>
<div class="highlight"><pre><span></span>
$ python expand_mnist.py
</pre></div>
</p>
<p>Running this program takes the $50,000$ MNIST training images, and prepares an expanded training set, with $250,000$ training images. We can then use those training images to train our network. We'll use the same network as above, with rectified linear
units. In my initial experiments I reduced the number of training epochs - this made sense, since we're training with $5$ times as much data. But, in fact, expanding the data turned out to considerably reduce the effect of overfitting. And so, after
some experimentation, I eventually went back to training for $60$ epochs. In any case, let's train:</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">expanded_training_data</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">network3</span><span class="o">.</span><span class="n">load_data_shared</span><span class="p">(</span>
<span class="s2">"../data/mnist_expanded.pkl.gz"</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
<span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
<span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
<span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">40</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
<span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">expanded_training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">,</span>
<span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>
</pre></div>
</p>
<p></p>
<p>Using the expanded training data I obtained a $99.37$ percent training accuracy. So this almost trivial change gives a substantial improvement in classification accuracy. Indeed, as we
<a href="chap3.html#other_techniques_for_regularization">discussed
earlier</a> this idea of algorithmically expanding the data can be taken further. Just to remind you of the flavour of some of the results in that earlier discussion: in 2003 Simard, Steinkraus and Platt*
<span class="marginnote">