forked from mnielsen/nnadl_site
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathchap3.html
executable file
·3050 lines (2938 loc) · 344 KB
/
chap3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file. Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other -->
<!-- code is quite ugly! Later versions should be better. -->
<head>
<meta charset="utf-8">
<meta name="citation_title" content="Neural Networks and Deep Learning">
<meta name="citation_author" content="Nielsen, Michael A.">
<meta name="citation_publication_date" content="2015">
<meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
<meta name="citation_publisher" content="Determination Press">
<link rel="icon" href="nnadl_favicon.ICO" />
<title>Neural networks and deep learning</title>
<script src="assets/jquery.min.js"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$']]}, "HTML-CSS": {scale: 92}, TeX: { equationNumbers: { autoNumber: "AMS" }}});
</script>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<link href="assets/style.css" rel="stylesheet">
<link href="assets/pygments.css" rel="stylesheet">
<link rel="stylesheet" href="https://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">
<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */
@font-face {
font-family: 'MJX_Math';
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot');
/* IE9 Compat Modes */
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}
@font-face {
font-family: 'MJX_Main';
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot');
/* IE9 Compat Modes */
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>
</head>
<body>
<div class="nonumber_header">
<h2><a href="index.html">Նեյրոնային ցանցեր և խորը ուսուցում</a></h2>
</div>
<div class="section">
<div id="toc">
<p class="toc_title">
<a href="index.html">Նեյրոնային ցանցեր և խորը ուսուցում</a>
</p>
<p class="toc_not_mainchapter">
<a href="about.html">Ինչի՞ մասին է գիրքը</a>
</p>
<p class="toc_not_mainchapter">
<a href="exercises_and_problems.html">Խնդիրների և վարժությունների մասին</a>
</p>
<p class='toc_mainchapter'>
<a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a>
<a href="chap1.html">Ձեռագիր թվանշանների ճանաչում՝ օգտագործելով նեյրոնային ցանցեր</a>
<div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap1.html#perceptrons">
<li>Պերսեպտրոններ</li>
</a>
<a href="chap1.html#sigmoid_neurons">
<li>Սիգմոիդ նեյրոններ</li>
</a>
<a href="chap1.html#the_architecture_of_neural_networks">
<li>Նեյրոնային ցանցերի կառուցվածքը</li>
</a>
<a href="chap1.html#a_simple_network_to_classify_handwritten_digits">
<li>Պարզ ցանց ձեռագիր թվանշանների ճանաչման համար</li>
</a>
<a href="chap1.html#learning_with_gradient_descent">
<li>Ուսուցում գրադիենտային վայրէջքի միջոցով</li>
</a>
<a href="chap1.html#implementing_our_network_to_classify_digits">
<li>Թվանշանները ճանաչող ցանցի իրականացումը</li>
</a>
<a href="chap1.html#toward_deep_learning">
<li>Խորը ուսուցմանն ընդառաջ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
};
$('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a>
<a href="chap2.html">Ինչպե՞ս է աշխատում հետադարձ տարածումը</a>
<div id="toc_how_the_backpropagation_algorithm_works" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output
_from_a_neural_network">
<li>Մարզանք. նեյրոնային ցանցի ելքային արժեքների հաշվման արագագործ, մատրիցային մոտեցում</li>
</a>
<a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function">
<li>Երկու ենթադրություն գնային ֆունկցիայի վերաբերյալ</li>
</a>
<a href="chap2.html#the_hadamard_product_$s_\odot_t$">
<li>Հադամարի արտադրյալը՝ $s \odot t$</li>
</a>
<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">
<li>Հետադարձ տարածման հիմքում ընկած չորս հիմնական հավասարումները</li>
</a>
<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">
<li>Չորս հիմնական հավասարումների ապացույցները (ընտրովի)</li>
</a>
<a href="chap2.html#the_backpropagation_algorithm">
<li>Հետադարձ տարածման ալգորիթմը</li>
</a>
<a href="chap2.html#the_code_for_backpropagation">
<li>Հետադարձ տարածման իրականացման կոդը</li>
</a>
<a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm">
<li>Ի՞նչ իմաստով է հետադարձ տարածումն արագագործ ալգորիթմ</li>
</a>
<a href="chap2.html#backpropagation_the_big_picture">
<li>Հետադարձ տարածում. ամբողջական պատկերը</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
};
$('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a>
<a href="chap3.html">Նեյրոնային ցանցերի ուսուցման բարելավումը</a>
<div id="toc_improving_the_way_neural_networks_learn" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap3.html#the_cross-entropy_cost_function">
<li>Գնային ֆունկցիան՝ միջէնտրոպիայով</li>
</a>
<a href="chap3.html#overfitting_and_regularization">
<li>Գերմարզում և ռեգուլյարացում</li>
</a>
<a href="chap3.html#weight_initialization">
<li>Կշիռների սկզբնարժեքավորումը</li>
</a>
<a href="chap3.html#handwriting_recognition_revisited_the_code">
<li>Ձեռագրերի ճամաչման կոդի վերանայում</li>
</a>
<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">
<li>Ինչպե՞ս ընտրել նեյրոնային ցանցերի հիպեր-պարամետրերը</li>
</a>
<a href="chap3.html#other_techniques">
<li>Այլ տեխնիկաներ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
};
$('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a>
<a href="chap4.html">Տեսողական ապացույց այն մասին, որ նեյրոնային ֆունկցիաները կարող են մոտարկել կամայական ֆունկցիա</a>
<div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap4.html#two_caveats">
<li>Երկու զգուշացում</li>
</a>
<a href="chap4.html#universality_with_one_input_and_one_output">
<li>Ունիվերսալություն մեկ մուտքով և մեկ ելքով</li>
</a>
<a href="chap4.html#many_input_variables">
<li>Մեկից ավել մուտքային փոփոխականներ</li>
</a>
<a href="chap4.html#extension_beyond_sigmoid_neurons">
<li>Ընդլայնումը Սիգմոիդ նեյրոններից դուրս </li>
</a>
<a href="chap4.html#fixing_up_the_step_functions">
<li>Քայլի ֆունկցիայի ուղղումը</li>
</a>
<a href="chap4.html#conclusion">
<li>Եզրակացություն</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
};
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a>
<a href="chap5.html">Ինչու՞մն է կայանում նեյրոնային ցանցերի մարզման բարդությունը</a>
<div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap5.html#the_vanishing_gradient_problem">
<li>Անհետացող գրադիենտի խնդիրը</li>
</a>
<a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets">
<li>Ի՞նչն է անհետացող գրադիենտի խնդրի պատճառը։ Խորը նեյրոնային ցանցերի անկայուն գրադիենտները</li>
</a>
<a href="chap5.html#unstable_gradients_in_more_complex_networks">
<li>Անկայուն գրադիենտներն ավելի կոմպլեքս ցանցերում</li>
</a>
<a href="chap5.html#other_obstacles_to_deep_learning">
<li>Այլ խոչընդոտներ խորը ուսուցման մեջ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
};
$('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a>
<a href="chap6.html">Խորը ուսուցում</a>
<div id="toc_deep_learning" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap6.html#introducing_convolutional_networks">
<li>Փաթույթային ցանցեր</li>
</a>
<a href="chap6.html#convolutional_neural_networks_in_practice">
<li>Փաթույթային ցանցերը կիրառության մեջ</li>
</a>
<a href="chap6.html#the_code_for_our_convolutional_networks">
<li>Փաթույթային ցանցերի կոդը</li>
</a>
<a href="chap6.html#recent_progress_in_image_recognition">
<li>Առաջխաղացումները պատկերների ճանաչման ասպարեզում</li>
</a>
<a href="chap6.html#other_approaches_to_deep_neural_nets">
<li>Այլ մոտեցումներ խորը նեյրոնային ցանցերի համար</li>
</a>
<a href="chap6.html#on_the_future_of_neural_networks">
<li>Նեյրոնային ցանցերի ապագայի մասին</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_deep_learning_reveal').click(function() {
var src = $('#toc_img_deep_learning').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_deep_learning").attr('src', 'images/arrow.png');
};
$('#toc_deep_learning').toggle('fast', function() {});
});
</script>
<p class="toc_not_mainchapter">
<a href="sai.html">Հավելված: Արդյո՞ք գոյություն ունի ինտելեկտի <em>պարզ</em> ալգորիթմ</a>
</p>
<p class="toc_not_mainchapter">
<a href="acknowledgements.html">Երախտագիտություն</a>
</p>
<p class="toc_not_mainchapter"><a href="faq.html">Հաճախ տրվող հարցեր</a>
</p>
<!--
<hr>
<p class="sidebar"> If you benefit from the book, please make a small
donation. I suggest $3, but you can choose the amount.</p>
<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="encrypted" value="-----BEGIN PKCS7-----MIIHTwYJKoZIhvcNAQcEoIIHQDCCBzwCAQExggEwMIIBLAIBADCBlDCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20CAQAwDQYJKoZIhvcNAQEBBQAEgYAtusFIFTgWVpgZsMgI9zMrWRAFFKQqeFiE6ay1nbmP360YzPtR+vvCXwn214Az9+F9g7mFxe0L+m9zOCdjzgRROZdTu1oIuS78i0TTbcbD/Vs/U/f9xcmwsdX9KYlhimfsya0ydPQ2xvr4iSGbwfNemIPVRCTadp/Y4OQWWRFKGTELMAkGBSsOAwIaBQAwgcwGCSqGSIb3DQEHATAUBggqhkiG9w0DBwQIK5obVTaqzmyAgajgc4w5t7l6DjTGVI7k+4UyO3uafxPac23jOyBGmxSnVRPONB9I+/Q6OqpXZtn8JpTuzFmuIgkNUf1nldv/DA1mhPOeeVxeuSGL8KpWxpJboKZ0mEu9b+0FJXvZW+snv0jodnRDtI4g0AXDZNPyRWIdJ3m+tlYfsXu4mQAe0q+CyT+QrSRhPGI/llicF4x3rMbRBNqlDze/tFqp/jbgW84Puzz6KyxAez6gggOHMIIDgzCCAuygAwIBAgIBADANBgkqhkiG9w0BAQUFADCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wHhcNMDQwMjEzMTAxMzE1WhcNMzUwMjEzMTAxMzE1WjCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMFHTt38RMxLXJyO2SmS+Ndl72T7oKJ4u4uw+6awntALWh03PewmIJuzbALScsTS4sZoS1fKciBGoh11gIfHzylvkdNe/hJl66/RGqrj5rFb08sAABNTzDTiqqNpJeBsYs/c2aiGozptX2RlnBktH+SUNpAajW724Nv2Wvhif6sFAgMBAAGjge4wgeswHQYDVR0OBBYEFJaffLvGbxe9WT9S1wob7BDWZJRrMIG7BgNVHSMEgbMwgbCAFJaffLvGbxe9WT9S1wob7BDWZJRroYGUpIGRMIGOMQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFjAUBgNVBAcTDU1vdW50YWluIFZpZXcxFDASBgNVBAoTC1BheVBhbCBJbmMuMRMwEQYDVQQLFApsaXZlX2NlcnRzMREwDwYDVQQDFAhsaXZlX2FwaTEcMBoGCSqGSIb3DQEJARYNcmVAcGF5cGFsLmNvbYIBADAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAIFfOlaagFrl71+jq6OKidbWFSE+Q4FqROvdgIONth+8kSK//Y/4ihuE4Ymvzn5ceE3S/iBSQQMjyvb+s2TWbQYDwcp129OPIbD9epdr4tJOUNiSojw7BHwYRiPh58S1xGlFgHFXwrEBb3dgNbMUa+u4qectsMAXpVHnD9wIyfmHMYIBmjCCAZYCAQEwgZQwgY4xCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEWMBQGA1UEBxMNTW91bnRhaW4gVmlldzEUMBIGA1UEChMLUGF5UGFsIEluYy4xEzARBgNVBAsUCmxpdmVfY2VydHMxETAPBgNVBAMUCGxpdmVfYXBpMRwwGgYJKoZIhvcNAQkBFg1yZUBwYXlwYWwuY29tAgEAMAkGBSsOAwIaBQCgXTAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA4MDUxMzMyMTRaMCMGCSqGSIb3DQEJBDEWBBRtGLYvbZ45sWVegWVP2CuXTHPmJTANBgkqhkiG9w0BAQEFAASBgKgrMHMINfV7yVuZgcTjp8gUzejPF2x2zRPU/G8pKUvYIl1F38TjV2pe4w0QXcGMJRT8mQfxHCy9UmF3LfblH8F0NSMMDrZqu3M0eLk96old+L0Xl6ING8l3idFDkLagE+lZK4A0rNV35aMci3VLvjQ34CvEj7jaHeLpbkgk/l6v-----END PKCS7-----
">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>
-->
<hr>
<span class="sidebar_title">Հովանավորներ</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>
<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>
<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>
<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>
<p class="sidebar">Շնորհակալություն եմ հայտնում բոլոր <a href="supporters.html">աջակցողներին</a>, ովքեր օգնել են գիրքն իրականություն դարձնել: Հատուկ շնորհակալություններ Պավել Դուդրենովին. Շնորհակալություն եմ հայտնում նաև նրանց, ովքեր ներդրում են ունեցել
<a href="bugfinder.html">Սխալների որոնման հուշատախտակում</a>. </p>
<hr>
<span class="sidebar_title">Ռեսուրսներ</span>
<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Մայքլ Նիլսենը թվիթերում</a></p>
<p class="sidebar"><a href="faq.html">Գրքի մասին հաճախակի տրբող հարցեր</a></p>
<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Կոդի պահոցը</a></p>
<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Մայքլ Նիլսենի նախագծերի հայտարարման էլ հասցեների ցուցակը</a>
</p>
<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Խորը Ուսուցում</a>, գրքի հեղինակներ` Յան Գուդֆելլո, Յոշուա Բենջիո և Ահարոն Կուրվիլ</p>
<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
<p class="sidebar">
<a href="http://michaelnielsen.org">Մայքլ Նիլսեն</a>, Հունվար 2017
</p>
</div>
<p>When a golf player is first learning to play golf, they usually spend most of their time developing a basic swing. Only gradually do they develop other shots, learning to chip, draw and fade the ball, building on and modifying their basic swing. In
a similar way,up to now we've focused on understanding the backpropagation algorithm. It's our "basic swing", the foundation for learning in most work on neural networks. In this chapter I explain a suite of techniques which can be used to improve
on our vanilla implementation of backpropagation, and so improve the way our networks learn.</p>
<p>The techniques we'll develop in this chapter include: a better choice of cost function, known as
<a href="chap3.html#the_cross-entropy_cost_function">the
cross-entropy</a> cost function; four so-called
<a href="chap3.html#overfitting_and_regularization">"regularization"
methods</a> (L1 and L2 regularization, dropout, and artificial expansion of the training data), which make our networks better at generalizing beyond the training data; a
<a href="chap3.html#weight_initialization">better method for
initializing the weights</a> in the network; and a
<a href="#how_to_choose_a_neural_network's_hyper-parameters">set
of heuristics to help choose good hyper-parameters</a> for the network. I'll also overview <a href="chap3.html#other_techniques">several other
techniques</a> in less depth. The discussions are largely independent of one another, and so you may jump ahead if you wish. We'll also
<a href="#handwriting_recognition_revisited_the_code">implement</a> many of the techniques in running code, and use them to improve the results obtained on the handwriting classification problem studied in
<a href="chap1.html">Chapter 1</a>.</p>
<p>Of course, we're only covering a few of the many, many techniques which have been developed for use in neural nets. The philosophy is that the best entree to the plethora of available techniques is in-depth study of a few of the most important. Mastering
those important techniques is not just useful in its own right, but will also deepen your understanding of what problems can arise when you use neural networks. That will leave you well prepared to quickly pick up other techniques, as you need them.</p>
<p></p>
<p></p>
<p></p>
<p>
<h3><a name="the_cross-entropy_cost_function"></a><a href="#the_cross-entropy_cost_function">The cross-entropy cost function</a></h3></p>
<p>Most of us find it unpleasant to be wrong. Soon after beginning to learn the piano I gave my first performance before an audience. I was nervous, and began playing the piece an octave too low. I got confused, and couldn't continue until someone pointed
out my error. I was very embarrassed. Yet while unpleasant, we also learn quickly when we're decisively wrong. You can bet that the next time I played before an audience I played in the correct octave! By contrast, we learn more slowly when our
errors are less well-defined.</p>
<p>Ideally, we hope and expect that our neural networks will learn fast from their errors. Is this what happens in practice? To answer this question, let's look at a toy example. The example involves a neuron with just one input:</p>
<p>
<center>
<img src="images/tikz28.png" />
</center>
</p>
<p>We'll train this neuron to do something ridiculously easy: take the input $1$ to the output $0$. Of course, this is such a trivial task that we could easily figure out an appropriate weight and bias by hand, without using a learning algorithm. However,
it turns out to be illuminating to use gradient descent to attempt to learn a weight and bias. So let's take a look at how the neuron learns.</p>
<p>To make things definite, I'll pick the initial weight to be $0.6$ and the initial bias to be $0.9$. These are generic choices used as a place to begin learning, I wasn't picking them to be special in any way. The initial output from the neuron is
$0.82$, so quite a bit of learning will be needed before our neuron gets near the desired output, $0.0$. Click on "Run" in the bottom right corner below to see how the neuron learns an output much closer to $0.0$. Note that this isn't a pre-recorded
animation, your browser is actually computing the gradient, then using the gradient to update the weight and bias, and displaying the result. The learning rate is $\eta = 0.15$, which turns out to be slow enough that we can follow what's happening,
but fast enough that we can get substantial learning in just a few seconds. The cost is the quadratic cost function, $C$, introduced back in Chapter 1. I'll remind you of the exact form of the cost function shortly, so there's no need to go and
dig up the definition. Note that you can run the animation multiple times by clicking on "Run" again.</p>
<p>
<script type="text/javascript" src="js/paper.js"></script>
<script type="text/paperscript" src="js/saturation1.js" canvas="saturation1">
</script>
<center>
<canvas id="saturation1" width="520" height="300"></canvas>
</center>
</p>
<p>As you can see, the neuron rapidly learns a weight and bias that drives down the cost, and gives an output from the neuron of about $0.09$. That's not quite the desired output, $0.0$, but it is pretty good. Suppose, however, that we instead choose
both the starting weight and the starting bias to be $2.0$. In this case the initial output is $0.98$, which is very badly wrong. Let's look at how the neuron learns to output $0$ in this case. Click on "Run" again:</p>
<p>
<script type="text/paperscript" src="js/saturation2.js" canvas="saturation2">
</script>
<a id="saturation2_anchor"></a>
<center>
<canvas id="saturation2" width="520" height="300"></canvas>
</center>
</p>
<p>Although this example uses the same learning rate ($\eta = 0.15$), we can see that learning starts out much more slowly. Indeed, for the first 150 or so learning epochs, the weights and biases don't change much at all. Then the learning kicks in and,
much as in our first example, the neuron's output rapidly moves closer to $0.0$.</p>
<p>This behaviour is strange when contrasted to human learning. As I said at the beginning of this section, we often learn fastest when we're badly wrong about something. But we've just seen that our artificial neuron has a lot of difficulty learning
when it's badly wrong - far more difficulty than when it's just a little wrong. What's more, it turns out that this behaviour occurs not just in this toy model, but in more general networks. Why is learning so slow? And can we find a way of avoiding
this slowdown?</p>
<p>To understand the origin of the problem, consider that our neuron learns by changing the weight and bias at a rate determined by the partial derivatives of the cost function, $\partial C/\partial w$ and $\partial C / \partial b$. So saying "learning
is slow" is really the same as saying that those partial derivatives are small. The challenge is to understand why they are small. To understand that, let's compute the partial derivatives. Recall that we're using the quadratic cost function, which,
from Equation <span id="margin_661558065379_reveal" class="equation_link">(6)</span><span id="margin_661558065379" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray} C(w,b) \equiv
\frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_661558065379_reveal').click(function() {
$('#margin_661558065379').toggle('slow', function() {});
});
</script>, is given by
<a class="displaced_anchor" name="eqtn54"></a>\begin{eqnarray} C = \frac{(y-a)^2}{2}, \tag{54}\end{eqnarray} where $a$ is the neuron's output when the training input $x = 1$ is used, and $y = 0$ is the corresponding desired output. To write this
more explicitly in terms of the weight and bias, recall that $a = \sigma(z)$, where $z = wx+b$. Using the chain rule to differentiate with respect to the weight and bias we get
<a class="displaced_anchor" name="eqtn55"></a><a class="displaced_anchor" name="eqtn56"></a>\begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{55}\\ \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a
\sigma'(z), \tag{56}\end{eqnarray} where I have substituted $x = 1$ and $y = 0$. To understand the behaviour of these expressions, let's look more closely at the $\sigma'(z)$ term on the right-hand side. Recall the shape of the $\sigma$ function:</p>
<p>
<div id="sigmoid_graph"><a name="sigmoid_graph"></a></div>
<script type="text/javascript" src="http://d3js.org/d3.v3.min.js"></script>
<script>
function s(x) {
return 1 / (1 + Math.exp(-x));
}
var m = [40, 120, 50, 120];
var height = 290 - m[0] - m[2];
var width = 600 - m[1] - m[3];
var xmin = -5;
var xmax = 5;
var sample = 400;
var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);
var data = d3.range(sample).map(function(d) {
return {
x: x1(d),
y: s(x1(d))
};
});
var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);
var y = d3.scale.linear()
.domain([0, 1])
.range([height, 0]);
var line = d3.svg.line()
.x(function(d) {
return x(d.x);
})
.y(function(d) {
return y(d.y);
})
var graph = d3.select("#sigmoid_graph")
.append("svg")
.attr("width", width + m[1] + m[3])
.attr("height", height + m[0] + m[2])
.append("g")
.attr("transform", "translate(" + m[3] + "," + m[0] + ")");
var xAxis = d3.svg.axis()
.scale(x)
.tickValues(d3.range(-4, 5, 1))
.orient("bottom")
graph.append("g")
.attr("class", "x axis")
.attr("transform", "translate(0, " + height + ")")
.call(xAxis);
var yAxis = d3.svg.axis()
.scale(y)
.tickValues(d3.range(0, 1.01, 0.2))
.orient("left")
.ticks(5)
graph.append("g")
.attr("class", "y axis")
.call(yAxis);
graph.append("path").attr("d", line(data));
graph.append("text")
.attr("class", "x label")
.attr("text-anchor", "end")
.attr("x", width / 2)
.attr("y", height + 35)
.text("z");
graph.append("text")
.attr("x", (width / 2))
.attr("y", -10)
.attr("text-anchor", "middle")
.style("font-size", "16px")
.text("sigmoid function");
</script>
</p>
<p>We can see from this graph that when the neuron's output is close to $1$, the curve gets very flat, and so $\sigma'(z)$ gets very small. Equations <span id="margin_725661222068_reveal" class="equation_link">(55)</span><span id="margin_725661222068"
class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_725661222068_reveal').click(function() {
$('#margin_725661222068').toggle('slow', function() {});
});
</script> and <span id="margin_544739045303_reveal" class="equation_link">(56)</span><span id="margin_544739045303" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_544739045303_reveal').click(function() {
$('#margin_544739045303').toggle('slow', function() {});
});
</script> then tell us that $\partial C / \partial w$ and $\partial C / \partial b$ get very small. This is the origin of the learning slowdown. What's more, as we shall see a little later, the learning slowdown occurs for essentially the same reason
in more general neural networks, not just the toy example we've been playing with.</p>
<p>
<h4><a name="introducing_the_cross-entropy_cost_function"></a><a href="#introducing_the_cross-entropy_cost_function">Introducing the cross-entropy cost function</a></h4></p>
<p>How can we address the learning slowdown? It turns out that we can solve the problem by replacing the quadratic cost with a different cost function, known as the cross-entropy. To understand the cross-entropy, let's move a little away from our super-simple
toy model. We'll suppose instead that we're trying to train a neuron with several input variables, $x_1, x_2, \ldots$, corresponding weights $w_1, w_2, \ldots$, and a bias, $b$:
<center>
<img src="images/tikz29.png" />
</center>
The output from the neuron is, of course, $a = \sigma(z)$, where $z = \sum_j w_j x_j+b$ is the weighted sum of the inputs. We define the cross-entropy cost function for this neuron by
<a class="displaced_anchor" name="eqtn57"></a>\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right], \tag{57}\end{eqnarray} where $n$ is the total number of items of training data, the sum is over all training inputs,
$x$, and $y$ is the corresponding desired output.
</p>
<p>It's not obvious that the expression <span id="margin_94747236429_reveal" class="equation_link">(57)</span><span id="margin_94747236429" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_94747236429_reveal').click(function() {
$('#margin_94747236429').toggle('slow', function() {});
});
</script>
fixes the learning slowdown problem. In fact, frankly, it's not even obvious that it makes sense to call this a cost function! Before addressing the learning slowdown, let's see in what sense the cross-entropy can be interpreted as a cost function.</p>
<p>Two properties in particular make it reasonable to interpret the cross-entropy as a cost function. First, it's non-negative, that is, $C > 0$. To see this, notice that: (a) all the individual terms in the sum in <span id="margin_459510912475_reveal"
class="equation_link">(57)</span><span id="margin_459510912475" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_459510912475_reveal').click(function() {
$('#margin_459510912475').toggle('slow', function() {});
});
</script> are negative, since both logarithms are of numbers in the range $0$ to $1$; and (b) there is a minus sign out the front of the sum.</p>
<p>Second, if the neuron's actual output is close to the desired output for all training inputs, $x$, then the cross-entropy will be close to zero*
<span class="marginnote">
*To prove this I will need to assume that the desired
outputs $y$ are all either $0$ or $1$. This is usually the case
when solving classification problems, for example, or when computing
Boolean functions. To understand what happens when we don't make
this assumption, see the exercises at the end of this section.</span>. To see this, suppose for example that $y = 0$ and $a \approx 0$ for some input $x$. This is a case when the neuron is doing a good job on that input. We see that the first
term in the expression <span id="margin_983582689293_reveal" class="equation_link">(57)</span><span id="margin_983582689293" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_983582689293_reveal').click(function() {
$('#margin_983582689293').toggle('slow', function() {});
});
</script> for the cost vanishes, since $y = 0$, while the second term is just $-\ln (1-a) \approx 0$. A similar analysis holds when $y = 1$ and $a \approx 1$. And so the contribution to the cost will be low provided the actual output is close to
the desired output.</p>
<p>Summing up, the cross-entropy is positive, and tends toward zero as the neuron gets better at computing the desired output, $y$, for all training inputs, $x$. These are both properties we'd intuitively expect for a cost function. Indeed, both properties
are also satisfied by the quadratic cost. So that's good news for the cross-entropy. But the cross-entropy cost function has the benefit that, unlike the quadratic cost, it avoids the problem of learning slowing down. To see this, let's compute
the partial derivative of the cross-entropy cost with respect to the weights. We substitute $a = \sigma(z)$ into <span id="margin_457283770982_reveal" class="equation_link">(57)</span><span id="margin_457283770982" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_457283770982_reveal').click(function() {
$('#margin_457283770982').toggle('slow', function() {});
});
</script>, and apply the chain rule twice, obtaining:
<a class="displaced_anchor" name="eqtn58"></a><a class="displaced_anchor" name="eqtn59"></a>\begin{eqnarray} \frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left( \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right) \frac{\partial
\sigma}{\partial w_j} \tag{58}\\ & = & -\frac{1}{n} \sum_x \left( \frac{y}{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j. \tag{59}\end{eqnarray} Putting everything over a common denominator and simplifying this becomes:
<a class="displaced_anchor" name="eqtn60"></a>\begin{eqnarray} \frac{\partial C}{\partial w_j} & = & \frac{1}{n} \sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))} (\sigma(z)-y). \tag{60}\end{eqnarray} Using the definition of the sigmoid function,
$\sigma(z) = 1/(1+e^{-z})$, and a little algebra we can show that $\sigma'(z) = \sigma(z)(1-\sigma(z))$. I'll ask you to verify this in an exercise below, but for now let's accept it as given. We see that the $\sigma'(z)$ and $\sigma(z)(1-\sigma(z))$
terms cancel in the equation just above, and it simplifies to become:
<a class="displaced_anchor" name="eqtn61"></a>\begin{eqnarray} \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y). \tag{61}\end{eqnarray} This is a beautiful expression. It tells us that the rate at which the weight learns is
controlled by $\sigma(z)-y$, i.e., by the error in the output. The larger the error, the faster the neuron will learn. This is just what we'd intuitively expect. In particular, it avoids the learning slowdown caused by the $\sigma'(z)$ term in the
analogous equation for the quadratic cost, Equation <span id="margin_801939868559_reveal" class="equation_link">(55)</span><span id="margin_801939868559" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_801939868559_reveal').click(function() {
$('#margin_801939868559').toggle('slow', function() {});
});
</script>. When we use the cross-entropy, the $\sigma'(z)$ term gets canceled out, and we no longer need worry about it being small. This cancellation is the special miracle ensured by the cross-entropy cost function. Actually, it's not really a
miracle. As we'll see later, the cross-entropy was specially chosen to have just this property.</p>
<p>In a similar way, we can compute the partial derivative for the bias. I won't go through all the details again, but you can easily verify that
<a class="displaced_anchor" name="eqtn62"></a>\begin{eqnarray} \frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y). \tag{62}\end{eqnarray} Again, this avoids the learning slowdown caused by the $\sigma'(z)$ term in the analogous equation
for the quadratic cost, Equation <span id="margin_986509902470_reveal" class="equation_link">(56)</span><span id="margin_986509902470" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_986509902470_reveal').click(function() {
$('#margin_986509902470').toggle('slow', function() {});
});
</script>.</p>
<p>
<h4><a name="exercise_35813"></a><a href="#exercise_35813">Exercise</a></h4>
<ul>
<li> Verify that $\sigma'(z) = \sigma(z)(1-\sigma(z))$.</p>
<p>
</ul>
</p>
<p>Let's return to the toy example we played with earlier, and explore what happens when we use the cross-entropy instead of the quadratic cost. To re-orient ourselves, we'll begin with the case where the quadratic cost did just fine, with starting weight
$0.6$ and starting bias $0.9$. Press "Run" to see what happens when we replace the quadratic cost by the cross-entropy:</p>
<p>
<script type="text/paperscript" src="js/saturation3.js" canvas="saturation3"></script>
<center><canvas id="saturation3" width="520" height="300"></canvas></center>
</p>
<p>Unsurprisingly, the neuron learns perfectly well in this instance, just as it did earlier. And now let's look at the case where our neuron got stuck before (<a href="#saturation2_anchor">link</a>, for comparison), with the weight and bias both starting
at $2.0$:</p>
<p>
<script type="text/paperscript" src="js/saturation4.js" canvas="saturation4"></script>
<center><canvas id="saturation4" width="520" height="300"></canvas></center>
</p>
<p>Success! This time the neuron learned quickly, just as we hoped. If you observe closely you can see that the slope of the cost curve was much steeper initially than the initial flat region on the corresponding curve for the quadratic cost. It's that
steepness which the cross-entropy buys us, preventing us from getting stuck just when we'd expect our neuron to learn fastest, i.e., when the neuron starts out badly wrong.</p>
<p>I didn't say what learning rate was used in the examples just illustrated. Earlier, with the quadratic cost, we used $\eta = 0.15$. Should we have used the same learning rate in the new examples? In fact, with the change in cost function it's not
possible to say precisely what it means to use the "same" learning rate; it's an apples and oranges comparison. For both cost functions I simply experimented to find a learning rate that made it possible to see what is going on. If you're still
curious, despite my disavowal, here's the lowdown: I used $\eta = 0.005$ in the examples just given.</p>
<p>You might object that the change in learning rate makes the graphs above meaningless. Who cares how fast the neuron learns, when our choice of learning rate was arbitrary to begin with?! That objection misses the point. The point of the graphs isn't
about the absolute speed of learning. It's about how the speed of learning changes. In particular, when we use the quadratic cost learning is <em>slower</em> when the neuron is unambiguously wrong than it is later on, as the neuron gets closer to
the correct output; while with the cross-entropy learning is faster when the neuron is unambiguously wrong. Those statements don't depend on how the learning rate is set. </p>
<p>We've been studying the cross-entropy for a single neuron. However, it's easy to generalize the cross-entropy to many-neuron multi-layer networks. In particular, suppose $y = y_1, y_2, \ldots$ are the desired values at the output neurons, i.e., the
neurons in the final layer, while $a^L_1, a^L_2, \ldots$ are the actual output values. Then we define the cross-entropy by
<a class="displaced_anchor" name="eqtn63"></a>\begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right]. \tag{63}\end{eqnarray} This is the same as our earlier expression, Equation <span id="margin_233878627463_reveal"
class="equation_link">(57)</span><span id="margin_233878627463" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_233878627463_reveal').click(function() {
$('#margin_233878627463').toggle('slow', function() {});
});
</script>, except now we've got the $\sum_j$ summing over all the output neurons. I won't explicitly work through a derivation, but it should be plausible that using the expression <span id="margin_787051846171_reveal" class="equation_link">(63)</span>
<span id="margin_787051846171" class="marginequation" style="display: none;"><a href="chap3.html#eqtn63" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray} C = -\frac{1}{n} \sum_x
\sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_787051846171_reveal').click(function() {
$('#margin_787051846171').toggle('slow', function() {});
});
</script> avoids a learning slowdown in many-neuron networks. If you're interested, you can work through the derivation in the problem below. </p>
<p>When should we use the cross-entropy instead of the quadratic cost? In fact, the cross-entropy is nearly always the better choice, provided the output neurons are sigmoid neurons. To see why, consider that when we're setting up the network we usually
initialize the weights and biases using some sort of randomization. It may happen that those initial choices result in the network being decisively wrong for some training input - that is, an output neuron will have saturated near $1$, when it should
be $0$, or vice versa. If we're using the quadratic cost that will slow down learning. It won't stop learning completely, since the weights will continue learning from other training inputs, but it's obviously undesirable.</p>
<p>
<h4><a name="exercises_824189"></a><a href="#exercises_824189">Exercises</a></h4>
<ul>
<li> One gotcha with the cross-entropy is that it can be difficult at first to remember the respective roles of the $y$s and the $a$s. It's easy to get confused about whether the right form is $-[y \ln a + (1-y) \ln (1-a)]$ or $-[a \ln y + (1-a) \ln
(1-y)]$. What happens to the second of these expressions when $y = 0$ or $1$? Does this problem afflict the first expression? Why or why not? </p>
<p>
<li> In the single-neuron discussion at the start of this section, I argued that the cross-entropy is small if $\sigma(z) \approx y$ for all training inputs. The argument relied on $y$ being equal to either $0$ or $1$. This is usually true in classification
problems, but for other problems (e.g., regression problems) $y$ can sometimes take values intermediate between $0$ and $1$. Show that the cross-entropy is still minimized when $\sigma(z) = y$ for all training inputs. When this is the case the
cross-entropy has the value:
<a class="displaced_anchor" name="eqtn64"></a>\begin{eqnarray} C = -\frac{1}{n} \sum_x [y \ln y+(1-y) \ln(1-y)]. \tag{64}\end{eqnarray} The quantity $-[y \ln y+(1-y)\ln(1-y)]$ is sometimes known as the
<a href="http://en.wikipedia.org/wiki/Binary_entropy_function">binary
entropy</a>.</p>
<p>
</ul>
</p>
<p>
<h4><a name="problems_382219"></a><a href="#problems_382219">Problems</a></h4>
<ul>
<li><strong>Many-layer multi-neuron networks</strong> In the notation introduced in the <a href="chap2.html">last chapter</a>, show that for the quadratic cost the partial derivative with respect to weights in the output layer is
<a class="displaced_anchor" name="eqtn65"></a>\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j) \sigma'(z^L_j). \tag{65}\end{eqnarray} The term $\sigma'(z^L_j)$ causes a learning slowdown whenever
an output neuron saturates on the wrong value. Show that for the cross-entropy cost the output error $\delta^L$ for a single training example $x$ is given by
<a class="displaced_anchor" name="eqtn66"></a>\begin{eqnarray} \delta^L = a^L-y. \tag{66}\end{eqnarray} Use this expression to show that the partial derivative with respect to the weights in the output layer is given by
<a class="displaced_anchor" name="eqtn67"></a>\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j). \tag{67}\end{eqnarray} The $\sigma'(z^L_j)$ term has vanished, and so the cross-entropy avoids
the problem of learning slowdown, not just when used with a single neuron, as we saw earlier, but also in many-layer multi-neuron networks. A simple variation on this analysis holds also for the biases. If this is not obvious to you, then you
should work through that analysis as well.</p>
<p>
<li><strong>Using the quadratic cost when we have linear neurons in the
output layer</strong> Suppose that we have a many-layer multi-neuron network. Suppose all the neurons in the final layer are
<em>linear neurons</em>, meaning that the sigmoid activation function is not applied, and the outputs are simply $a^L_j = z^L_j$. Show that if we use the quadratic cost function then the output error $\delta^L$ for a single training example $x$
is given by
<a class="displaced_anchor" name="eqtn68"></a>\begin{eqnarray} \delta^L = a^L-y. \tag{68}\end{eqnarray} Similarly to the previous problem, use this expression to show that the partial derivatives with respect to the weights and biases in the output
layer are given by
<a class="displaced_anchor" name="eqtn69"></a><a class="displaced_anchor" name="eqtn70"></a>\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j) \tag{69}\\ \frac{\partial C}{\partial b^L_{j}} & =
& \frac{1}{n} \sum_x (a^L_j-y_j). \tag{70}\end{eqnarray} This shows that if the output neurons are linear neurons then the quadratic cost will not give rise to any problems with a learning slowdown. In this case the quadratic cost is, in fact,
an appropriate cost function to use.
</ul>
</p>
<p>
<h4><a name="using_the_cross-entropy_to_classify_mnist_digits"></a><a href="#using_the_cross-entropy_to_classify_mnist_digits">Using the cross-entropy to classify MNIST digits</a></h4></p>
<p></p>
<p>The cross-entropy is easy to implement as part of a program which learns using gradient descent and backpropagation. We'll do that
<a href="#handwriting_recognition_revisited_the_code">later in the
chapter</a>, developing an improved version of our
<a href="chap1.html#implementing_our_network_to_classify_digits">earlier
program</a> for classifying the MNIST handwritten digits,
<tt>network.py</tt>. The new program is called <tt>network2.py</tt>, and incorporates not just the cross-entropy, but also several other techniques developed in this chapter*<span class="marginnote">
*The code is available
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network2.py">on
GitHub</a>.</span>. For now, let's look at how well our new program classifies MNIST digits. As was the case in Chapter 1, we'll use a network with $30$ hidden neurons, and we'll use a mini-batch size of $10$. We set the learning rate to $\eta = 0.5$*
<span class="marginnote">
*In Chapter 1 we used the quadratic cost and a learning rate of $\eta = 3.0$. As discussed above, it's not possible to say precisely what it means to use the "same" learning rate when the cost function is changed. For both cost functions I experimented
to find a learning rate that provides near-optimal performance, given the other hyper-parameter choices. <br/> <br/> There is, incidentally, a very rough general heuristic for relating the learning rate for the cross-entropy and the quadratic
cost. As we saw earlier, the gradient terms for the quadratic cost have an extra $\sigma' = \sigma(1-\sigma)$ term in them. Suppose we average this over values for $\sigma$, $\int_0^1 d\sigma \sigma(1-\sigma) = 1/6$. We see that (very roughly)
the quadratic cost learns an average of $6$ times slower, for the same learning rate. This suggests that a reasonable starting point is to divide the learning rate for the quadratic cost by $6$. Of course, this argument is far from rigorous, and
shouldn't be taken too seriously. Still, it can sometimes be a useful starting point.</span> and we train for $30$ epochs. The interface to <tt>network2.py</tt> is slightly different than
<tt>network.py</tt>, but it should still be clear what is going on. You can, by the way, get documentation about <tt>network2.py</tt>'s interface by using commands such as <tt>help(network2.Network.SGD)</tt> in a Python shell.</p>
<p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">>>></span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
<span class="o">>>></span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">>>></span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">large_weight_initializer</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p>
<p>Note, by the way, that the <tt>net.large_weight_initializer()</tt> command is used to initialize the weights and biases in the same way as described in Chapter 1. We need to run this command because later in this chapter we'll change the default weight
initialization in our networks. The result from running the above sequence of commands is a network with $95.49$ percent accuracy. This is pretty close to the result we obtained in Chapter 1, $95.42$ percent, using the quadratic cost.
</p>
<p>Let's look also at the case where we use $100$ hidden neurons, the cross-entropy, and otherwise keep the parameters the same. In this case we obtain an accuracy of $96.82$ percent. That's a substantial improvement over the results from Chapter 1,
where we obtained a classification accuracy of $96.59$ percent, using the quadratic cost. That may look like a small change, but consider that the error rate has dropped from $3.41$ percent to $3.18$ percent. That is, we've eliminated about one
in fourteen of the original errors. That's quite a handy improvement.</p>
<p>It's encouraging that the cross-entropy cost gives us similar or better results than the quadratic cost. However, these results don't conclusively prove that the cross-entropy is a better choice. The reason is that I've put only a little effort into
choosing hyper-parameters such as learning rate, mini-batch size, and so on. For the improvement to be really convincing we'd need to do a thorough job optimizing such hyper-parameters. Still, the results are encouraging, and reinforce our earlier
theoretical argument that the cross-entropy is a better choice than the quadratic cost.</p>
<p>This, by the way, is part of a general pattern that we'll see through this chapter and, indeed, through much of the rest of the book. We'll develop a new technique, we'll try it out, and we'll get "improved" results. It is, of course, nice that we
see such improvements. But the interpretation of such improvements is always problematic. They're only truly convincing if we see an improvement after putting tremendous effort into optimizing all the other hyper-parameters. That's a great deal
of work, requiring lots of computing power, and we're not usually going to do such an exhaustive investigation. Instead, we'll proceed on the basis of informal tests like those done above. Still, you should keep in mind that such tests fall short
of definitive proof, and remain alert to signs that the arguments are breaking down.</p>
<p>By now, we've discussed the cross-entropy at great length. Why go to so much effort when it gives only a small improvement to our MNIST results? Later in the chapter we'll see other techniques - notably,
<a href="#overfitting_and_regularization">regularization</a> - which give much bigger improvements. So why so much focus on cross-entropy? Part of the reason is that the cross-entropy is a widely-used cost function, and so is worth understanding
well. But the more important reason is that neuron saturation is an important problem in neural nets, a problem we'll return to repeatedly throughout the book. And so I've discussed the cross-entropy at length because it's a good laboratory to begin
understanding neuron saturation and how it may be addressed.
</p>
<p>
<h4><a name="what_does_the_cross-entropy_mean_where_does_it_come_from"></a><a href="#what_does_the_cross-entropy_mean_where_does_it_come_from">What does the cross-entropy mean? Where does it come from?</a></h4></p>
<p>Our discussion of the cross-entropy has focused on algebraic analysis and practical implementation. That's useful, but it leaves unanswered broader conceptual questions, like: what does the cross-entropy mean? Is there some intuitive way of thinking
about the cross-entropy? And how could we have dreamed up the cross-entropy in the first place?</p>
<p>Let's begin with the last of these questions: what could have motivated us to think up the cross-entropy in the first place? Suppose we'd discovered the learning slowdown described earlier, and understood that the origin was the $\sigma'(z)$ terms
in Equations <span id="margin_560366758901_reveal" class="equation_link">(55)</span><span id="margin_560366758901" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_560366758901_reveal').click(function() {
$('#margin_560366758901').toggle('slow', function() {});
});
</script> and <span id="margin_583079464766_reveal" class="equation_link">(56)</span><span id="margin_583079464766" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_583079464766_reveal').click(function() {
$('#margin_583079464766').toggle('slow', function() {});
});
</script>. After staring at those equations for a bit, we might wonder if it's possible to choose a cost function so that the $\sigma'(z)$ term disappeared. In that case, the cost $C = C_x$ for a single training example $x$ would satisfy
<a class="displaced_anchor" name="eqtn71"></a><a class="displaced_anchor" name="eqtn72"></a>\begin{eqnarray} \frac{\partial C}{\partial w_j} & = & x_j(a-y) \tag{71}\\ \frac{\partial C}{\partial b } & = & (a-y). \tag{72}\end{eqnarray} If we could
choose the cost function to make these equations true, then they would capture in a simple way the intuition that the greater the initial error, the faster the neuron learns. They'd also eliminate the problem of a learning slowdown. In fact, starting
from these equations we'll now show that it's possible to derive the form of the cross-entropy, simply by following our mathematical noses. To see this, note that from the chain rule we have
<a class="displaced_anchor" name="eqtn73"></a>\begin{eqnarray} \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} \sigma'(z). \tag{73}\end{eqnarray} Using $\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$ the last equation becomes
<a class="displaced_anchor" name="eqtn74"></a>\begin{eqnarray} \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} a(1-a). \tag{74}\end{eqnarray} Comparing to Equation <span id="margin_639536640662_reveal" class="equation_link">(72)</span>
<span id="margin_639536640662" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_639536640662_reveal').click(function() {
$('#margin_639536640662').toggle('slow', function() {});
});
</script> we obtain
<a class="displaced_anchor" name="eqtn75"></a>\begin{eqnarray} \frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}. \tag{75}\end{eqnarray} Integrating this expression with respect to $a$ gives
<a class="displaced_anchor" name="eqtn76"></a>\begin{eqnarray} C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant}, \tag{76}\end{eqnarray} for some constant of integration. This is the contribution to the cost from a single training example, $x$. To
get the full cost function we must average over training examples, obtaining
<a class="displaced_anchor" name="eqtn77"></a>\begin{eqnarray} C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant}, \tag{77}\end{eqnarray} where the constant here is the average of the individual constants for each training example.
And so we see that Equations <span id="margin_119016391726_reveal" class="equation_link">(71)</span><span id="margin_119016391726" class="marginequation" style="display: none;"><a href="chap3.html#eqtn71" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial w_j} & = & x_j(a-y) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_119016391726_reveal').click(function() {
$('#margin_119016391726').toggle('slow', function() {});
});
</script>
and <span id="margin_792073230645_reveal" class="equation_link">(72)</span><span id="margin_792073230645" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_792073230645_reveal').click(function() {
$('#margin_792073230645').toggle('slow', function() {});
});
</script> uniquely determine the form of the cross-entropy, up to an overall constant term. The cross-entropy isn't something that was miraculously pulled out of thin air. Rather, it's something that we could have discovered in a simple and natural
way.
</p>
<p>What about the intuitive meaning of the cross-entropy? How should we think about it? Explaining this in depth would take us further afield than I want to go. However, it is worth mentioning that there is a standard way of interpreting the cross-entropy
that comes from the field of information theory. Roughly speaking, the idea is that the cross-entropy is a measure of surprise. In particular, our neuron is trying to compute the function $x \rightarrow y = y(x)$. But instead it computes the function
$x \rightarrow a = a(x)$. Suppose we think of $a$ as our neuron's estimated probability that $y$ is $1$, and $1-a$ is the estimated probability that the right value for $y$ is $0$. Then the cross-entropy measures how "surprised" we are, on average,
when we learn the true value for $y$. We get low surprise if the output is what we expect, and high surprise if the output is unexpected. Of course, I haven't said exactly what "surprise" means, and so this perhaps seems like empty verbiage. But
in fact there is a precise information-theoretic way of saying what is meant by surprise. Unfortunately, I don't know of a good, short, self-contained discussion of this subject that's available online. But if you want to dig deeper, then Wikipedia
contains a
<a href="http://en.wikipedia.org/wiki/Cross_entropy#Motivation">brief
summary</a> that will get you started down the right track. And the details can be filled in by working through the materials about the Kraft inequality in chapter 5 of the book about information theory by
<a href="http://books.google.ca/books?id=VWq5GG6ycxMC">Cover and Thomas</a>.</p>
<p>
<h4><a name="problem_337461"></a><a href="#problem_337461">Problem</a></h4>
<ul>
<li> We've discussed at length the learning slowdown that can occur when output neurons saturate, in networks using the quadratic cost to train. Another factor that may inhibit learning is the presence of the $x_j$ term in Equation <span id="margin_802738986988_reveal"
class="equation_link">(61)</span><span id="margin_802738986988" class="marginequation" style="display: none;"><a href="chap3.html#eqtn61" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
\frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_802738986988_reveal').click(function() {
$('#margin_802738986988').toggle('slow', function() {});
});
</script>. Because of this term, when an input $x_j$ is near to zero, the corresponding weight $w_j$ will learn slowly. Explain why it is not possible to eliminate the $x_j$ term through a clever choice of cost function.
</ul>
</p>
<p>
<h4><a name="softmax"></a><a href="#softmax">Softmax</a></h4></p>
<p>In this chapter we'll mostly use the cross-entropy cost to address the problem of learning slowdown. However, I want to briefly describe another approach to the problem, based on what are called
<em>softmax</em> layers of neurons. We're not actually going to use softmax layers in the remainder of the chapter, so if you're in a great hurry, you can skip to the next section. However, softmax is still worth understanding, in part because it's
intrinsically interesting, and in part because we'll use softmax layers in
<a href="chap6.html">Chapter 6</a>, in our discussion of deep neural networks.
</p>
<p>The idea of softmax is to define a new type of output layer for our neural networks. It begins in the same way as with a sigmoid layer, by forming the weighted inputs*<span class="marginnote">
*In describing the softmax
we'll make frequent use of notation introduced in the
<a href="chap2.html">last chapter</a>. You may wish to revisit that
chapter if you need to refresh your memory about the meaning of the
notation.</span> $z^L_j = \sum_{k} w^L_{jk} a^{L-1}_k + b^L_j$. However, we don't apply the sigmoid function to get the output. Instead, in a softmax layer we apply the so-called <em>softmax function</em> to the $z^L_j$. According to this function,
the activation $a^L_j$ of the $j$th output neuron is
<a class="displaced_anchor" name="eqtn78"></a>\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}, \tag{78}\end{eqnarray} where in the denominator we sum over all the output neurons.</p>
<p>If you're not familiar with the softmax function, Equation <span id="margin_843513390435_reveal" class="equation_link">(78)</span><span id="margin_843513390435" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_843513390435_reveal').click(function() {
$('#margin_843513390435').toggle('slow', function() {});
});
</script> may look pretty opaque. It's certainly not obvious why we'd want to use this function. And it's also not obvious that this will help us address the learning slowdown problem. To better understand Equation <span id="margin_798644145159_reveal"
class="equation_link">(78)</span><span id="margin_798644145159" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_798644145159_reveal').click(function() {
$('#margin_798644145159').toggle('slow', function() {});
});
</script>, suppose we have a network with four output neurons, and four corresponding weighted inputs, which we'll denote $z^L_1, z^L_2, z^L_3$, and $z^L_4$. Shown below are adjustable sliders showing possible values for the weighted inputs, and
a graph of the corresponding output activations. A good place to start exploration is by using the bottom slider to increase $z^L_4$:
</p>
<p>
<script type="text/javascript" src="js/jquery-ui.min.js"></script>
<link rel="stylesheet" href="js/jquery-ui.css">
<script src="js/softmax.js"></script>
<script src="js/canvas.js"></script>
<style>
.softmaxTable {
height: 64px;
width: 260px;
}
.canvasSoftmax {
position: absolute;
margin-top: -22px;
padding-bottom: 5px;
}
.softmaxInput {
border: 0;
font: 18px Arial;
}
.softmaxA {
padding-top: 7px;
padding-left: 5px;
}
</style>
<p>
<table>
<tr>
<td class="softmaxTable">
<div id="slider1" style="width: 200px;"></div>
$z^L_1 = $ <input type="text" id="amount1" readonly class="softmaxInput">
</td>
<td class="softmaxTable">
<canvas id="smG1" width="300" height="40" class="canvasSoftmax"></canvas>
<div class="softmaxA">$a^L_1 = $ <input type="text" id="activation1" readonly class="softmaxInput"></div>
</td>
</tr>
<tr>
<td class="softmaxTable">
<div id="slider2" style="width: 200px;"></div>
$z^L_2$ = <input type="text" id="amount2" readonly class="softmaxInput">
</td>
<td class="softmaxTable">
<canvas id="smG2" width="300" height="40" class="canvasSoftmax"></canvas>
<div class="softmaxA">$a^L_2 = $ <input type="text" id="activation2" readonly class="softmaxInput"></div>
</td>
</tr>
<tr>
<td class="softmaxTable">
<div id="slider3" style="width: 200px;"></div>
$z^L_3$ = <input type="text" id="amount3" readonly class="softmaxInput">
</td>
<td class="softmaxTable">
<canvas id="smG3" width="300" height="40" class="canvasSoftmax"></canvas>
<div class="softmaxA">$a^L_3 = $ <input type="text" id="activation3" readonly class="softmaxInput"></div>
</td>
</tr>
<tr>
<td class="softmaxTable">
<div id="slider4" style="width: 200px;"></div>
$z^L_4$ = <input type="text" id="amount4" readonly class="softmaxInput">
</td>
<td class="softmaxTable">
<canvas id="smG4" width="300" height="40" class="canvasSoftmax"></canvas>
<div class="softmaxA">$a^L_4 = $ <input type="text" id="activation4" readonly class="softmaxInput"></div>
</td>
</tr>
</table>
</p>
<p>As you increase $z^L_4$, you'll see an increase in the corresponding output activation, $a^L_4$, and a decrease in the other output activations. Similarly, if you decrease $z^L_4$ then $a^L_4$ will decrease, and all the other output activations
will increase. In fact, if you look closely, you'll see that in both cases the total change in the other activations exactly compensates for the change in $a^L_4$. The reason is that the output activations are guaranteed to always sum up to $1$,
as we can prove using Equation <span id="margin_798491963651_reveal" class="equation_link">(78)</span><span id="margin_798491963651" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_798491963651_reveal').click(function() {
$('#margin_798491963651').toggle('slow', function() {});
});
</script> and a little algebra:
<a class="displaced_anchor" name="eqtn79"></a>\begin{eqnarray} \sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1. \tag{79}\end{eqnarray} As a result, if $a^L_4$ increases, then the other output activations must decrease by the same
total amount, to ensure the sum over all activations remains $1$. And, of course, similar statements hold for all the other activations.</p>
<p>Equation <span id="margin_908544036249_reveal" class="equation_link">(78)</span><span id="margin_908544036249" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_908544036249_reveal').click(function() {
$('#margin_908544036249').toggle('slow', function() {});
});
</script> also implies that the output activations are all positive, since the exponential function is positive. Combining this with the observation in the last paragraph, we see that the output from the softmax layer is a set of positive numbers
which sum up to $1$. In other words, the output from the softmax layer can be thought of as a probability distribution.</p>
<p>The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it's convenient to be able to interpret the output activation $a^L_j$ as the network's estimate of the probability that the correct output is $j$.
So, for instance, in the MNIST classification problem, we can interpret $a^L_j$ as the network's estimated probability that the correct digit classification is $j$.</p>
<p>By contrast, if the output layer was a sigmoid layer, then we certainly couldn't assume that the activations formed a probability distribution. I won't explicitly prove it, but it should be plausible that the activations from a sigmoid layer won't
in general form a probability distribution. And so with a sigmoid output layer we don't have such a simple interpretation of the output activations.</p>
<p>
<h4><a name="exercise_332838"></a><a href="#exercise_332838">Exercise</a></h4>
<ul>
<li> Construct an example showing explicitly that in a network with a sigmoid output layer, the output activations $a^L_j$ won't always sum to $1$.
</ul>
</p>
<p>We're starting to build up some feel for the softmax function and the way softmax layers behave. Just to review where we're at: the exponentials in Equation <span id="margin_189739971255_reveal" class="equation_link">(78)</span><span id="margin_189739971255"
class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_189739971255_reveal').click(function() {
$('#margin_189739971255').toggle('slow', function() {});
});
</script> ensure that all the output activations are positive. And the sum in the denominator of Equation <span id="margin_996666328163_reveal" class="equation_link">(78)</span><span id="margin_996666328163" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_996666328163_reveal').click(function() {
$('#margin_996666328163').toggle('slow', function() {});
});
</script> ensures that the softmax outputs sum to $1$. So that particular form no longer appears so mysterious: rather, it is a natural way to ensure that the output activations form a probability distribution. You can think of softmax as a way
of rescaling the $z^L_j$, and then squishing them together to form a probability distribution.</p>
<p>
<h4><a name="exercises_193619"></a><a href="#exercises_193619">Exercises</a></h4>
<ul>
<li><strong>Monotonicity of softmax</strong> Show that $\partial a^L_j / \partial z^L_k$ is positive if $j = k$ and negative if $j \neq k$. As a consequence, increasing $z^L_j$ is guaranteed to increase the corresponding output activation, $a^L_j$,
and will decrease all the other output activations. We already saw this empirically with the sliders, but this is a rigorous proof.</p>
<p>
<li><strong>Non-locality of softmax</strong> A nice thing about sigmoid layers is that the output $a^L_j$ is a function of the corresponding weighted input, $a^L_j = \sigma(z^L_j)$. Explain why this is not the case for a softmax layer: any particular
output activation $a^L_j$ depends on <em>all</em> the weighted inputs.
</ul>
</p>
<p>
<h4><a name="problem_905066"></a><a href="#problem_905066">Problem</a></h4>
<ul>
<li><strong>Inverting the softmax layer</strong> Suppose we have a neural network with a softmax output layer, and the activations $a^L_j$ are known. Show that the corresponding weighted inputs have the form $z^L_j = \ln a^L_j + C$, for some constant
$C$ that is independent of $j$.
</ul>
</p>
<p><strong>The learning slowdown problem:</strong> We've now built up considerable familiarity with softmax layers of neurons. But we haven't yet seen how a softmax layer lets us address the learning slowdown problem. To understand that, let's define
the
<em>log-likelihood</em> cost function. We'll use $x$ to denote a training input to the network, and $y$ to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is
<a class="displaced_anchor" name="eqtn80"></a>\begin{eqnarray} C \equiv -\ln a^L_y. \tag{80}\end{eqnarray} So, for instance, if we're training with MNIST images, and input an image of a $7$, then the log-likelihood cost is $-\ln a^L_7$. To see
that this makes intuitive sense, consider the case when the network is doing a good job, that is, it is confident the input is a $7$. In that case it will estimate a value for the corresponding probability $a^L_7$ which is close to $1$, and so
the cost $-\ln a^L_7$ will be small. By contrast, when the network isn't doing such a good job, the probability $a^L_7$ will be smaller, and the cost $-\ln a^L_7$ will be larger. So the log-likelihood cost behaves as we'd expect a cost function
to behave.</p>
<p>What about the learning slowdown problem? To analyze that, recall that the key to the learning slowdown is the behaviour of the quantities $\partial C / \partial w^L_{jk}$ and $\partial C / \partial b^L_j$. I won't go through the derivation explicitly
- I'll ask you to do in the problems, below - but with a little algebra you can show that*<span class="marginnote">
*Note that I'm abusing notation here, using $y$ in a
slightly different way to last paragraph. In the last paragraph we
used $y$ to denote the desired output from the network - e.g.,
output a "$7$" if an image of a $7$ was input. But in the
equations which follow I'm using $y$ to denote the vector of output
activations which corresponds to $7$, that is, a vector which is all
$0$s, except for a $1$ in the $7$th location.</span>
<a class="displaced_anchor" name="eqtn81"></a><a class="displaced_anchor" name="eqtn82"></a>\begin{eqnarray} \frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j \tag{81}\\ \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \tag{82}\end{eqnarray}
These equations are the same as the analogous expressions obtained in our earlier analysis of the cross-entropy. Compare, for example, Equation <span id="margin_954684416067_reveal" class="equation_link">(82)</span><span id="margin_954684416067"
class="marginequation" style="display: none;"><a href="chap3.html#eqtn82" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \nonumber\end{eqnarray}</a></span>
<script>
$('#margin_954684416067_reveal').click(function() {
$('#margin_954684416067').toggle('slow', function() {});
});
</script> to Equation <span id="margin_975454356091_reveal" class="equation_link">(67)</span><span id="margin_975454356091" class="marginequation" style="display: none;"><a href="chap3.html#eqtn67" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}