forked from mnielsen/nnadl_site
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathchap4.html
909 lines (870 loc) · 77.2 KB
/
chap4.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file. Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other -->
<!-- code is quite ugly! Later versions should be better. -->
<head>
<meta charset="utf-8">
<meta name="citation_title" content="Neural Networks and Deep Learning">
<meta name="citation_author" content="Nielsen, Michael A.">
<meta name="citation_publication_date" content="2015">
<meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
<meta name="citation_publisher" content="Determination Press">
<link rel="icon" href="nnadl_favicon.ICO" />
<title>Neural networks and deep learning</title>
<script src="assets/jquery.min.js"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$']]}, "HTML-CSS": {scale: 92}, TeX: { equationNumbers: { autoNumber: "AMS" }}});
</script>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<link href="assets/style.css" rel="stylesheet">
<link href="assets/pygments.css" rel="stylesheet">
<link rel="stylesheet" href="https://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">
<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */
@font-face {
font-family: 'MJX_Math';
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot');
/* IE9 Compat Modes */
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}
@font-face {
font-family: 'MJX_Main';
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot');
/* IE9 Compat Modes */
src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype'),
url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>
</head>
<body>
<div class="nonumber_header">
<h2><a href="index.html">Նեյրոնային ցանցեր և խորը ուսուցում</a></h2>
</div>
<div class="section">
<div id="toc">
<p class="toc_title">
<a href="index.html">Նեյրոնային ցանցեր և խորը ուսուցում</a>
</p>
<p class="toc_not_mainchapter">
<a href="about.html">Ինչի՞ մասին է գիրքը</a>
</p>
<p class="toc_not_mainchapter">
<a href="exercises_and_problems.html">Խնդիրների և վարժությունների մասին</a>
</p>
<p class='toc_mainchapter'>
<a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a>
<a href="chap1.html">Ձեռագիր թվանշանների ճանաչում՝ օգտագործելով նեյրոնային ցանցեր</a>
<div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap1.html#perceptrons">
<li>Պերսեպտրոններ</li>
</a>
<a href="chap1.html#sigmoid_neurons">
<li>Սիգմոիդ նեյրոններ</li>
</a>
<a href="chap1.html#the_architecture_of_neural_networks">
<li>Նեյրոնային ցանցերի կառուցվածքը</li>
</a>
<a href="chap1.html#a_simple_network_to_classify_handwritten_digits">
<li>Պարզ ցանց ձեռագիր թվանշանների ճանաչման համար</li>
</a>
<a href="chap1.html#learning_with_gradient_descent">
<li>Ուսուցում գրադիենտային վայրէջքի միջոցով</li>
</a>
<a href="chap1.html#implementing_our_network_to_classify_digits">
<li>Թվանշանները ճանաչող ցանցի իրականացումը</li>
</a>
<a href="chap1.html#toward_deep_learning">
<li>Խորը ուսուցմանն ընդառաջ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
};
$('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a>
<a href="chap2.html">Ինչպե՞ս է աշխատում հետադարձ տարածումը</a>
<div id="toc_how_the_backpropagation_algorithm_works" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output
_from_a_neural_network">
<li>Մարզանք. նեյրոնային ցանցի ելքային արժեքների հաշվման արագագործ, մատրիցային մոտեցում</li>
</a>
<a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function">
<li>Երկու ենթադրություն գնային ֆունկցիայի վերաբերյալ</li>
</a>
<a href="chap2.html#the_hadamard_product_$s_\odot_t$">
<li>Հադամարի արտադրյալը՝ $s \odot t$</li>
</a>
<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">
<li>Հետադարձ տարածման հիմքում ընկած չորս հիմնական հավասարումները</li>
</a>
<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">
<li>Չորս հիմնական հավասարումների ապացույցները (ընտրովի)</li>
</a>
<a href="chap2.html#the_backpropagation_algorithm">
<li>Հետադարձ տարածման ալգորիթմը</li>
</a>
<a href="chap2.html#the_code_for_backpropagation">
<li>Հետադարձ տարածման իրականացման կոդը</li>
</a>
<a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm">
<li>Ի՞նչ իմաստով է հետադարձ տարածումն արագագործ ալգորիթմ</li>
</a>
<a href="chap2.html#backpropagation_the_big_picture">
<li>Հետադարձ տարածում. ամբողջական պատկերը</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
};
$('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a>
<a href="chap3.html">Նեյրոնային ցանցերի ուսուցման բարելավումը</a>
<div id="toc_improving_the_way_neural_networks_learn" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap3.html#the_cross-entropy_cost_function">
<li>Գնային ֆունկցիան՝ միջէնտրոպիայով</li>
</a>
<a href="chap3.html#overfitting_and_regularization">
<li>Գերմարզում և ռեգուլյարացում</li>
</a>
<a href="chap3.html#weight_initialization">
<li>Կշիռների սկզբնարժեքավորումը</li>
</a>
<a href="chap3.html#handwriting_recognition_revisited_the_code">
<li>Ձեռագրերի ճամաչման կոդի վերանայում</li>
</a>
<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">
<li>Ինչպե՞ս ընտրել նեյրոնային ցանցերի հիպեր-պարամետրերը</li>
</a>
<a href="chap3.html#other_techniques">
<li>Այլ տեխնիկաներ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
};
$('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a>
<a href="chap4.html">Տեսողական ապացույց այն մասին, որ նեյրոնային ֆունկցիաները կարող են մոտարկել կամայական ֆունկցիա</a>
<div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap4.html#two_caveats">
<li>Երկու զգուշացում</li>
</a>
<a href="chap4.html#universality_with_one_input_and_one_output">
<li>Ունիվերսալություն մեկ մուտքով և մեկ ելքով</li>
</a>
<a href="chap4.html#many_input_variables">
<li>Մեկից ավել մուտքային փոփոխականներ</li>
</a>
<a href="chap4.html#extension_beyond_sigmoid_neurons">
<li>Ընդլայնումը Սիգմոիդ նեյրոններից դուրս </li>
</a>
<a href="chap4.html#fixing_up_the_step_functions">
<li>Քայլի ֆունկցիայի ուղղումը</li>
</a>
<a href="chap4.html#conclusion">
<li>Եզրակացություն</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
};
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a>
<a href="chap5.html">Ինչու՞մն է կայանում նեյրոնային ցանցերի մարզման բարդությունը</a>
<div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap5.html#the_vanishing_gradient_problem">
<li>Անհետացող գրադիենտի խնդիրը</li>
</a>
<a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets">
<li>Ի՞նչն է անհետացող գրադիենտի խնդրի պատճառը։ Խորը նեյրոնային ցանցերի անկայուն գրադիենտները</li>
</a>
<a href="chap5.html#unstable_gradients_in_more_complex_networks">
<li>Անկայուն գրադիենտներն ավելի կոմպլեքս ցանցերում</li>
</a>
<a href="chap5.html#other_obstacles_to_deep_learning">
<li>Այլ խոչընդոտներ խորը ուսուցման մեջ</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
};
$('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});
</script>
<p class='toc_mainchapter'>
<a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a>
<a href="chap6.html">Խորը ուսուցում</a>
<div id="toc_deep_learning" style="display: none;">
<p class="toc_section">
<ul>
<a href="chap6.html#introducing_convolutional_networks">
<li>Փաթույթային ցանցեր</li>
</a>
<a href="chap6.html#convolutional_neural_networks_in_practice">
<li>Փաթույթային ցանցերը կիրառության մեջ</li>
</a>
<a href="chap6.html#the_code_for_our_convolutional_networks">
<li>Փաթույթային ցանցերի կոդը</li>
</a>
<a href="chap6.html#recent_progress_in_image_recognition">
<li>Առաջխաղացումները պատկերների ճանաչման ասպարեզում</li>
</a>
<a href="chap6.html#other_approaches_to_deep_neural_nets">
<li>Այլ մոտեցումներ խորը նեյրոնային ցանցերի համար</li>
</a>
<a href="chap6.html#on_the_future_of_neural_networks">
<li>Նեյրոնային ցանցերի ապագայի մասին</li>
</a>
</ul>
</p>
</div>
<script>
$('#toc_deep_learning_reveal').click(function() {
var src = $('#toc_img_deep_learning').attr('src');
if (src == 'images/arrow.png') {
$("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
} else {
$("#toc_img_deep_learning").attr('src', 'images/arrow.png');
};
$('#toc_deep_learning').toggle('fast', function() {});
});
</script>
<p class="toc_not_mainchapter">
<a href="sai.html">Հավելված: Արդյո՞ք գոյություն ունի ինտելեկտի <em>պարզ</em> ալգորիթմ</a>
</p>
<p class="toc_not_mainchapter">
<a href="acknowledgements.html">Երախտագիտություն</a>
</p>
<p class="toc_not_mainchapter"><a href="faq.html">Հաճախ տրվող հարցեր</a>
</p>
<!--
<hr>
<p class="sidebar"> If you benefit from the book, please make a small
donation. I suggest $3, but you can choose the amount.</p>
<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="encrypted" value="-----BEGIN PKCS7-----MIIHTwYJKoZIhvcNAQcEoIIHQDCCBzwCAQExggEwMIIBLAIBADCBlDCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20CAQAwDQYJKoZIhvcNAQEBBQAEgYAtusFIFTgWVpgZsMgI9zMrWRAFFKQqeFiE6ay1nbmP360YzPtR+vvCXwn214Az9+F9g7mFxe0L+m9zOCdjzgRROZdTu1oIuS78i0TTbcbD/Vs/U/f9xcmwsdX9KYlhimfsya0ydPQ2xvr4iSGbwfNemIPVRCTadp/Y4OQWWRFKGTELMAkGBSsOAwIaBQAwgcwGCSqGSIb3DQEHATAUBggqhkiG9w0DBwQIK5obVTaqzmyAgajgc4w5t7l6DjTGVI7k+4UyO3uafxPac23jOyBGmxSnVRPONB9I+/Q6OqpXZtn8JpTuzFmuIgkNUf1nldv/DA1mhPOeeVxeuSGL8KpWxpJboKZ0mEu9b+0FJXvZW+snv0jodnRDtI4g0AXDZNPyRWIdJ3m+tlYfsXu4mQAe0q+CyT+QrSRhPGI/llicF4x3rMbRBNqlDze/tFqp/jbgW84Puzz6KyxAez6gggOHMIIDgzCCAuygAwIBAgIBADANBgkqhkiG9w0BAQUFADCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wHhcNMDQwMjEzMTAxMzE1WhcNMzUwMjEzMTAxMzE1WjCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMFHTt38RMxLXJyO2SmS+Ndl72T7oKJ4u4uw+6awntALWh03PewmIJuzbALScsTS4sZoS1fKciBGoh11gIfHzylvkdNe/hJl66/RGqrj5rFb08sAABNTzDTiqqNpJeBsYs/c2aiGozptX2RlnBktH+SUNpAajW724Nv2Wvhif6sFAgMBAAGjge4wgeswHQYDVR0OBBYEFJaffLvGbxe9WT9S1wob7BDWZJRrMIG7BgNVHSMEgbMwgbCAFJaffLvGbxe9WT9S1wob7BDWZJRroYGUpIGRMIGOMQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFjAUBgNVBAcTDU1vdW50YWluIFZpZXcxFDASBgNVBAoTC1BheVBhbCBJbmMuMRMwEQYDVQQLFApsaXZlX2NlcnRzMREwDwYDVQQDFAhsaXZlX2FwaTEcMBoGCSqGSIb3DQEJARYNcmVAcGF5cGFsLmNvbYIBADAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAIFfOlaagFrl71+jq6OKidbWFSE+Q4FqROvdgIONth+8kSK//Y/4ihuE4Ymvzn5ceE3S/iBSQQMjyvb+s2TWbQYDwcp129OPIbD9epdr4tJOUNiSojw7BHwYRiPh58S1xGlFgHFXwrEBb3dgNbMUa+u4qectsMAXpVHnD9wIyfmHMYIBmjCCAZYCAQEwgZQwgY4xCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEWMBQGA1UEBxMNTW91bnRhaW4gVmlldzEUMBIGA1UEChMLUGF5UGFsIEluYy4xEzARBgNVBAsUCmxpdmVfY2VydHMxETAPBgNVBAMUCGxpdmVfYXBpMRwwGgYJKoZIhvcNAQkBFg1yZUBwYXlwYWwuY29tAgEAMAkGBSsOAwIaBQCgXTAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA4MDUxMzMyMTRaMCMGCSqGSIb3DQEJBDEWBBRtGLYvbZ45sWVegWVP2CuXTHPmJTANBgkqhkiG9w0BAQEFAASBgKgrMHMINfV7yVuZgcTjp8gUzejPF2x2zRPU/G8pKUvYIl1F38TjV2pe4w0QXcGMJRT8mQfxHCy9UmF3LfblH8F0NSMMDrZqu3M0eLk96old+L0Xl6ING8l3idFDkLagE+lZK4A0rNV35aMci3VLvjQ34CvEj7jaHeLpbkgk/l6v-----END PKCS7-----
">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>
-->
<hr>
<span class="sidebar_title">Հովանավորներ</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>
<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>
<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>
<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>
<p class="sidebar">Շնորհակալություն եմ հայտնում բոլոր <a href="supporters.html">աջակցողներին</a>, ովքեր օգնել են գիրքն իրականություն դարձնել: Հատուկ շնորհակալություններ Պավել Դուդրենովին. Շնորհակալություն եմ հայտնում նաև նրանց, ովքեր ներդրում են ունեցել
<a href="bugfinder.html">Սխալների որոնման հուշատախտակում</a>. </p>
<hr>
<span class="sidebar_title">Ռեսուրսներ</span>
<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Մայքլ Նիլսենը թվիթերում</a></p>
<p class="sidebar"><a href="faq.html">Գրքի մասին հաճախակի տրբող հարցեր</a></p>
<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Կոդի պահոցը</a></p>
<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Մայքլ Նիլսենի նախագծերի հայտարարման էլ հասցեների ցուցակը</a>
</p>
<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Խորը Ուսուցում</a>, գրքի հեղինակներ` Յան Գուդֆելլո, Յոշուա Բենջիո և Ահարոն Կուրվիլ</p>
<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
<p class="sidebar">
<a href="http://michaelnielsen.org">Մայքլ Նիլսեն</a>, Հունվար 2017
</p>
</div>
<p>One of the most striking facts about neural networks is that they can compute any function at all. That is, suppose someone hands you some complicated, wiggly function, $f(x)$:</p>
<p>
<center><canvas id="function" width="300" height="300"></canvas></center>
</p>
<p><a id="basic_network_precursor"></a> No matter what the function, there is guaranteed to be a neural network so that for every possible input, $x$, the value $f(x)$ (or some close approximation) is output from the network, e.g.:</p>
<p>
<center><canvas id="basic_network" width="350" height="220"></canvas></center>
</p>
<p>This result holds even if the function has many inputs, $f = f(x_1, \ldots, x_m)$, and many outputs. For instance, here's a network computing a function with $m = 3$ inputs and $n = 2$ outputs:</p>
<p>
<center><canvas id="vector_valued_network" width="450" height="370"></canvas></center>
</p>
<p>This result tells us that neural networks have a kind of
<em>universality</em>. No matter what function we want to compute, we know that there is a neural network which can do the job.</p>
<p>What's more, this universality theorem holds even if we restrict our networks to have just a single layer intermediate between the input and the output neurons - a so-called single hidden layer. So even very simple network architectures can be extremely
powerful.</p>
<p>The universality theorem is well known by people who use neural networks. But why it's true is not so widely understood. Most of the explanations available are quite technical. For instance, one of the original papers proving the result*
<span class="marginnote">
*<a href="http://www.dartmouth.edu/~gvc/Cybenko_MCSS.pdf">Approximation
by superpositions of a sigmoidal function</a>, by George Cybenko
(1989). The result was very much in the air at the time, and
several groups proved closely related results. Cybenko's paper
contains a useful discussion of much of that work. Another
important early paper is
<a href="http://www.sciencedirect.com/science/article/pii/0893608089900208">Multilayer
feedforward networks are universal approximators</a>, by Kurt Hornik,
Maxwell Stinchcombe, and Halbert White (1989). This paper uses the
Stone-Weierstrass theorem to arrive at similar results.</span> did so using the Hahn-Banach theorem, the Riesz Representation theorem, and some Fourier analysis. If you're a mathematician the argument is not difficult to follow, but it's not so
easy for most people. That's a pity, since the underlying reasons for universality are simple and beautiful.
</p>
<p>In this chapter I give a simple and mostly visual explanation of the universality theorem. We'll go step by step through the underlying ideas. You'll understand why it's true that neural networks can compute any function. You'll understand some of
the limitations of the result. And you'll understand how the result relates to deep neural networks.</p>
<p>To follow the material in the chapter, you do not need to have read earlier chapters in this book. Instead, the chapter is structured to be enjoyable as a self-contained essay. Provided you have just a little basic familiarity with neural networks,
you should be able to follow the explanation. I will, however, provide occasional links to earlier material, to help fill in any gaps in your knowledge.</p>
<p></p>
<p></p>
<p></p>
<p>Universality theorems are a commonplace in computer science, so much so that we sometimes forget how astonishing they are. But it's worth reminding ourselves: the ability to compute an arbitrary function is truly remarkable. Almost any process you
can imagine can be thought of as function computation. Consider the problem of naming a piece of music based on a short sample of the piece. That can be thought of as computing a function. Or consider the problem of translating a Chinese text into
English. Again, that can be thought of as computing a function*<span class="marginnote">
*Actually, computing one of many functions, since
there are often many acceptable translations of a given piece of
text.</span>. Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting. Again, that can be thought of as a kind of function computation*<span class="marginnote">
*Ditto the remark about translation and
there being many possible functions.</span>. Universality means that, in principle, neural networks can do all these things and many more.</p>
<p>Of course, just because we know a neural network exists that can (say) translate Chinese text into English, that doesn't mean we have good techniques for constructing or even recognizing such a network. This limitation applies also to traditional
universality theorems for models such as Boolean circuits. But, as we've seen earlier in the book, neural networks have powerful algorithms for learning functions. That combination of learning algorithms + universality is an attractive mix. Up to
now, the book has focused on the learning algorithms. In this chapter, we focus on universality, and what it means.
</p>
<p>
<h3><a name="two_caveats"></a><a href="#two_caveats">Two caveats</a></h3></p>
<p>Before explaining why the universality theorem is true, I want to mention two caveats to the informal statement "a neural network can compute any function".</p>
<p>First, this doesn't mean that a network can be used to <em>exactly</em> compute any function. Rather, we can get an <em>approximation</em> that is as good as we want. By increasing the number of hidden neurons we can improve the approximation. For
instance,
<a href="#basic_network_precursor">earlier</a> I illustrated a network computing some function $f(x)$ using three hidden neurons. For most functions only a low-quality approximation will be possible using three hidden neurons. By increasing the
number of hidden neurons (say, to five) we can typically get a better approximation:</p>
<p>
<center><canvas id="bigger_network" width="350" height="380"></canvas></center>
</p>
<p>And we can do still better by further increasing the number of hidden neurons. </p>
<p>To make this statement more precise, suppose we're given a function $f(x)$ which we'd like to compute to within some desired accuracy $\epsilon > 0$. The guarantee is that by using enough hidden neurons we can always find a neural network whose output
$g(x)$ satisfies $|g(x) - f(x)|
< \epsilon$, for all inputs $x$. In other words, the approximation will be good to within the desired accuracy for every possible input.</p>
<p>The second caveat is that the class of functions which can be approximated in the way described are the <em>continuous</em> functions. If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate
using a neural net. This is not surprising, since our neural networks compute continuous functions of their input. However, even if the function we'd really like to compute is discontinuous, it's often the case that a continuous approximation
is good enough. If that's so, then we can use a neural network. In practice, this is not usually an important limitation.
</p>
<p>Summing up, a more precise statement of the universality theorem is that neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision. In this chapter we'll actually prove a slightly weaker
version of this result, using two hidden layers instead of one. In the problems I'll briefly outline how the explanation can, with a few tweaks, be adapted to give a proof which uses only a single hidden layer.
<h3><a name="universality_with_one_input_and_one_output"></a><a href="#universality_with_one_input_and_one_output">Universality with one input and one output</a></h3></p>
<p>To understand why the universality theorem is true, let's start by understanding how to construct a neural network which approximates a function with just one input and one output:</p>
<p>
<center><canvas id="function_2" width="300" height="300"></canvas></center>
</p>
<p>It turns out that this is the core of the problem of universality. Once we've understood this special case it's actually pretty easy to extend to functions with many inputs and many outputs.</p>
<p>To build insight into how to construct a network to compute $f$, let's start with a network containing just a single hidden layer, with two hidden neurons, and an output layer containing a single output neuron:</p>
<p>
<center><canvas id="two_hidden_neurons" width="350" height="220"></canvas></center>
</p>
<p>To get a feel for how components in the network work, let's focus on the top hidden neuron. In the diagram below, click on the weight, $w$, and drag the mouse a little ways to the right to increase $w$. You can immediately see how the function
computed by the top hidden neuron changes:</p>
<p>
<center><canvas id="basic_manipulation" width="600" height="285"></canvas></center>
</p>
<p>As we learnt <a href="chap1.html#sigmoid_neurons">earlier in the book</a>, what's being computed by the hidden neuron is $\sigma(wx + b)$, where $\sigma(z) \equiv 1/(1+e^{-z})$ is the sigmoid function. Up to now, we've made frequent use of this
algebraic form. But for the proof of universality we will obtain more insight by ignoring the algebra entirely, and instead manipulating and observing the shape shown in the graph. This won't just give us a better feel for what's going on, it
will also give us a proof*<span class="marginnote">
*Strictly speaking, the visual
approach I'm taking isn't what's traditionally thought of as a
proof. But I believe the visual approach gives more insight into
why the result is true than a traditional proof. And, of course,
that kind of insight is the real purpose behind a proof.
Occasionally, there will be small gaps in the reasoning I present:
places where I make a visual argument that is plausible, but not
quite rigorous. If this bothers you, then consider it a challenge
to fill in the missing steps. But don't lose sight of the real
purpose: to understand why the universality theorem is true.</span> of universality that applies to activation functions other than the sigmoid function.</p>
<p>To get started on this proof, try clicking on the bias, $b$, in the diagram above, and dragging to the right to increase it. You'll see that as the bias increases the graph moves to the left, but its shape doesn't change.</p>
<p>Next, click and drag to the left in order to decrease the bias. You'll see that as the bias decreases the graph moves to the right, but, again, its shape doesn't change.</p>
<p>Next, decrease the weight to around $2$ or $3$. You'll see that as you decrease the weight, the curve broadens out. You might need to change the bias as well, in order to keep the curve in-frame.</p>
<p>Finally, increase the weight up past $w = 100$. As you do, the curve gets steeper, until eventually it begins to look like a step function. Try to adjust the bias so the step occurs near $x = 0.3$. The following short clip shows what your result
should look like. Click on the play button to play (or replay) the video:</p>
<p>
<!-- Based on http://worrydream.com/ScrubbingCalculator/, with minor changes -->
<script type="text/javascript">
function playVideo(name) {
var div = $("#" + name)[0];
div.style.backgroundColor = "transparent";
div.style.cursor = "default";
div.getElementsByTagName("img")[0].style.display = "none";
var video = $("#v" + name)[0];
video.play();
}
function videoEnded(name) {
var div = document.getElementById(name);
div.getElementsByTagName("img")[0].style.display = "block";
div.style.backgroundColor = "white";
div.style.opacity = 0.6;
div.style.cursor = "pointer";
}
</script>
<div>
<div id="a" class="videoOverlay" style="width: 560px; height: 280px; opacity: 0.8" onclick="playVideo('a');">
<img style="left: 210px; top: 75px;" src="images/play.png" width="128px">
</div>
<video id="va" width="560" height="280" preload onended="videoEnded('a');">
<source type="video/webm"
src="movies/create_step_function.webm">
<source type="video/mp4"
src="movies/create_step_function.mp4">
</video>
</div>
</p>
<p>We can simplify our analysis quite a bit by increasing the weight so much that the output really is a step function, to a very good approximation. Below I've plotted the output from the top hidden neuron when the weight is $w = 999$. Note that
this plot is static, and you can't change parameters such as the weight.</p>
<p><img src="images/high_weight_function.jpg"></p>
<p>It's actually quite a bit easier to work with step functions than general sigmoid functions. The reason is that in the output layer we add up contributions from all the hidden neurons. It's easy to analyze the sum of a bunch of step functions,
but rather more difficult to reason about what happens when you add up a bunch of sigmoid shaped curves. And so it makes things much easier to assume that our hidden neurons are outputting step functions. More concretely, we do this by fixing
the weight $w$ to be some very large value, and then setting the position of the step by modifying the bias. Of course, treating the output as a step function is an approximation, but it's a very good approximation, and for now we'll treat it
as exact. I'll come back later to discuss the impact of deviations from this approximation.</p>
<p>At what value of $x$ does the step occur? Put another way, how does the position of the step depend upon the weight and bias?</p>
<p>To answer this question, try modifying the weight and bias in the diagram above (you may need to scroll back a bit). Can you figure out how the position of the step depends on $w$ and $b$? With a little work you should be able to convince yourself
that the position of the step is <em>proportional</em> to $b$, and <em>inversely proportional</em> to $w$.</p>
<p>In fact, the step is at position $s = -b/w$, as you can see by modifying the weight and bias in the following diagram:</p>
<p><canvas id="step" width="600" height="285"></canvas></p>
<p>It will greatly simplify our lives to describe hidden neurons using just a single parameter, $s$, which is the step position, $s = -b/w$. Try modifying $s$ in the following diagram, in order to get used to the new parameterization:</p>
<p><canvas id="step_parameterization" width="600" height="285"></canvas></p>
<p>As noted above, we've implicitly set the weight $w$ on the input to be some large value - big enough that the step function is a very good approximation. We can easily convert a neuron parameterized in this way back into the conventional model,
by choosing the bias $b = -w s$.</p>
<p>Up to now we've been focusing on the output from just the top hidden neuron. Let's take a look at the behavior of the entire network. In particular, we'll suppose the hidden neurons are computing step functions parameterized by step points $s_1$
(top neuron) and $s_2$ (bottom neuron). And they'll have respective output weights $w_1$ and $w_2$. Here's the network:</p>
<p><canvas id="two_hn_network" width="600" height="285"></canvas></p>
<p>What's being plotted on the right is the <em>weighted output</em> $w_1 a_1 + w_2 a_2$ from the hidden layer. Here, $a_1$ and $a_2$ are the outputs from the top and bottom hidden neurons, respectively*
<span class="marginnote">
*Note, by the way, that the output from the whole
network is $\sigma(w_1 a_1+w_2 a_2 + b)$, where $b$ is the bias on
the output neuron. Obviously, this isn't the same as the weighted
output from the hidden layer, which is what we're plotting here.
We're going to focus on the weighted output from the hidden layer
right now, and only later will we think about how that relates to
the output from the whole network.</span>. These outputs are denoted with $a$s because they're often known as the neurons' <em>activations</em>.</p>
<p>Try increasing and decreasing the step point $s_1$ of the top hidden neuron. Get a feel for how this changes the weighted output from the hidden layer. It's particularly worth understanding what happens when $s_1$ goes past $s_2$. You'll see that
the graph changes shape when this happens, since we have moved from a situation where the top hidden neuron is the first to be activated to a situation where the bottom hidden neuron is the first to be activated.</p>
<p>Similarly, try manipulating the step point $s_2$ of the bottom hidden neuron, and get a feel for how this changes the combined output from the hidden neurons.</p>
<p>Try increasing and decreasing each of the output weights. Notice how this rescales the contribution from the respective hidden neurons. What happens when one of the weights is zero?</p>
<p>Finally, try setting $w_1$ to be $0.8$ and $w_2$ to be $-0.8$. You get a "bump" function, which starts at point $s_1$, ends at point $s_2$, and has height $0.8$. For instance, the weighted output might look like this:</p>
<p><img src="images/bump_function.jpg"></p>
<p>Of course, we can rescale the bump to have any height at all. Let's use a single parameter, $h$, to denote the height. To reduce clutter I'll also remove the "$s_1 = \ldots$" and "$w_1 = \ldots$" notations.</p>
<p><canvas id="bump_fn" width="600" height="285"></canvas></p>
<p>Try changing the value of $h$ up and down, to see how the height of the bump changes. Try changing the height so it's negative, and observe what happens. And try changing the step points to see how that changes the shape of the bump.</p>
<p>You'll notice, by the way, that we're using our neurons in a way that can be thought of not just in graphical terms, but in more conventional programming terms, as a kind of <tt>if-then-else</tt> statement, e.g.:</p>
<p>
<div class="highlight"><pre><span></span> <span class="k">if</span> input ><span class="o">=</span> step point:
add <span class="m">1</span> to the weighted output
<span class="k">else</span>:
add <span class="m">0</span> to the weighted output
</pre></div>
</p>
<p>For the most part I'm going to stick with the graphical point of view. But in what follows you may sometimes find it helpful to switch points of view, and think about things in terms of <tt>if-then-else</tt>.</p>
<p>We can use our bump-making trick to get two bumps, by gluing two pairs of hidden neurons together into the same network:</p>
<p><canvas id="double_bump" width="600" height="280"></canvas></p>
<p>I've suppressed the weights here, simply writing the $h$ values for each pair of hidden neurons. Try increasing and decreasing both $h$ values, and observe how it changes the graph. Move the bumps around by changing the step points.</p>
<p>More generally, we can use this idea to get as many peaks as we want, of any height. In particular, we can divide the interval $[0, 1]$ up into a large number, $N$, of subintervals, and use $N$ pairs of hidden neurons to set up peaks of any desired
height. Let's see how this works for $N = 5$. That's quite a few neurons, so I'm going to pack things in a bit. Apologies for the complexity of the diagram: I could hide the complexity by abstracting away further, but I think it's worth putting
up with a little complexity, for the sake of getting a more concrete feel for how these networks work.</p>
<p><canvas id="five_bumps" width="600" height="620"></canvas></p>
<p>You can see that there are five pairs of hidden neurons. The step points for the respective pairs of neurons are $0, 1/5$, then $1/5, 2/5$, and so on, out to $4/5, 5/5$. These values are fixed - they make it so we get five evenly spaced bumps
on the graph.</p>
<p>Each pair of neurons has a value of $h$ associated to it. Remember, the connections output from the neurons have weights $h$ and $-h$ (not marked). Click on one of the $h$ values, and drag the mouse to the right or left to change the value. As
you do so, watch the function change. By changing the output weights we're actually
<em>designing</em> the function!</p>
<p>Contrariwise, try clicking on the graph, and dragging up or down to change the height of any of the bump functions. As you change the heights, you can see the corresponding change in $h$ values. And, although it's not shown, there is also a change
in the corresponding output weights, which are $+h$ and $-h$.</p>
<p>In other words, we can directly manipulate the function appearing in the graph on the right, and see that reflected in the $h$ values on the left. A fun thing to do is to hold the mouse button down and drag the mouse from one side of the graph
to the other. As you do this you draw out a function, and get to watch the parameters in the neural network adapt.</p>
<p>Time for a challenge.</p>
<p>Let's think back to the function I plotted at the beginning of the chapter:
</p>
<p>
<center> <canvas id="function_3" width="300" height="300"></canvas>
</center>
</p>
<p>I didn't say it at the time, but what I plotted is actually the function
<a class="displaced_anchor" name="eqtn113"></a>\begin{eqnarray} f(x) = 0.2+0.4 x^2+0.3x \sin(15 x) + 0.05 \cos(50 x), \tag{113}\end{eqnarray} plotted over $x$ from $0$ to $1$, and with the $y$ axis taking values from $0$ to $1$.</p>
<p>That's obviously not a trivial function.</p>
<p>You're going to figure out how to compute it using a neural network.</p>
<p>In our networks above we've been analyzing the weighted combination $\sum_j w_j a_j$ output from the hidden neurons. We now know how to get a lot of control over this quantity. But, as I noted earlier, this quantity is not what's output from the
network. What's output from the network is $\sigma(\sum_j w_j a_j + b)$ where $b$ is the bias on the output neuron. Is there some way we can achieve control over the actual output from the network?</p>
<p>The solution is to design a neural network whose hidden layer has a weighted output given by $\sigma^{-1} \circ f(x)$, where $\sigma^{-1}$ is just the inverse of the $\sigma$ function. That is, we want the weighted output from the hidden layer
to be:</p>
<p>
<center> <canvas id="inverted_function" width="340" height="300"></canvas> </center>
</p>
<p>If we can do this, then the output from the network as a whole will be a good approximation to $f(x)$*<span class="marginnote">
*Note that I have set the bias
on the output neuron to $0$.</span>.</p>
<p>Your challenge, then, is to design a neural network to approximate the goal function shown just above. To learn as much as possible, I want you to solve the problem twice. The first time, please click on the graph, directly adjusting the heights
of the different bump functions. You should find it fairly easy to get a good match to the goal function. How well you're doing is measured by the <em>average
deviation</em> between the goal function and the function the network is actually computing. Your challenge is to drive the average deviation as <em>low</em> as possible. You complete the challenge when you drive the average deviation to $0.40$ or below.</p>
<p>Once you've done that, click on "Reset" to randomly re-initialize the bumps. The second time you solve the problem, resist the urge to click on the graph. Instead, modify the $h$ values on the left-hand side, and again attempt to drive the average
deviation to $0.40$ or below.
</p>
<p><canvas id="design_function" width="600" height="620"></canvas></p>
<p>You've now figured out all the elements necessary for the network to approximately compute the function $f(x)$! It's only a coarse approximation, but we could easily do much better, merely by increasing the number of pairs of hidden neurons, allowing
more bumps.</p>
<p>In particular, it's easy to convert all the data we have found back into the standard parameterization used for neural networks. Let me just recap quickly how that works.</p>
<p>The first layer of weights all have some large, constant value, say $w = 1000$.</p>
<p>The biases on the hidden neurons are just $b = -w s$. So, for instance, for the second hidden neuron $s = 0.2$ becomes $b = -1000 \times 0.2 = -200$.</p>
<p>The final layer of weights are determined by the $h$ values. So, for instance, the value you've chosen above for the first $h$, $h = $
<span id="h" style="font-family: MJX_Main;"></span>, means that the output weights from the top two hidden neurons are <span id="w1" style="font-family: MJX_Main;"></span> and <span id="w2" style="font-family: MJX_Main;"></span>, respectively.
And so on, for the entire layer of output weights.</p>
<p>Finally, the bias on the output neuron is $0$.</p>
<p>That's everything: we now have a complete description of a neural network which does a pretty good job computing our original goal function. And we understand how to improve the quality of the approximation by improving the number of hidden neurons.</p>
<p>What's more, there was nothing special about our original goal function, $f(x) = 0.2+0.4 x^2+0.3 \sin(15 x) + 0.05 \cos(50 x)$. We could have used this procedure for any continuous function from $[0, 1]$ to $[0, 1]$. In essence, we're using our
single-layer neural networks to build a lookup table for the function. And we'll be able to build on this idea to provide a general proof of universality.</p>
<p>
<h3><a name="many_input_variables"></a><a href="#many_input_variables">Many input variables</a></h3></p>
<p>Let's extend our results to the case of many input variables. This sounds complicated, but all the ideas we need can be understood in the case of just two inputs. So let's address the two-input case.</p>
<p>We'll start by considering what happens when we have two inputs to a neuron:
</p>
<p>
<center> <canvas id="two_inputs" width="350" height="220"></canvas>
</center>
</p>
<p>Here, we have inputs $x$ and $y$, with corresponding weights $w_1$ and $w_2$, and a bias $b$ on the neuron. Let's set the weight $w_2$ to $0$, and then play around with the first weight, $w_1$, and the bias, $b$, to see how they affect the output
from the neuron:</p>
<p>
<script src="js/three.min.js"></script>
</p>
<p><canvas id="ti_graph" width="200" height="220"></canvas>
<span id="ti_graph_3d" style="position: absolute; left: 260px;"></span></p>
<p>As you can see, with $w_2 = 0$ the input $y$ makes no difference to the output from the neuron. It's as though $x$ is the only input.</p>
<p>Given this, what do you think happens when we increase the weight $w_1$ to $w_1 = 100$, with $w_2$ remaining $0$? If you don't immediately see the answer, ponder the question for a bit, and see if you can figure out what happens. Then try it out
and see if you're right. I've shown what happens in the following movie:</p>
<p>
<div>
<div id="b" class="videoOverlay" style="width: 460px; height: 252px; opacity: 0.8" onclick="playVideo('b');">
<img style="left: 160px; top: 70px;" src="images/play.png" width="128px">
</div>
<video id="vb" width="460" height="252" preload onended="videoEnded('b');">
<source type="video/mp4"
src="movies/step_3d.mp4">
<source type="video/webm"
src="movies/step_3d.webm"></p><p> </video>
</div>
</p>
<p>Just as in our earlier discussion, as the input weight gets larger the output approaches a step function. The difference is that now the step function is in three dimensions. Also as before, we can move the location of the step point around by
modifying the bias. The actual location of the step point is $s_x \equiv -b / w_1$.</p>
<p>Let's redo the above using the position of the step as the parameter:</p>
<p><canvas id="ti_graph_redux" width="200" height="220"></canvas> <span id="ti_graph_redux_3d" style="position: absolute; left:
260px;"></span></p>
<p>Here, we assume the weight on the $x$ input has some large value - I've used $w_1 = 1000$ - and the weight $w_2 = 0$. The number on the neuron is the step point, and the little $x$ above the number reminds us that the step is in the $x$ direction.
Of course, it's also possible to get a step function in the $y$ direction, by making the weight on the $y$ input very large (say, $w_2 = 1000$), and the weight on the $x$ equal to $0$, i.e., $w_1 = 0$:</p>
<p><canvas id="y_step" width="200" height="220"></canvas> <span id="y_step_3d" style="position: absolute; left: 260px;"></span></p>
<p>The number on the neuron is again the step point, and in this case the little $y$ above the number reminds us that the step is in the $y$ direction. I could have explicitly marked the weights on the $x$ and $y$ inputs, but decided not to, since
it would make the diagram rather cluttered. But do keep in mind that the little $y$ marker implicitly tells us that the $y$ weight is large, and the $x$ weight is $0$.</p>
<p>We can use the step functions we've just constructed to compute a three-dimensional bump function. To do this, we use two neurons, each computing a step function in the $x$ direction. Then we combine those step functions with weight $h$ and $-h$,
respectively, where $h$ is the desired height of the bump. It's all illustrated in the following diagram:
</p>
<p><canvas id="bump_3d" width="300" height="220"></canvas> <span id="bump_3d_graph" style="position: absolute; left: 360px;"></span></p>
<p>Try changing the value of the height, $h$. Observe how it relates to the weights in the network. And see how it changes the height of the bump function on the right.</p>
<p>Also, try changing the step point $0.30$ associated to the top hidden neuron. Witness how it changes the shape of the bump. What happens when you move it past the step point $0.70$ associated to the bottom hidden neuron?</p>
<p>We've figured out how to make a bump function in the $x$ direction. Of course, we can easily make a bump function in the $y$ direction, by using two step functions in the $y$ direction. Recall that we do this by making the weight large on the
$y$ input, and the weight $0$ on the $x$ input. Here's the result:</p>
<p><canvas id="bump_3d_y" width="300" height="220"></canvas> <span id="bump_3d_y_graph" style="position: absolute; left: 360px;"></span></p>
<p>This looks nearly identical to the earlier network! The only thing explicitly shown as changing is that there's now little $y$ markers on our hidden neurons. That reminds us that they're producing $y$ step functions, not $x$ step functions, and
so the weight is very large on the $y$ input, and zero on the $x$ input, not vice versa. As before, I decided not to show this explicitly, in order to avoid clutter.</p>
<p>Let's consider what happens when we add up two bump functions, one in the $x$ direction, the other in the $y$ direction, both of height $h$:</p>
<p><canvas id="xy_bump" width="300" height="270"></canvas> <span id="xy_bump_3d" style="position: absolute; left: 360px;"></span></p>
<p>To simplify the diagram I've dropped the connections with zero weight. For now, I've left in the little $x$ and $y$ markers on the hidden neurons, to remind you in what directions the bump functions are being computed. We'll drop even those markers
later, since they're implied by the input variable.</p>
<p>Try varying the parameter $h$. As you can see, this causes the output weights to change, and also the heights of both the $x$ and $y$ bump functions.
</p>
<p>What we've built looks a little like a <em>tower</em> function:</p>
<p>
<center> <span id="tower" style="position: absolute;"></span>
<div style="height: 230px"></div>
</center>
</p>
<p>If we could build such tower functions, then we could use them to approximate arbitrary functions, just by adding up many towers of different heights, and in different locations:</p>
<p>
<center> <span id="many_towers" style="position: absolute;"></span>
<div style="height: 230px"></div>
</center>
</p>
<p>Of course, we haven't yet figured out how to build a tower function. What we have constructed looks like a central tower, of height $2h$, with a surrounding plateau, of height $h$.</p>
<p>But we can make a tower function. Remember that earlier we saw neurons can be used to implement a type of <tt>if-then-else</tt> statement:
</p>
<p>
<div class="highlight"><pre><span></span> <span class="k">if</span> input ><span class="o">=</span> threshold:
output 1
<span class="k">else</span>:
output 0
</pre></div>
</p>
<p>That was for a neuron with just a single input. What we want is to apply a similar idea to the combined output from the hidden neurons:</p>
<p>
<div class="highlight"><pre><span></span> <span class="k">if</span> combined output from hidden neurons ><span class="o">=</span> threshold:
output 1
<span class="k">else</span>:
output 0
</pre></div>
</p>
<p>If we choose the <tt>threshold</tt> appropriately - say, a value of $3h/2$, which is sandwiched between the height of the plateau and the height of the central tower - we could squash the plateau down to zero, and leave just the tower standing.</p>
<p>Can you see how to do this? Try experimenting with the following network to figure it out. Note that we're now plotting the output from the entire network, not just the weighted output from the hidden layer. This means we add a bias term to the
weighted output from the hidden layer, and apply the sigma function. Can you find values for $h$ and $b$ which produce a tower? This is a bit tricky, so if you think about this for a while and remain stuck, here's two hints: (1) To get the output
neuron to show the right kind of <tt>if-then-else</tt> behaviour, we need the input weights (all $h$ or $-h$) to be large; and (2) the value of $b$ determines the scale of the
<tt>if-then-else</tt> threshold.</p>
<p><canvas id="tower_construction" width="300" height="270"></canvas>
<span id="tower_construction_3d" style="position: absolute; left:
350px;"></span></p>
<p>With our initial parameters, the output looks like a flattened version of the earlier diagram, with its tower and plateau. To get the desired behaviour, we increase the parameter $h$ until it becomes large. That gives the <tt>if-then-else</tt> thresholding behaviour. Second, to get the threshold right, we'll choose $b \approx -3h/2$. Try it, and see how it works!</p>
<p>Here's what it looks like, when we use $h = 10$:</p>
<p>
<div>
<div id="c" class="videoOverlay" style="width: 556px; height: 284px; opacity: 0.8" onclick="playVideo('c');">
<img style="left: 210px; top: 80px;" src="images/play.png" width="128px">
</div>
<video id="vc" width="556" height="284" preload onended="videoEnded('c');">
<source type="video/mp4"
src="movies/tower_construction.mp4">
<source type="video/webm"
src="movies/tower_construction.webm">
</video></div>
</p>
<p>Even for this relatively modest value of $h$, we get a pretty good tower function. And, of course, we can make it as good as we want by increasing $h$ still further, and keeping the bias as $b = -3h/2$.</p>
<p>Let's try gluing two such networks together, in order to compute two different tower functions. To make the respective roles of the two sub-networks clear I've put them in separate boxes, below: each box computes a tower function, using the technique
described above. The graph on the right shows the weighted output from the <em>second</em> hidden layer, that is, it's a weighted combination of tower functions.</p>
<p><canvas id="the_two_towers" width="320" height="580"></canvas> <span id="the_two_towers_3d" style="position: absolute; left: 370px;
margin-top: 180px;"></span></p>
<p>In particular, you can see that by modifying the weights in the final layer you can change the height of the output towers.</p>
<p>The same idea can be used to compute as many towers as we like. We can also make them as thin as we like, and whatever height we like. As a result, we can ensure that the weighted output from the second hidden layer approximates any desired function
of two variables:</p>
<p>
<center> <span id="many_towers_2" style="position: absolute;"></span>
<div style="height: 230px"></div>
</center>
</p>
<p>In particular, by making the weighted output from the second hidden layer a good approximation to $\sigma^{-1} \circ f$, we ensure the output from our network will be a good approximation to any desired function, $f$.</p>
<p>What about functions of more than two variables?</p>
<p>Let's try three variables $x_1, x_2, x_3$. The following network can be used to compute a tower function in four dimensions:</p>
<p><canvas id="tower_n_dim" width="300" height="410"></canvas></p>
<p>Here, the $x_1, x_2, x_3$ denote inputs to the network. The $s_1, t_1$ and so on are step points for neurons - that is, all the weights in the first layer are large, and the biases are set to give the step points $s_1, t_1, s_2, \ldots$. The weights
in the second layer alternate $+h, -h$, where $h$ is some very large number. And the output bias is $-5h/2$.</p>
<p>This network computes a function which is $1$ provided three conditions are met: $x_1$ is between $s_1$ and $t_1$; $x_2$ is between $s_2$ and $t_2$; and $x_3$ is between $s_3$ and $t_3$. The network is $0$ everywhere else. That is, it's a kind
of tower which is $1$ in a little region of input space, and $0$ everywhere else.</p>
<p>By gluing together many such networks we can get as many towers as we want, and so approximate an arbitrary function of three variables. Exactly the same idea works in $m$ dimensions. The only change needed is to make the output bias $(-m+1/2)h$,
in order to get the right kind of sandwiching behavior to level the plateau.</p>
<p>Okay, so we now know how to use neural networks to approximate a real-valued function of many variables. What about vector-valued functions $f(x_1, \ldots, x_m) \in R^n$? Of course, such a function can be regarded as just $n$ separate real-valued
functions, $f^1(x_1, \ldots, x_m), f^2(x_1, \ldots, x_m)$, and so on. So we create a network approximating $f^1$, another network for $f^2$, and so on. And then we simply glue all the networks together. So that's also easy to cope with.</p>
<p>
<h4><a name="problem_863961"></a><a href="#problem_863961">Problem</a></h4>
<ul>
</p>
<p>
<li> We've seen how to use networks with two hidden layers to approximate an arbitrary function. Can you find a proof showing that it's possible with just a single hidden layer? As a hint, try working in the case of just two input variables, and
showing that: (a) it's possible to get step functions not just in the $x$ or $y$ directions, but in an arbitrary direction; (b) by adding up many of the constructions from part (a) it's possible to approximate a tower function which is circular
in shape, rather than rectangular; (c) using these circular towers, it's possible to approximate an arbitrary function. To do part (c) it may help to use ideas from a
<a href="#fixing_up_the_step_functions">bit later in this
chapter</a>.</p>
<p>
</ul>
</p>
<p>
<h3><a name="extension_beyond_sigmoid_neurons"></a><a href="#extension_beyond_sigmoid_neurons">Extension beyond sigmoid neurons</a></h3></p>
<p>We've proved that networks made up of sigmoid neurons can compute any function. Recall that in a sigmoid neuron the inputs $x_1, x_2, \ldots$ result in the output $\sigma(\sum_j w_j x_j + b)$, where $w_j$ are the weights, $b$ is the bias, and
$\sigma$ is the sigmoid function:
</p>
<p>
<center> <canvas id="sigmoid" width="500" height="200"></canvas>
</center>
</p>
<p>What if we consider a different type of neuron, one using some other activation function, $s(z)$:</p>
<p>
<center> <canvas id="sigmoid_like" width="500" height="200"></canvas>
</center>
</p>
<p>That is, we'll assume that if our neurons has inputs $x_1, x_2, \ldots$, weights $w_1, w_2, \ldots$ and bias $b$, then the output is $s(\sum_j w_j x_j + b)$.</p>
<p>We can use this activation function to get a step function, just as we did with the sigmoid. Try ramping up the weight in the following, say to $w = 100$:</p>
<p><canvas id="ramping" width="600" height="200"></canvas></p>
<p>Just as with the sigmoid, this causes the activation function to contract, and ultimately it becomes a very good approximation to a step function. Try changing the bias, and you'll see that we can set the position of the step to be wherever we
choose. And so we can use all the same tricks as before to compute any desired function.</p>
<p>What properties does $s(z)$ need to satisfy in order for this to work? We do need to assume that $s(z)$ is well-defined as $z \rightarrow -\infty$ and $z \rightarrow \infty$. These two limits are the two values taken on by our step function. We
also need to assume that these limits are different from one another. If they weren't, there'd be no step, simply a flat graph! But provided the activation function $s(z)$ satisfies these properties, neurons based on such an activation function
are universal for computation.</p>
<p>
<h4><a name="problems_963556"></a><a href="#problems_963556">Problems</a></h4>
<ul>
<li> Earlier in the book we met another type of neuron known as a <a href="chap3.html#other_models_of_artificial_neuron">rectified linear
unit</a>. Explain why such neurons don't satisfy the conditions just given for universality. Find a proof of universality showing that rectified linear units are universal for computation.</p>
<p>
<li> Suppose we consider linear neurons, i.e., neurons with the activation function $s(z) = z$. Explain why linear neurons don't satisfy the conditions just given for universality. Show that such neurons can't be used to do universal computation.
</ul>
</p>
<p>
<h3><a name="fixing_up_the_step_functions"></a><a href="#fixing_up_the_step_functions">Fixing up the step functions</a></h3></p>
<p>Up to now, we've been assuming that our neurons can produce step functions exactly. That's a pretty good approximation, but it is only an approximation. In fact, there will be a narrow window of failure, illustrated in the following graph, in
which the function behaves very differently from a step function:</p>
<p><canvas id="failure" width="220" height="200"></canvas></p>
<p>In these windows of failure the explanation I've given for universality will fail.</p>
<p>Now, it's not a terrible failure. By making the weights input to the neurons big enough we can make these windows of failure as small as we like. Certainly, we can make the window much narrower than I've shown above - narrower, indeed, than our
eye could see. So perhaps we might not worry too much about this problem.</p>
<p>Nonetheless, it'd be nice to have some way of addressing the problem.</p>
<p>In fact, the problem turns out to be easy to fix. Let's look at the fix for neural networks computing functions with just one input and one output. The same ideas work also to address the problem when there are more inputs and outputs.</p>
<p>In particular, suppose we want our network to compute some function, $f$. As before, we do this by trying to design our network so that the weighted output from our hidden layer of neurons is $\sigma^{-1} \circ f(x)$:</p>
<p>
<center> <canvas id="inverted_function_2" width="340" height="300"></canvas> </center>
</p>
<p>If we were to do this using the technique described earlier, we'd use the hidden neurons to produce a sequence of bump functions:</p>
<p>
<center> <canvas id="series_of_bumps" width="340" height="300"></canvas> </center>
</p>
<p>Again, I've exaggerated the size of the windows of failure, in order to make them easier to see. It should be pretty clear that if we add all these bump functions up we'll end up with a reasonable approximation to $\sigma^{-1} \circ f(x)$, except
within the windows of failure.</p>
<p>Suppose that instead of using the approximation just described, we use a set of hidden neurons to compute an approximation to <em>half</em> our original goal function, i.e., to $\sigma^{-1} \circ f(x) / 2$. Of course, this looks just like a scaled
down version of the last graph:</p>
<p>
<center> <canvas id="half_bumps" width="340" height="300"></canvas>
</center>
</p>
<p>And suppose we use another set of hidden neurons to compute an approximation to $\sigma^{-1} \circ f(x)/ 2$, but with the bases of the bumps <em>shifted</em> by half the width of a bump:</p>
<p>
<center> <canvas id="shifted_bumps" width="340" height="300"></canvas>
</center>
</p>
<p>Now we have two different approximations to $\sigma^{-1} \circ f(x) / 2$. If we add up the two approximations we'll get an overall approximation to $\sigma^{-1} \circ f(x)$. That overall approximation will still have failures in small windows.
But the problem will be much less than before. The reason is that points in a failure window for one approximation won't be in a failure window for the other. And so the approximation will be a factor roughly $2$ better in those windows.
</p>
<p>We could do even better by adding up a large number, $M$, of overlapping approximations to the function $\sigma^{-1} \circ f(x) / M$. Provided the windows of failure are narrow enough, a point will only ever be in one window of failure. And provided
we're using a large enough number $M$ of overlapping approximations, the result will be an excellent overall approximation.</p>
<p>
<h3><a name="conclusion"></a><a href="#conclusion">Conclusion</a></h3></p>
<p>The explanation for universality we've discussed is certainly not a practical prescription for how to compute using neural networks! In this, it's much like proofs of universality for <tt>NAND</tt> gates and the like. For this reason, I've focused
mostly on trying to make the construction clear and easy to follow, and not on optimizing the details of the construction. However, you may find it a fun and instructive exercise to see if you can improve the construction.</p>
<p>Although the result isn't directly useful in constructing networks, it's important because it takes off the table the question of whether any particular function is computable using a neural network. The answer to that question is always "yes".
So the right question to ask is not whether any particular function is computable, but rather what's a <em>good</em> way to compute the function.</p>
<p>The universality construction we've developed uses just two hidden layers to compute an arbitrary function. Furthermore, as we've discussed, it's possible to get the same result with just a single hidden layer. Given this, you might wonder why
we would ever be interested in deep networks, i.e., networks with many hidden layers. Can't we simply replace those networks with shallow, single hidden layer networks?</p>
<p><span class="marginnote"><strong>Chapter
acknowledgments:</strong> Thanks to <a
href="http://jendodd.com">Jen Dodd</a> and <a
href="http://colah.github.io/about.html">Chris Olah</a> for many
discussions about universality in neural networks. My thanks, in
particular, to Chris for suggesting the use of a lookup table to
prove universality. The interactive visual form of the chapter is
inspired by the work of people such as <a
href="http://bost.ocks.org/mike/algorithms/">Mike Bostock</a>, <a
href="http://www-cs-students.stanford.edu/~amitp/">Amit
Patel</a>, <a href="http://worrydream.com">Bret Victor</a>, and <a
href="http://acko.net/">Steven Wittens</a>. </span></p>
<p>While in principle that's possible, there are good practical reasons to use deep networks. As argued in
<a href="chap1.html#toward_deep_learning">Chapter 1</a>, deep networks have a hierarchical structure which makes them particularly well adapted to learn the hierarchies of knowledge that seem to be useful in solving real-world problems. Put
more concretely, when attacking problems such as image recognition, it helps to use a system that understands not just individual pixels, but also increasingly more complex concepts: from edges to simple geometric shapes, all the way up through
complex, multi-object scenes. In later chapters, we'll see evidence suggesting that deep networks do a better job than shallow networks at learning such hierarchies of knowledge. To sum up: universality tells us that neural networks can compute
any function; and empirical evidence suggests that deep networks are the networks best adapted to learn the functions useful in solving many real-world problems.
</p>
<p>
<!-- Seems to be necessary to ensure the font loads --><span style="font-family: MJX_Math; color: #fff;">.</span> <span style="font-family: MJX_Main; color: #fff;">.</span></p>
<p>
<script src="js/chap4.js"></script>
<script src="js/canvas.js"></script>
<script src="js/neuron.js"></script>
<script src="js/scrubbable.js"></script>
<script src="js/button.js"></script>
</p>
<p>
</div>
<div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2015
<br/>
<br/>
This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>. This means you're free to copy, share, and
build on this book, but not to sell it. If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Thu Jan 19 06:09:48 2017
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
(function(i, s, o, g, r, a, m) {
i['GoogleAnalyticsObject'] = r;
i[r] = i[r] || function() {
(i[r].q = i[r].q || []).push(arguments)
}, i[r].l = 1 * new Date();
a = s.createElement(o),
m = s.getElementsByTagName(o)[0];
a.async = 1;
a.src = g;
m.parentNode.insertBefore(a, m)
})(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');
ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
ga('send', 'pageview');
</script>
</body>
</html>