-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathatom.xml
1378 lines (1152 loc) · 310 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>raindu's home</title>
<subtitle>A lifelong learner!!!</subtitle>
<link href="/atom.xml" rel="self"/>
<link href="http://www.raindu.com/"/>
<updated>2017-09-16T04:50:07.641Z</updated>
<id>http://www.raindu.com/</id>
<author>
<name>RainDu</name>
<email>578708965@.com</email>
</author>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>用R语言制作商务图表,让你的图表美出新高度~</title>
<link href="http://www.raindu.com/2017/09/16/%E7%94%A8R%E8%AF%AD%E8%A8%80%E5%88%B6%E4%BD%9C%E5%95%86%E5%8A%A1%E5%9B%BE%E8%A1%A8%EF%BC%8C%E8%AE%A9%E4%BD%A0%E7%9A%84%E5%9B%BE%E8%A1%A8%E7%BE%8E%E5%87%BA%E6%96%B0%E9%AB%98%E5%BA%A6/"/>
<id>http://www.raindu.com/2017/09/16/用R语言制作商务图表,让你的图表美出新高度/</id>
<published>2017-09-16T04:38:05.000Z</published>
<updated>2017-09-16T04:50:07.641Z</updated>
<content type="html"><![CDATA[<p><img src="http://or7gx50mg.bkt.clouddn.com/cube_bar/%E5%BE%AE%E4%BF%A1%E9%A6%96%E5%9B%BE.jpg" alt=""></p>
<p>本文案例图表是之前本公众号推送过的一个方块面积比较图,之前看到过刘万祥老师在其公众号及博客中也提供了很好的制作思路。<br><a id="more"></a></p>
<p>当时就想如果是使用R来写该图表,不止是否可行呢,只是当时一时半会儿没有思路,直到最近,随着对ggplot 系统的理解进一步加深,这才找到了比较合适的思路。</p>
<p>今天跟大家分享使用R语言的ggplot函数来模仿该图表:</p>
<p>以下是原图:<br><img src="http://or7gx50mg.bkt.clouddn.com/cube_bar/image1.jpg" alt=""></p>
<p>为了 将本文 制作的成品图与该案例原图进行比较,我使用了相同的数据源:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"></div><div class="line"><span class="keyword">library</span>(ggplot2)</div><div class="line"><span class="keyword">library</span>(ggmap) </div><div class="line"><span class="comment">#加载ggmap是为了使用其theme_nothing()函数清空其原有主题元素</span></div></pre></td></tr></table></figure></p>
<p><strong>制作数据源:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mydata<-data.frame(X=c(<span class="number">3</span>,<span class="number">7</span>,<span class="number">11</span>,<span class="number">15</span>,<span class="number">19</span>),A=c(<span class="number">2471</span>,<span class="number">1893</span>,<span class="number">1248</span>,<span class="number">1078</span>,<span class="number">556</span>),B=c(<span class="number">1385</span>,<span class="number">951</span>,<span class="number">869</span>,<span class="number">784</span>,<span class="number">366</span>),C=c(<span class="number">56</span>,<span class="number">7</span>,<span class="number">19</span>,<span class="number">13</span>,<span class="number">40</span>))</div></pre></td></tr></table></figure></p>
<p>以下过程构造三个序列的矩形范围数据(X轴起点终点、Y轴起点终点)<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line">mydata$Axmin<-mydata$X-sqrt(mydata$A)/<span class="number">30</span></div><div class="line">mydata$Axmax<-mydata$X+sqrt(mydata$A)/<span class="number">30</span></div><div class="line">mydata$Aymin<-<span class="number">0</span></div><div class="line">mydata$Aymax<-sqrt(mydata$A)/<span class="number">15</span></div><div class="line"></div><div class="line"></div><div class="line">mydata$Bxmin<-mydata$X+sqrt(mydata$A)/<span class="number">30</span>-sqrt(mydata$B)/<span class="number">15</span></div><div class="line">mydata$Bxmax<-mydata$X+sqrt(mydata$A)/<span class="number">30</span></div><div class="line">mydata$Bymin<-<span class="number">0</span></div><div class="line">mydata$Bymax<-sqrt(mydata$B)/<span class="number">15</span></div><div class="line"></div><div class="line"></div><div class="line">mydata$Cxmin<-mydata$X+sqrt(mydata$A)/<span class="number">30</span>-sqrt(mydata$C)/<span class="number">10</span></div><div class="line">mydata$Cxmax<-mydata$X+sqrt(mydata$A)/<span class="number">30</span></div><div class="line">mydata$Cymin<-<span class="number">0</span></div><div class="line">mydata$Cymax<-sqrt(mydata$C)/<span class="number">10</span></div></pre></td></tr></table></figure></p>
<p>仔细体会我在设置以上起始点时所用到的思路</p>
<ul>
<li>其中第一个序列在最底层,可以使其中心对齐</li>
<li>第二、三个序列则需要对齐第一个序列的右侧</li>
<li>原数据开方后仍然很大需要酌情进行压缩标度(这里除以15)</li>
</ul>
<p><strong>数值标签和文本标签</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">mydata$text<-c(<span class="string">"University of\n Pennsylvania"</span>,<span class="string">"University of\n Notre Dame"</span>,<span class="string">"Princeton\n University"</span>,<span class="string">"Stanford\n University"</span>,<span class="string">"California Institute\n of Technology"</span>)</div><div class="line">mydata$full<-c(<span class="string">"31663"</span>,<span class="string">"16548"</span>,<span class="string">"27189"</span>,<span class="string">"34348"</span>,<span class="string">"5225"</span>)</div></pre></td></tr></table></figure></p>
<p><strong>数据整理</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line">mydata1<-mydata[,<span class="number">5</span>:<span class="number">8</span>]</div><div class="line">names(mydata1)<-c(<span class="string">"xmin"</span>,<span class="string">"xmax"</span>,<span class="string">"ymin"</span>,<span class="string">"ymax"</span>)</div><div class="line">mydata1$Group<-<span class="string">"A"</span></div><div class="line"></div><div class="line">mydata2<-mydata[,<span class="number">9</span>:<span class="number">12</span>]</div><div class="line">names(mydata2)<-c(<span class="string">"xmin"</span>,<span class="string">"xmax"</span>,<span class="string">"ymin"</span>,<span class="string">"ymax"</span>)</div><div class="line">mydata2$Group<-<span class="string">"B"</span></div><div class="line"></div><div class="line">mydata3<-mydata[,<span class="number">13</span>:<span class="number">16</span>]</div><div class="line">names(mydata3)<-c(<span class="string">"xmin"</span>,<span class="string">"xmax"</span>,<span class="string">"ymin"</span>,<span class="string">"ymax"</span>)</div><div class="line">mydata3$Group<-<span class="string">"C"</span></div><div class="line"></div><div class="line">mynewdata<-rbind(mydata1,mydata2,mydata3)</div><div class="line">mynewdata$Group<-factor(mynewdata$Group,order=<span class="literal">T</span>)</div></pre></td></tr></table></figure></p>
<p><strong>设置字体为arial字体</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">windowsFonts(myFont = windowsFont(<span class="string">"arial"</span>))</div></pre></td></tr></table></figure></p>
<p><strong>运行以下图表函数:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div></pre></td><td class="code"><pre><div class="line">ggplot(mynewdata)+</div><div class="line">geom_rect(aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax,fill=Group))+</div><div class="line">scale_fill_manual(values=c(<span class="string">"#59AF8A"</span>,<span class="string">"#0074A3"</span>,<span class="string">"#C72733"</span>))+</div><div class="line">geom_linerange(aes(x=X+<span class="number">2</span>,ymin=<span class="number">0</span>,ymax=<span class="number">4.8</span>),col=<span class="string">"grey"</span>,linetype=<span class="number">2</span>)+</div><div class="line">ylim(-<span class="number">.5</span>,<span class="number">6</span>)+</div><div class="line">labs(x=<span class="string">""</span>,y=<span class="string">""</span>)+</div><div class="line">geom_text(aes(x=X,y=<span class="number">4.5</span>,label=text),size=<span class="number">4</span>,fontface=<span class="string">"bold"</span>,family=<span class="string">"myFont"</span>)+</div><div class="line">geom_label(aes(x=X,y=<span class="number">3.7</span>,label=full),fill=<span class="string">"#EFE5CA"</span>,colour=<span class="string">"black"</span>,fontface=<span class="string">"bold"</span>,size=<span class="number">3.5</span>,label.r=unit(<span class="number">0.15</span>,<span class="string">"lines"</span>),family=<span class="string">"myFont"</span>)+</div><div class="line">geom_text(aes(x=Axmin,y=Aymax,label=A),hjust=-<span class="number">.2</span>,vjust=<span class="number">1</span>,size=<span class="number">3.5</span>,col=<span class="string">"white"</span>,family=<span class="string">"myFont"</span>)+</div><div class="line">geom_text(aes(x=Bxmin,y=Bymax,label=B),hjust=-<span class="number">.2</span>,vjust=<span class="number">1</span>,size=<span class="number">3.5</span>,col=<span class="string">"white"</span>,family=<span class="string">"myFont"</span>)+</div><div class="line">geom_text(aes(x=Cxmin,y=Cymax,label=C),hjust=-<span class="number">.2</span>,vjust=<span class="number">1</span>,size=<span class="number">3</span>,col=<span class="string">"white"</span>,family=<span class="string">"myFont"</span>)+</div><div class="line">annotate(<span class="string">"text"</span>,x=<span class="number">2.5</span>,y=<span class="number">5.7</span>,label=<span class="string">"Class Struggle"</span>,col=<span class="string">"black"</span>, size=<span class="number">6</span>,family=<span class="string">"myFont"</span>)+ </div><div class="line">annotate(<span class="string">"text"</span>,x=<span class="number">8.85</span>,y=<span class="number">5.2</span>,label=<span class="string">"A spot on a university or college's waitlist rarely translates into admission. A look at the numbers for several institutions"</span>, size=<span class="number">4</span>,family=<span class="string">"myFont"</span>)+ </div><div class="line">annotate(<span class="string">"text"</span>,x=<span class="number">3.9</span>,y=-<span class="number">.32</span>,label=<span class="string">"Source:The universities and 2011-2012 Common Data Set"</span>,col=<span class="string">"black"</span>,size=<span class="number">3</span>,family=<span class="string">"myFont"</span>)+ </div><div class="line">annotate(<span class="string">"text"</span>,x=<span class="number">19.8</span>,y=-<span class="number">.32</span>,label=<span class="string">"The wall Street Jaunual"</span>,col=<span class="string">"black"</span>,size=<span class="number">3</span>,family=<span class="string">"myFont"</span>)+ </div><div class="line">theme_nothing()+</div><div class="line">theme(panel.background=element_rect(fill=<span class="string">"#F5F2E1"</span>))</div></pre></td></tr></table></figure></p>
<p><img src="http://or7gx50mg.bkt.clouddn.com/cube_bar/image.png" alt=""></p>
<p>建议保存尺寸(1035*330)</p>
<p>建议使用Cairo包进行保存操作:</p>
<p>使用方法如下:</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">font.add(<span class="string">"myfont"</span>, <span class="string">"arial.ttf"</span>)</div><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/image.png"</span>,width=<span class="number">1035</span>,height=<span class="number">330</span>)</div><div class="line">showtext.begin()</div><div class="line">……</div><div class="line">showtext.end()</div><div class="line">dev.off()</div></pre></td></tr></table></figure>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://or7gx50mg.bkt.clouddn.com/cube_bar/%E5%BE%AE%E4%BF%A1%E9%A6%96%E5%9B%BE.jpg" alt=""></p>
<p>本文案例图表是之前本公众号推送过的一个方块面积比较图,之前看到过刘万祥老师在其公众号及博客中也提供了很好的制作思路。<br>
</summary>
<category term="数据可视化" scheme="http://www.raindu.com/categories/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
</entry>
<entry>
<title>大连市2016年空气质量数据可视化~</title>
<link href="http://www.raindu.com/2017/08/30/%E5%A4%A7%E8%BF%9E%E5%B8%822016%E5%B9%B4%E7%A9%BA%E6%B0%94%E8%B4%A8%E9%87%8F%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<id>http://www.raindu.com/2017/08/30/大连市2016年空气质量数据可视化/</id>
<published>2017-08-30T11:15:45.000Z</published>
<updated>2017-09-19T01:03:46.568Z</updated>
<content type="html"><![CDATA[<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image2.png" alt=""></p>
<p>前几天发现了一个很有趣的包——openair,可以将年度时间序列刻画成周年日历热图,感觉这种形式非常适合用于呈现年度空气质量可视化,所以抓空爬了一些大连市2016年年度空气质量数据拿来玩玩,目标网站网页结构比较简单,爬取过程很轻松,界面部分很规律,感觉这个代码可以作为模板用,感兴趣的小伙伴儿可以试着玩一玩!<br><a id="more"></a></p>
<p>###数据准备</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(RCurl)</div><div class="line"><span class="keyword">library</span>(XML)</div><div class="line"><span class="keyword">library</span>(dplyr)</div><div class="line"><span class="keyword">library</span>(ggplot2)</div><div class="line"><span class="keyword">library</span>(stringr)</div><div class="line"><span class="keyword">library</span>(rvest)</div><div class="line"><span class="keyword">library</span>(lubridate)</div><div class="line"><span class="keyword">library</span>(<span class="string">"DT"</span>)</div><div class="line"><span class="keyword">library</span>(openair)</div><div class="line"><span class="keyword">library</span>(ggplot2)</div></pre></td></tr></table></figure>
<p>###数据爬取过程:</p>
<p>构造月度url地址(网站是按照月度数据存储的,需要按月爬取)<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">urlbase<-<span class="string">"https://www.aqistudy.cn/historydata/"</span></div><div class="line">url<-<span class="string">"https://www.aqistudy.cn/historydata/monthdata.php?city=大连"</span></div><div class="line">rd <- getURL(url,.encoding=<span class="string">"UTF-8"</span>)</div><div class="line">rdhtml <- htmlParse(rd,encoding=<span class="string">"UTF-8"</span>)</div><div class="line">otherpage<-getNodeSet(rdhtml,<span class="string">"//a"</span>)</div><div class="line">allurl<-laply(otherpage,xmlGetAttr,name=<span class="string">'href'</span>)%>%grep(<span class="string">"2016"</span>,.,value=<span class="literal">T</span>)%>%sub(<span class="string">"麓贸脕卢"</span>,<span class="string">"大连"</span>,.)</div><div class="line"><span class="comment">#以上还是编码出了问题,不知道那个乱码是什么鬼!只能强行替换了!</span></div><div class="line">allurl<-paste0(urlbase,allurl)</div></pre></td></tr></table></figure></p>
<p>以上过程也可先通过观察大连市的月度空气质量url地址规律,然后通过paste函数直接生成。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">allurl<-paste0(<span class="string">"https://www.aqistudy.cn/historydata/daydata.php?city=大连&month="</span>,<span class="number">201601</span>:<span class="number">201612</span>)</div></pre></td></tr></table></figure>
<p>这是简单粗暴的方式,但是不保证任何网址都可以使用</p>
<p>先写完一个看下具体情况</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">tbls<-read_html(allurl[<span class="number">1</span>],encoding=<span class="string">"utf-8"</span>)%>%html_table(.,header=<span class="literal">TRUE</span>,trim=<span class="literal">TRUE</span>);tbls<-tbls[[<span class="number">1</span>]]</div></pre></td></tr></table></figure>
<p>编写单次爬取函数,使用for循环遍历网址进行数据获取(原谅我又用了for循环)<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">mytable<-data.frame()</div><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> allurl){</div><div class="line">Sys.sleep(sample(<span class="number">1</span>:<span class="number">5</span>,<span class="number">1</span>))</div><div class="line">fun<-<span class="keyword">function</span>(m){</div><div class="line">table<-read_html(m,encoding=<span class="string">"utf-8"</span>)%>%html_table(.,header=<span class="literal">TRUE</span>,trim=<span class="literal">TRUE</span>)</div><div class="line">table<-table[[<span class="number">1</span>]]</div><div class="line">}</div><div class="line">mytable<-rbind(mytable,fun(i))</div><div class="line">}</div></pre></td></tr></table></figure></p>
<p>使用动态表格查看数据<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">datatable(mytable)</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image1.jpg" alt=""></p>
<blockquote>
<p>备份一份数据,以防原数据损坏<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mytableb<-mytable</div></pre></td></tr></table></figure></p>
</blockquote>
<p>###调整时间变量<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mytable$日期<-as.Date(mytable$日期);names(mytable)[<span class="number">1</span>]<-<span class="string">"date"</span></div></pre></td></tr></table></figure></p>
<p><strong>AQI指数年度分布日力图</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">calendarPlot(mytable,pollutant=<span class="string">"AQI"</span>,year=<span class="number">2016</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image2.png" alt=""></p>
<p><strong>PM2.5指数年度分布日力图</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">calendarPlot(mytable,pollutant=<span class="string">"PM2.5"</span>,year=<span class="number">2016</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image3.png" alt=""></p>
<p>###使用ggplot函数制作同样的日力图<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">dat <- mytable</div></pre></td></tr></table></figure></p>
<p>这次使用lubridate包来处理时间日期变量(超级好用)<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">dat$month<-as.numeric(as.POSIXlt(dat$date)$mon+<span class="number">1</span>)</div><div class="line">dat$monthf<-factor(dat$month,levels=as.character(<span class="number">1</span>:<span class="number">12</span>),labels=c(<span class="string">"Jan"</span>,<span class="string">"Feb"</span>,<span class="string">"Mar"</span>,<span class="string">"Apr"</span>,<span class="string">"May"</span>,<span class="string">"Jun"</span>,<span class="string">"Jul"</span>,<span class="string">"Aug"</span>,<span class="string">"Sep"</span>,<span class="string">"Oct"</span>,<span class="string">"Nov"</span>,<span class="string">"Dec"</span>),ordered=<span class="literal">TRUE</span>)</div><div class="line">dat$weekday<-as.POSIXlt(dat$date)$wday</div><div class="line">dat$weekdayf<-factor(dat$weekday,levels=rev(<span class="number">0</span>:<span class="number">6</span>),labels=rev(c(<span class="string">"Sun"</span>,<span class="string">"Mon"</span>,<span class="string">"Tue"</span>,<span class="string">"Wed"</span>,<span class="string">"Thu"</span>,<span class="string">"Fri"</span>,<span class="string">"Sat"</span>)),ordered=<span class="literal">TRUE</span>)</div><div class="line">dat$week <- as.numeric(format(dat$date,<span class="string">"%W"</span>))</div><div class="line">dat<-ddply(dat,.(monthf),transform,monthweek=<span class="number">1</span>+week-min(week))</div></pre></td></tr></table></figure></p>
<p><strong>AQI指数为污染级别以上的天数分布</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">windowsFonts(myFont = windowsFont(<span class="string">"微软雅黑"</span>))</div><div class="line">ggplot(dat,aes(monthweek,weekdayf,fill=AQI))+</div><div class="line"> geom_tile(colour=<span class="string">'white'</span>) +</div><div class="line"> facet_wrap(~monthf ,nrow=<span class="number">3</span>) +</div><div class="line"> scale_fill_gradient(space=<span class="string">"Lab"</span>,limits=c(<span class="number">100</span>, max(dat$AQI)),low=<span class="string">"yellow"</span>, high=<span class="string">"red"</span>) +</div><div class="line"> labs(title=<span class="string">"大连市2016年空气日历热图"</span>,subtitle=<span class="string">"AQI指数为污染级别以上的天数分布(AQI>=100)"</span>,x=<span class="string">"Week of Month"</span>,y=<span class="string">""</span>)+</div><div class="line"> theme(title=element_text(family=<span class="string">"myFont"</span>))</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image4.png" alt=""></p>
<p><strong>PM2.5指数为污染级别以上的天数分布</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">ggplot(dat,aes(monthweek,weekdayf,fill=PM2.5))+</div><div class="line"> geom_tile(colour=<span class="string">'white'</span>) +</div><div class="line"> facet_wrap(~monthf ,nrow=<span class="number">3</span>) +</div><div class="line"> scale_fill_gradient(space=<span class="string">"Lab"</span>,limits=c(<span class="number">75</span>, max(dat$PM2.5)),low=<span class="string">"yellow"</span>, high=<span class="string">"red"</span>) +</div><div class="line"> labs(title=<span class="string">"大连市2016年气温日历热图"</span>,subtitle=<span class="string">"PM2.5指数为污染级别以上的天数分布(PM2.5>=75)"</span>,x=<span class="string">"Week of Month"</span>,y=<span class="string">""</span>)+</div><div class="line"> theme(title=element_text(family=<span class="string">"myFont"</span>))</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image5.png" alt=""></p>
<p>大体来看,大连整个2016年度污染天气相对来讲,还算是挺良心的,跟帝都比起来要好很多。AQI和PM2.5在污染级别以上的天数不超过两个月。</p>
<p>###时间维度考察</p>
<p>接下来呢,我们做一些详细的统计工作,具体就是从时间细分维度来查看季度、月度、周度等平均AQI、PM2.5指数分布情况<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">data3<-mytable[,c(<span class="number">1</span>,<span class="number">2</span>,<span class="number">4</span>,<span class="number">5</span>)]</div><div class="line">data3<-transform(data3,Quarter=quarter(date),Month=month(date),Week=week(date))</div><div class="line">data3$质量等级<-factor(data3$质量等级,levels=c(<span class="string">"重度污染"</span>,<span class="string">"中度污染"</span>,<span class="string">"轻度污染"</span>,<span class="string">"良"</span>,<span class="string">"优"</span>),labels=c(<span class="string">"重度污染"</span>,<span class="string">"中度污染"</span>,<span class="string">"轻度污染"</span>,<span class="string">"良"</span>,<span class="string">"优"</span>),order=<span class="literal">T</span>)</div></pre></td></tr></table></figure></p>
<p><strong>首选查看五个污染级别在2016年度出现的频率:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div></pre></td><td class="code"><pre><div class="line">countd<-count(data3$质量等级)%>%arrange(.,-freq)</div><div class="line">ggplot(countd,aes(reorder(x,freq),freq))+</div><div class="line">geom_bar(fill=<span class="string">"#0C8DC4"</span>,stat=<span class="string">"identity"</span>)+</div><div class="line">coord_flip()+</div><div class="line">labs(title=<span class="string">"大连市2016年度空气质量分布"</span>,subtitle=<span class="string">"污染级别频率分布图"</span>,caption=<span class="string">"https://www.aqistudy.cn/"</span>)+</div><div class="line">geom_text(aes(label=freq),hjust=<span class="number">1</span>,colour=<span class="string">"white"</span>,size=<span class="number">8</span>)+</div><div class="line">theme_bw()+</div><div class="line">theme(</div><div class="line"> text=element_text(family=<span class="string">"myFont"</span>),</div><div class="line"> panel.border=element_blank(),</div><div class="line"> panel.grid.major=element_line(linetype=<span class="string">"dashed"</span>),</div><div class="line"> panel.grid.minor=element_blank(),</div><div class="line"> plot.caption=element_text(hjust=<span class="number">0</span>,size=<span class="number">10</span>),</div><div class="line"> axis.title=element_blank()</div><div class="line"> )</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image7.png" alt=""></p>
<p><strong>基于季度空气质量平均水平分布图:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line">Quarter<-aggregate(AQI~Quarter,data=data3,FUN=mean)</div><div class="line">ggplot(Quarter,aes(reorder(Quarter,-AQI),AQI))+</div><div class="line">geom_bar(fill=<span class="string">"#0C8DC4"</span>,stat=<span class="string">"identity"</span>)+</div><div class="line">labs(title=<span class="string">"大连市2016年度空气质量分布"</span>,subtitle=<span class="string">"AQI污染指数季度指标平均分布图"</span>,caption=<span class="string">"https://www.aqistudy.cn/"</span>)+</div><div class="line">geom_text(aes(label=round(AQI)),vjust=<span class="number">1.5</span>,colour=<span class="string">"white"</span>,size=<span class="number">8</span>)+</div><div class="line">theme_bw()+</div><div class="line">theme(</div><div class="line"> text=element_text(family=<span class="string">"myFont"</span>),</div><div class="line"> panel.border=element_blank(),</div><div class="line"> panel.grid.major=element_line(linetype=<span class="string">"dashed"</span>),</div><div class="line"> panel.grid.minor=element_blank(),</div><div class="line"> plot.caption=element_text(hjust=<span class="number">0</span>,size=<span class="number">10</span>),</div><div class="line"> axis.title=element_blank()</div><div class="line"> )</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image8.png" alt=""></p>
<p><strong>基于月度空气质量平均水平分布图:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line">Month<-aggregate(AQI~Month,data=data3,FUN=mean)</div><div class="line">ggplot(Month,aes(reorder(Month,-AQI),AQI))+</div><div class="line">geom_bar(fill=<span class="string">"#0C8DC4"</span>,stat=<span class="string">"identity"</span>)+</div><div class="line">labs(title=<span class="string">"大连市2016年度空气质量分布"</span>,subtitle=<span class="string">"AQI污染指数月度指标平均分布图"</span>,caption=<span class="string">"https://www.aqistudy.cn/"</span>)+</div><div class="line">geom_text(aes(label=round(AQI)),vjust=<span class="number">1.5</span>,colour=<span class="string">"white"</span>,size=<span class="number">6</span>)+</div><div class="line">theme_bw()+</div><div class="line">theme(</div><div class="line"> text=element_text(family=<span class="string">"myFont"</span>),</div><div class="line"> panel.border=element_blank(),</div><div class="line"> panel.grid.major=element_line(linetype=<span class="string">"dashed"</span>),</div><div class="line"> panel.grid.minor=element_blank(),</div><div class="line"> plot.caption=element_text(hjust=<span class="number">0</span>,size=<span class="number">10</span>),</div><div class="line"> axis.title=element_blank()</div><div class="line"> )</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image9.png" alt=""></p>
<p><strong>基于月度空气质量平均水平分布图:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line">Week<-aggregate(AQI~Week,data=data3,FUN=mean)</div><div class="line">ggplot(Week,aes(factor(Week,order=<span class="literal">T</span>),AQI,group=<span class="number">1</span>))+</div><div class="line">geom_line(col=<span class="string">"#0C8DC4"</span>)+</div><div class="line">labs(title=<span class="string">"大连市2016年度空气质量分布"</span>,subtitle=<span class="string">"AQI污染指数周度指标平均分布图"</span>,caption=<span class="string">"https://www.aqistudy.cn/"</span>)+</div><div class="line">geom_text(aes(label=ifelse(Week><span class="number">100</span>,Week,<span class="string">""</span>)),vjust=<span class="number">1.5</span>,colour=<span class="string">"white"</span>,size=<span class="number">6</span>)+</div><div class="line">theme_bw()+</div><div class="line">theme(</div><div class="line"> text=element_text(family=<span class="string">"myFont"</span>),</div><div class="line"> panel.border=element_blank(),</div><div class="line"> panel.grid.major=element_line(linetype=<span class="string">"dashed"</span>),</div><div class="line"> panel.grid.minor=element_blank(),</div><div class="line"> plot.caption=element_text(hjust=<span class="number">0</span>,size=<span class="number">10</span>),</div><div class="line"> axis.title=element_blank()</div><div class="line"> )</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image10.png" alt=""></p>
<p>从以上周度AQI平均指标上来看,大连市2016年度的周度平均AQI指数大部分周都在100以下,看到这个感觉生活在大连还是蛮幸福的,看着北上的小伙伴儿隔三差五的在朋友圈晒人间仙境也是一件很有趣的事哈哈!</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://oqdvmreg2.bkt.clouddn.com/dalianweather/image2.png" alt=""></p>
<p>前几天发现了一个很有趣的包——openair,可以将年度时间序列刻画成周年日历热图,感觉这种形式非常适合用于呈现年度空气质量可视化,所以抓空爬了一些大连市2016年年度空气质量数据拿来玩玩,目标网站网页结构比较简单,爬取过程很轻松,界面部分很规律,感觉这个代码可以作为模板用,感兴趣的小伙伴儿可以试着玩一玩!<br>
</summary>
<category term="数据可视化" scheme="http://www.raindu.com/categories/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
</entry>
<entry>
<title>一篇文章揭开office配色模板的的神秘面纱~</title>
<link href="http://www.raindu.com/2017/08/29/%E4%B8%80%E7%AF%87%E6%96%87%E7%AB%A0%E6%8F%AD%E5%BC%80office%E9%85%8D%E8%89%B2%E6%A8%A1%E6%9D%BF%E7%9A%84%E7%9A%84%E7%A5%9E%E7%A7%98%E9%9D%A2%E7%BA%B1/"/>
<id>http://www.raindu.com/2017/08/29/一篇文章揭开office配色模板的的神秘面纱/</id>
<published>2017-08-29T11:48:15.000Z</published>
<updated>2017-08-29T12:23:20.076Z</updated>
<content type="html"><![CDATA[<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image.jpg" alt=""></p>
<p>今天我教大家用R语言做一些有趣的事情,什么事情呢~暂时保密哦,想知道的话,认真往下看哟!</p>
<a id="more"></a>
<h3 id="背景介绍:"><a href="#背景介绍:" class="headerlink" title="背景介绍:"></a>背景介绍:</h3><p>想必经常使用office的童鞋(无论你用的Word、PPT、还是Excel),经常会遇到这样的问题,如果做出来的图表不加修饰的话,然后效果就巨难看,你知道这是为什么嘛~</p>
<p>好吧让小魔方来告诉你吧,因为office给你用的默认配色模板,这种配色模板呢,坦白的说,我不认为好看(至少跟主流的设计搭配风格比起来的话),可是没有办法呀,谁让偌大的地球,只有微软一家开发出了完善的office办公平台了吧(虽然有些组件是收购的,某软只是做了整合的工作)。</p>
<p>可是你要知道,微软是一年技术驱动的公司(至少跟苹果这种设计驱动的公司相比的话),你要让哪些整天码代码的码农给你设计好看的配色,这个也有点儿不现实不是(真的跟公司创始人和企业文化有关,想想当年乔布斯为什么竭力地址IE和Windows甚至不惜自己开发了一套专供MAC平台使用的演示文稿软件——keynote)。</p>
<p>这种配色的不协调、不养眼、没品位在office2003及以前版本中体现的最深(不过也不可以过分苛责、毕竟整个社会的审美水平也是有选序渐进的趋势)。</p>
<p>好吧,说到这里,小魔方就给大家展示一下自己所使用平台的office官方默认配色系统是什么样的:</p>
<p>因为office系统组件贡献配色模板,所以这里给大家对比下Word、PPT、Excel的配色模板入口:</p>
<ul>
<li>word:设计——颜色——下拉菜单</li>
<li>Excel:页面布局——主题——颜色</li>
<li>ppt:视图——幻灯片模板——背景——颜色</li>
</ul>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image1.png" alt=""></p>
<blockquote>
<p>本人office版本及系统信息:<br> W10专业版(64) office2016(64) </p>
</blockquote>
<h3 id="office配色系统:"><a href="#office配色系统:" class="headerlink" title="office配色系统:"></a>office配色系统:</h3><p>那么接下来大家可能要问了,如果要想要自己定义配色模板是否可行呢,yes of course!</p>
<p>方法很简单,我刚才已经强调过一点,office平台所有组件共享诸多模块,当然也包含配色模板,也就是说,无论你在那个组件界面自定义了色板,最终都会保存在office平台公用的自定义配色模板文件夹里。(确实如此,无论是在Word界面自定义、还是Excel甚至PPT或者其他office组件内自定义配色模板,最终都会保存在同一个公用自定义配色模板文件夹,但是在软件界面你无法看到该文件保存的路径)。</p>
<p>所以说你就不用纠结应该在哪儿设置自定义配色了,那么我们就用Excel做一个尝试吧~</p>
<p>当你在Excel软件界面的页面布局——颜色下拉菜单底部点击自定义配色的时候,弹出菜单如下:</p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image2.png" alt=""></p>
<p>主题颜色一共12个选项,前四个是分深浅的两组文字背景颜色,中间6组是以着色1:6进行编号命名的颜色组合(通常用在系列配色中,如图表序列、线条系列等)。最后两个是超链接颜色(包括默认颜色和访问过的颜色显示)。</p>
<p>自定义模板一般来说是定义3:10这8个颜色,因为这也是显示在你软件界面配色选项中第一排的颜色(无论是字体颜色菜单、形状填充颜色菜单还是线条颜色菜单都是如此)。至于颜色1和颜色2为何不显示这个问题我也很纳闷,最后两个超链接颜色用的频率不高可以忽略。</p>
<p>随意打开一个颜色对应下拉框,你就可以看到很熟悉的调色板,没错跟软件主界面的字体、线条和图形填充颜色菜单一模一样,接下来就不用我说了,自己选就行。</p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image3.png" alt=""></p>
<p>定义完目标色之后,点击保存,在公用配色模板目录就会多一个配色模板(.xml格式,没错就是xml,透露一个秘密,office文档结构其实是一组.xml文件组成的压缩包,不信你用任意一个docx、pptx、xlsx文档修改后缀名为.rar,然后用解压软件打开仔细看看里面都放了什么东西,有惊喜哦~)</p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image4.png" alt=""></p>
<p>我保存了一个名为balalala的自定义配色模板。然后……</p>
<p>对吧,我没有骗你吧,确实是定义一个,所有的软件都可以共用——</p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image5.png" alt=""></p>
<p>接下来做一些有挑战性的事情,既然office平台的配色模板是.xml格式的文件,那么我们可不可以不用再软件里面手动定义,而直接生成符合xml语法格式的配色模板文件呢~</p>
<p>答案当然是可以的啦,不过做这个需要你掌握以下两点知识:</p>
<p><strong>知道office配色模板的存放路径:</strong></p>
<p>一般有两类模板,一类是系统模板,就是安装软件时候,系统自动加载的,一般配色都不甚好看,显示在颜色下拉菜单的第二个栏目,而自定义模板可以给大家充分发挥想象力的空间。一般显示在颜色下拉菜单的第一个部分。</p>
<p>模板的路径如下(个人电脑的路径,其他平台及品牌类比操作):</p>
<ul>
<li>系统:C:\Program Files\Microsoft Office\root\Document Themes 16\Theme Colors</li>
<li>自定义:C:\Users\Administrator\AppData\Roaming\Microsoft\Templates\Document Themes\Theme Colors</li>
</ul>
<p><strong>了解xml语法结构,并会适当修改色值。</strong>(只是了解,不要求会写,事实上看不懂没关系,我也不会写,一会儿会一步一步的教你拆解xml)。</p>
<p>接下来我们就来拆拆拆~</p>
<h3 id="使用R语言的XML包处理office配色模板:"><a href="#使用R语言的XML包处理office配色模板:" class="headerlink" title="使用R语言的XML包处理office配色模板:"></a>使用R语言的XML包处理office配色模板:</h3><p>首先先随便自定义一个模板(在软件里面自定义),然后定位到自定义模板文件夹里:将xml文件用txt记事本打开(notpad++也可以),随便浏览下xml文档结构:</p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image6.png" alt=""></p>
<p>看吧,其实模板也没啥神秘的,xml语法结构树中,包含了刚才自定义的所有12中颜色的十六进制色值信息。(数一数一共几个val)。</p>
<p>但是介于好多小伙伴没有xml基础,txt记事本没有标注xml机构的功能,本文全部是紧凑排列的,挤在一起看不出来什么。</p>
<p>这里我将其导入R中,使用规范的xml树结构工具来拆解xml文档。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(XML)</div><div class="line"><span class="keyword">library</span>(plyr)</div><div class="line"><span class="keyword">library</span>(dplyr)</div><div class="line"><span class="keyword">library</span>(scales)</div><div class="line">setwd(<span class="string">"C:/Users/Administrator/AppData/Roaming/Microsoft/Templates/Document Themes/Theme Colors"</span>)</div><div class="line">color2<-xmlTreeParse(<span class="string">"balalala.xml"</span>,useInternalNodes=<span class="literal">TRUE</span>)</div><div class="line">color2</div></pre></td></tr></table></figure></p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image7.png" alt=""></p>
<p>这回看清楚了吧,以上函数将该xml还原成其xml结构下的树结构,这也是一个语法完整的xml文档,第一行是xml的头文件,声明该xml文档的类型、版本号编码方式以及是否含有外部DTD(什么鬼我也不知道)该树结构分为三级子元素结构:</p>
<p>分别是带有命名空间的元素a:clrScheme<br>二级元素:a:dkl……a:folHlink 刚好12个二级子元素,对应之前自定义的12中颜色;<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">colornames<-xmlRoot(color2)%>%xmlChildren()%>%names()</div><div class="line"> [<span class="number">1</span>] <span class="string">"dk1"</span> <span class="string">"lt1"</span> <span class="string">"dk2"</span> <span class="string">"lt2"</span> <span class="string">"accent1"</span> <span class="string">"accent2"</span> <span class="string">"accent3"</span> <span class="string">"accent4"</span> </div><div class="line"> [<span class="number">9</span>] <span class="string">"accent5"</span> <span class="string">"accent6"</span> <span class="string">"hlink"</span> <span class="string">"folHlink"</span></div></pre></td></tr></table></figure></p>
<p>三级元素:a:sysClr 12个三级元素拥有同样的名称,而我们定义的颜色色值变存储在这12个三级空子元素的val属性值内。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">colorgroup<-getNodeSet(color2,<span class="string">'//a:srgbClr'</span>) <span class="comment">#选取所有三级节点a</span></div><div class="line">colorindex<-laply(colorgroup,xmlAttrs,name=<span class="string">'val'</span>) <span class="comment">#提取所有三级节点中的val属性值</span></div><div class="line">colorindex </div><div class="line"> [<span class="number">1</span>] <span class="string">"44546A"</span> <span class="string">"E7E6E6"</span> <span class="string">"4472C4"</span> <span class="string">"ED7D31"</span> <span class="string">"A5A5A5"</span> <span class="string">"FFC000"</span> <span class="string">"5B9BD5"</span> <span class="string">"70AD47"</span> <span class="string">"0563C1"</span> <span class="string">"954F72"</span></div></pre></td></tr></table></figure>
<p>因为R中的16进制色值需要加前缀#才能识别,所以最好加上前缀,这样我们就可以愉快的在R中查看颜色了。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">colorindex<-paste0(<span class="string">"#"</span>,colorindex)</div><div class="line">show_col(colorindex,labels=<span class="literal">F</span>)</div></pre></td></tr></table></figure>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image8.png" alt=""></p>
<p>好吧,就问你,这个配色方案丑不丑,之前自定义的时候一个颜色没改,就是office2016默认的配色方案,怪不得你做不出来好点的图表作品~</p>
<p>那么我们是不是可以自己在txt中把一些好看的色值给黏贴进去呢,这样也做成了也挺神奇的哈~说干就干~</p>
<h3 id="自制调色板:"><a href="#自制调色板:" class="headerlink" title="自制调色板:"></a>自制调色板:</h3><p>首先找一个好的在线配色平台——</p>
<p>没错就它了——Adobe Color CC<br><img src="http://or3sddq9r.bkt.clouddn.com/office/image9.png" alt=""></p>
<p>看着还不错</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">setwd(<span class="string">"C:/Users/Administrator/AppData/Roaming/Microsoft/Templates/Document Themes/Theme Colors"</span>)</div><div class="line">color2<-xmlTreeParse(<span class="string">"balalala.xml"</span>,useInternalNodes=<span class="literal">TRUE</span>)</div><div class="line">colorgroup<-getNodeSet(color2,<span class="string">'//a:srgbClr'</span>) </div><div class="line">colorindex<-laply(colorgroup,xmlAttrs,name=<span class="string">'val'</span>) </div><div class="line">colorindex<-paste0(<span class="string">"#"</span>,colorindex)</div><div class="line">show_col(colorindex,labels=<span class="literal">F</span>)</div></pre></td></tr></table></figure>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image13.png" alt=""></p>
<p>哈哈,这回配色是不是稍微好一些了,至少要比刚才好很多!</p>
<p>接下来我有个疯狂的想法,我想把office平台系统默认配色模板全部都导出到R中,然后做成一个万花筒,让大家好好看看微软的工程师都给我们弄了啥花花绿绿的颜色——</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">setwd(<span class="string">"C:/Program Files/Microsoft Office/root/Document Themes 16/Theme Colors"</span>)</div><div class="line">a<-list.files(<span class="string">"."</span>,pattern = <span class="string">"*.xml$"</span>) </div><div class="line"></div><div class="line"> [<span class="number">1</span>] <span class="string">"Aspect.xml"</span> <span class="string">"Blue Green.xml"</span> <span class="string">"Blue II.xml"</span> </div><div class="line"> [<span class="number">4</span>] <span class="string">"Blue Warm.xml"</span> <span class="string">"Blue.xml"</span> <span class="string">"Grayscale.xml"</span> </div><div class="line"> [<span class="number">7</span>] <span class="string">"Green Yellow.xml"</span> <span class="string">"Green.xml"</span> <span class="string">"Marquee.xml"</span> </div><div class="line">[<span class="number">10</span>] <span class="string">"Median.xml"</span> <span class="string">"Office 2007 - 2010.xml"</span> <span class="string">"Orange Red.xml"</span> </div><div class="line">[<span class="number">13</span>] <span class="string">"Orange.xml"</span> <span class="string">"Paper.xml"</span> <span class="string">"Red Orange.xml"</span> </div><div class="line">[<span class="number">16</span>] <span class="string">"Red Violet.xml"</span> <span class="string">"Red.xml"</span> <span class="string">"Slipstream.xml"</span> </div><div class="line">[<span class="number">19</span>] <span class="string">"Violet II.xml"</span> <span class="string">"Violet.xml"</span> <span class="string">"Yellow Orange.xml"</span> </div><div class="line">[<span class="number">22</span>] <span class="string">"Yellow.xml"</span></div></pre></td></tr></table></figure>
<p>没错这就是某软提供的所有默认配色模板(一共22套)<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line">mycolorfile<-c()</div><div class="line">fun<-<span class="keyword">function</span>(i){ </div><div class="line">color2<-xmlTreeParse(a[i],useInternalNodes=<span class="literal">TRUE</span>)</div><div class="line">colornames<-xmlRoot(color2)%>%xmlChildren()%>%names()</div><div class="line">colorgroup<-getNodeSet(color2,<span class="string">'//a:srgbClr'</span>) </div><div class="line">colorindex<-laply(colorgroup,xmlAttrs,name=<span class="string">'val'</span>)</div><div class="line">colorindex<-paste0(<span class="string">"#"</span>,colorindex)</div><div class="line">} </div><div class="line"><span class="keyword">for</span> ( i <span class="keyword">in</span> <span class="number">1</span>:length(a)){</div><div class="line">mycolorfile<-c(mycolorfile,fun(i))</div><div class="line">}</div><div class="line"></div><div class="line">show_col(mycolorfile,labels=<span class="literal">F</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image10.png" alt=""></p>
<p>好吧,这就是微软office所提供的所有默认配色,给你一个眼神,自己体会~_~<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">colordata<-myfullcolorfile[c(-<span class="number">221</span>,-<span class="number">222</span>)]</div><div class="line">dim(colordata)<-c(<span class="number">10</span>,length(a))</div><div class="line">rownames(colordata)<-colornames[c(-<span class="number">1</span>,-<span class="number">2</span>)]</div><div class="line">colnames(colordata)<-sub(<span class="string">".xml"</span>,<span class="string">""</span>,a)</div><div class="line">colordata<-t(colordata)</div><div class="line">write.table(colordata,<span class="string">"F:/colordata.txt"</span>,sep=<span class="string">" "</span>,row.names=<span class="literal">FALSE</span>,col.names=<span class="literal">TRUE</span>,quote=<span class="literal">FALSE</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image11.jpg" alt=""><br><img src="http://or3sddq9r.bkt.clouddn.com/office/image12.png" alt=""></p>
<p>好了,大功告成,这样你还会觉得office的配色模板很神秘嘛,哈哈是不是今天学到了很多,不要谢我哟~</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。</p>
<h2 id="Mail-578708965-qq-com"><a href="#Mail-578708965-qq-com" class="headerlink" title="Mail:578708965@qq.com "></a>Mail:578708965@qq.com </h2><p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://or3sddq9r.bkt.clouddn.com/office/image.jpg" alt=""></p>
<p>今天我教大家用R语言做一些有趣的事情,什么事情呢~暂时保密哦,想知道的话,认真往下看哟!</p>
</summary>
<category term="数据可视化" scheme="http://www.raindu.com/categories/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="office" scheme="http://www.raindu.com/tags/office/"/>
</entry>
<entry>
<title>一篇文章教你搞定JSON素材,从此告别SHP时代~</title>
<link href="http://www.raindu.com/2017/08/28/%E4%B8%80%E7%AF%87%E6%96%87%E7%AB%A0%E6%95%99%E4%BD%A0%E6%90%9E%E5%AE%9AJSON%E7%B4%A0%E6%9D%90%EF%BC%8C%E4%BB%8E%E6%AD%A4%E5%91%8A%E5%88%ABSHP%E6%97%B6%E4%BB%A3/"/>
<id>http://www.raindu.com/2017/08/28/一篇文章教你搞定JSON素材,从此告别SHP时代/</id>
<published>2017-08-28T03:59:58.000Z</published>
<updated>2017-08-29T12:27:24.629Z</updated>
<content type="html"><![CDATA[<p><img src="http://or3sddq9r.bkt.clouddn.com/json_shp/image.jpg" alt=""></p>
<p>最近几天推送频率之所以下降了,不是因为偷懒,是在攻克一个难题~</p>
<p>还记得前一篇推送,关于山东省财政数据可视化那一篇,因为没有精准、最新的山东省县级市边界地图素材数据,花了好多冤枉功夫,搜地图素材各种碰壁,最后的得到的地图数据并不尽如人意。</p>
<a id="more"></a>
<p>现在shp的素材相比json整体都不太流行了,无论是制作成本上还是占用内存上以及与实际行政区划的更新速度上,json地图素材轻便、时效、易获取,很多网站都提供这种轻量级的数据文件。</p>
<p>可是json文件遵循的JS语法,导入R中之后,全部被强制转化为各种嵌套的list、data.frame、array等混合体,如果没有对R数据结构很好的把握,基本看上一眼就绝望了。</p>
<h3 id="json数据预览:"><a href="#json数据预览:" class="headerlink" title="json数据预览:"></a>json数据预览:</h3><p><strong>记事本打开的json数据</strong><br><img src="http://or3sddq9r.bkt.clouddn.com/json_shp/image2.jpg" alt=""></p>
<p><strong>R中打开的json数据</strong><br><img src="http://or3sddq9r.bkt.clouddn.com/json_shp/image3.jpg" alt=""></p>
<p><strong>网页渲染后的json数据代码</strong><br><img src="http://or3sddq9r.bkt.clouddn.com/json_shp/image4.jpg" alt=""></p>
<p>虽然难以理解,但是又不得不用,所以再难也得拿下~</p>
<h3 id="json数据类型:"><a href="#json数据类型:" class="headerlink" title="json数据类型:"></a>json数据类型:</h3><p>这里先说明一下,Json数据格式分为两类,一类是geojson,内部的数据类型显示FeatureCollection,这种类型数据文件里面直接存储的是解码后的经纬度数据,另一类是topojson,这种类型是需要通过坐标转换后才能使用,因为每一个点不是真实经纬度,所以下载的时候一定要看清楚。</p>
<p>这里提供给大家三个网址:<br><a href="http://geojson.io/#map=7/32.064/117.268" target="_blank" rel="external">geojson.io</a></p>
<blockquote>
<p>以上网址自选、也可以通过导入shp数据转换格式(其中就可以将topojson转化为geojson)。<br><a href="http://mapshaper.org/" target="_blank" rel="external">mapshaper</a><br><a href="http://datav.aliyun.com/static/tools/atlas/#&lat=36.416862115300304&lng=117.5701904296875&zoom=7" target="_blank" rel="external">datav</a></p>
</blockquote>
<h3 id="数据准备:"><a href="#数据准备:" class="headerlink" title="数据准备:"></a>数据准备:</h3><p>以下过程我用两个示例展示提取json地图数据的过程:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(<span class="string">"jsonlite"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"ggplot2"</span>)</div><div class="line"><span class="keyword">library</span>(plyr)</div><div class="line"><span class="keyword">library</span>(dplyr)</div><div class="line">setwd(<span class="string">"D:/R/mapdata/City/"</span>)</div></pre></td></tr></table></figure></p>
<p><strong>提取济南市json地图数据:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">json_data <- fromJSON(<span class="string">"370100.json"</span>)</div><div class="line">city<-json_data$features$properties</div><div class="line">names(city)[<span class="number">2</span>]<-<span class="string">"code"</span></div><div class="line">city$id<-<span class="number">1</span>:nrow(city)</div><div class="line">city$sale<-round(rnorm(nrow(city),<span class="number">100</span>,<span class="number">20</span>),<span class="number">0</span>)</div></pre></td></tr></table></figure></p>
<p>这里提取了济南市各区的名称、代码,并生成了虚拟指标</p>
<p><strong>济南市各区边界点坐标:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">citydata<-json_data$features$geometry$coordinates</div><div class="line">mapdata<-data.frame()</div><div class="line"><span class="keyword">for</span>( i <span class="keyword">in</span> <span class="number">1</span>:length(citydata)){</div><div class="line">citymapdata<-citydata[[i]]</div><div class="line">dim(citymapdata)=c(length(citymapdata)/<span class="number">2</span>,<span class="number">2</span>)</div><div class="line">citymapdata<-data.frame(citymapdata);names(citymapdata)[<span class="number">1</span>:<span class="number">2</span>]<-c(<span class="string">"lon"</span>,<span class="string">"lat"</span>)</div><div class="line">citymapdata$id<-i</div><div class="line">citymapdata$group<-as.numeric(paste0(i,<span class="string">"."</span>,<span class="number">1</span>))</div><div class="line">citymapdata$order<-<span class="number">1</span>:dim(citymapdata)[<span class="number">1</span>]</div><div class="line">mapdata<-rbind(mapdata,citymapdata)</div><div class="line">}</div></pre></td></tr></table></figure></p>
<p>以上过程通过循环函数提取了济南市各区的边界点经纬度坐标,并生成了分组依据group、指定了单个区边界点顺序,生成id变量便于和各区合并<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mymapdata<-merge(mapdata,city)</div></pre></td></tr></table></figure></p>
<p>合并边界点数据和各区名称与分组依据(主要是ggplot映射时作为分组变量使用)</p>
<p>因为各区的行政中心经纬度未知,这里暂时提取多边形中心作为其参考值</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">midpos <- <span class="keyword">function</span>(x) mean(range(x,na.rm=<span class="literal">TRUE</span>))</div><div class="line">centres <- ddply(dongsansheng_map_data,.(city),colwise(midpos,.(long,lat)))</div></pre></td></tr></table></figure>
<p>以上过程展示了如何从json格式的数据文件中提取我们制作数据地图所需要的指标(核心指标由三个:lon、lat、group),但是以上只够我们画出一幅单色地图,因为没有指定任何指标,在素材提取过程中,之所以先提各区的代码和id,目的是之后与边界经纬度信息合并,这样,所有指标都可以通过合并进入整体的边界点经纬度信息数据文件中,指标(无论是连续还是分类型)可以作为映射规则(大小、颜色、形状)。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">ggplot(dongsansheng_map_data,aes(long,lat)) + </div><div class="line"> geom_polygon(aes(group=group,fill=zhibiao),colour=<span class="string">"grey95"</span>) +</div><div class="line"> scale_fill_gradient(low=<span class="string">"white"</span>,high=<span class="string">"steelblue"</span>) +</div><div class="line"> geom_text(aes(label=city),data=centres) +</div><div class="line"> theme( </div><div class="line"> panel.grid = element_blank(),</div><div class="line"> panel.background = element_blank(),</div><div class="line"> axis.text = element_blank(),</div><div class="line"> axis.ticks = element_blank(),</div><div class="line"> axis.title = element_blank()</div><div class="line"> )</div></pre></td></tr></table></figure>
<p>因为只是讲解数据提取过程, 这里就不展示最终的图形了。</p>
<p>但是针对省级边界的json数据文件,相对就要复杂得多,因为很多省份内的城市辖区可能地域上是分割开的(比如河北的廊坊、安徽的铜陵等),但是R语言通过多边形映射的时候,是将分离的多边形分别定义(依据就是上面的group变量),然后通过将具有相同行政隶属关系的多边形指定一个相同的ID(我们的所有指标型数据都是跟id挂钩的,与group无关,只有在该地区行政辖区内各子行政单位没有出现地域分割的情况,此时基于行政单位编号的id和基于多边形编号的group才会一一对应,否则不会出现严格对应关系)。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">setwd(<span class="string">"D:/R/mapdata/Province/"</span>)</div><div class="line">anhui_data <- fromJSON(<span class="string">"anhui.json"</span>)</div></pre></td></tr></table></figure>
<p>接下来以安徽省的json数据结构为例来说明:</p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/json_shp/image5.jpg" alt=""></p>
<p>我们可以看到经纬度数据都存在名称为properties的子list里面,首先提取出来安徽市级行政单位的属性信息(代码、名称)。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">anhui_city_data1<-anhui_data$features$properties[,<span class="number">1</span>:<span class="number">2</span>]</div><div class="line">anhui_city_data2<-anhui_data$features$properties$center</div><div class="line">anhui_city<-cbind(anhui_city_data1,anhui_city_data2)</div><div class="line">names(anhui_city)[<span class="number">2</span>]<-<span class="string">"code"</span></div><div class="line">anhui_city$id<-<span class="number">1</span>:nrow(anhui_city)</div><div class="line">anhui_city$sale<-round(rnorm(nrow(anhui_city),<span class="number">100</span>,<span class="number">20</span>),<span class="number">0</span>)</div></pre></td></tr></table></figure></p>
<p>接下来问题来了,安徽省的各市级单位经纬度信息数据看起来在list不是同级的,即有些城市是单独一个list,有些城市是一个list里面嵌套好几个子list(这就解释了上面所讲过的,有些城市辖区不接壤,需要分别对其进行多边形描述和定义)。<br><img src="http://or3sddq9r.bkt.clouddn.com/json_shp/image6.jpg" alt=""></p>
<p>这里写了个自定义函数,具体示意呢,不太好讲,全凭感觉写的,这个还真的看具体情况来分析,如果作为模板使用,换一个省份可能不一定还能用,但是可以作为参考,修修改改也能省不少事儿!<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div></pre></td><td class="code"><pre><div class="line">anhui_map_data<-anhui_data $features$geometry$coordinates</div><div class="line">mapdata1<-data.frame()</div><div class="line">mapdata2<-data.frame()</div><div class="line"><span class="keyword">for</span>( i <span class="keyword">in</span> <span class="number">1</span>:length(anhui_map_data)){</div><div class="line">citymapdata<-anhui_map_data[[i]]</div><div class="line"> <span class="keyword">if</span> (length(citymapdata)<<span class="number">50</span>){</div><div class="line"> <span class="keyword">for</span>(m <span class="keyword">in</span> <span class="number">1</span>:length(citymapdata)){</div><div class="line"> citymapdata1<-data.frame(citymapdata[[m]]);names(citymapdata1)<-c(<span class="string">"lon"</span>,<span class="string">"lat"</span>)</div><div class="line"> citymapdata1$id<-i</div><div class="line"> citymapdata1$group<-as.numeric(paste0(i,<span class="string">"."</span>,m,<span class="number">1</span>))</div><div class="line"> citymapdata1$order<-<span class="number">1</span>:dim(citymapdata1)[<span class="number">1</span>]</div><div class="line"> mapdata1<-rbind(mapdata1,citymapdata1)</div><div class="line"> }</div><div class="line"> }<span class="keyword">else</span>{</div><div class="line">dim(citymapdata)=c(length(citymapdata)/<span class="number">2</span>,<span class="number">2</span>)</div><div class="line">citymapdata2<-data.frame(citymapdata);names(citymapdata2)<-c(<span class="string">"lon"</span>,<span class="string">"lat"</span>)</div><div class="line">citymapdata2$id<-i</div><div class="line">citymapdata2$group<-as.numeric(paste0(i,<span class="string">"."</span>,<span class="number">1</span>))</div><div class="line">citymapdata2$order<-<span class="number">1</span>:dim(citymapdata2)[<span class="number">1</span>]</div><div class="line">mapdata2<-rbind(mapdata2,citymapdata2)</div><div class="line"> }</div><div class="line">mydatanew<-rbind(mapdata1,mapdata2)</div><div class="line">}</div><div class="line"></div><div class="line">mydatanew<-arrange(mydatanew,id,order)</div><div class="line">mydatanew_map_data<-merge(mydatanew,anhui_city[,c(-<span class="number">3</span>,-<span class="number">4</span>)],by=<span class="string">"id"</span>)</div><div class="line">ggplot(mydatanew_map_data,aes(lon,lat,group=group,fill=sale))+geom_polygon(col=<span class="string">"white"</span>)+</div><div class="line"> theme( </div><div class="line"> panel.grid = element_blank(),</div><div class="line"> panel.background = element_blank(),</div><div class="line"> axis.text = element_blank(),</div><div class="line"> axis.ticks = element_blank(),</div><div class="line"> axis.title = element_blank()</div><div class="line"> )</div></pre></td></tr></table></figure></p>
<p><img src="http://or3sddq9r.bkt.clouddn.com/json_shp/image1.jpg" alt=""></p>
<p>啊噢,完美的搞定json数据,你肯定看不出来这根使用shp导入的地图数据做出来的图有啥区别,因为根本就没有任何区别(排除两者在经纬度算法上的差异),因为我们并没有使用shp或者json中声明的任何关于地图素材的格式属性,我们只是提取了有用的经纬度变量信息。</p>
<p>下一篇,跟大家细讲关于ggplot在制作数据地图过程中的变量映射规则和注意事项。</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://or3sddq9r.bkt.clouddn.com/json_shp/image.jpg" alt=""></p>
<p>最近几天推送频率之所以下降了,不是因为偷懒,是在攻克一个难题~</p>
<p>还记得前一篇推送,关于山东省财政数据可视化那一篇,因为没有精准、最新的山东省县级市边界地图素材数据,花了好多冤枉功夫,搜地图素材各种碰壁,最后的得到的地图数据并不尽如人意。</p>
</summary>
<category term="数据可视化" scheme="http://www.raindu.com/categories/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
</entry>
<entry>
<title>经历过绝望之后,选择去知乎爬了几张图~</title>
<link href="http://www.raindu.com/2017/08/08/%E7%BB%8F%E5%8E%86%E8%BF%87%E7%BB%9D%E6%9C%9B%E4%B9%8B%E5%90%8E%EF%BC%8C%E9%80%89%E6%8B%A9%E5%8E%BB%E7%9F%A5%E4%B9%8E%E7%88%AC%E4%BA%86%E5%87%A0%E5%BC%A0%E5%9B%BE/"/>
<id>http://www.raindu.com/2017/08/08/经历过绝望之后,选择去知乎爬了几张图/</id>
<published>2017-08-08T15:22:48.000Z</published>
<updated>2017-08-08T15:30:13.617Z</updated>
<content type="html"><![CDATA[<p><img src="http://oqdvmreg2.bkt.clouddn.com/Spyderphoto/image.jpg" alt=""></p>
<p>本来今天要跟大家分享怎么批量爬取2016年各大上市公司年报的,可是代码刚写了开头,就发现年报这玩意儿,真的不太好爬,还以为自己写的姿势不对,换了好几个网站。</p>
<a id="more"></a>
<p>眼睁睁的开着网页源码里排的整整齐齐的pdf文档,可是就是爬不到,NND,还是火候不够,本来打算放弃的,可是想着不干点什么太没成就感了,就跑去知乎爬了人家几张图。</p>
<p>之前分享过知乎爬图的代码,当时利用的Rvest爬的,今天换RCurl+XML包来爬,也算是新知识点了。</p>
<p>用R语言抓取网页图片——从此高效存图告别手工时代</p>
<p>因为害怕爬太多,会被禁IP,毕竟知乎每天必看的,被禁了那就不好了,特意选了个图片不多的摄影外拍的帖子。</p>
<p><a href="https://www.zhihu.com/question/31785374/answer/150310292" target="_blank" rel="external">外拍帖子地址</a></p>
<p>代码一共没几行,很好理解,可以作为学习的案例:</p>
<h3 id="数据准备:"><a href="#数据准备:" class="headerlink" title="数据准备:"></a>数据准备:</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#加载包:</span></div><div class="line"><span class="keyword">library</span>(<span class="string">"RCurl"</span>)</div><div class="line"><span class="keyword">library</span>(XML)</div><div class="line"><span class="keyword">library</span>(stringr)</div><div class="line"><span class="keyword">library</span>(dplyr)</div><div class="line"><span class="keyword">library</span>(plyr)</div></pre></td></tr></table></figure>
<h3 id="爬取过程:"><a href="#爬取过程:" class="headerlink" title="爬取过程:"></a>爬取过程:</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">url<-<span class="string">"https://www.zhihu.com/question/31785374/answer/150310292"</span></div><div class="line"><span class="comment">#获取目标网页(注意查看网页编码)</span></div><div class="line">rd <-getURL(url,.encoding=<span class="string">"UTF-8"</span>)</div><div class="line"><span class="comment">#利用xml包函数整理网页树结构</span></div><div class="line">rdhtml <- htmlParse(rd,encoding=<span class="string">"UTF-8"</span>) </div><div class="line"><span class="comment">#获取根目录</span></div><div class="line">root <- xmlRoot(rdhtml) </div><div class="line"><span class="comment">#获取话题下的所有img标签(里面含有所有的图片网址) </span></div><div class="line">Name<-getNodeSet(root,<span class="string">"//div[@class='zm-editable-content clearfix']/img"</span>)</div></pre></td></tr></table></figure>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/Spyderphoto/image1.png" alt=""></p>
<p>根据Name列表中的内容,img下面有关于三个带图片网址的属性,第一个src是打开帖子直接看到的,后两个data-original\data-actualsrc是该图片的原地址,就是点击图片后大图的网址。</p>
<p>这里选择data-original网址,利用拉laply函数提取该属性下的网址列表。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">Name1 <-laply(Name,xmlGetAttr,name=<span class="string">'data-original'</span>)</div><div class="line"><span class="comment">#为方便命名,这里截取一部分图片网址后缀作为名称</span></div><div class="line">Name2<-sub(<span class="string">"https://pic\\d.zhimg.com/v2-"</span>,<span class="string">""</span>,Name1)</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/Spyderphoto/image2.png" alt=""></p>
<p><strong>下载过程</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#建立新文件夹</span></div><div class="line">dir.create(<span class="string">"D:/R/Image/zhihu/image"</span>)</div><div class="line"><span class="comment">#使用for循环批量下载:</span></div><div class="line"><span class="keyword">for</span>(i <span class="keyword">in</span> <span class="number">1</span>:length(Name1)){</div><div class="line"> download.file(Name1[i],paste0(<span class="string">"D:/R/Image/zhihu/image/"</span>,Name2[i],sep = <span class="string">""</span>), mode = <span class="string">"wb"</span>)</div><div class="line">}</div></pre></td></tr></table></figure></p>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/Spyderphoto/image3.png" alt=""></p>
<p>###最终收获:<br><img src="http://oqdvmreg2.bkt.clouddn.com/Spyderphoto/image4.png" alt=""></p>
<h3 id="爬图的核心要点:"><a href="#爬图的核心要点:" class="headerlink" title="爬图的核心要点:"></a>爬图的核心要点:</h3><ul>
<li>抓img下的图片网址,这里你要学会迅速的进行html结构定位,无论是使用CSS选择器还是Xpath路径,都要稳、准、狠!这是决定你整过过程的首要任务。</li>
<li>建立批量下载任务:无论是使用for循环还是使用其他的向量化函数都可以,图多的话还是建议尝试使用apply组函数或者plyr包内的升级版apply函数族。</li>
<li>如果网页结构复杂且图比较多,可能要考虑伪装报头、设置随机暂停以防被封IP。</li>
</ul>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://oqdvmreg2.bkt.clouddn.com/Spyderphoto/image.jpg" alt=""></p>
<p>本来今天要跟大家分享怎么批量爬取2016年各大上市公司年报的,可是代码刚写了开头,就发现年报这玩意儿,真的不太好爬,还以为自己写的姿势不对,换了好几个网站。</p>
</summary>
<category term="网络爬虫" scheme="http://www.raindu.com/categories/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="网络爬虫" scheme="http://www.raindu.com/tags/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/"/>
</entry>
<entry>
<title>送你两款炫酷到没朋友的神奇字体~</title>
<link href="http://www.raindu.com/2017/08/05/%E9%80%81%E4%BD%A0%E4%B8%A4%E6%AC%BE%E7%82%AB%E9%85%B7%E5%88%B0%E6%B2%A1%E6%9C%8B%E5%8F%8B%E7%9A%84%E7%A5%9E%E5%A5%87%E5%AD%97%E4%BD%93/"/>
<id>http://www.raindu.com/2017/08/05/送你两款炫酷到没朋友的神奇字体/</id>
<published>2017-08-05T10:46:38.000Z</published>
<updated>2017-08-05T11:07:00.159Z</updated>
<content type="html"><![CDATA[<p><img src="http://oqdvmreg2.bkt.clouddn.com/creativefont/image.png" alt=""></p>
<a id="more"></a>
<p>今天给大家介绍两款字体,这两款字体是一个喜欢设计的大神学长开发的,专门用作mini图表字体。</p>
<p>而且只要是支持字体显示的设备,几乎都可以用,当然Excel里面也可以用,这里我用R语言来演示如何使用图表字体来制作mini信息图,丰富数据表达形式。</p>
<blockquote>
<p>PieChart #mini百分比饼图<br><img src="http://oqdvmreg2.bkt.clouddn.com/creativefont/image1.jpg" alt=""></p>
<p>BlockChart #mini方块堆积百分比图<br><img src="http://oqdvmreg2.bkt.clouddn.com/creativefont/image2.jpg" alt=""></p>
</blockquote>
<p>项目主页:<a href="http://9ishare.cc/" target="_blank" rel="external">9ishare</a></p>
<h3 id="数据准备:"><a href="#数据准备:" class="headerlink" title="数据准备:"></a>数据准备:</h3><p><strong>以下代码运行前务必要保证系统已经安装了这两款字体:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(<span class="string">"ggplot2"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"showtext"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"Cairo"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"ggthemes"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"dplyr"</span>)</div></pre></td></tr></table></figure></p>
<h3 id="导入这两款字体:"><a href="#导入这两款字体:" class="headerlink" title="导入这两款字体:"></a>导入这两款字体:</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">font.add(<span class="string">"BlockCharts"</span>,<span class="string">"BlockCharts.ttf"</span>)</div><div class="line">font.add(<span class="string">"PieChart"</span>,<span class="string">"PieCharts.ttf"</span>)</div></pre></td></tr></table></figure>
<h3 id="构造数据:"><a href="#构造数据:" class="headerlink" title="构造数据:"></a>构造数据:</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#mini饼图数据:</span></div><div class="line">PieChart<-data.frame(x=rep(<span class="number">1</span>:<span class="number">5</span>,<span class="number">2</span>),y=rep(<span class="number">2</span>:<span class="number">3</span>,each=<span class="number">5</span>),value=round(runif(<span class="number">10</span>,<span class="number">0</span>,<span class="number">1</span>),<span class="number">2</span>),class=rep(c(<span class="string">"A"</span>,<span class="string">"B"</span>),each=<span class="number">5</span>))</div></pre></td></tr></table></figure>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#匹配mini饼图字体代码:</span></div><div class="line">char1<-<span class="string">"A0F1K2P3U4Z5e6j7o8t9y"</span>%>%strsplit(<span class="string">""</span>)%>%unlist</div><div class="line">char2<-<span class="string">"BCDEGHIJLMNOQRSTVWXYabcdfghiklmnpqrsuvwx"</span>%>%strsplit(<span class="string">""</span>)%>%unlist</div><div class="line">PieChart$label<-ifelse((<span class="number">100</span>*PieChart$value)%%<span class="number">5</span>==<span class="number">0</span>,char1[PieChart$value*<span class="number">20</span>+<span class="number">1</span>],char2[PieChart$value*<span class="number">40</span>+<span class="number">1</span>])</div></pre></td></tr></table></figure>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#mini饼图可视化过程:</span></div><div class="line">setwd(<span class="string">"E:/微信公众号/公众号——数据小魔方/2017年8月/20170805/"</span>)</div><div class="line">CairoPNG(file=<span class="string">"PieChart.png"</span>,width=<span class="number">1000</span>,height=<span class="number">750</span>)</div><div class="line">showtext.begin()</div><div class="line">ggplot(PieChart,aes(x,y))+</div><div class="line">geom_text(aes(label=label,colour=class),hjust=<span class="number">1</span>,family=<span class="string">"PieChart"</span>,size=<span class="number">45</span>)+</div><div class="line">geom_text(aes(y=y+<span class="number">.35</span>,label=paste0(value,<span class="string">"%"</span>)),hjust=<span class="number">.5</span>,size=<span class="number">7</span>,colour=<span class="string">"#C10000"</span>)+</div><div class="line">scale_colour_manual(values=c(<span class="string">"#92D24F"</span>,<span class="string">"#FFC000"</span>),guide=<span class="literal">FALSE</span>)+</div><div class="line">ylim(<span class="number">1.5</span>,<span class="number">3.5</span>)+</div><div class="line">xlim(<span class="number">.5</span>,<span class="number">5.5</span>)+</div><div class="line">theme_void()</div><div class="line">showtext.end()</div><div class="line">dev.off()</div></pre></td></tr></table></figure>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/creativefont/image3.png" alt=""></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#匹配mini百分比方块堆积图代码:</span></div><div class="line">char3<-<span class="string">"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz①②③④⑤⑥⑦⑧⑨七三上下九二八六十千口土大天太女子山工干平开心才文方无日木四"</span>%>%strsplit(<span class="string">""</span>)%>%unlist</div><div class="line">PieChart$label2<-char3[PieChart$value*<span class="number">100</span>+<span class="number">1</span>]</div></pre></td></tr></table></figure>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#百分比方块堆积图可视化过程:</span></div><div class="line">CairoPNG(file=<span class="string">"BlockCharts.png"</span>,width=<span class="number">1000</span>,height=<span class="number">750</span>)</div><div class="line">showtext.begin()</div><div class="line">ggplot(PieChart,aes(x,y))+</div><div class="line">geom_text(aes(label=label2,colour=class),hjust=<span class="number">.5</span>,family=<span class="string">"BlockCharts"</span>,size=<span class="number">45</span>)+</div><div class="line">geom_text(aes(y=y+<span class="number">.35</span>,label=paste0(value,<span class="string">"%"</span>)),hjust=<span class="number">.5</span>,size=<span class="number">7</span>,colour=<span class="string">"#C10000"</span>)+</div><div class="line">scale_colour_manual(values=c(<span class="string">"#92D24F"</span>,<span class="string">"#FFC000"</span>),guide=<span class="literal">FALSE</span>)+</div><div class="line">ylim(<span class="number">1.5</span>,<span class="number">3.5</span>)+</div><div class="line">xlim(<span class="number">0.5</span>,<span class="number">5.5</span>)+</div><div class="line">theme_void()</div><div class="line">showtext.end()</div><div class="line">dev.off()</div></pre></td></tr></table></figure>
<p><img src="http://oqdvmreg2.bkt.clouddn.com/creativefont/image4.png" alt=""></p>
<p>是不是感觉很神奇呀,R语言中竟然可以这么玩字体,没错就是这种操作,不夸张的说,只要是系统注册过的字体,都可以用R语言这么玩,还记得曾经发过的那一篇中国身份字体地图吗,也是这么玩出来的!</p>
<p>挑战不可能之——ggplot环形字体地图</p>
<p>期待大家可以用这些字体做出新的创意图表!</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://oqdvmreg2.bkt.clouddn.com/creativefont/image.png" alt=""></p>
</summary>
<category term="信息图" scheme="http://www.raindu.com/categories/%E4%BF%A1%E6%81%AF%E5%9B%BE/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
<category term="信息图" scheme="http://www.raindu.com/tags/%E4%BF%A1%E6%81%AF%E5%9B%BE/"/>
</entry>
<entry>
<title>当大家都在讨论金刚狼3的时候,他们到底在说些什么~</title>
<link href="http://www.raindu.com/2017/07/30/%E5%BD%93%E5%A4%A7%E5%AE%B6%E9%83%BD%E5%9C%A8%E8%AE%A8%E8%AE%BA%E9%87%91%E5%88%9A%E7%8B%BC3%E7%9A%84%E6%97%B6%E5%80%99%EF%BC%8C%E4%BB%96%E4%BB%AC%E5%88%B0%E5%BA%95%E5%9C%A8%E8%AF%B4%E4%BA%9B%E4%BB%80%E4%B9%88/"/>
<id>http://www.raindu.com/2017/07/30/当大家都在讨论金刚狼3的时候,他们到底在说些什么/</id>
<published>2017-07-30T04:01:52.000Z</published>
<updated>2017-07-30T04:30:55.374Z</updated>
<content type="html"><![CDATA[<p><img src="http://orqx7zr30.bkt.clouddn.com/goldwolf/%E5%BE%AE%E4%BF%A1%E9%A6%96%E5%9B%BE.png" alt=""></p>
<p>最近的金刚狼3着实让大家过足了瘾,虽然很悲情、英雄迟暮,光鲜不再,可是惋惜之余,大家对这步狼叔的封山之作还是充满敬意和关切的,豆瓣评分8.5,影评里将近4万多条短评,这里截取其中内容内容质量比较好的,以文字云的形式呈现,看下大家对于金刚狼都有着怎么的认知标签。</p>
<a id="more"></a>
<h3 id="数据准备:"><a href="#数据准备:" class="headerlink" title="数据准备:"></a>数据准备:</h3><p><strong>本文数据获取过程所用到的所有相关包:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(XML)</div><div class="line"><span class="keyword">library</span>(Rwordseg)</div><div class="line"><span class="keyword">library</span>(wordcloud2)</div><div class="line"><span class="keyword">library</span>(RCurl)</div><div class="line"><span class="keyword">library</span>(stringr)</div><div class="line"><span class="keyword">library</span>(plyr)</div><div class="line"><span class="keyword">library</span>(dplyr)</div></pre></td></tr></table></figure></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">url<-<span class="string">"https://movie.douban.com/subject/25765735/reviews?rating=&start=0"</span></div><div class="line">baseurl<-<span class="string">"https://movie.douban.com/subject/25765735/reviews"</span></div></pre></td></tr></table></figure>
<p><strong>建立影评网址抓取函数</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#豆瓣影评的网址遍历过程</span></div><div class="line">urln<-paste0(baseurl,<span class="string">"?rating=&start="</span>,<span class="number">20</span>*(seq(<span class="number">0</span>:<span class="number">73</span>)-<span class="number">1</span>)) </div><div class="line">fun<-<span class="keyword">function</span>(urln){ </div><div class="line">rd <- getURL(urln,.encoding=<span class="string">"UTF-8"</span>) <span class="comment">#获取网页</span></div><div class="line">rdhtml <- htmlParse(rd,encoding=<span class="string">"UTF-8"</span>) <span class="comment">#解析网页</span></div><div class="line">root <- xmlRoot(rdhtml) <span class="comment">#获取根节点</span></div><div class="line">page<-getNodeSet(root,<span class="string">"//div[@class='review-short']/div[@class='short-content']/a"</span>)</div><div class="line"><span class="comment">#目标节点获取</span></div><div class="line">pagevalue<-unique(laply(page,xmlGetAttr,name=<span class="string">'href'</span>)) </div><div class="line"> <span class="comment">#获取目标节点内的属性值(这里是评论网址)</span></div><div class="line">}</div></pre></td></tr></table></figure></p>
<p><strong>使用向量化函数进行循环处理</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">pagefull<-sapply(urln,fun) </div><div class="line"><span class="comment">#转换列表为向量</span></div><div class="line">pagefullnew<-unlist(pagefull,use.names =<span class="literal">F</span>)</div></pre></td></tr></table></figure></p>
<p><strong>建立评论区评论文本获取函数</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">func<-<span class="keyword">function</span>(page){</div><div class="line">rd <- getURL(page,.encoding=<span class="string">"UTF-8"</span>) <span class="comment">#获取网页</span></div><div class="line">rdhtml<-htmlParse(rd,encoding=<span class="string">"UTF-8"</span>) <span class="comment"># 解析网页</span></div><div class="line">root<-xmlRoot(rdhtml) <span class="comment">#获取根节点</span></div><div class="line">ly<-getNodeSet(root,<span class="string">"//div[@class='main-bd']/div[@id='link-report']/div[@property='v:description']/p"</span>)</div><div class="line"><span class="comment">#获取目标节点</span></div><div class="line">value<- laply(ly,xmlValue,trim=<span class="literal">T</span>)</div><div class="line"> <span class="comment">#获取目标节点内的属性值(这里是评论文本)</span></div><div class="line">}</div></pre></td></tr></table></figure></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#使用向量化函数进行循环处理</span></div><div class="line">valuefull<-sapply(pagefullnew,func)</div><div class="line"><span class="comment">#转换列表为向量</span></div><div class="line">valuefullnew<-unlist(valuefull,use.names =<span class="literal">F</span>)</div></pre></td></tr></table></figure>
<h3 id="文本分词处理过程"><a href="#文本分词处理过程" class="headerlink" title="文本分词处理过程"></a>文本分词处理过程</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">myrevieww<-valuefullnew</div><div class="line">thewords <- segmentCN(myrevieww,nature=<span class="literal">T</span>)%>%unlist()</div><div class="line">thewords <- gsub(<span class="string">"[a-z]|\\."</span>, <span class="string">""</span>, thewords)</div><div class="line">thewords<-thewords[nchar(thewords)><span class="number">1</span>]</div></pre></td></tr></table></figure>
<p><strong>建立关于影评的停止词</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line">invalid.words <- c(<span class="string">"电影"</span>, <span class="string">"演员"</span>, <span class="string">"导演"</span>, <span class="string">"我们"</span>, <span class="string">"他们"</span>, <span class="string">"一个"</span>, <span class="string">"没有"</span>,</div><div class="line"> <span class="string">"所以"</span>, <span class="string">"可以"</span>, <span class="string">"影片"</span>, <span class="string">"但是"</span>, <span class="string">"因为"</span>, <span class="string">"什么"</span>, <span class="string">"自己"</span>,</div><div class="line"> <span class="string">"这个"</span>, <span class="string">"故事"</span>, <span class="string">"最后"</span>, <span class="string">"这样"</span>, <span class="string">"觉得"</span>, <span class="string">"为了"</span>, <span class="string">"一部"</span>,</div><div class="line"> <span class="string">"这部"</span>, <span class="string">"片子"</span>, <span class="string">"其实"</span>, <span class="string">"当然"</span>, <span class="string">"时候"</span>, <span class="string">"看到"</span>, <span class="string">"已经"</span>,</div><div class="line"> <span class="string">"这种"</span>, <span class="string">"知道"</span>, <span class="string">"这些"</span>, <span class="string">"一样"</span>, <span class="string">"如果"</span>, <span class="string">"观众"</span>, <span class="string">"人物"</span>,</div><div class="line"> <span class="string">"开始"</span>, <span class="string">"那么"</span>, <span class="string">"那个"</span>, <span class="string">"可能"</span>, <span class="string">"情节"</span>, <span class="string">"结局"</span>, <span class="string">"结尾"</span>,</div><div class="line"> <span class="string">"风格"</span>, <span class="string">"节奏"</span>, <span class="string">"剧情"</span>, <span class="string">"有点"</span>, <span class="string">"终于"</span>, <span class="string">"之后"</span>, <span class="string">"怎么"</span>,</div><div class="line"> <span class="string">"一种"</span>, <span class="string">"出现"</span>, <span class="string">"作品"</span>, <span class="string">"地方"</span>, <span class="string">"本片"</span>, <span class="string">"一些"</span>, <span class="string">"一定"</span>,</div><div class="line"> <span class="string">"之前"</span>, <span class="string">"还是"</span>, <span class="string">"虽然"</span>, <span class="string">"这么"</span>, <span class="string">"角色"</span>, <span class="string">"这么"</span>, <span class="string">"不过"</span>,</div><div class="line"> <span class="string">"类型"</span>, <span class="string">"以为"</span>, <span class="string">"显得"</span>, <span class="string">"还是"</span>, <span class="string">"算是"</span>, <span class="string">"东西"</span>, <span class="string">"有些"</span>)</div><div class="line">theflags <- thewords %<span class="keyword">in</span>% invalid.words</div><div class="line">thewords<-thewords[!theflags]</div><div class="line">reviewdata<-table(thewords)%>%as.data.frame(stringsAsFactors = <span class="literal">FALSE</span>)%>% arrange(desc(Freq))</div><div class="line">reviewdata$thewords[<span class="number">1</span>]<-<span class="string">"金刚狼"</span></div></pre></td></tr></table></figure></p>
<h3 id="词云可视化:"><a href="#词云可视化:" class="headerlink" title="词云可视化:"></a>词云可视化:</h3><p><strong>使用文字云包处理</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">wordcloud2(reviewdata[<span class="number">1</span>:<span class="number">1000</span>,],color = <span class="string">"random-light"</span>,minSize=<span class="number">.5</span>,size=<span class="number">1</span>,backgroundColor = <span class="string">"dark"</span>,minRotation = -pi/<span class="number">6</span>, maxRotation = -pi/<span class="number">6</span>,fontFamily =<span class="string">"微软雅黑"</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://orqx7zr30.bkt.clouddn.com/goldwolf/image1.png" alt=""><br><img src="http://orqx7zr30.bkt.clouddn.com/goldwolf/image2.png" alt=""></p>
<p><strong>导出词频结果</strong></p>
<p>write.table(reviewdata,file=”D:\R\File\reviewdata.csv”, sep =”,”, row.names =FALSE)</p>
<p><img src="http://orqx7zr30.bkt.clouddn.com/goldwolf/image3.png" alt=""></p>
<p>为了更加完美的利用文字云呈现广大影迷们对连狼叔的封山之作评价标签,这里我使用著名的在线文字云平台——tagul(<a href="https://tagul.com/cloud/1)来制作两幅文字云。" target="_blank" rel="external">https://tagul.com/cloud/1)来制作两幅文字云。</a></p>
<p>因为平台不支持中文,所以要先将汉语转化为英文,考虑到之前使用R语言调用有道词典结果不够完美,这里使用excel自带的在线翻译函数在excel中进行翻译,然后导入在线平台制作。</p>
<p><strong>最终的词云效果:</strong></p>
<p><img src="http://orqx7zr30.bkt.clouddn.com/goldwolf/image4.png" alt=""><br><img src="http://orqx7zr30.bkt.clouddn.com/goldwolf/image5.png" alt=""></p>
<p>因为不太懂关于影评的停止词设置,还是留下了很多副词,导致最终的效果有些不是很完美,但是作为一次尝试,以后会慢慢改善的!</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://orqx7zr30.bkt.clouddn.com/goldwolf/%E5%BE%AE%E4%BF%A1%E9%A6%96%E5%9B%BE.png" alt=""></p>
<p>最近的金刚狼3着实让大家过足了瘾,虽然很悲情、英雄迟暮,光鲜不再,可是惋惜之余,大家对这步狼叔的封山之作还是充满敬意和关切的,豆瓣评分8.5,影评里将近4万多条短评,这里截取其中内容内容质量比较好的,以文字云的形式呈现,看下大家对于金刚狼都有着怎么的认知标签。</p>
</summary>
<category term="网络爬虫" scheme="http://www.raindu.com/categories/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="网络爬虫" scheme="http://www.raindu.com/tags/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/"/>
</entry>
<entry>
<title>教你如何优雅的用R语言调用有道翻译</title>
<link href="http://www.raindu.com/2017/07/28/%E6%95%99%E4%BD%A0%E5%A6%82%E4%BD%95%E4%BC%98%E9%9B%85%E7%9A%84%E7%94%A8R%E8%AF%AD%E8%A8%80%E8%B0%83%E7%94%A8%E6%9C%89%E9%81%93%E7%BF%BB%E8%AF%91/"/>
<id>http://www.raindu.com/2017/07/28/教你如何优雅的用R语言调用有道翻译/</id>
<published>2017-07-28T09:09:45.000Z</published>
<updated>2017-07-28T14:48:14.479Z</updated>
<content type="html"><![CDATA[<p><img src="http://orm967kgl.bkt.clouddn.com/youdao/image5.png" alt=""></p>
<p>最近刚发现了个有趣的包,一个R语言发烧友开发了R语言与有道在线翻译的接口,可能这位大神也是一个受够了每天打开网页狂敲键盘查词的罪,索性自己动手,从此丰衣足食。</p>
<a id="more"></a>
<h3 id="函数简介:"><a href="#函数简介:" class="headerlink" title="函数简介:"></a>函数简介:</h3><p>感觉这种性格,跟我超级像哈哈,可是我目前还没有那种说干就干底气,达不到开发包的水平,以后要是学会了,一定要多贡献一些偷懒神器!</p>
<p>以下是代码思路,这里我提供两种方法,一种是集合包内翻译函数和for循环,算是笨方法。另一种是对该包封装的函数源码进行了稍许调整,使得输出更加和谐一些。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(<span class="string">"RYoudaoTranslate"</span>)</div></pre></td></tr></table></figure>
<p>鉴于在线调用有道翻译服务需要自己现在有道词典的开放平台免费注册开发者服务并获取有限量调用服务的API账号密码,每日限调用6000次。注册地址如下:</p>
<p><a href="http://fanyi.youdao.com/openapi?path=data-mode" target="_blank" rel="external">注册地址</a></p>
<p>以下是该包官方文档提供的免费账号密码,试了一下还能用,感觉大神还是很够意思的,提供了好用的产品不说,还自己分享了免费账号。</p>
<blockquote>
<p>apikey = “498375134”<br>keyfrom = “JustForTestYouDao”</p>
</blockquote>
<p>随机生成了五个单词:</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">word<-c(<span class="string">"weather"</span>,<span class="string">"father"</span>,<span class="string">"apple"</span>,<span class="string">"like"</span>,<span class="string">"hate"</span>)</div></pre></td></tr></table></figure>
<h3 id="调用过程:"><a href="#调用过程:" class="headerlink" title="调用过程:"></a>调用过程:</h3><p><strong>方法一:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">url<-paste(<span class="string">"http://fanyi.youdao.com/openapi.do?keyfrom=JustForTestYouDao&key=498375134&type=data&doctype=json&version=1.1&q="</span>,word,sep=<span class="string">""</span>)</div><div class="line">url<-youdaoUrl(word=word,api=<span class="string">"498375134"</span>,keyfrom=<span class="string">"JustForTestYouDao"</span>)</div><div class="line"><span class="comment">#以上两句等价。</span></div><div class="line">Res<-c()</div><div class="line"><span class="keyword">for</span>( i <span class="keyword">in</span> word){</div><div class="line"> Res[i] = youdaoLookUp(i,api=<span class="string">"282671603"</span>,keyfrom=<span class="string">"fy1991--421fy"</span>)</div><div class="line"> }</div></pre></td></tr></table></figure></p>
<p>以上使用for循环,结合包内的查询函数,可获取查询结果向量。</p>
<p><img src="http://orm967kgl.bkt.clouddn.com/youdao/image1.jpg" alt=""></p>
<p>其实我是觉得这样的不加筛选的输出不够友好,看了下源码里面的封装函数,稍微做了些改动,以下是方法二。</p>
<p><strong>方法二:</strong></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">youdaoUrl = <span class="keyword">function</span>(word){</div><div class="line">paste(<span class="string">"http://fanyi.youdao.com/openapi.do?keyfrom=fy1991--421fy&key=282671603&type=data&doctype=json&version=1.1&q="</span>,word,sep=<span class="string">""</span>)</div><div class="line">}</div><div class="line">youdaoTranslate<-<span class="keyword">function</span>(word){</div><div class="line"> url = getURL(youdaoUrl(word))</div><div class="line"> obj = fromJSON(url) </div><div class="line"> result=paste0(obj$web[[<span class="number">1</span>]]$value,collapse=<span class="string">";"</span>)</div><div class="line"> <span class="keyword">return</span>(result)</div><div class="line">}</div></pre></td></tr></table></figure>
<p>以上构造了两个函数,一个提供单词的URL地址匹配,一个提供查询结果。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">sapply(word,youdaoTranslate,simplify=<span class="literal">TRUE</span>)</div></pre></td></tr></table></figure>
<p><a href="http://orm967kgl.bkt.clouddn.com/youdao/image2.jpg" target="_blank" rel="external"></a></p>
<p>这里结果摒弃for循环,使用内置的apply组函数sapply,懂行的都知道为什么!</p>
<p>以上经过我的进一步筛选,结果更加简洁,实用。</p>
<p>如果是遇到大批量的翻译需求,这种方式还是很能提高效率的,不过我还没有测试中文单词的翻译效果,有兴趣的大家可以自己玩。</p>
<p>你以为到这里就结束了吗? NO!</p>
<p>后面还有好戏呢!</p>
<p>其实微软的excel更新至13版以后,也提供了调用有道在线翻译的服务。而且使用比较简单。以下是函数语句,不要被冗长的各种参数吓到了,你需要改的只是其中的单元格引用。(本例中位于中间位置的A2,记得用的时候是相对引用,否则填充后只能翻译首个单元格内容)。</p>
<p>本小编已经测试过了翻译结果,中英文互译效果相当棒,但是限制是:<br>单词翻译效果好,句子不行,水平还不如小编我,即便是那种带空格的短句、地名、人名也够呛。</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">=FILTERXML(WEBSERVICE("http://fanyi.youdao.com/translate?&i="&A2&"&doctype=xml&version"),"//translation")</div></pre></td></tr></table></figure>
<p><img src="http://orm967kgl.bkt.clouddn.com/youdao/image3.jpg" alt=""></p>
<p>带函数的测试文件待我推送后,会传到QQ群共享中,下载后直接复制该单元格引用就OK了。</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://orm967kgl.bkt.clouddn.com/youdao/image5.png" alt=""></p>
<p>最近刚发现了个有趣的包,一个R语言发烧友开发了R语言与有道在线翻译的接口,可能这位大神也是一个受够了每天打开网页狂敲键盘查词的罪,索性自己动手,从此丰衣足食。</p>
</summary>
<category term="网络爬虫" scheme="http://www.raindu.com/categories/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="网络爬虫" scheme="http://www.raindu.com/tags/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/"/>
</entry>
<entry>
<title>超强脑洞第五弹——ggplot 构造连环饼图</title>
<link href="http://www.raindu.com/2017/07/27/%E8%B6%85%E5%BC%BA%E8%84%91%E6%B4%9E%E7%AC%AC%E4%BA%94%E5%BC%B9%E2%80%94%E2%80%94ggplot-%E6%9E%84%E9%80%A0%E8%BF%9E%E7%8E%AF%E9%A5%BC%E5%9B%BE/"/>
<id>http://www.raindu.com/2017/07/27/超强脑洞第五弹——ggplot-构造连环饼图/</id>
<published>2017-07-27T15:05:46.000Z</published>
<updated>2017-07-27T15:13:41.405Z</updated>
<content type="html"><![CDATA[<p><img src="http://ortu7ddty.bkt.clouddn.com/line_pie/image2.png" alt=""></p>
<p>今天这篇之前曾有涉略过,就是利用ggplot的辅助插件工具——scatterpie制作基于气泡图的饼图,之前曾在地图图层上演示过此种类似图表,不过这里我将其与折线图融合,案例来源于陈荣兴老师的名作——《Excel图表拒绝平庸》。</p>
<a id="more"></a>
<h3 id="数据准备:"><a href="#数据准备:" class="headerlink" title="数据准备:"></a>数据准备:</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(<span class="string">"ggplot2"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"scatterpie"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"Cairo"</span>)</div></pre></td></tr></table></figure>
<p><strong>数据集构造:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">mydata<-c(<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>,<span class="number">2</span>,<span class="number">3</span>,<span class="number">5</span>,<span class="number">5</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">2</span>,<span class="number">4</span>,<span class="number">5</span>,<span class="number">1</span>,<span class="number">3</span>,<span class="number">2</span>,<span class="number">3</span>,<span class="number">5</span>,<span class="number">5</span>,<span class="number">4</span>,<span class="number">2</span>,<span class="number">4</span>,<span class="number">2</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">1</span>,<span class="number">1</span>,<span class="number">0.5</span>,<span class="number">0.5</span>)</div><div class="line">Dummy<-<span class="number">5</span>*seq(<span class="number">1</span>:<span class="number">8</span>)</div><div class="line">mynewdata<-matrix(mydata,nrow=<span class="number">8</span>,ncol=<span class="number">5</span>,byrow=<span class="literal">T</span>)</div><div class="line">colnames(mynewdata)<-c(<span class="string">"S1"</span>,<span class="string">"S2"</span>,<span class="string">"S3"</span>,<span class="string">"S4"</span>,<span class="string">"S5"</span>)</div><div class="line">mynewdata<-as.data.frame(mynewdata)</div><div class="line">as.integer(mynewdata1$Year)</div><div class="line">mynewdata1<-cbind(Year,Dummy,Data,mynewdata)</div><div class="line">as.integer(mynewdata1$Dummy)</div></pre></td></tr></table></figure></p>
<p><strong>构造色盘:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">color1<-c(<span class="string">"#FF2D2D"</span>,<span class="string">"#F79646"</span>,<span class="string">"#4BACC6"</span>,<span class="string">"#FFC000"</span>,<span class="string">"#92D050"</span>)</div><div class="line">color2<-c(<span class="string">"#17375E"</span>,<span class="string">"#23538D"</span>,<span class="string">"#558ED5"</span>,<span class="string">"#8EB4E3"</span>,<span class="string">"#C6D9F1"</span>)</div></pre></td></tr></table></figure></p>
<h3 id="图形可视化:"><a href="#图形可视化:" class="headerlink" title="图形可视化:"></a>图形可视化:</h3><p><strong>色盘1图表输出:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/scatterpie1.png"</span>,width=<span class="number">500</span>,height=<span class="number">330</span>)</div><div class="line">ggplot()+</div><div class="line">geom_line(data=mynewdata1,aes(x=Dummy,y=Data,group=<span class="number">1</span>),col=<span class="string">"#085264"</span>,size=<span class="number">.8</span>)+</div><div class="line">geom_scatterpie(data=mynewdata1,aes(x=Dummy,y=Data,r=<span class="number">2</span>),cols=colnames(mynewdata1)[<span class="number">4</span>:<span class="number">8</span>],color=<span class="literal">NA</span>)+</div><div class="line">ylim(<span class="number">0</span>,<span class="number">25</span>)+</div><div class="line">scale_fill_manual(values=color1)+</div><div class="line">scale_x_continuous(breaks=mynewdata1$Dummy,labels=c(<span class="number">2004</span>:<span class="number">2011</span>))+</div><div class="line">guides( fill=guide_legend(label.position =<span class="string">"top"</span>))+</div><div class="line">theme(</div><div class="line">axis.title=element_blank(),</div><div class="line">legend.title=element_blank(),</div><div class="line">panel.background=element_blank(),</div><div class="line">axis.line=element_line(),</div><div class="line">axis.ticks=element_line(),</div><div class="line">legend.direction=<span class="string">"horizontal"</span>,</div><div class="line">legend.position=c(<span class="number">0.15</span>,<span class="number">0.9</span>),</div><div class="line">)</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://ortu7ddty.bkt.clouddn.com/line_pie/image1.png" alt=""></p>
<p><strong>色盘2输出:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/scatterpie2.png"</span>,width=<span class="number">500</span>,height=<span class="number">330</span>)</div><div class="line">ggplot()+</div><div class="line">geom_line(data=mynewdata1,aes(x=Dummy,y=Data,group=<span class="number">1</span>),col=<span class="string">"#085264"</span>,size=<span class="number">.8</span>)+</div><div class="line">geom_scatterpie(data=mynewdata1,aes(x=Dummy,y=Data,r=<span class="number">2</span>),cols=colnames(mynewdata1)[<span class="number">4</span>:<span class="number">8</span>],color=<span class="literal">NA</span>)+</div><div class="line">ylim(<span class="number">0</span>,<span class="number">25</span>)+</div><div class="line">scale_fill_manual(values=color2)+</div><div class="line">scale_x_continuous(breaks=mynewdata1$Dummy,labels=c(<span class="number">2004</span>:<span class="number">2011</span>))+</div><div class="line">guides( fill=guide_legend(label.position =<span class="string">"top"</span>))+</div><div class="line">theme(</div><div class="line">axis.title=element_blank(),</div><div class="line">legend.title=element_blank(),</div><div class="line">panel.background=element_blank(),</div><div class="line">axis.line=element_line(),</div><div class="line">axis.ticks=element_line(),</div><div class="line">legend.direction=<span class="string">"horizontal"</span>,</div><div class="line">legend.position=c(<span class="number">0.15</span>,<span class="number">0.9</span>),</div><div class="line">)</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://ortu7ddty.bkt.clouddn.com/line_pie/image2.png" alt=""></p>
<p>本来原始图表在案例中使用VBA写的,大体思路是在折线图对应点位置强制插入8个饼图对象。(就是对应八个点位置的饼图)思路虽好,可是VBA的笨拙语法操作起来实在不易,代码量巨大。</p>
<p><strong>源案例代码截图:</strong></p>
<p><img src="http://ortu7ddty.bkt.clouddn.com/line_pie/image.png" alt=""></p>
<p>而使用R语言,不算主题修饰成分,核心代码只有短短6行,由此可见R在图形操控方面的便利。</p>
<h3 id="核心要点总结:"><a href="#核心要点总结:" class="headerlink" title="核心要点总结:"></a>核心要点总结:</h3><p><strong>本例适用场景:</strong></p>
<ul>
<li>基于时间维度的个指标结构分解(年度GDP构成);</li>
<li>基于地域维度的指标构成分解。(不同地区产品销量、销额等)。</li>
</ul>
<p><strong>核心要点:</strong></p>
<ul>
<li>需掌握geom_scatterpie 图层函数要义(其实就是熟知scatterpie包的参数);</li>
<li>保证横、纵轴刻度线量级一致,细心地童鞋可能已经发现,我并未直接将X轴映射给Year变量,而是费事儿的用0,5,10……40等间隔为5的数值来作为X轴,之后才将刻度标签替换成2004~2011的年份(具有实际意义的指标)。原因就是因为规避横纵坐标量级差异导致饼图变形。(算是scatterpie的bug吧,无法自动优化饼图半径)。</li>
</ul>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://ortu7ddty.bkt.clouddn.com/line_pie/image2.png" alt=""></p>
<p>今天这篇之前曾有涉略过,就是利用ggplot的辅助插件工具——scatterpie制作基于气泡图的饼图,之前曾在地图图层上演示过此种类似图表,不过这里我将其与折线图融合,案例来源于陈荣兴老师的名作——《Excel图表拒绝平庸》。</p>
</summary>
<category term="R语言" scheme="http://www.raindu.com/categories/R%E8%AF%AD%E8%A8%80/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
</entry>
<entry>
<title>超强脑洞第四弹——ggplot构造甘特图</title>
<link href="http://www.raindu.com/2017/07/26/%E8%B6%85%E5%BC%BA%E8%84%91%E6%B4%9E%E7%AC%AC%E5%9B%9B%E5%BC%B9%E2%80%94%E2%80%94ggplot%E6%9E%84%E9%80%A0%E7%94%98%E7%89%B9%E5%9B%BE/"/>
<id>http://www.raindu.com/2017/07/26/超强脑洞第四弹——ggplot构造甘特图/</id>
<published>2017-07-26T10:24:56.000Z</published>
<updated>2017-07-26T11:01:26.647Z</updated>
<content type="html"><![CDATA[<p><img src="http://ortu7ddty.bkt.clouddn.com/gant/%E7%94%98%E7%89%B9%E5%9B%BE.png" alt=""></p>
<p>甘特图即便是用excel来做,也是要吃些苦头的,ggplot中无此内置图层函数,但是,没啥图能难道哈神的巨作,ggplot制作甘特不费吹灰之力,而且还能做得有模有样,有板有眼。以下是代码过程。</p>
<a id="more"></a>
<h3 id="数据准备:"><a href="#数据准备:" class="headerlink" title="数据准备:"></a>数据准备:</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(<span class="string">"lubridate"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"ggplot2"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"ggmap"</span>)</div><div class="line"><span class="keyword">library</span>(showtext)</div><div class="line"><span class="keyword">library</span>(grid)</div><div class="line"><span class="keyword">library</span>(scales)</div><div class="line"><span class="keyword">library</span>(Cairo)</div></pre></td></tr></table></figure>
<p><strong>数据集构造:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">Item<-paste(<span class="string">"Step"</span>,<span class="string">" "</span>,<span class="number">1</span>:<span class="number">8</span>,sep=<span class="string">""</span>)</div><div class="line">Planned_Start_Date<-c(<span class="string">"2016/03/03"</span>,<span class="string">"2016/03/16"</span>,<span class="string">"2016/03/28"</span>,<span class="string">"2016/04/02"</span>,<span class="string">"2016/04/12"</span>,<span class="string">"2016/04/22"</span>,<span class="string">"2016/05/16"</span>,<span class="string">"2016/05/22"</span>)</div><div class="line">Planned_Finish_Date<-c(<span class="string">"2016/03/15"</span>,<span class="string">"2016/03/31"</span>,<span class="string">"2016/04/04"</span>,<span class="string">"2016/04/15"</span>,<span class="string">"2016/04/26"</span>,<span class="string">"2016/05/20"</span>,<span class="string">"2016/05/28"</span>,<span class="string">"2016/06/12"</span>)</div><div class="line">Actual_Start_Date<-c(<span class="string">"2016/03/03"</span>,<span class="string">"2016/03/16"</span>,<span class="string">"2016/03/27"</span>,<span class="string">"2016/04/05"</span>,<span class="string">"2016/04/13"</span>,<span class="string">"2016/04/22"</span>,<span class="string">"2016/05/16"</span>,<span class="string">"2016/05/22"</span>)</div><div class="line">Actual_Finish_Date<-c(<span class="string">"2016/03/18"</span>,<span class="string">"2016/03/28"</span>,<span class="string">"2016/04/05"</span>,<span class="string">"2016/04/16"</span>,<span class="string">"2016/04/27"</span>,<span class="string">"2016/05/15"</span>,<span class="string">"2016/05/16"</span>,<span class="string">"2016/05/22"</span>)</div><div class="line">mydata<-data.frame(Item,Planned_Start_Date,Planned_Finish_Date,Actual_Start_Date,Actual_Finish_Date,stringsAsFactors = <span class="literal">FALSE</span>)</div></pre></td></tr></table></figure></p>
<p><strong>日期变量转换:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">mydata$Planned_Start_Date<-ymd(mydata$Planned_Start_Date)</div><div class="line">mydata$Planned_Finish_Date<-ymd(mydata$Planned_Finish_Date)</div><div class="line">mydata$Actual_Start_Date<-ymd(mydata$Actual_Start_Date)</div><div class="line">mydata$Actual_Finish_Date<-ymd(mydata$Actual_Finish_Date)</div></pre></td></tr></table></figure></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">datebreaks<-seq(as.Date(<span class="string">"2015-03-01"</span>),as.Date(<span class="string">"2015-06-01"</span>),by=<span class="string">"1 month"</span>)</div><div class="line">time<-as.Date(<span class="string">"2016-05-15"</span>)</div></pre></td></tr></table></figure>
<h3 id="可视化过程:"><a href="#可视化过程:" class="headerlink" title="可视化过程:"></a>可视化过程:</h3><p><strong>GGsave函数渲染输出:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div></pre></td><td class="code"><pre><div class="line">windowsFonts(myFont = windowsFont(<span class="string">"微软雅黑"</span>))</div><div class="line">p<-ggplot()+</div><div class="line">geom_linerange(data=mydata,aes(x=Item,ymin=Planned_Start_Date,ymax=Planned_Finish_Date),size=<span class="number">10</span>,color=<span class="string">"#BFBFBF"</span>,alpha=<span class="number">0.8</span>)+</div><div class="line">geom_linerange(data=mydata,aes(x=Item,ymin=Actual_Start_Date,ymax=Actual_Finish_Date),size=<span class="number">7</span>,color=<span class="string">"#085264"</span>,alpha=<span class="number">0.8</span>)+</div><div class="line">scale_x_discrete(limits=sort(Item,decreasing=<span class="literal">T</span>))+</div><div class="line">scale_y_date(position =<span class="string">"top"</span>)+</div><div class="line"><span class="comment">#scale_y_date(breaks=datebreaks,labels=date_format("%Y %b"))+</span></div><div class="line"><span class="comment">#geom_hline(data=NULL,aes(hintercept=time))+</span></div><div class="line">coord_flip()+</div><div class="line">theme(</div><div class="line">axis.title=element_blank(),</div><div class="line">axis.text.x=element_text(margin=margin(<span class="number">5</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="string">"pt"</span>)),</div><div class="line">axis.text.y=element_text(margin=margin(<span class="number">0</span>,<span class="number">10</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="string">"pt"</span>)),</div><div class="line">axis.ticks.y=element_blank(),</div><div class="line">panel.grid.major.y=element_line(color=<span class="string">"#FFB666"</span>,linetype=<span class="number">5</span>),</div><div class="line">panel.background=element_rect(fill=<span class="string">"white"</span>),</div><div class="line">axis.text=element_text(colour =<span class="string">"black"</span>,size=<span class="number">10</span>,face=<span class="string">"italic"</span>,family=<span class="string">"myFont"</span>),</div><div class="line">axis.line.x=element_line(),</div><div class="line">panel.spacing=unit(-<span class="number">0.3</span>,<span class="string">"cm"</span>)</div><div class="line">)</div><div class="line">ggsave(<span class="string">"C:/Users/Administrator/Desktop/甘特图.png"</span>,p,width=<span class="number">140</span>,height=<span class="number">75</span>,unit=<span class="string">"mm"</span>,dpi=<span class="number">100</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://ortu7ddty.bkt.clouddn.com/gant/Gante.png" alt=""></p>
<p><strong>Cairo高清渲染输出:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div></pre></td><td class="code"><pre><div class="line">font.add(<span class="string">"myfont"</span>,<span class="string">"msyhl.ttc"</span>)</div><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/Gante.png"</span>,width=<span class="number">600</span>,height=<span class="number">300</span>)</div><div class="line">showtext.begin()</div><div class="line">ggplot()+</div><div class="line">geom_linerange(data=mydata,aes(x=Item,ymin=Planned_Start_Date,ymax=Planned_Finish_Date),size=<span class="number">10</span>,color=<span class="string">"#BFBFBF"</span>,alpha=<span class="number">0.8</span>)+</div><div class="line">geom_linerange(data=mydata,aes(x=Item,ymin=Actual_Start_Date,ymax=Actual_Finish_Date),size=<span class="number">7</span>,color=<span class="string">"#085264"</span>,alpha=<span class="number">0.8</span>)+</div><div class="line">scale_x_discrete(limits=sort(Item,decreasing=<span class="literal">T</span>))+</div><div class="line">scale_y_date(position =<span class="string">"top"</span>)+</div><div class="line"><span class="comment">#scale_y_date(breaks=datebreaks,labels=date_format("%Y %b"))+</span></div><div class="line"><span class="comment">#geom_hline(data=NULL,aes(hintercept=time))+</span></div><div class="line">coord_flip()+</div><div class="line">theme(</div><div class="line">axis.title=element_blank(),</div><div class="line">axis.text.x=element_text(margin=margin(<span class="number">5</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="string">"pt"</span>)),</div><div class="line">axis.text.y=element_text(margin=margin(<span class="number">0</span>,<span class="number">10</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="string">"pt"</span>)),</div><div class="line">axis.ticks.y=element_blank(),</div><div class="line">panel.grid.major.y=element_line(color=<span class="string">"#FFB666"</span>,linetype=<span class="number">5</span>),</div><div class="line">panel.background=element_rect(fill=<span class="string">"white"</span>),</div><div class="line">axis.text=element_text(colour =<span class="string">"black"</span>,size=<span class="number">10</span>,face=<span class="string">"italic"</span>,family=<span class="string">"myfont"</span>),</div><div class="line">axis.line.x=element_line(),</div><div class="line">panel.spacing=unit(-<span class="number">0.3</span>,<span class="string">"cm"</span>)</div><div class="line">)</div><div class="line">showtext.end()</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://ortu7ddty.bkt.clouddn.com/gant/%E7%94%98%E7%89%B9%E5%9B%BE.png" alt=""></p>
<p><strong>核心要点总结:</strong></p>
<p>数据构造:本案例使用geom图层系统中的范围线图层来临摹的(linerange),该图层接受两个变量的范围(起点、终点),因为有计划日期、实际执行日期,所以使用了两个linerange图层,四个变量(计划开始日期、计划结束日期、实际开始日期、实际结束日期)。数据构造主要涉及日期变量转换。<br>图层映射过程,掌握好linerange图层内的元素调整(颜色、线条宽度、类型等)。</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a></p>
]]></content>
<summary type="html">
<p><img src="http://ortu7ddty.bkt.clouddn.com/gant/%E7%94%98%E7%89%B9%E5%9B%BE.png" alt=""></p>
<p>甘特图即便是用excel来做,也是要吃些苦头的,ggplot中无此内置图层函数,但是,没啥图能难道哈神的巨作,ggplot制作甘特不费吹灰之力,而且还能做得有模有样,有板有眼。以下是代码过程。</p>
</summary>
<category term="R语言" scheme="http://www.raindu.com/categories/R%E8%AF%AD%E8%A8%80/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
</entry>
<entry>
<title>超强脑洞第三弹之——ggplot构造瀑布图</title>
<link href="http://www.raindu.com/2017/07/19/%E8%B6%85%E5%BC%BA%E8%84%91%E6%B4%9E%E7%AC%AC%E4%B8%89%E5%BC%B9%E4%B9%8B%E2%80%94%E2%80%94ggplot%E6%9E%84%E9%80%A0%E7%80%91%E5%B8%83%E5%9B%BE/"/>
<id>http://www.raindu.com/2017/07/19/超强脑洞第三弹之——ggplot构造瀑布图/</id>
<published>2017-07-19T10:27:25.000Z</published>
<updated>2017-07-19T10:33:25.686Z</updated>
<content type="html"><![CDATA[<p><img src="http://orpzs13ft.bkt.clouddn.com/waterfall/%E7%80%91%E5%B8%83%E5%9B%BE2.png" alt=""></p>
<p>对,就是瀑布图,你没看错。而且是使用ggplot现有图层叠加构造,并没有用任何ggplot的外挂插件。</p>
<a id="more"></a>
<h3 id="数据准备:"><a href="#数据准备:" class="headerlink" title="数据准备:"></a>数据准备:</h3><p>作图理念是在数据源的构造上,方法与《Excel图表之道》《Excel图表拒绝平庸》中的方法一致,我只是加入了自己的技巧。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(<span class="string">"reshape2"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"ggplot2"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"ggmap"</span>)</div><div class="line"><span class="keyword">library</span>(<span class="string">"Cairo"</span>)</div></pre></td></tr></table></figure></p>
<p><strong>构造瀑布图数据源:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">Item<-c(<span class="string">"Before"</span>,<span class="string">"Factor A"</span>,<span class="string">"Factor B"</span>,<span class="string">"Factor C"</span>,<span class="string">"Factor D"</span>,<span class="string">"Factor E"</span>,<span class="string">"Factor F"</span>,<span class="string">"Factor G"</span>,<span class="string">"After"</span>)</div><div class="line">Data<-c(<span class="number">325</span>,-<span class="number">32</span>,-<span class="number">105</span>,<span class="number">38</span>,<span class="number">86</span>,<span class="number">97</span>,<span class="number">232</span>,<span class="number">389</span>,<span class="number">1030</span>)</div><div class="line">mydata<-data.frame(Item,Data,stringsAsFactors =<span class="literal">F</span>)</div><div class="line">mydata$BA<-mydata$Data</div><div class="line">mydata$Dummy<-<span class="number">0</span></div><div class="line">mydata$add<-<span class="number">0</span></div><div class="line">mydata$Reduc<-<span class="number">0</span></div><div class="line">mydata$BA[<span class="number">2</span>:<span class="number">8</span>]<-<span class="number">0</span></div></pre></td></tr></table></figure></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">2</span>:<span class="number">8</span>){</div><div class="line">ifelse(mydata$Data[i]<<span class="number">0</span>,mydata$Dummy[i]<-sum(mydata$Data[<span class="number">1</span>:i]),mydata$Dummy[i]<-sum(mydata$Data[<span class="number">1</span>:i-<span class="number">1</span>]))</div><div class="line">}</div><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">2</span>:<span class="number">8</span>){</div><div class="line">ifelse(mydata$Data[i]<<span class="number">0</span>,mydata$add[i]<-<span class="number">0</span>,mydata$add[i]<-mydata$Data[i])</div><div class="line">}</div><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">2</span>:<span class="number">8</span>){</div><div class="line">ifelse(mydata$Data[i]<<span class="number">0</span>,mydata$Reduc[i]<-abs(mydata$Data[i]),mydata$add[i]<-<span class="number">0</span>)</div><div class="line">}</div></pre></td></tr></table></figure>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">mydata1<-mydata[,-<span class="number">2</span>]</div><div class="line">mydataA<- melt(mydata1,id.vars =<span class="string">"Item"</span>,variable.name = <span class="string">"class"</span>, value.name = <span class="string">"scope"</span>)</div><div class="line">mydataA$class<-factor(mydataA$class,levels=c(<span class="string">"Reduc"</span>,<span class="string">"add"</span>,<span class="string">"Dummy"</span>,<span class="string">"BA"</span>),order=<span class="literal">T</span>)</div></pre></td></tr></table></figure>
<h3 id="图形可视化:"><a href="#图形可视化:" class="headerlink" title="图形可视化:"></a>图形可视化:</h3><p><strong>色盘设置:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">Color<-c(<span class="string">"#A6442A"</span>,<span class="string">"#015313"</span>,<span class="string">"#FFFFFF"</span>,<span class="string">"#131F37"</span>)</div></pre></td></tr></table></figure></p>
<p><strong>作图函数:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div></pre></td><td class="code"><pre><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/瀑布图1.png"</span>,width=<span class="number">650</span>,height=<span class="number">360</span>)</div><div class="line">ggplot()+</div><div class="line">geom_bar(data=mydataA,aes(x=Item,y=scope,fill=class),stat=<span class="string">"identity"</span>,position=<span class="string">"stack"</span>,width=<span class="number">1</span>)+</div><div class="line">scale_x_discrete(limits=Item)+</div><div class="line">scale_fill_manual(values=Color)+</div><div class="line">guides(fill=<span class="literal">FALSE</span>)+</div><div class="line">geom_text(data=mydata1,aes(x=Item,y=BA/<span class="number">2</span>),label=ifelse(mydata1$BA!=<span class="number">0</span>,mydata1$BA,<span class="string">""</span>),col=<span class="string">"white"</span>)+</div><div class="line">geom_text(data=mydata1,aes(x=Item,y=Dummy+add),label=ifelse(mydata1$add!=<span class="number">0</span>,paste(<span class="string">"+"</span>,mydata1$add,sep=<span class="string">""</span>),<span class="string">""</span>),col=<span class="string">"#015313"</span>,vjust=-<span class="number">.5</span>)+</div><div class="line">geom_text(data=mydata1,aes(x=Item,y=Dummy),label=ifelse(mydata1$Reduc!=<span class="number">0</span>,paste(<span class="string">"-"</span>,mydata1$Reduc,sep=<span class="string">""</span>),<span class="string">""</span>),col=<span class="string">"#A6442A"</span>,vjust=<span class="number">1.2</span>)+</div><div class="line">theme(</div><div class="line">panel.background=element_blank(),</div><div class="line">axis.title=element_blank(),</div><div class="line">axis.text = element_text(colour =<span class="string">"black"</span>,size=<span class="number">12</span>,face=<span class="string">"italic"</span>),</div><div class="line">axis.text.y=element_blank(),</div><div class="line">axis.ticks=element_blank()</div><div class="line">)</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://orpzs13ft.bkt.clouddn.com/waterfall/%E7%80%91%E5%B8%83%E5%9B%BE1.png" alt=""></p>
<p><strong>通过角度旋转,得到水平方向瀑布图:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/瀑布图2.png"</span>,width=<span class="number">650</span>,height=<span class="number">360</span>)</div><div class="line">ggplot()+</div><div class="line">geom_bar(data=mydataA,aes(x=Item,y=scope,fill=class),stat=<span class="string">"identity"</span>,position=<span class="string">"stack"</span>,width=<span class="number">1</span>)+</div><div class="line">scale_x_discrete(limits=Item)+</div><div class="line">scale_fill_manual(values=Color)+</div><div class="line">coord_flip()+</div><div class="line">guides(fill=<span class="literal">FALSE</span>)+</div><div class="line">geom_text(data=mydata1,aes(x=Item,y=BA/<span class="number">2</span>),label=ifelse(mydata1$BA!=<span class="number">0</span>,mydata1$BA,<span class="string">""</span>),col=<span class="string">"white"</span>)+</div><div class="line">geom_text(data=mydata1,aes(x=Item,y=Dummy+add),label=ifelse(mydata1$add!=<span class="number">0</span>,paste(<span class="string">"+"</span>,mydata1$add,sep=<span class="string">""</span>),<span class="string">""</span>),col=<span class="string">"#015313"</span>,hjust=-<span class="number">.20</span>)+</div><div class="line">geom_text(data=mydata1,aes(x=Item,y=Dummy),label=ifelse(mydata1$Reduc!=<span class="number">0</span>,paste(<span class="string">"-"</span>,mydata1$Reduc,sep=<span class="string">""</span>),<span class="string">""</span>),col=<span class="string">"#A6442A"</span>,hjust=<span class="number">1.2</span>)+</div><div class="line">theme(</div><div class="line">panel.background=element_blank(),</div><div class="line">axis.title=element_blank(),</div><div class="line">axis.text = element_text(colour =<span class="string">"black"</span>,size=<span class="number">12</span>,face=<span class="string">"italic"</span>),</div><div class="line">axis.text.x=element_blank(),</div><div class="line">axis.ticks=element_blank()</div><div class="line">)</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://orpzs13ft.bkt.clouddn.com/waterfall/%E7%80%91%E5%B8%83%E5%9B%BE2.png" alt=""></p>
<h3 id="核心要点总结"><a href="#核心要点总结" class="headerlink" title="核心要点总结:"></a>核心要点总结:</h3><ul>
<li>数据源组织:瀑布图高度依赖数据源组织,如果你不太熟悉R中的数据操纵,完全可以将数据源组织过程在excel使用函数完成,然后倒入R并转为长数据进行作图。</li>
<li>数据宽转长过程:转换后的因子变量的四个水平顺序要重点注意。因子水平顺序为:降低值<增加值<占位值<开头/结尾值。(顺序万不能乱)。</li>
<li>色盘颜色顺序:与因子水平顺序一致。第三个为白色,其他与之对应。</li>
</ul>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a>进行许可。</p>
]]></content>
<summary type="html">
<p><img src="http://orpzs13ft.bkt.clouddn.com/waterfall/%E7%80%91%E5%B8%83%E5%9B%BE2.png" alt=""></p>
<p>对,就是瀑布图,你没看错。而且是使用ggplot现有图层叠加构造,并没有用任何ggplot的外挂插件。</p>
</summary>
<category term="R语言" scheme="http://www.raindu.com/categories/R%E8%AF%AD%E8%A8%80/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
</entry>
<entry>
<title>给R语言初学者的几个建议~</title>
<link href="http://www.raindu.com/2017/07/14/%E7%BB%99R%E8%AF%AD%E8%A8%80%E5%88%9D%E5%AD%A6%E8%80%85%E7%9A%84%E5%87%A0%E4%B8%AA%E5%BB%BA%E8%AE%AE/"/>
<id>http://www.raindu.com/2017/07/14/给R语言初学者的几个建议/</id>
<published>2017-07-14T11:38:19.000Z</published>
<updated>2017-07-14T12:14:46.549Z</updated>
<content type="html"><![CDATA[<p><img src="http://oqdvmreg2.bkt.clouddn.com/guide/image.jpg" alt=""></p>
<p>最近有很多人在问我关于R语言学习入门的问题。</p>
<a id="more"></a>
<p>有在公众号文章留言的,有后台回复的,有加qq或者微信直接交流的、有知乎私信或者文章留言的,还有微信群里直接@我的。</p>
<p>说实话,这个话题,如果由一个在数据科学领域叱咤多年、项目经验丰富,代码写的很溜的老司机来回答,结果会更有信服力。</p>
<p><strong>而我并不适合来回答这个问题,理由如下:</strong></p>
<ul>
<li>首先我的学习周期很短,正式开始于2016年的9月份,算起来仅有10个月左右,有点速成的意味;</li>
<li>其次我在学习R语言之前并没有任何的编程基础(如果不算大学修过的SQL和自己只会一点儿皮毛的VBA的话);</li>
<li>第三我是一个文科生,没有很强大的数理基础和统计背景。</li>
</ul>
<p><strong>但是如果换一个角度来思考,我又觉得我很有资格来回答这个问题,理由如下:</strong></p>
<ul>
<li>作为一个文科编程小白和数学盲,我更懂和我一样编程零基础、数学又不好的初学者在初次接触R语言之时,内心的困惑和挣扎;</li>
<li>按照我的学习周期及其效果来看,我的学习效果得到了实战检验和诸多读者的认可;</li>
<li>我的R语言学习之旅是在大学校园里完成的,而非是在职场的强压下被逼无奈之下开始的,所以不带很多的功利性和企图速成的快节奏,同样也是按需去学,学的都是真把式和投入回报率最高的部分,所以对于基础的把握及学习的节奏的经验更值得大家借鉴。</li>
</ul>
<p>下面我开始我的回答(biaoyan!!!)</p>
<p><strong>关于学习初衷:</strong></p>
<p>首先我想说的是,在打算学习R之前一定要先问自己一句,你学习R语言的目的是什么?</p>
<p>是大学专业课程的需要?或者提前储备自己的数据分析技能?还是为了应对职场压力,被动充电?亦或者仅仅一时心血来潮,看到如今的大数据发展的如火如荼、不由自主的来凑一波热闹?或者是纯粹只是为了兴趣、实现自己的某些想法。</p>
<p>因为目标定位不同,意味着你能为此打算花费的时间长度不同、付出的心血不同、学习的路径不同、学习学习的模块不同,达到的效果也不同。</p>
<p>一定要确定好目标,按需去学,否则你在入门之前就首先会陷入迷茫和困顿,因为R语言除了内置的几个基础包之外,CRAN上可用的有不下上万个扩展包,如果再算上GitHub上 托管的个人开发的小众包,可能有好几万了,掰着指头数一数,够学一辈子了。</p>
<p><strong>关于对R语言的理解:</strong></p>
<p>这里我讲一下自己对于R语言的理念,我不想重复那些已经被转播烂了的概念解释、发展历史、及其功能简介。</p>
<p>R语言是统计学家开发的,出生之初就决定了它的使命是统计计算和数据可视化,这算是R语言核心功能的两个大方向。</p>
<p>对于这两个方向而言,统计计算的学习,基础都在课堂理论与专业背景上,说实话,R语言只是提供了一个实现的平台而已,它并不该改变或者创造新的理论、模型。</p>
<p>而这些统计计算所使用的公式、用到的模型算法,大部分都被封装成一个个扩展包里,导入包之后,仅需调用对应函数、设置对应参数即可,这些函数与Excel里面的函数本无区别,无需恐惧。</p>
<p>至于参数的调优、模型的检验与优化,这些东西所依赖的知识背景,也基本都是来源于课堂学习和专业背景,与R软件的关系并不大,对于需要自己写算法的情况,你也仅是在函数的基础上按照成熟的理论算法进行调优和计算,这与软件无关(除了基础语法之外),而与软件之外的专业背景和行业经验有关。</p>
<p>说到底,对于统计学习这一块,重要的是理论背景、业务经验、而真正需要R来实现的,仅仅是内置的扩展包函数、基础语法而已。</p>
<p>类比一下SPSS的学习,一个不懂统计学的人很难学好SPSS,尽管他知道各种功能模块和菜单(比如我),同样,一个不懂统计学和数学的人也难学好R语言(统计计算模块),尽管他很熟悉R语言的基础语法和很多扩展包所能实现的功能(比如我)。</p>
<p>而对于R语言的数据可视化方向来讲,则稍有不同。数据可视化确实并不十分依赖数学(除了专门用于算法呈现的图形之外,很少有需要大量运算的),但是他高度依赖图形语法,依赖可视化视觉呈现理念。</p>
<p>R语言中被疯传有四套语法(分别是基础图形语法、高级图形语法、lattice语法、ggplot2语法)。但是遗憾的我只会其中之一——ggplot2。</p>
<p>我学东西的初衷很纯粹,要把一个东西做好,不是一般的好,而是要令效果赏心悦目、达到惊艳众人的目的,最重要的,要让老板赞不绝口(难道你不想升职加薪嘛)。</p>
<p>这就意味着我要学一套优雅、高效、兼容性强、更加贴近可视化理念的语法。因为我的精力和时间不容许我把付出的心血平摊在四条任务线上,毕竟我的多任务处理能力极差。</p>
<p>倘若贪多,这样造成的后果可能是,每一套都语法都能懂一些,但是每一套都表现平平,这是我不能容忍的。而ggplot2对我来说既是绝佳的选择。</p>
<p>即便如此,语法熟练或者说能够顺记于心就真的够了吗?当然不是,即便能能够熟记于心,也并不能保证自己能够游刃有余的实现自己的想法,因为数据可视化除了依赖实现的工具和平台语法之外,更多的是对于数据源的理解、对于可视化的理解、对于设计理念的融会贯通(怎么去配色、怎么去排版、怎么去搭配字体等)。</p>
<p>如果说软件的学习也遵循二八定律的话,我觉得,R语言的学习也是如此。</p>
<p>百分之八十的精力需要花费在软件之外的统计学理论背景、业务知识(可以自学呀),而需要使用R软件来实现的部分,不要干巴巴的去学(当然R语言的基础语法要牢固),理论搞透了很多事情就会水到渠成,迎刃而解了。这一点特别体现在对于统计与数据分析的学习上。</p>
<p>而数据可视化则需要你在牢固掌握基础上(基础语法运用、数据清洗技能),能够熟练运用一套图形语法(推荐ggplot2),然后不要过于将精力放在工具和代码本身,而是多积累可视化素养和提升设计审美水平。(这里我将数据可视化的二八定律稍稍修改一下,五五定律比较合适,因为ggplot2不是很好掌握)。</p>
<p>至于设计、审美、创意这些柔性的东西,很难去通过一两本书或者一两套课程搞定,这些是内化于生活,积累于日常的点点滴滴,当然如果有意识的去通过一些课程、书籍慢慢培养,日积月累也会见效的。</p>
<p><strong>关于R语言学习技能路径:</strong></p>
<blockquote>
<p>通用技能学习:</p>
<ul>
<li>基础:数据结构、变量类型、数据导入/导出、数据合并追加、长宽转换、数据索引、切片、聚合。</li>
<li>进阶:正则表达式、合并与分列、匹配与替换、缺失值插补、去重与排序,控制流:循环与判断。</li>
</ul>
<p>专用技能学习:</p>
<ul>
<li>统计与分析:去学课本吧</li>
<li>数据可视化:ggplot2语法+设计+审美+创意</li>
</ul>
</blockquote>
<p>基本上只要自己的通用技能学的差不多之后,就没有必要一直钻在这个小圈子里来回转了,可以自己去找数据做案例,案例是最好的学习,进步大多源于案例中解决未知问题的能力。</p>
<p>我没有读过很多R语言的书,所以这里还是不荐书了,如果你真的有心去学,还用别人荐书吗,看下豆瓣的图书榜就成了。</p>
<p>平时多用搜索引擎去解决临时性问题,基本上你遇到的问题,前人在网络上都已经给出了很详细的答案。</p>
<p><strong>回答一些初学者的问题:</strong></p>
<ol>
<li><p>R语言是不是需要很深厚的编程基础,我编程基础基本为0,是不是不适合学这个?<br>我在学习R之前编程基础也为0的,有编程基础那叫程序员,程序员学习R语言都不用眨眼的~</p>
</li>
<li><p>学习R语言是不是需要很厉害的数学背景,我是文科生,数学超级差,是不是学不会啦!<br>握个手,我情况跟你一样,也是文科生,数学超级差,如果你打算往数据挖掘方向转型的话,可能需要补一下高数、线代、概率论统计与算法的东西,如果仅是作为业务分析工具、可视化之用,可能你的数学水平已经超越门槛了。</p>
</li>
<li><p>我R语言学了很长时间,好像也有一年了,看了很多书,所有的基础语法都会、ggplot2也都理解,但是就是自己写代码的时候写不出来,画图的时候干着急。<br>你是不是一直在看课本,一直在看,连练习代码都是copy课本上的,你做过多少实战案例,分析过多少真实业务数据、有多少新知识是在课本之外的实战过程中解决的,多看不如多练。</p>
</li>
<li><p>求地图模板!!!<br>抱歉,我不提供模板,我只提供代码和案例数据!(R语言很难做成模板)</p>
</li>
<li><p>你好,在吗,可以帮我画个图嘛~<br>……(我想说一句不在的)</p>
</li>
<li><p>可以推荐一下入门书吗?<br>其实我并不觉得R语言的入门需要入门书,因为我在入门阶段也没有照着书去学,但是既然大家提出来这个问题,我还是给些建议吧,如果你是在校学生,时间充裕,推荐《R语言实战》,不过一定要有选择性的去看,不要通篇看,前面几张关于数据结构、变量类型、数据清洗的要好好看(略过概念性和纯解释性内容),中间统计学习部分按需酌情去看,最后的文档报告输出部分谨慎去看(LaTeX和HTML你不一定能用得到的)。<br>数据可视化推荐两本吧,《R语言可视化手册》、《ggplot2:数据分析与图形艺术》(首选第一本,比较接地气,第二本虽然是作者本人大作,但是立意较为独特、高远、对初学者不是很友好)。<br>如果你是职场人士的话,那么并不十分推荐以课本为主,因为工作时间占用太多,不可能抽出大量时间用于练习,推荐工作之余利用碎片化时间听一些在线课程。(入门可以听免费的,天善智能社区就是很好地免费课程学习平台,我自己在天善智能也有开课哦,而且里面有大数据主题的多门免费课程,此外网易云课堂里也可以淘到很多好课)。免费课程用于入门,然后借助接触一线业务数据的机会,多把R语言用工作实践,你会进步的更快。</p>
</li>
<li><p>小魔方,你是如何学习R语言的,可以传授一些经验吗?<br>这个问题我太不好意思回答了,不过我还是舔着脸讲一下吧,我属于实战派,平时练习都是直接使用爬虫抓网络上的数据,实习的时候也是能用R的场合坚决不用Excel,逼着自己找R语言的使用场景,然后通过微信公众号、知乎专栏和个人博客进行持续性的内容输出(强迫自己去持续练习)。<br>当然,打捞基础很重要,否则只能每次写代码的时候都放着笔记本在身边,那里不会看哪里(很浪费时间的)</p>
</li>
</ol>
<p>善用帮助文档,R语言中有强大的帮助系统,你可以直接进入扩展包的文档主页, 也可以使用?info来搜索某个函数的详细使用方法及参数设定规则。</p>
<p>保证规律性的练习,每天都要抽出固定时间来练习,具体安排视个人的具体情况。</p>
<p>最后给一句忠告,一门用于数据分析的编程语言,其只有用于数据分析实战才能发挥作用,就像老虎只有在森林里才能具备兽王的野性,所以一旦感觉自己掌握了基础之后,最后的进阶方式就是用于实战。</p>
]]></content>
<summary type="html">
<p><img src="http://oqdvmreg2.bkt.clouddn.com/guide/image.jpg" alt=""></p>
<p>最近有很多人在问我关于R语言学习入门的问题。</p>
</summary>
<category term="R语言" scheme="http://www.raindu.com/categories/R%E8%AF%AD%E8%A8%80/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
</entry>
<entry>
<title>超强脑洞第二弹之——ggplot构造漏斗图</title>
<link href="http://www.raindu.com/2017/07/13/%E8%B6%85%E5%BC%BA%E8%84%91%E6%B4%9E%E7%AC%AC%E4%BA%8C%E5%BC%B9%E4%B9%8B%E2%80%94%E2%80%94ggplot%E6%9E%84%E9%80%A0%E6%BC%8F%E6%96%97%E5%9B%BE/"/>
<id>http://www.raindu.com/2017/07/13/超强脑洞第二弹之——ggplot构造漏斗图/</id>
<published>2017-07-13T01:01:32.000Z</published>
<updated>2017-07-13T16:11:58.047Z</updated>
<content type="html"><![CDATA[<p><img src="http://oro3igf2g.bkt.clouddn.com/funnel/image.jpg" alt=""></p>
<p>今天这篇要用ggplot构造漏斗图,其实ggplot内置图层函数中不存在所谓的漏斗图、子弹图等比较复杂的图表类型,但是ggplot的现有图层函数和标度设置完全可以胜任这些图形,以下是利用ggplot临摹漏斗图的代码过程。</p>
<a id="more"></a>
<h3 id="数据准备:"><a href="#数据准备:" class="headerlink" title="数据准备:"></a>数据准备:</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(reshape2)</div><div class="line"><span class="keyword">library</span>(plyr)</div><div class="line"><span class="keyword">library</span>(ggplot2)</div></pre></td></tr></table></figure>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">scope<-c(<span class="number">0.9</span>,<span class="number">0.8</span>,<span class="number">0.6</span>,<span class="number">0.4</span>,<span class="number">0.2</span>)</div><div class="line">Part<-paste(<span class="string">"part"</span>,<span class="number">1</span>:<span class="number">5</span>,sep=<span class="string">""</span>)</div><div class="line">Order<-<span class="number">1</span>:<span class="number">5</span></div><div class="line">help<-(<span class="number">1</span>-scope)/<span class="number">2</span></div><div class="line">mydata<-data.frame(Order,Part,help,scope)</div><div class="line">mydata1<-melt(mydata,id.vars=c(<span class="string">"Order"</span>,<span class="string">"Part"</span>),variable.name=<span class="string">"perform"</span>,value.name=<span class="string">"scope"</span>)</div><div class="line">mydata1$perform<-factor(mydata1$perform,level=c(<span class="string">"scope"</span>,<span class="string">"help"</span>),order=<span class="literal">T</span>)</div></pre></td></tr></table></figure>
<p>很重要的一步,需要构造有序因子变量,两个因子水平,分别是实际指标值和辅助值,在构造有序因子变量时,注意辅助值因子水平要高于实际值数据。柱形图堆叠时,按照因子水平由高到低堆叠(底层因子水平高,顶层因子水平低,这样才能将指标值的数据条撑起,其实水平均居中)。</p>
<h3 id="可视化过程:"><a href="#可视化过程:" class="headerlink" title="可视化过程:"></a>可视化过程:</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">ggplot(mydata1,aes(Order,scope,order=desc(scope),fill=perform))+geom_bar(stat=<span class="string">"identity"</span>,position=<span class="string">"stack"</span>)</div></pre></td></tr></table></figure>
<p><img src="http://oro3igf2g.bkt.clouddn.com/funnel/image1.png" alt=""></p>
<p><strong>制作色盘:</strong><br>(其实使用了一个白色色值隐藏掉了辅助列,理念跟在excel里面制作漏斗图一致,但是色盘颜色顺序白色要在第一个,这样将来颜色映射的时候颜色顺序与因子水平由大到小进行匹配的。)这一点非常重要,也是ggplot临摹漏斗图的核心技巧。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">Color<-c(<span class="string">"#FFFFFF"</span>,<span class="string">"#088158"</span>)</div><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/漏斗图1.png"</span>,width=<span class="number">330</span>,height=<span class="number">400</span>)</div><div class="line">ggplot()+</div><div class="line">geom_bar(data=mydata1,aes(x=Order,y=scope,fill=perform),stat=<span class="string">"identity"</span>,position=<span class="string">"stack"</span>)+</div><div class="line">scale_fill_manual(values=sort(Color))+ </div><div class="line">geom_text(data=mydata,aes(x=Order,y=help+scope/<span class="number">2</span>-<span class="number">.025</span>,label=Part),col=<span class="string">"white"</span>,size=<span class="number">4</span>)+</div><div class="line">geom_text(data=mydata,aes(x=Order,y=help+scope/<span class="number">2</span>+<span class="number">.035</span>,label=paste(<span class="number">100</span>*mydata$scope,<span class="string">"%"</span>,sep=<span class="string">""</span>)),col=<span class="string">"white"</span>,size=<span class="number">5.5</span>)+</div><div class="line">theme_nothing()</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://oro3igf2g.bkt.clouddn.com/funnel/image2.png" alt=""></p>
<p><strong>转向</strong>(反转坐标轴)<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/漏斗图2.png"</span>,width=<span class="number">330</span>,height=<span class="number">400</span>)</div><div class="line">ggplot()+</div><div class="line">geom_bar(data=mydata1,aes(x=Order,y=scope,fill=perform),stat=<span class="string">"identity"</span>,position=<span class="string">"stack"</span>)+</div><div class="line">scale_fill_manual(values=sort(Color))+ </div><div class="line">scale_x_reverse()+</div><div class="line">geom_text(data=mydata,aes(x=Order,y=help+scope/<span class="number">2</span>-<span class="number">.025</span>,label=Part),col=<span class="string">"white"</span>,size=<span class="number">4</span>)+</div><div class="line">geom_text(data=mydata,aes(x=Order,y=help+scope/<span class="number">2</span>+<span class="number">.035</span>,label=paste(<span class="number">100</span>*mydata$scope,<span class="string">"%"</span>,sep=<span class="string">""</span>)),col=<span class="string">"white"</span>,size=<span class="number">5.5</span>)+</div><div class="line">theme_nothing()</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://oro3igf2g.bkt.clouddn.com/funnel/image3.png" alt=""></p>
<p><strong>水平漏斗图:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/漏斗图3.png"</span>,width=<span class="number">330</span>,height=<span class="number">400</span>)</div><div class="line">ggplot()+</div><div class="line">geom_bar(data=mydata1,aes(x=Order,y=scope,fill=perform),stat=<span class="string">"identity"</span>,position=<span class="string">"stack"</span>)+</div><div class="line">scale_fill_manual(values=sort(Color))+ </div><div class="line">coord_flip()+</div><div class="line">geom_text(data=mydata,aes(x=Order,y=help+scope/<span class="number">2</span>-<span class="number">.04</span>,label=Part),col=<span class="string">"white"</span>,size=<span class="number">4</span>)+</div><div class="line">geom_text(data=mydata,aes(x=Order,y=help+scope/<span class="number">2</span>+<span class="number">.04</span>,label=paste(<span class="number">100</span>*mydata$scope,<span class="string">"%"</span>,sep=<span class="string">""</span>)),col=<span class="string">"white"</span>,size=<span class="number">5.5</span>)+</div><div class="line">theme_nothing()</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://oro3igf2g.bkt.clouddn.com/funnel/image4.png" alt=""></p>
<p><strong>反转坐标轴:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line">CairoPNG(file=<span class="string">"C:/Users/Administrator/Desktop/漏斗图4.png"</span>,width=<span class="number">330</span>,height=<span class="number">400</span>)</div><div class="line">ggplot()+</div><div class="line">geom_bar(data=mydata1,aes(x=Order,y=scope,fill=perform),stat=<span class="string">"identity"</span>,position=<span class="string">"stack"</span>)+</div><div class="line">scale_fill_manual(values=sort(Color))+ </div><div class="line">coord_flip()+</div><div class="line">scale_x_reverse()+</div><div class="line">geom_text(data=mydata,aes(x=Order,y=help+scope/<span class="number">2</span>-<span class="number">.05</span>,label=Part),col=<span class="string">"white"</span>,size=<span class="number">4</span>)+</div><div class="line">geom_text(data=mydata,aes(x=Order,y=help+scope/<span class="number">2</span>+<span class="number">.05</span>,label=paste(<span class="number">100</span>*mydata$scope,<span class="string">"%"</span>,sep=<span class="string">""</span>)),col=<span class="string">"white"</span>,size=<span class="number">5.5</span>)+</div><div class="line">theme_nothing()</div><div class="line">dev.off()</div></pre></td></tr></table></figure></p>
<p><img src="http://oro3igf2g.bkt.clouddn.com/funnel/image5.png" alt=""></p>
<h3 id="本文小结:"><a href="#本文小结:" class="headerlink" title="本文小结:"></a>本文小结:</h3><p><strong>下面总结以下使用ggplot临摹漏斗图的核心技巧:</strong></p>
<ul>
<li>指标值和辅助列值这两个因子水平的设置上,需要设置成有序因子,因子水平大小为指标值因子水平<辅助值因子水平。因为柱状图堆叠时因子水平由大到小从底部顺次向顶部堆积。这样辅助列可以堆在底部,刚好撑起数据列,将其置于水平居中位置。因此模拟漏斗图。</li>
<li>色盘设置,色值顺序白色在前,数值色在后。颜色映射时,色板颜色会顺次分配给由高到低的因子水平。(其实因为就两个颜色,即便是颜色色值写反了,使用逆序函数倒过来就好了)。</li>
</ul>
<p>好了,期待下一篇ggplot的脑洞吧,可能是甘特图,也有可能是瀑布图,或者其他不知名的图表~</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a>进行许可。</p>
]]></content>
<summary type="html">
<p><img src="http://oro3igf2g.bkt.clouddn.com/funnel/image.jpg" alt=""></p>
<p>今天这篇要用ggplot构造漏斗图,其实ggplot内置图层函数中不存在所谓的漏斗图、子弹图等比较复杂的图表类型,但是ggplot的现有图层函数和标度设置完全可以胜任这些图形,以下是利用ggplot临摹漏斗图的代码过程。</p>
</summary>
<category term="R语言" scheme="http://www.raindu.com/categories/R%E8%AF%AD%E8%A8%80/"/>
<category term="数据可视化" scheme="http://www.raindu.com/tags/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="ggplot2" scheme="http://www.raindu.com/tags/ggplot2/"/>
</entry>
<entry>
<title>一言不合就爬虫系列之——爬取小姐姐的秒拍MV</title>
<link href="http://www.raindu.com/2017/07/12/%E4%B8%80%E8%A8%80%E4%B8%8D%E5%90%88%E5%B0%B1%E7%88%AC%E8%99%AB%E7%B3%BB%E5%88%97%E4%B9%8B%E2%80%94%E2%80%94%E7%88%AC%E5%8F%96%E5%B0%8F%E5%A7%90%E5%A7%90%E7%9A%84%E7%A7%92%E6%8B%8DMV/"/>
<id>http://www.raindu.com/2017/07/12/一言不合就爬虫系列之——爬取小姐姐的秒拍MV/</id>
<published>2017-07-12T14:30:30.000Z</published>
<updated>2017-07-12T14:54:07.216Z</updated>
<content type="html"><![CDATA[<p><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image.jpg" alt=""></p>
<p>大连的盛夏实在是热的让人心烦(对于我这种既怕热又怕冷的真的没地呆了)。</p>
<a id="more"></a>
<p>再加上令人头疼的毕业论文,这种日子怎能缺少MV来解暑呢。</p>
<p>既然要听,怎么只听一首呢,既然学了爬虫怎么让技能荒废呢。</p>
<p>好吧,烦躁的心情+想听MV的冲动+爬虫技能,今天小魔方教你使用R语言批量爬取秒拍小姐姐的MV视频短片。</p>
<p><a href="http://www.miaopai.com/u/paike_wgleqt8r08" target="_blank" rel="external">小姐姐主页</a></p>
<p>今天要爬的主页是一位叫陶心瑶小姐姐,刷微博偶然听到她翻唱薛之谦的《方圆万里》,感觉蛮有味道的,于是搜了她的秒拍主页。</p>
<p><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image1.jpg" alt=""></p>
<p>主页的作品不多,仅有5首,但是因为仅作为爬虫练习只用,五首也够了(毕竟只是构造循环而已,1000首的步骤也是如此,可能需要构造翻页请求)。<br><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image2.jpg" alt=""></p>
<p>MV挺长比较占内存,所以这里就不演示怎么去大批量的爬MV了(刚买的m本内存都快被掏空了)。</p>
<h3 id="爬虫三步走:"><a href="#爬虫三步走:" class="headerlink" title="爬虫三步走:"></a>爬虫三步走:</h3><h4 id="第一步:分析网页:"><a href="#第一步:分析网页:" class="headerlink" title="第一步:分析网页:"></a>第一步:分析网页:</h4><p><strong>首先是到主页之后分析它的网页结构:</strong></p>
<p>可以看到该主页只有5首mv列表,这时候鼠标随便定位到其中一首(我定位的是第一首),然后右键单击,打开开发者工具。<br><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image3.jpg" alt=""><br><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image4.jpg" alt=""></p>
<p>可以看到该首MV的视频地址存放在:</p>
<figure class="highlight html"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">div.videoCont></div><div class="line">div.videoList></div><div class="line">div.video></div><div class="line">div.MIAOPAI_player></div><div class="line">div.video-player></div><div class="line">video</div><div class="line">src</div></pre></td></tr></table></figure>
<p><a href="//gslb.miaopai.com/stream/AUTy2nx4l-T~BhG-zX60wSDwwqoWfwpa.mp4"></a></p>
<p><strong>尝试着用这个地址来浏览器中运行:</strong></p>
<p><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image5.jpg" alt=""></p>
<p>OK,一切正常,说明这个地址很给力!</p>
<p>随然完整的视频地址仅在video子节点的src属性中存放着,但是其实只要仔细研究就会发现,父节点MIAOPAI_player下的data-scid属性,data-img,子节点video内src,poster属性所存储的名称你图片链接名称中也是含有该视频的信息的。(共享一部分视频链接中的地址)。<br><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image6.jpg" alt=""></p>
<p>事实上网页中展示的视频信息,最起码会给出三处可用的视频信息,即视频名称、视频封面页、视频的源地址。</p>
<p>(给出这里的解析只是想让大家知道,爬虫不要钻牛角尖,不要以为只有视频原地址的链接一条路可走,万一节点的链接你抓不出来,那岂不是要哭瞎了)。</p>
<h4 id="第二部:抓取网页:"><a href="#第二部:抓取网页:" class="headerlink" title="第二部:抓取网页:"></a>第二部:抓取网页:</h4><p>然后该干嘛呢,当然是抓视频地址了(这里使用rvest包来抓)<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">setwd(<span class="string">"E:/CloudMusic"</span>)</div><div class="line"><span class="keyword">library</span>(tidyverse)</div><div class="line"><span class="keyword">library</span>(rvest)</div><div class="line"><span class="keyword">library</span>(stringr)</div><div class="line">(read_html(url,encoding=<span class="string">"utf-8"</span>)%>%html_nodes(<span class="string">"div.videoCont>div.videoList>div.video>div.MIAOPAI_player>div.video-player>video"</span>))</div></pre></td></tr></table></figure></p>
<blockquote>
<p>{xml_nodeset (0)}</p>
</blockquote>
<p>啊哦,这意味着,软件不想理你,并向给你了个鄙视的眼神(请自己体会)。</p>
<p>抓不到地址很心酸的(自己去网页里面复制那也太low啦)。</p>
<p>肿么办,肿么办,肿么办???</p>
<p>之前已经说过了,视频地址链接并非唯一的手段,因为视频的中的id在好几个属性值里面都有包含,所有我们只需任意抓一个属性值,通过观察原始视频地址进行链接的拼接即可。</p>
<p>如果不想做复杂的字符串处理,那就抓最原始的名称吧。(这次目标是父节点MIAOPAI_player下的data-scid属性)。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mylinks<-read_html(url,encoding=<span class="string">"utf-8"</span>)%>%html_nodes(<span class="string">"div.videoCont>div.videoList>div.video>div.MIAOPAI_player"</span>)%>%html_attr(<span class="string">"data-scid"</span>)</div></pre></td></tr></table></figure>
<blockquote>
<p>[1] “AUTy2nx4l-T~BhG-zX60wSDwwqoWfwpa” “ugJzN6LvH3emoPlSU2b52Cu-SbIQ5LFa” “wJ4AsVMgek6jp6lXDxIpXExCig9cVXo~” “I-J6u~qy7V5CpRIq-FoFA3pYtc6Yr0Sz”<br>[5] “pCLMPKezqWVWHyhjNHaRyKrX16APCeuw”</p>
</blockquote>
<p>OK,完美搞定,接下来该干啥呢(快想快想)。(因为视频数量少,没有换页存储,所以这里不用构造换页请求,整个代码效率都提高了很多)</p>
<p>接下来当然是构造可用的视频地址啦,因为我们刚才抓的并非完整的可直接传递到视频源的地址,仅仅是视频的id而已。</p>
<p>好吧现在我们对比之前手工复制的视频源地址和这次抓到的视频id信息,观察规律。</p>
<blockquote>
<p>AUTy2nx4l-T~BhG-zX60wSDwwqoWfwpa<br><a href="http://gslb.miaopai.com/stream/AUTy2nx4l-T~BhG-zX60wSDwwqoWfwpa.mp4" target="_blank" rel="external">http://gslb.miaopai.com/stream/AUTy2nx4l-T~BhG-zX60wSDwwqoWfwpa.mp4</a></p>
</blockquote>
<p>好吧,这次是不是一看就看明白咋回事儿啦,没错,视频源地址就是在视频id的基础上左侧添加了秒拍服务端的视频流服务器主网址,右侧添加了.MP4的视频格式而已,OK,接下我们的任务就是构造可用的视频下载地址。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">baseurl<-<span class="string">"http://gslb.miaopai.com/stream/"</span></div><div class="line">mymvlinks<-paste0(baseurl,mylinks,<span class="string">".mp4"</span>)</div></pre></td></tr></table></figure>
<p>OK两部搞定,敢不敢相信自己的眼睛哈哈。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="comment">#[1] "http://gslb.miaopai.com/stream/AUTy2nx4l-T~BhG-zX60wSDwwqoWfwpa.mp4" "http://gslb.miaopai.com/stream/ugJzN6LvH3emoPlSU2b52Cu-SbIQ5LFa.mp4"</span></div><div class="line"><span class="comment">#[3] "http://gslb.miaopai.com/stream/wJ4AsVMgek6jp6lXDxIpXExCig9cVXo~.mp4" "http://gslb.miaopai.com/stream/I-J6u~qy7V5CpRIq-FoFA3pYtc6Yr0Sz.mp4"</span></div><div class="line"><span class="comment">#[5] "http://gslb.miaopai.com/stream/pCLMPKezqWVWHyhjNHaRyKrX16APCeuw.mp4"</span></div></pre></td></tr></table></figure>
<p>如果不放心的话,可以使用这个地址再再浏览器中打开看一看,预览下是否可以观看视频。(放心吧我都替你试过了)</p>
<p>现在我们只是获取了视频下载地址,可是没有MV的歌名呀(命名和123最后下载完事你可以需要打开听一听才知道是啥歌,如果使用ID的话一串字母数字组合也很烦人)。</p>
<p><strong>好吧索性再把名字趴下来:</strong></p>
<p>可以看到每首MV下面都有个含歌名的句子,就抓这句文字就行。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">myinfo<-read_html(url,encoding=<span class="string">"utf-8"</span>)%>%html_nodes(<span class="string">"div.viedoAbout"</span>)%>%html_text(trim = <span class="literal">TRUE</span>)</div><div class="line"></div><div class="line">[<span class="number">1</span>] <span class="string">"温暖女声陶心瑶翻唱薛之谦《方圆几里》 \n #陶心瑶第二自我##纪念青春的那些歌#"</span> </div><div class="line">[<span class="number">2</span>] <span class="string">"陶心瑶首张实体专辑《第二自我》众筹宣传片 \n #陶心瑶第二自我##纪念青春的那些歌#"</span></div><div class="line">[<span class="number">3</span>] <span class="string">"上课中《丑八怪》"</span> </div><div class="line">[<span class="number">4</span>] <span class="string">"陶心瑶暖心翻唱JJ《她说》 \n #陶心瑶##林俊杰的第36页#"</span> </div><div class="line">[<span class="number">5</span>] <span class="string">"这个《双截棍》也太柔了吧!唱的心都醉啦"</span></div></pre></td></tr></table></figure>
<p><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image7.jpg" alt=""></p>
<p>抓完发现每一个句子里面的歌名都是带有书名号的(特么的中文的书名号怎么匹配内部中文呀,正则不会写~_~)</p>
<p>好吧,技不如人但是我勤快呀,就用字符串匹配函数一个个匹配吧!<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">mymvname<-c()</div><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">1</span>:length(myinfo) ){</div><div class="line">mymvname[i]<-substr(myinfo[i],regexpr(<span class="string">"《"</span>,myinfo[i])[<span class="number">1</span>]+<span class="number">1</span>,regexpr(<span class="string">"》"</span>,myinfo[i])[<span class="number">1</span>]-<span class="number">1</span>)</div><div class="line">}</div></pre></td></tr></table></figure></p>
<p>匹配完成之后,把MP4的后缀带上。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">mymvname<-paste0(mymvname,<span class="string">".mp4"</span>)</div><div class="line">[<span class="number">1</span>] <span class="string">"方圆几里.mp4"</span> <span class="string">"第二自我.mp4"</span> <span class="string">"丑八怪.mp4"</span> <span class="string">"她说.mp4"</span> <span class="string">"双截棍.mp4"</span></div></pre></td></tr></table></figure></p>
<p>都到了这份上了你还想怎样–好吧我只想把小姐姐的视频下载下来而已。</p>
<h4 id="爬虫第三部:构建下载函数:"><a href="#爬虫第三部:构建下载函数:" class="headerlink" title="爬虫第三部:构建下载函数:"></a>爬虫第三部:构建下载函数:</h4><p>因为是五个视频文件要下载,所以需要构造循环下载函数:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">1</span>:length(myinfo)){</div><div class="line">download.file(mymvlinks[i],mymvname[i],mode=<span class="string">"wb"</span>)</div><div class="line">}</div></pre></td></tr></table></figure></p>
<p><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image8.jpg" alt=""><br><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image9.jpg" alt=""><br><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image10.jpg" alt=""></p>
<p>OK,爬完收工,是不是很简单呀,感觉去找个视频网站试一下吧,这个夏天,让音乐来的更猛烈一些吧!</p>
<h3 id="接下来做一个完整的代码汇总:"><a href="#接下来做一个完整的代码汇总:" class="headerlink" title="接下来做一个完整的代码汇总:"></a>接下来做一个完整的代码汇总:</h3><p><strong>第一步:分析网页:</strong></p>
<p><strong>第二部:爬取网页:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line">setwd(<span class="string">"E:/CloudMusic"</span>)</div><div class="line"><span class="keyword">library</span>(tidyverse)</div><div class="line"><span class="keyword">library</span>(rvest)</div><div class="line"><span class="keyword">library</span>(stringr)</div><div class="line">mylinks<-read_html(url,encoding=<span class="string">"utf-8"</span>)%>%html_nodes(<span class="string">"div.videoCont>div.videoList>div.video>div.MIAOPAI_player"</span>)%>%html_attr(<span class="string">"data-scid"</span>)<span class="comment">#爬取视频ID:</span></div><div class="line">baseurl<-<span class="string">"http://gslb.miaopai.com/stream/"</span></div><div class="line">mymvlinks<-paste0(baseurl,mylinks,<span class="string">".mp4"</span>) <span class="comment">#构造视频链接:</span></div><div class="line">myinfo<-read_html(url,encoding=<span class="string">"utf-8"</span>)%>%html_nodes(<span class="string">"div.viedoAbout"</span>)%>%html_text(trim = <span class="literal">TRUE</span>)<span class="comment">#爬取评论文本</span></div><div class="line">mymvname<-c()</div><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">1</span>:length(myinfo) ){</div><div class="line">mymvname[i]<-substr(myinfo[i],regexpr(<span class="string">"《"</span>,myinfo[i])[<span class="number">1</span>]+<span class="number">1</span>,regexpr(<span class="string">"》"</span>,myinfo[i])[<span class="number">1</span>]-<span class="number">1</span>)</div><div class="line">}<span class="comment">#提取视频名称</span></div><div class="line">mymvname<-paste0(mymvname,<span class="string">".mp4"</span>)<span class="comment">#构造视频名称(带格式)</span></div></pre></td></tr></table></figure></p>
<p><strong>第三步:构造下载函数:</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">1</span>:length(myinfo)){</div><div class="line">download.file(mymvlinks[i],mymvname[i],mode=<span class="string">"wb"</span>)</div><div class="line">}</div></pre></td></tr></table></figure></p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>
<p><strong>个人简介:</strong><br><strong><em>杜雨</em></strong><br>财经专业研究僧;<br>伪数据可视化达人;<br>文科背景的编程小白;<br>喜欢研究商务图表与地理信息数据可视化,爱倒腾PowerBI、SAP DashBoard、Tableau、R ggplot2、Think-cell chart等诸如此类的数据可视化软件,创建并运营微信公众号“数据小魔方”。<br>Mail:578708965@qq.com </p>
<hr>
<p><strong>备注信息:</strong><br><a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png"></a><br>本作品采用<a rel="external" href="http://creativecommons.org/licenses/by-nc/4.0/" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a>进行许可。</p>
]]></content>
<summary type="html">
<p><img src="http://orz60j4aw.bkt.clouddn.com/taoxinyao/image.jpg" alt=""></p>
<p>大连的盛夏实在是热的让人心烦(对于我这种既怕热又怕冷的真的没地呆了)。</p>
</summary>
<category term="网络爬虫" scheme="http://www.raindu.com/categories/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/"/>
<category term="R语言" scheme="http://www.raindu.com/tags/R%E8%AF%AD%E8%A8%80/"/>
<category term="网络爬虫" scheme="http://www.raindu.com/tags/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/"/>
</entry>
<entry>
<title>左手用R右手Python系列——因子变量与分类重编码</title>
<link href="http://www.raindu.com/2017/07/10/%E5%B7%A6%E6%89%8B%E7%94%A8R%E5%8F%B3%E6%89%8BPython%E7%B3%BB%E5%88%97%E2%80%94%E2%80%94%E5%9B%A0%E5%AD%90%E5%8F%98%E9%87%8F%E4%B8%8E%E5%88%86%E7%B1%BB%E9%87%8D%E7%BC%96%E7%A0%81/"/>
<id>http://www.raindu.com/2017/07/10/左手用R右手Python系列——因子变量与分类重编码/</id>
<published>2017-07-10T04:44:32.000Z</published>
<updated>2017-07-10T04:59:57.017Z</updated>
<content type="html"><![CDATA[<p><img src="http://orssvamao.bkt.clouddn.com/factor/image.jpg" alt=""></p>
<p>今天这篇介绍数据类型中因子变量的运用在R语言和Python中的实现。</p>
<a id="more"></a>
<p>因子变量是数据结构中用于描述分类事物的一类重要变量。其在现实生活中对应着大量具有实际意义的分类事物。</p>
<p>比如年龄段、性别、职位、爱好,星座等。</p>
<p>之所以给其单独列出一个篇幅进行讲解,除了其在数据结构中的特殊地位之外,在数据可视化和数据分析与建模过程中,因子变量往往也承担中描述某一事物重要维度特征的作用,其意义非同寻常,无论是在数据处理过程中还是后期的分析与建模,都不容忽视。</p>
<p>通常意义上,按照其所描述的维度实际意义,因子变量一般又可细分为无序因子(类别之间没有特定顺序,水平相等)和有序因子(类别中间存在某种约定俗成的顺序,如年龄段、职称、学历、体重等)。</p>
<p>在统计学中对变量进行了如下四类划分:定类变量、定序变量、定距变量、定比变量。而其中的定类和定比变量就对应着我们今天将要讲解的因子变量(无序因子和有序因子变量)。</p>
<p>因子变量从信息含量上来看,其要比单纯的定性变量(文本变量)所包含的描述信息多一些,但是又比数值型变量(定距变量和定比变量)所表述的信息含量少一些。</p>
<p>因而原则上来讲,数值型变量可以转换为因子变量,因子变量可以转换为文本型变量,但是以上顺序却是不可逆的(信息含量多的变量可以放弃信息量,转换为信息含量较少的变量类型,但是信息含量较少的变量却无法增加信息含量)。</p>
<p>以下将分别讲解在R语言和Python中如何生成因子变量、如何将数值型变量转换为因子变量、以及如何对因子变量进行重编码。</p>
<h3 id="R语言因子变量处理函数:"><a href="#R语言因子变量处理函数:" class="headerlink" title="R语言因子变量处理函数:"></a>R语言因子变量处理函数:</h3><p>在R语言中,通常使用factor直接生成因子变量,我们仅需一个向量(原则上可以是文本型、也可以是数字型,但是通常从实际意义上来说,被转换的应该是一个含有多类别的类别型文本变量)。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">factor(x, levels,labels=levels,ordered=)</div></pre></td></tr></table></figure>
<p>以上参数中,x即是我们将要转换的变量,levels是将要设定的因子水平(可选参数,省略则自动以向量中的不重复对象为因子水平),labels作为因子标签(可选参数,与前述因子水平对应,若设置,则打印时显示的是对应因子标签,省略则同因子水平一样,使用向量中不重复值【即类别】作为标签),ordered是逻辑参数,设定是否对因子水平排序。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">vector<-rep(LETTERS[<span class="number">1</span>:<span class="number">5</span>],<span class="number">6</span>);print(vector);plyr::count(vector)</div><div class="line">myfactor<-(factor(vector,levels=c(<span class="string">"E"</span>,<span class="string">"D"</span>,<span class="string">"C"</span>,<span class="string">"B"</span>,<span class="string">"A"</span>),labels=c(<span class="string">"EEE"</span>,<span class="string">"DDD"</span>,<span class="string">"CCC"</span>,<span class="string">"BBB"</span>,<span class="string">"AAA"</span>),ordered=<span class="literal">TRUE</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://orssvamao.bkt.clouddn.com/factor/image1.jpg" alt=""></p>
<p>通常来说,factor函数中,levels一般不用设置,函数会自动判断向量内有几个水平,但是倘若要生成有序因子的话,默认会根据字母顺序排列,如果自然顺序与目标有序因子顺序不一致,则一定要指定levels,labels则视具体需求而定,如果本身就是文本类别的话,一般无需设定标签。</p>
<p>如果是问卷类数据,而且编码为数值,则一定要通过labels标签的设定来还原每一个编码的真实意义。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">factor(vector,labels=c(<span class="string">"AAA"</span>,<span class="string">"BBB"</span>,<span class="string">"CCC"</span>,<span class="string">"DDD"</span>,<span class="string">"EEE"</span>),ordered=<span class="literal">TRUE</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://orssvamao.bkt.clouddn.com/factor/image2.jpg" alt=""></p>
<p>因子变量与文本变量数值变量之间的互转则通过as.character()或者as.numeric()函数来实现。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">library</span>(dplyr)</div><div class="line">as.character(as.factor(<span class="number">1</span>:<span class="number">10</span>))%>%str()</div><div class="line">as.numeric(as.factor(<span class="number">1</span>:<span class="number">10</span>))%>%str()</div></pre></td></tr></table></figure>
<p><strong>R语言中的因子变量重编码</strong></p>
<p>如果你有一个度量指标,需要将其转换为分段的因子变量,则可以通过cut函数来实现这种转换。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">scale<-runif(<span class="number">100</span>,<span class="number">0</span>,<span class="number">100</span>)</div><div class="line">cut(x,breaks,labels=<span class="literal">NULL</span>,include.lowest=<span class="literal">FALSE</span>,right=<span class="literal">TRUE</span>,ordered=)</div></pre></td></tr></table></figure></p>
<ul>
<li>cut函数参数如上,接受一个数值型向量,breaks接受一个数值向量(标识分割点)或者单个数值(分割 数目)。</li>
<li>right是逻辑参数,设定分割带是左开右闭或者左闭右开。(默认左开右闭)。</li>
<li>include.lowest则根据right的设定,决定是否应该包含端点值(如果right为TRUE,左开右闭区间,则包含最小值,如果right为FALSE,左闭右开区间则包含最大值),默认为FALSE。</li>
<li>ordered则设定是否对因子水平进行排序。</li>
</ul>
<figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">(factor1<-cut(scale,breaks=c(<span class="number">0</span>,<span class="number">20</span>,<span class="number">40</span>,<span class="number">60</span>,<span class="number">80</span>,<span class="number">100</span>),labels=c(<span class="string">"0~20"</span>,<span class="string">"20~40"</span>,<span class="string">"40~60"</span>,<span class="string">"60~80"</span>,<span class="string">"80~100"</span>),include.lowest=<span class="literal">TRUE</span>,ordered=<span class="literal">TRUE</span>))</div></pre></td></tr></table></figure>
<p><img src="http://orssvamao.bkt.clouddn.com/factor/image3.jpg" alt=""></p>
<p><strong>另一种分割场景是使用分位数函数进行分割</strong><br><figure class="highlight r"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">qa <- quantile(scale, c(<span class="number">0</span>,<span class="number">0.2</span>,<span class="number">0.4</span>,<span class="number">0.6</span>,<span class="number">0.8</span>,<span class="number">1.0</span>))</div><div class="line">(cut(scale,breaks=qa,labels=c(<span class="string">"0%~20%"</span>,<span class="string">"20%~40%"</span>,<span class="string">"40%~60%"</span>,<span class="string">"60%~80%"</span>,<span class="string">"80%~100%"</span>),include.lowest=<span class="literal">TRUE</span>,ordered=<span class="literal">TRUE</span>))</div></pre></td></tr></table></figure></p>
<p><img src="http://orssvamao.bkt.clouddn.com/factor/image4.jpg" alt=""></p>
<p>以上分割方法在是较为常用的因子变量转换方法,当然你可以使用if函数进行类似分割,但是相比较来讲,使用cut函数进行分割要高效很多。</p>
<h3 id="Python因子编码处理函数:"><a href="#Python因子编码处理函数:" class="headerlink" title="Python因子编码处理函数:"></a>Python因子编码处理函数:</h3><p>在Python中,Pandas库包含了处理因子变量的一整套完整语法函数。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">import</span> pandas <span class="keyword">as</span> pd</div><div class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</div><div class="line"><span class="keyword">import</span> string</div></pre></td></tr></table></figure>
<p>在pandas中的官方在线文档中,给出了pandas因子变量的详细论述,并在适当位置与R语言进行了对比描述。</p>
<p><a href="http://pandas.pydata.org/pandas-docs/stable/categorical.html#working-with-categories" target="_blank" rel="external"></a></p>
<p>当利用pandas生成序列时,可以在序列函数内的dtype参数设定因子变量类型。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">s = pd.Series([<span class="string">"A"</span>,<span class="string">"B"</span>,<span class="string">"C"</span>,<span class="string">"D"</span>,<span class="string">"E"</span>], dtype=<span class="string">"category"</span>)</div></pre></td></tr></table></figure></p>
<p>生成数据框时,也可以直接生成因子变量。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">df = pd.DataFrame({<span class="string">"A"</span>:[<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>,<span class="string">"a"</span>]})</div><div class="line">df[<span class="string">"B"</span>] = df[<span class="string">"A"</span>].astype(<span class="string">'category'</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://orssvamao.bkt.clouddn.com/factor/image5.jpg" alt=""></p>
<p>除了直接在生成序列或者数据框时生成因子变量之外,也可以通过一个特殊的函数pd.Categorical来完成在序列和数据框中创建因子变量。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">s = pd.Series(pd.Categorical([<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>,<span class="string">"a"</span>], categories=[<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>],ordered=<span class="keyword">False</span>))</div><div class="line">df = pd.DataFrame({<span class="string">"A"</span>:[<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>,<span class="string">"a"</span>]})</div><div class="line">df[<span class="string">"B"</span>] =pd.Series(pd.Categorical([<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>,<span class="string">"a"</span>], categories=[<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>],ordered=<span class="keyword">False</span>))</div></pre></td></tr></table></figure></p>
<p><img src="http://orssvamao.bkt.clouddn.com/factor/image6.jpg" alt=""></p>
<p>因子顺序的添加可以通过设定序列或者数框框列的.astype来进行详细的操作。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">s = pd.Series([<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>,<span class="string">"a"</span>])</div><div class="line">s_cat = s.astype(<span class="string">"category"</span>, categories=[<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>], ordered=<span class="keyword">True</span>)</div></pre></td></tr></table></figure></p>
<p><img src="http://orssvamao.bkt.clouddn.com/factor/image7.jpg" alt=""></p>
<p>无论是序列中还是数据框中的因子变量生成之后,都可以通过以下属性查看其具体的类型、因子类别、以及是否含有顺序。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">s_cat.dtypes</div><div class="line">s_cat.cat.categories</div><div class="line">s_cat.cat.ordered</div></pre></td></tr></table></figure>
<p>一种比较迂回的方法是,先生成普通序列,然后通过设定序列类型完成因子变量的转化。而想要舍弃因子变量,还原成普通的文本序列,则同样只需再其astype中进行格式设定。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">s = pd.Series([<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>,<span class="string">"a"</span>])</div><div class="line">s2 = s.astype(<span class="string">'category'</span>,categories=[<span class="string">"a"</span>,<span class="string">"b"</span>,<span class="string">"c"</span>],ordered=<span class="keyword">True</span>)</div><div class="line">s2.astype(str)</div></pre></td></tr></table></figure>
<p>最后讲一下,如何在数据框中分割数值型变量为因子变量,pandas的数据框也有与R语言同名的函数——cut。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">df = pd.DataFrame({<span class="string">'value'</span>: np.random.randint(<span class="number">0</span>, <span class="number">100</span>, <span class="number">20</span>)})</div><div class="line">labels = [ <span class="string">"{0} - {1}"</span>.format(i, i + <span class="number">9</span>) <span class="keyword">for</span> i <span class="keyword">in</span> range(<span class="number">0</span>,<span class="number">100</span>,<span class="number">10</span>) ]</div><div class="line">df[<span class="string">'group'</span>] = pd.cut(df.value, range(<span class="number">0</span>, <span class="number">105</span>, <span class="number">10</span>), right=<span class="keyword">False</span>, labels=labels)</div></pre></td></tr></table></figure></p>
<p><img src="http://orssvamao.bkt.clouddn.com/factor/image8.jpg" alt=""></p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">pd.cut(x, bins, right=, labels=,include_lowest=<span class="keyword">False</span>)</div><div class="line"><span class="comment">#df.value代表待风格的变量,第二项是bins可以是一个列表(作为分割点),也可以是一个整数(作为分割带箱数),</span></div><div class="line"><span class="comment">#right控制带宽是左开右闭还是左闭右开,</span></div><div class="line"><span class="comment">#labels设定输出显示标签,</span></div><div class="line"><span class="comment">#include_lowest=控制是否包含边界点(以上参数可以类比R语言中的cut函数)。</span></div></pre></td></tr></table></figure>
<h3 id="最后做一个小总结:"><a href="#最后做一个小总结:" class="headerlink" title="最后做一个小总结:"></a>最后做一个小总结:</h3><p>关于因子变量在R语言和Python中涉及到的操作函数;</p>
<p><strong>R语言:</strong></p>
<p>创建因子变量:<br>factor<br>转换因子变量:<br>as.factor<br>as.numeric(as.character)<br>分割因子变量:<br>cut函数</p>
<p><strong>Python:</strong></p>
<p>创建因子变量:<br>pd.Categorical(categories=,ordered=)<br>pd.Series(dtype=”category”)<br>转换因子变量:<br>df.astype(‘category’,categories,ordered)<br>分割因子变量:<br>df.cut(df.value,breaks=,right=,labels)</p>
<hr>
<p><strong>联系方式:</strong><br>wechat:ljty1991<br>Mail:578708965@qq.com<br>个人公众号:数据小魔方(datamofang)<br>团队公众号:EasyCharts<br>qq交流群:[魔方学院]298236508</p>