Skip to content

Commit fd6cbb0

Browse files
committed
make versions and add lists without comments
1 parent bd5a06d commit fd6cbb0

File tree

8 files changed

+11024
-154
lines changed

8 files changed

+11024
-154
lines changed

make_lists.R

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ source("~/Documents/github/r-dev/helpers.R")
88
# GREEK
99

1010
# Set version number
11-
version_greek <- "2.4"
11+
version_greek <- "2.5"
1212

1313
# Convert current JSON list to TXT with Markdown headings
1414
greek_json <- read_file("stopwords_greek.json")
@@ -54,17 +54,18 @@ greek_metadata <- paste0(
5454
stopwords_greek <- paste0(greek_metadata, greek_raw)
5555
stopwords_greek <- utf8::utf8_normalize(stopwords_greek)
5656
write_file(stopwords_greek, "stopwords_greek.txt")
57+
write_file(stopwords_greek, paste("./versions/stopwords_greek_v", str_replace(version_greek, "\\.", "_"), ".txt", sep = ""))
5758

5859
# Make file without categories as comments
5960
greek_raw %>%
60-
str_replace_all("#.+\n", "\n") %>%
61-
str_replace_all("\n\n", "\n") %>%
61+
str_replace_all("#.+\n", "") %>%
62+
str_replace_all("\n+", "\n") %>%
6263
write_file("./test/test_json_txt/stopwords_greek_no_comments.txt")
6364

6465
# LATIN
6566

6667
# Set version number
67-
version_latin <- "2.3"
68+
version_latin <- "2.4"
6869

6970
# Convert current JSON list to TXT with Markdown headings
7071
latin_json <- read_file("stopwords_latin.json")
@@ -109,9 +110,10 @@ latin_metadata <- paste0(
109110
)
110111
stopwords_latin <- paste0(latin_metadata, latin_raw)
111112
write_file(stopwords_latin, "stopwords_latin.txt")
113+
write_file(stopwords_latin, paste("./versions/stopwords_latin_v", str_replace(version_latin, "\\.", "_"), ".txt", sep = ""))
112114

113115
# Make file without categories as comments
114116
latin_raw %>%
115-
str_replace_all("#.+\n", "\n") %>%
116-
str_replace_all("\n\n", "\n") %>%
117+
str_replace_all("#.+\n", "") %>%
118+
str_replace_all("\n+", "\n") %>%
117119
write_file("./test/test_json_txt/stopwords_latin_no_comments.txt")

rationale.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,18 @@ For version 2 (January-February 2018) I rebased the lists on corpus statistics a
1717
Total number of items (tokens or symbols):
1818

1919
* Latin
20-
* 4009 items in [stopwords_latin_v2_3.txt](versions/stopwords_latin_v2_3.txt)
21-
* 4008 items in [stopwords_latin_v2_2.txt](versions/stopwords_latin_v2_2.txt)
22-
* 3844 items in [stopwords_latin_v2_1.txt](versions/stopwords_latin_v2_1.txt)
20+
* 4010 items in [stopwords_latin_v2_4.txt](versions/stopwords_latin_v2_4.txt)
21+
* 4010 items in [stopwords_latin_v2_3.txt](versions/stopwords_latin_v2_3.txt)
22+
* 4009 items in [stopwords_latin_v2_2.txt](versions/stopwords_latin_v2_2.txt)
23+
* 3845 items in [stopwords_latin_v2_1.txt](versions/stopwords_latin_v2_1.txt)
2324
* 3839 items in [stopwords_latin_v2_0.txt](versions/stopwords_latin_v2_0.txt)
2425
* 0144 items in [stopwords_latin_v1_0.txt](versions/stopwords_latin_v1_0.txt)
2526
* Greek
26-
* 6694 items in [stopwords_greek_v2_4.txt](versions/stopwords_greek_v2_4.txt)
27-
* 6693 items in [stopwords_greek_v2_3.txt](versions/stopwords_greek_v2_3.txt)
28-
* 6529 items in [stopwords_greek_v2_2.txt](versions/stopwords_greek_v2_2.txt)
29-
* 6517 items in [stopwords_greek_v2_1.txt](versions/stopwords_greek_v2_1.txt)
27+
* 6695 items in [stopwords_greek_v2_5.txt](versions/stopwords_greek_v2_5.txt)
28+
* 6695 items in [stopwords_greek_v2_4.txt](versions/stopwords_greek_v2_4.txt)
29+
* 6694 items in [stopwords_greek_v2_3.txt](versions/stopwords_greek_v2_3.txt)
30+
* 6530 items in [stopwords_greek_v2_2.txt](versions/stopwords_greek_v2_2.txt)
31+
* 6518 items in [stopwords_greek_v2_1.txt](versions/stopwords_greek_v2_1.txt)
3032
* 7573 items in [stopwords_greek_v2_0.txt](versions/stopwords_greek_v2_0.txt)
3133
* 0262 items in [stopwords_greek_v1_0.txt](versions/stopwords_greek_v1_0.txt)
3234

@@ -136,6 +138,10 @@ I added abbreviations commonly found in critical apparatus and notes. They don'
136138

137139
I sorted both lists alphabetically (except for numerals) with James Tauber's [Pyuca](https://github.com/jtauber/pyuca/) software ("Python Unicode Collation Algorithm implementation"). This was mostly useful for Greek as polytonic Unicode is not handled correctly by default. I also reinserted a number ("79") missing in Roman numerals.
138140

141+
### Latin version 2.4 and Greek version 2.5: Fixing counts
142+
143+
I only corrected the item counts. The function I used was counting the octothorpe sign as a comment in the list of typographical symbols.
144+
139145
## Feedback?
140146

141147
Questions, comments and advice are most welcome.

stopwords_greek.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Ancient Greek stopwords
2-
# version 2.4
3-
# 2018-02-10
2+
# version 2.5
3+
# 2018-02-13
44
# Aurélien Berra
55
#
66
# Ancient Greek stopwords for textual analysis
77
# language: Ancient Greek (grc)
88
# type: dataset
9-
# items count: 6694
9+
# items count: 6695
1010
# https://github.com/aurelberra/stopwords
1111
# rights: CC-BY-NC-SA
1212

stopwords_latin.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Ancient Latin stopwords
2-
# version 2.3
3-
# 2018-02-10
2+
# version 2.4
3+
# 2018-02-13
44
# Aurélien Berra
55
#
66
# Ancient Latin stopwords for textual analysis
77
# language: Latin (la, lat)
88
# type: dataset
9-
# items count: 4009
9+
# items count: 4010
1010
# https://github.com/aurelberra/stopwords
1111
# rights: CC-BY-NC-SA
1212

0 commit comments

Comments
 (0)