Skip to content

Commit

Permalink
inscr types
Browse files Browse the repository at this point in the history
  • Loading branch information
petrifiedvoices committed Feb 17, 2022
1 parent 32cf06a commit b4f798d
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 35 deletions.
29 changes: 0 additions & 29 deletions scripts/1_3_r_EDCS_exploration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -393,35 +393,6 @@ First 20 Comments to see the nature of the contents:
unique(EDCS$Comment)[1:20]
```

# Text of inscription

## How many inscriptions contain a text of an inscription

```{r}
length(na.omit(EDCS$clean_text_interpretive_word))
```
In percent:
```{r}
length(na.omit(EDCS$clean_text_interpretive_word))/(nrow(EDCS)/100)
```

## How many words there are:

Original text before cleaning:
```{r}
sum(lengths(gregexpr("\\w+", EDCS$inscription)) + 1)
#different counting method
sum(na.omit(str_count(EDCS$inscription, '\\w+')))
```

Text after cleaning:
```{r}
sum(lengths(gregexpr("\\w+", EDCS$clean_text_interpretive_word)) + 1)
#different counting method
sum(na.omit(str_count(EDCS$clean_text_interpretive_word, '\\w+')))
```



Expand Down
18 changes: 12 additions & 6 deletions scripts/1_4_r_EDCS_text_exploration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ output:

```{r setup, include=FALSE, echo=FALSE}
require("knitr")
opts_knit$set(root.dir = "/home/petra/Github/EDCS_ETL/")
#opts_knit$set(root.dir = "/home/petra/Github/EDCS_ETL/")
library(tidyverse)
library(jsonlite)
Expand All @@ -30,6 +30,7 @@ Make a list and tibble from the downloaded dataset
EDCS <- jsonlite::fromJSON("../output/EDCS_text_cleaned_2022-02-15.json")
```


# Text of inscription

## How many inscriptions contain a text of an inscription
Expand Down Expand Up @@ -116,17 +117,22 @@ head(int_wordcounts, 10)
EDCS_inscrtype<- EDCS %>%
filter(EDCS$inscr_type != "list()")
random100 <- sample_n(EDCS_inscrtype, 100)
EDCS_inscrtype_export <- EDCS_inscrtype %>%
select(`EDCS-ID`, publication, province, province_list , place, place_list, Links, language, `dating from`, `dating to`, start_yr, end_yr_list, end_yr_1, notes_dating, status, status_list, inscr_type, status_notation, inscr_process, Latitude, Longitude, photo, Material, Comment, notes_references, notes_comments, inscription, inscription_stripped_final, clean_text_interpretive_word, clean_text_conservative, notes_dating, notes_references, notes_comments)
select(`EDCS-ID`, Material, inscription, clean_text_interpretive_word, clean_text_conservative)
random100 <- sample_n(EDCS_inscrtype_export, 100)
write_csv(x=EDCS_inscrtype_export, path="output/EDCS_random100_inscrtype.csv")
write_csv(x=random100, path="../output/EDCS_random100_inscrtype.csv")
```


### Extracting labels for inscription types
```{r}
EDCS_types<- as.data.frame(unique(unlist(EDCS$inscr_type)))
write_csv(x=EDCS_types, path="../output/EDCS_types_inscr.csv", col_names = FALSE)
```



Expand Down

0 comments on commit b4f798d

Please sign in to comment.