diff --git a/README.md b/README.md
index 363a066..cf5591a 100644
--- a/README.md
+++ b/README.md
@@ -2,22 +2,28 @@
This repository has two main components:
-1. Indicator framework for quickly generating aggregate indicators.
-2. Reporting framework for generating reports from those indicators.
+1. Aggregate table framework for generating aggregate tables.
+2. Reporting framework for generating reports from those aggregate tables.
-## Indicator framework
+## Aggregate table framework
-To build the monthly and lifetime indicator tables issue the following
-command:
+To build the aggregate tables issue the following
+commands:
- make indicators
+ make clean
+ make aggregate_tables
-This calls [indicators.R](indicators.R), which creates
-tables described in [indicators.json](indicators.json)
-and [indicator_functions.R](indicator_functions.R).
+Alternatively, make debug can be used to build aggregate tables from a subset of rows from each data source. This is useful for data sources that involve long-running queries. Specify debug mode as follows:
+
+ make clean
+ make debug
+
+Both forms call [aggregate_tables.R](aggregate_tables.R), which creates
+tables described in [aggregate_tables.json](aggregate_tables.json)
+using the indicator calculation functions in [indicator_functions.R](indicator_functions.R).
How does one add an indicator?
-[indicators.json](indicators.json) contains a list of
+[aggregate_tables.json](aggregate_tables.json) contains a list of
tables to be created. Each element in the list is a dictionary with
four pairs:
@@ -28,23 +34,19 @@ four pairs:
4. A list of components that will be merged together to build the
final table.
-Each component has three key-value pairs:
+Each component has two key-value pairs:
-1. The name of the database where the data for this component is
- pulled.
-2. The name of the table where the data for this component is pulled.
-3. The columns to be created.
+1. The name of the table where the data for this component is pulled.
+2. The columns to be created.
Here is an example configuration:
[
{
- "database": "dimagi_data_platform",
- "table": "monthly",
+ "table": "aggregate_monthly_interactions",
"by": ["domain", "user_id", "month.index"],
"components": [
{
- "database": "dimagi_data_platform",
"table": "interactions",
"columns": {
"column1": "date_first_visit",
@@ -52,7 +54,6 @@ Here is an example configuration:
}
},
{
- "database": "dimagi_data_platform",
"table": "device_logs",
"columns": {
"column3": "nrow"
@@ -62,12 +63,11 @@ Here is an example configuration:
}
]
-This will create a new table called `monthly` in the
-`dimagi_data_platform` with three columns. `column1` will be the date
-of the first visit a user performed in the month. `column2` will be
-the date of the last visit a user performed in the month. `column3`
-will count the number of device logs recorded in a month for a given
-user. Notice that the key for each column is the name of the column
+This will create a new table called `aggregate_monthly_interactions` with three columns.
+`column1` will be the date of the first visit a user performed in the month.
+`column2` will be the date of the last visit a user performed in the month.
+`column3` will count the number of device logs recorded in a month for a given user.
+Notice that the key for each column is the name of the column
where that indicator will be stored, the value for each column is the
R function to be called on each block of data. To modularize the code
a bit, we have moved custom indicator functions into
@@ -78,9 +78,18 @@ the definition of `date_first_visit`:
This function takes a data.frame `x` and returns the minimum value of
the `visit_date` column of `x`. So, adding an indicator always
-requires modifying [indicators.json](indicators.json)
+requires modifying [aggregate_tables.json](aggregate_tables.json)
and if the necessary R function to calculate the indicator does not
exist then it must be added to
[indicator_functions.R](indicator_functions.R).
## Reporting framework
+Report modules produce one or more PDF reports. Report modules for a run are specified in config_run.json, along with options like split-bys and report dates.
+
+test_report.R is an example report module.
+
+The script reports.R reads the config file, sets up variables (including filtering domains and returning a list of domains to include in the report run) and runs the report modules specified in the config. Each report module should return a list of file names (full paths) specifying the individual pdfs it has written to tmp_pdf_report_dir. Once all report modules have run, these are collated into a single pdf in reports.R, using http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/.
+
+Report modules should provide a render function that returns a list of pdf files (full paths) generated in the report module. The render function should be defined as follows:
+
+ render <- function (db,domains_for_run,report_options,tmp_report_pdf_dir)
diff --git a/aggregate_tables.R b/aggregate_tables.R
index ddcc8bb..bed2d0c 100644
--- a/aggregate_tables.R
+++ b/aggregate_tables.R
@@ -29,26 +29,28 @@ drop_tables <- function(file) {
write_tables <- function(file, debug) {
config <- fromJSON(file=file)
-
+ db <- get_db_connection()
for (table.info in config) {
- print(paste('Writing', table.info$table, 'indicator table.'))
- df <- compute_indicators(table.info, debug)
- db <- get_db_connection()
+ print(paste('Computing indicators for ', table.info$table, 'indicator table.'))
+ df <- compute_indicators(table.info, db, debug)
+ print(paste('Writing ', table.info$table, 'indicator table.'))
dbRemoveTable(db$con, name=table.info$table)
copy_to(db, df=df, name=table.info$table, temporary=FALSE)
}
}
-compute_indicators <- function(info, debug) {
+compute_indicators <- function(info, db, debug) {
debug <- as.logical(debug)
if (debug == T) {limit = 5000} else {limit = -1}
dfs <- lapply(info$components, function(component) {
- db <- get_db_connection()
+ print(paste('Getting data source ', component$table))
source.data <- get_data_source(db, component$table, limit)
group.by.str <- paste(info$by, collapse=', ')
+ print(paste('Grouping and aggregating', component$table))
df <- source.data %.% s_group_by(group.by.str) %.% aggregate(component$columns)
return(df)
})
+ print('merging...')
merged <- Reduce(function(...) merge(..., all.x=TRUE, all.y=TRUE, by=info$by), dfs)
return(merged)
}
diff --git a/aggregate_tables.json b/aggregate_tables.json
index da5555e..be295b4 100644
--- a/aggregate_tables.json
+++ b/aggregate_tables.json
@@ -3,41 +3,58 @@
"table":"aggregate_lifetime_interactions",
"by":[
"domain",
- "user_id"
+ "user_id",
+ "user_pk"
],
"components":[
{
- "table":"interactions",
+ "table":"visit_detail",
"columns":{
"date_first_visit":"date_first_visit",
"date_last_visit":"date_last_visit",
"nvisits":"nvisits",
- "days_on_cc":"days_on_cc",
- "days_visit_last":"days_visit_last",
- "active_user":"active_user",
"calendar_month_on_cc":"calendar_month_on_cc",
- "active_months":"active_months",
- "active_month_percent":"active_month_percent",
- "active_days":"active_days",
- "active_day_percent":"active_day_percent",
"nforms":"nforms",
"median_visit_duration":"median_visit_duration",
"median_visits_per_day":"median_visits_per_day",
"median_visits_per_month":"median_visits_per_month",
- "median_time_elapsed_btw_visits":"median_time_elapsed_btw_visits",
- "batch_entry_visit":"batch_entry_visit",
- "batch_entry_percent":"batch_entry_percent",
- "ncases_registered":"ncases_registered",
- "register_followup":"register_followup",
- "case_register_followup_rate":"case_register_followup_rate",
"morning":"morning",
"afternoon":"afternoon",
+ "evening":"evening",
"night":"night",
- "after_midnight":"after_midnight",
- "ncases_opened":"ncases_opened",
+ "time_using_cc":"time_using_cc",
+ "nvisits_travel":"nvisits_travel",
+ "nvisits_travel_batch":"nvisits_travel_batch"
+ }
+ },
+ {
+ "table":"interactions",
+ "columns":{
+ "ninteractions":"ninteractions",
+ "ncases_registered":"ncases_registered",
+ "register_followup":"register_followup",
+ "case_register_followup_rate":"case_register_followup_rate",
"ncases_touched":"ncases_touched",
"nunique_followups":"nunique_followups"
}
+ },
+ {
+ "table":"device_type",
+ "columns":{
+ "summary_device_type":"summary_device_type",
+ "nforms": "nforms",
+ "active_days":"active_days",
+ "active_months":"active_months"
+
+ }
+ },
+ {
+ "table":"device_log_types_by_user",
+ "columns":{
+ "total_logs":"total_logs",
+ "audio_plays":"audio_plays",
+ "network_warnings":"network_warnings"
+ }
}
]
},
@@ -46,32 +63,35 @@
"by":[
"domain",
"user_id",
+ "user_pk",
"month.index"
],
"components":[
{
- "table":"interactions",
+ "table":"visit_detail",
"columns":{
"date_first_visit":"date_first_visit",
"date_last_visit":"date_last_visit",
"nvisits":"nvisits",
- "days_on_cc":"days_on_cc",
- "active_days":"active_days",
- "active_day_percent":"active_day_percent",
- "nforms":"nforms",
"median_visit_duration":"median_visit_duration",
"median_visits_per_day":"median_visits_per_day",
- "median_time_elapsed_btw_visits":"median_time_elapsed_btw_visits",
- "batch_entry_visit":"batch_entry_visit",
- "batch_entry_percent":"batch_entry_percent",
- "ncases_registered":"ncases_registered",
- "register_followup":"register_followup",
- "case_register_followup_rate":"case_register_followup_rate",
"morning":"morning",
"afternoon":"afternoon",
+ "evening":"evening",
"night":"night",
- "after_midnight":"after_midnight",
- "ncases_opened":"ncases_opened",
+ "numeric_index":"numeric_index",
+ "time_using_cc":"time_using_cc",
+ "nvisits_travel":"nvisits_travel",
+ "nvisits_travel_batch":"nvisits_travel_batch"
+ }
+ },
+ {
+ "table":"interactions",
+ "columns":{
+ "ninteractions":"ninteractions",
+ "ncases_registered":"ncases_registered",
+ "register_followup":"register_followup",
+ "case_register_followup_rate":"case_register_followup_rate",
"ncases_touched":"ncases_touched",
"nunique_followups":"nunique_followups"
}
@@ -79,7 +99,17 @@
{
"table":"device_type",
"columns":{
- "summary_device_type":"summary_device_type"
+ "summary_device_type":"summary_device_type",
+ "nforms": "nforms",
+ "active_days":"active_days"
+ }
+ },
+ {
+ "table":"device_log_types_by_user",
+ "columns":{
+ "total_logs":"total_logs",
+ "audio_plays":"audio_plays",
+ "network_warnings":"network_warnings"
}
}
]
diff --git a/aggregate_tables/indicator_functions.R b/aggregate_tables/indicator_functions.R
index d5f40b1..b15edfe 100644
--- a/aggregate_tables/indicator_functions.R
+++ b/aggregate_tables/indicator_functions.R
@@ -4,85 +4,75 @@
library(lubridate)
library(zoo)
+# VISIT TABLE INDICATORS:
date_first_visit <- function(x) min(x$visit_date, na.rm=TRUE)
date_last_visit <- function(x) max(x$visit_date, na.rm=TRUE)
-nvisits <- function(x) nrow(x)
-days_on_cc <- function(x) as.numeric(date_last_visit(x) - date_first_visit(x)) + 1
+nvisits <- function(x) NROW(x)
-## The next five indicators only make sense for the lifetime table.
-days_visit_last <- function(x) as.numeric(Sys.Date() - date_last_visit(x))
-active_user <- function(x) ifelse(days_visit_last(x) <= 30, 1, 0)
-calendar_month_on_cc <- function(x) {
- first.month <- as.yearmon(date_first_visit(x))
- last.month <- as.yearmon(date_last_visit(x))
- nmonths <- 12 * as.numeric(last.month - first.month) + 1
- return(nmonths)
+get_visit_duration <- function(start, end) {
+ dur <- as.numeric(end - start, units="mins")
+ if (is.na(dur)) return(NA)
+ if (dur < 0) return(0)
+ if (dur > 30) return(30)
+ return(dur)
}
-active_months <- function(x) length(unique(as.yearmon(x$visit_date)))
-active_month_percent <- function(x) active_months(x) / calendar_month_on_cc(x)
-
-active_days <- function(x) length(unique(x$visit_date))
-active_day_percent <- function(x) active_days(x) / days_on_cc(x)
-nforms <- function(x) sum(x$total_forms, na.rm=TRUE)
-median_visit_duration <- function(x) median(x$form_duration / 60, na.rm=TRUE)
+durations <- function(x) na.omit(mapply(get_visit_duration, x$time_start, x$time_end))
+median_visit_duration <- function(x) round(as.numeric(median(durations(x))), digits = 1)
+time_using_cc <- function(x) sum(durations(x), na.rm = T)
-## TODO: Right now this only considers days with at least on visit.
median_visits_per_day <- function(x) median(as.numeric(table(x$visit_date)), na.rm=TRUE)
+nvisits_travel <- function(x) sum(x$home_visit, na.rm=T)
+nvisits_travel_batch <- function(x) sum(x$time_since_previous_hv/60<10, na.rm = T)
-## This only makes sense in the lifetime table.
-median_visits_per_month <- function(x) median(as.numeric(table(as.yearmon(x$visit_date))), na.rm=TRUE)
-
-median_time_elapsed_btw_visits <- function(x) median(x$time_since_previous, na.rm=TRUE)
+# Proportion of visits by time of day
+morning <- function(x) mean(x$visit_time == 'morning')*100
+afternoon <- function(x) mean(x$visit_time == 'afternoon')*100
+evening <- function(x) mean(x$visit_time == 'night')*100
+night <- function(x) mean(x$visit_time == 'after midnight')*100
-## To get reliable follow-up rates we'll need the full lifetime
-## data. Additionally, can the case be followed-up by a different FLW?
-## I'm not sure how to interpret this indicator description: "median
-## time (in mins) elapsed between two followup visits conducted by a
-## mobile user"
-## I took these two indicators out of the config file because they
-## were running too slowly.
-median_time_btw_followup <- function(x) {
- f <- function(block) {
- sorted.times <- sort(ymd_hms(block$time_start))
- value <- ifelse(
- nrow(block) <= 1,
- NA,
- difftime(sorted.times[2], sorted.times[1], units='mins')
- )
- return(data.frame(followup_time=value))
- }
- times <- x %.% group_by(case_id) %.% do(f(.))
- return(median(times$followup_time, na.rm=TRUE))
+# User's first, second, third... etc. month on CC
+numeric_index <- function (x) {
+ first_possible_visit_date <- as.POSIXct(strptime("2010-01-01 00:00:00", "%Y-%m-%d %H:%M:%S"))
+
+ this_month <- as.POSIXct(format(min(x$time_start),"%Y-%m-01"), tz = "UTC")
+ if (this_month < first_possible_visit_date) { return (1) }
+
+ start_month <- as.POSIXct(format(min(x$user_start_date),"%Y-%m-01"), tz = "UTC")
+ if (start_month < first_possible_visit_date) {start_month <- first_possible_visit_date}
+
+ total_months <- length(seq(from=start_month, to=this_month, by='month'))
+ return (total_months)
}
-median_days_btw_followup <- function(x) median_time_btw_followup(x) / 60 / 24
-
-batch_entry_visit <- function(x) sum(x$batch_entry, na.rm=TRUE)
-batch_entry_percent <- function(x) mean(x$batch_entry, na.rm=TRUE)
-
-ncases_registered <- function(x) sum(x$new_case, na.rm=TRUE)
-register_followup <- function(x) sum(x$follow_up)
-case_register_followup_rate <- function(x) mean(x$follow_up)
-morning <- function(x) mean(x$visit_time == 'morning')
-afternoon <- function(x) mean(x$visit_time == 'afternoon')
-night <- function(x) mean(x$visit_time == 'night')
-after_midnight <- function(x) mean(x$visit_time == 'after midnight')
-
-ncases_opened <- function(x) sum(x$new_case)
+## The next indicators are only applicable for the lifetime table.
+calendar_month_on_cc <- function(x) {
+ first.month <- as.yearmon(date_first_visit(x))
+ last.month <- as.yearmon(date_last_visit(x))
+ nmonths <- 12 * as.numeric(last.month - first.month) + 1
+ return(nmonths)
+}
+median_visits_per_month <- function(x) median(as.numeric(table(as.yearmon(x$visit_date))), na.rm=TRUE)
-## TODO: Some indicators do not fit into our current framework. For
-## instance, the total number of cases that are not opened at some
-## time point before this month and not closed yet until this month.
-## cum_open_cases: total number of cases that are not opened at some
-## time point before this month and not closed yet until this month
+# INTERACTION TABLE INDICATORS:
+ninteractions <- function(x) NROW(x)
+ncases_registered <- function(x) sum(x$created, na.rm=TRUE)
+register_followup <- function(x) sum(!x$created)
+case_register_followup_rate <- function(x) mean(!x$created)*100
ncases_touched <- function(x) length(unique(x$case_id))
+n_followups <- function(x) {
+ stopifnot(!any(is.na(x$created)))
+ stopifnot(all(x$created == 0 | x$created == 1))
+ return(length(x$case_id[x$created == 0]))
+}
nunique_followups <- function(x) {
- stopifnot(!any(is.na(x$follow_up)))
- stopifnot(all(x$follow_up == 0 | x$follow_up == 1))
- return(length(x$case_id[x$follow_up == 1]))
+ stopifnot(!any(is.na(x$created)))
+ stopifnot(all(x$created == 0 | x$created == 1))
+ return(length(unique(x$case_id[x$created == 0])))
}
+median_days_btw_followup <- function(x) median(x$days_elapsed_case)
+# DEVICE TYPE TABLE INDICATORS:
summary_device_type <- function (x) {
if (length(unique(x$device)) == 1) {
s <- paste(toupper(substring(x$device[1], 1,1)), substring(x$device[1], 2),
@@ -92,4 +82,11 @@ summary_device_type <- function (x) {
return ('Multi')
}
}
+nforms <- function(x) NROW(x)
+active_days <- function(x) length(unique(as.Date(x$time_start)))
+active_months <- function(x) length(unique(x$month.index))
+# DEVICE LOG TABLE INDICATORS:
+total_logs <-function(x) sum(x$num_logs)
+audio_plays <-function(x) if (NROW(x[x$log_type=='audio',]) > 0) sum(x[x$log_type=='audio',c('num_logs')]) else 0
+network_warnings <-function(x) if (NROW(x[x$log_type=='warning-network',]) > 0) sum(x[x$log_type=='warning-network',c('num_logs')]) else 0
diff --git a/aggregate_tables/monthly_table_functions.R b/aggregate_tables/monthly_table_functions.R
new file mode 100644
index 0000000..df30537
--- /dev/null
+++ b/aggregate_tables/monthly_table_functions.R
@@ -0,0 +1,33 @@
+library(data.table)
+library(lubridate)
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+add_next_previous_active <- function(all_monthly) {
+ all_monthly$domain_numeric = as.numeric(as.factor(all_monthly$domain))
+ all_monthly$calendar_month <- all_monthly$month_start
+ all_monthly$month_abbr <- month(all_monthly$calendar_month, label = T, abbr = T)
+
+ all_monthly <- arrange(all_monthly, domain_numeric, user_pk, calendar_month)
+ df <- data.table(all_monthly)
+ setkey(df,user_pk)
+ df[,diff_days:=c(NA,diff(calendar_month)),by=user_pk]
+ all_monthly <- as.data.frame(df)
+ all_monthly$previous_three_months_active <- all_monthly$diff_days <= 93
+
+ users <- unique(all_monthly$user_pk)
+
+ next_three_months_active <- c()
+ for (i in users) {
+ single_user <- all_monthly[all_monthly$user_pk == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_three_months_active[-1], F)
+ next_three_months_active <- append(next_three_months_active, next_active)
+ }
+ all_monthly$next_three_months_active <- next_three_months_active
+
+ #If calendar_month = 10/1/14 then next_month_active = NA
+ #because we don't know if the user will be active in the following month
+ is.na(all_monthly$next_three_months_active) <- all_monthly$calendar_month >= "2014-08-01"
+ return(all_monthly)
+}
diff --git a/analysis_scripts/blog_posts/device_type_trends.R b/analysis_scripts/blog_posts/device_type_trends.R
new file mode 100644
index 0000000..4eccef3
--- /dev/null
+++ b/analysis_scripts/blog_posts/device_type_trends.R
@@ -0,0 +1,223 @@
+library(dplyr)
+library(googleVis)
+library(rCharts)
+
+# load config files
+source(file.path("function_libraries","config_file_funcs.R", fsep = .Platform$file.sep))
+system_conf <- get_system_config(file.path("config_system.json"))
+source(file.path("config_setup.R", fsep = .Platform$file.sep))
+run_conf <-get_run_config(config_run_path)
+
+db <- get_db_connection(system_conf)
+
+source(file.path("function_libraries","db_queries.R", fsep = .Platform$file.sep))
+domain_table <- get_domain_table(db)
+
+# get the domains to run on based on the filters/names in run conf
+domains_for_run <- get_domains_for_run(domain_table,run_conf)
+
+source(file.path("function_libraries","report_utils.R", fsep = .Platform$file.sep))
+monthly_table <- get_aggregate_table (db, "aggregate_monthly_interactions", domains_for_run)
+
+# remove demo users and NA/NONE users
+monthly_table = monthly_table[!(monthly_table$user_id =="demo_user"),]
+monthly_table = monthly_table[!(monthly_table$user_id =="NONE"),]
+monthly_table = monthly_table[!(monthly_table$user_id =="none"),]
+monthly_table = monthly_table[!is.na(monthly_table$user_id),]
+
+# use date range from config_run
+start_date <- as.Date(run_conf$reports$start_date)
+end_date <- as.Date(run_conf$reports$end_date)
+monthly_table <- subset(monthly_table, date_first_visit >= start_date & date_last_visit <= end_date)
+
+monthly_table$month_start<-as.Date(paste0('1 ',monthly_table$month.index), '%d %b %Y')
+monthly_table$month_start_numeric <- as.numeric(as.POSIXct(monthly_table$month_start))
+source(file.path("aggregate_tables","monthly_table_functions.R", fsep = .Platform$file.sep))
+monthly_table <- add_next_previous_active(monthly_table)
+
+# summary by device by month
+summary_table_by_device <- monthly_table %>%
+ group_by(month_start_numeric, summary_device_type) %>%
+ summarise(total_forms = sum(nforms), total_visits=sum(nvisits), total_users=length(unique(user_id)))
+
+# Rickshaw chart by user
+all_months <- unique(summary_table_by_device$month_start_numeric)
+all_dev_types <- unique(summary_table_by_device$summary_device_type)
+full_set <- merge(all_months,all_dev_types, all=T)
+names(full_set) <- c('month_start_numeric','summary_device_type')
+full_set <- full_set[order(full_set$month_start_numeric,full_set$summary_device_type),]
+full_set$value <- 0
+full_set$variable <- 'total_users'
+dev_long = reshape2::melt(summary_table_by_device[c('summary_device_type','month_start_numeric','total_users')],
+ id.vars=c('summary_device_type','month_start_numeric'))
+dev_long<-replace.df(full_set, dev_long,by=c('month_start_numeric','summary_device_type','variable'))
+
+chart <- Rickshaw$new()
+
+chart$layer(value ~ month_start_numeric, group = "summary_device_type",
+ data = dev_long, type = "area", width = 400)
+chart$set(slider = TRUE)
+chart$save('rickshaw_chart.html', standalone = TRUE)
+
+# 6-month comparison between nokia and android users.
+six_month_comparison_data <- monthly_table %>% filter(numeric_index == 6,
+ summary_device_type %in% c('Android','Nokia'))
+six_month_summary <- six_month_comparison_data %>% group_by(summary_device_type) %>%
+ summarise(total_users = length(unique(user_id)),
+ total_domains = length(unique(domain)),
+ mean_forms = mean(nforms),
+ median_forms = median(nforms),
+ mean_visits = mean(nvisits),
+ median_visits = median(nvisits),
+ mean_active_days = mean(active_days),
+ median_active_days = median(active_days),
+ mean_time_using_cc = mean(time_using_cc),
+ median_time_using_cc = median(time_using_cc) / (60*60),
+ number_active_next_3 = sum(next_three_months_active, na.rm=T))
+six_month_summary <- six_month_summary %>% mutate (percent_active_next_three = (number_active_next_3/total_users) * 100,
+ mean_time_using_cc_hours = mean_time_using_cc / (60*60))
+
+
+# k-w tests
+kruskal.test(nforms~summary_device_type,data=six_month_comparison_data)
+kruskal.test(nvisits~summary_device_type,data=six_month_comparison_data)
+kruskal.test(active_days~summary_device_type,data=six_month_comparison_data)
+kruskal.test(time_using_cc~summary_device_type,data=six_month_comparison_data)
+kruskal.test(next_three_months_active~summary_device_type,data=six_month_comparison_data)
+
+# summary by domain
+six_month_by_domain <- six_month_comparison_data %>% group_by(domain) %>%
+ summarise(total_users = length(unique(user_id)),
+ total_forms = sum(nforms),
+ mean_forms_per_user = mean(nforms),
+ median_forms_per_user = median(nforms),
+ total_visits = sum(nvisits),
+ mean_visits_per_user = mean(nvisits),
+ median_visits_per_user = median(nvisits),
+ mean_active_days = mean(active_days),
+ median_active_days = median(active_days),
+ mean_time_using_cc = mean(time_using_cc) / (60*60),
+ median_time_using_cc = median(time_using_cc) / (60*60),
+ percent_active_next_3 = (sum(next_three_months_active) / length(unique(user_id)))*100,
+ most_common_device = names(sort(table(summary_device_type),decreasing=TRUE)[1]))
+six_month_by_domain <- six_month_by_domain %>% filter(total_users > 5)
+
+# kruskal-wallis non-parametric test for domains TODO clean this up
+six_month_by_domain$most_common_device <- as.factor (six_month_by_domain$most_common_device)
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Android',]$median_visits_per_user)
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Nokia',]$median_visits_per_user)
+kruskal.test(median_visits_per_user~most_common_device,data=six_month_by_domain)
+
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Android',]$median_forms_per_user)
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Nokia',]$median_forms_per_user)
+kruskal.test(median_forms_per_user~most_common_device,data=six_month_by_domain)
+
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Android',]$median_active_days)
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Nokia',]$median_active_days)
+kruskal.test(median_active_days~most_common_device,data=six_month_by_domain)
+
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Android',]$median_time_using_cc)
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Nokia',]$median_time_using_cc)
+kruskal.test(median_time_using_cc~most_common_device,data=six_month_by_domain)
+
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Android',]$percent_active_next_3, na.rm=T)
+median(six_month_by_domain[six_month_by_domain$most_common_device == 'Nokia',]$percent_active_next_3, na.rm=T)
+kruskal.test(percent_active_next_3~most_common_device,data=six_month_by_domain)
+
+
+# summary by domain
+monthly_table$month_start<-as.Date(paste0('1 ',monthly_table$month.index), '%d %b %Y')
+summary_by_domain <- as.data.frame (monthly_table %>%
+ group_by(domain, month_start, month_start_numeric) %>%
+ summarise(total_forms = sum(nforms), total_visits=sum(nvisits), total_users=length(unique(user_id)),
+ most_common_device=names(sort(table(summary_device_type),decreasing=TRUE)[1]),
+ median_active_days = median(active_days), median_time_using_cc = median(time_using_cc),
+ android_users = length(unique(user_id[summary_device_type=="Android"])),
+ nokia_users = length(unique(user_id[summary_device_type=="Nokia"]))))
+summary_by_domain$percent_android =
+ summary_by_domain$android_users / (summary_by_domain$total_users) * 100
+summary_by_domain$percent_nokia =
+ summary_by_domain$nokia_users / (summary_by_domain$total_users) * 100
+
+mixed_domains <- subset(summary_by_domain, percent_android > 25 & percent_nokia > 25 & total_users > 5)
+domains_by_month <- summary_by_domain %>%
+ group_by (month_start,month_start_numeric, most_common_device) %>%
+ summarise(domains_total=length(domain))
+
+
+# Rickshaw chart by domain
+all_months <- unique(domains_by_month$month_start_numeric)
+all_dev_types <- unique(domains_by_month$most_common_device)
+full_set <- merge(all_months,all_dev_types, all=T)
+names(full_set) <- c('month_start_numeric','most_common_device')
+full_set <- full_set[order(full_set$month_start_numeric,full_set$most_common_device),]
+full_set$value <- 0
+full_set$variable <- 'domains_total'
+dev_long = reshape2::melt(domains_by_month,
+ id.vars=c('most_common_device','month_start_numeric'))
+dev_long<-replace.df(full_set, dev_long,by=c('month_start_numeric','most_common_device','variable'))
+
+chart_by_domain <- Rickshaw$new()
+
+chart_by_domain$layer(value ~ month_start_numeric, group = "most_common_device",
+ data = dev_long, type = "area", width = 560)
+chart_by_domain$set(slider = TRUE)
+chart_by_domain$save('rickshaw_chart_by_domain.html', standalone = TRUE)
+
+# self-starters
+monthly_selfstart <- merge(monthly_table[monthly_table$numeric_index==3,],
+ domain_table[c('name','internal.self_started')], by.x='domain', by.y='name')
+monthly_selfstart$internal.self_started <- as.character (monthly_selfstart$internal.self_started )
+monthly_selfstart$internal.self_started[is.na(monthly_selfstart$internal.self_started)] <- "Unknown"
+monthly_selfstart_domain <- monthly_selfstart %>% group_by (domain, internal.self_started) %>%
+ summarise(most_common_device=names(sort(table(summary_device_type),decreasing=TRUE)[1]),
+ total_users = length(unique(user_id)))
+selfstart_summary <- monthly_selfstart_domain %>%
+ group_by(internal.self_started, most_common_device) %>%
+ summarise(projects = length(domain))
+total_self_started = selfstart_summary %>% filter(internal.self_started == 'True') %>%
+ summarize(total = sum(projects),
+ android = sum (projects[most_common_device=='Android']),
+ nokia = sum (projects[most_common_device=='Nokia'])) %>% mutate(android_percent = (android/total)*100,
+ nokia_percent = (nokia/total)*100)
+total_non_self_started = selfstart_summary %>% filter(!(internal.self_started == 'True')) %>%
+ summarize(total = sum(projects),
+ android = sum (projects[most_common_device=='Android']),
+ nokia = sum (projects[most_common_device=='Nokia'])) %>% mutate(android_percent = (android/total)*100,
+ nokia_percent = (nokia/total)*100)
+
+
+# % Android users by country
+monthly_country <- merge(monthly_table, domain_table[c('name','deployment.country')], by.x='domain', by.y='name')
+users_by_country_by_month <- monthly_country %>%
+ group_by(month_start, month_start_numeric, deployment.country) %>%
+ summarise(total_projects = length(unique(domain)),
+ total_forms = sum(nforms), total_visits=sum(nvisits), total_users=length(unique(user_id)),
+ android_users =length(unique(user_id[summary_device_type=="Android"])))
+users_by_country_by_month$percent_android =
+ users_by_country_by_month$android_users / (users_by_country_by_month$total_users) * 100
+
+cleaned <- users_by_country_by_month[!(users_by_country_by_month$deployment.country == 'None'),]
+cleaned$hovervar <- sprintf('%s: %d projects, %d users (%d Android)',cleaned$deployment.country,
+ cleaned$total_projects, cleaned$total_users, cleaned$android_users)
+Sept_2012 <- cleaned[cleaned$month_start==as.Date('2012-09-01'),]
+Sept_2013 <- cleaned[cleaned$month_start==as.Date('2013-09-01'),]
+Sept_2014 <- cleaned[cleaned$month_start==as.Date('2014-09-01'),]
+
+
+Sept_2012_chart <- gvisGeoChart(Sept_2012,
+ locationvar='deployment.country',
+ colorvar='percent_android', hovervar="hovervar",
+ options=list(width=600, height=400,
+ colorAxis="{colors: ['#e7711c', '#4374e0']}",
+ title = 'Percent Android Users, Sept 2013'))
+plot(Sept_2012_chart)
+cat(Sept_2012_chart$html$chart, file="Sept_2012_chart.html")
+
+Sept_2014_chart <- gvisGeoChart(Sept_2014,
+ locationvar='deployment.country',
+ colorvar='percent_android', hovervar="hovervar",
+ options=list(width=600, height=400,
+ colorAxis="{colors: ['#e7711c', '#4374e0']}"))
+plot(Sept_2014_chart)
+cat(Sept_2014_chart$html$chart, file="Sept_2014_chart.html")
diff --git a/analysis_scripts/device_log_investigation.R b/analysis_scripts/device_log_investigation.R
new file mode 100644
index 0000000..2e2c114
--- /dev/null
+++ b/analysis_scripts/device_log_investigation.R
@@ -0,0 +1,20 @@
+library(dplyr)
+
+source(file.path("function_libraries","config_file_funcs.R", fsep = .Platform$file.sep))
+source(file.path("data_sources.R"))
+source(file.path("function_libraries/db_queries.R"))
+system_conf <- get_system_config(file.path("config_system.json"))
+
+# Get db connection
+db <- get_db_connection(system_conf)
+
+
+logs <- get_data_source(db, 'device_log_types_by_user', 1000)
+
+logs_by_type <- logs %.%
+ group_by (log_type) %.%
+ summarise (total_logs = count(id), with_user_id = count(user_id))
+
+logs_by_type_by_domain <- logs %.%
+ group_by (log_type, name) %.%
+ summarise (total_logs = count(id), with_user_id = count(user_id))
diff --git a/analysis_scripts/mchen/blog_visualization/cluster_analysis.R b/analysis_scripts/mchen/blog_visualization/cluster_analysis.R
new file mode 100644
index 0000000..ce47eb0
--- /dev/null
+++ b/analysis_scripts/mchen/blog_visualization/cluster_analysis.R
@@ -0,0 +1,20 @@
+# cluster analyis
+# get the kmeans
+
+# add scripts to compute kmeans here
+
+# results: 4 clusters: x = 20, 40, 95, 270
+
+# removing all x = y = 1
+test3 <- test2[-which(test2$x == 1 & test2$y == 1),]
+ca_sub1 <- test3[which(test3$x <= 30 & test3$y <= 30),]
+
+crs <- test3[which(test3$domain == "crs-remind"),]
+crs_sub <- crs[which(crs$x >= 10 & crs$x <= 40),]
+smoothScatter(crs_sub$x, crs_sub$y)
+
+koraro <- test3[which(test3$domain == "mvp-koraro"),]
+koraro_sub <- koraro[which(koraro$x >= 40 & koraro$x <= 100),]
+smoothScatter(koraro_sub$x, koraro_sub$y)
+
+intensityPlotOut(crs_sub, koraro_sub, 100, 100)
\ No newline at end of file
diff --git a/analysis_scripts/mchen/blog_visualization/distribution_data_prep.R b/analysis_scripts/mchen/blog_visualization/distribution_data_prep.R
new file mode 100644
index 0000000..12587e4
--- /dev/null
+++ b/analysis_scripts/mchen/blog_visualization/distribution_data_prep.R
@@ -0,0 +1,111 @@
+library(plyr)
+library(dplyr)
+
+# all data
+data = tbl_df(read.csv("blog_data_2_13_15.csv", stringsAsFactors=FALSE))
+data$calendar_month = as.Date(data$calendar_month, "%m/%d/%y") # we should change the format we have timeDate stored in DP
+
+# columns of interest
+all = select(data, domain_numeric, user_pk, active_days, ncases_touched, calendar_month)
+all = filter(all, calendar_month >= as.Date("2010-01-01")) %>% filter(., calendar_month <= as.Date("2014-12-01"))
+
+# active months per user
+m = all %>%
+ group_by(domain_numeric, user_pk) %>%
+ summarise(nmonths = n_distinct(calendar_month))
+
+# active months per domain
+totalMonths = function(x) {
+ y = tbl_df(as.data.frame(table(x$domain_numeric)))
+ y = arrange(y, desc(Freq))
+ names(y) = c("domain_numeric", "months")
+ y$pct = y$months/sum(y$months)
+ return(y)
+}
+
+# min months to be dropped (or added) on each domain to get to 10% contribution
+drop = function(x, N) {
+ x$bal = (x$months - sum(x$months)*N)/(1-N)
+ return(x)
+}
+
+# excluding first N months from the data
+excdMonth = function(data, n) {
+ data = data %>%
+ group_by(domain_numeric, user_pk) %>%
+ filter(., row_number() > n) # NOTE: users who are active for less than 3 months would be automatically filtered out
+ return(data)
+}
+
+# Rebalancing data (nested function)
+rebalanceData = function(data, N) {
+ ov_month_data = totalMonths(data)
+ K = drop(ov_month_data, N)
+ dms = filter(K, bal > 0)
+ keepVal = dms$months - ceiling(dms$bal)
+ rebal_dm = as.numeric(as.character(dms$domain_numeric))
+ sdata = filter(data, domain_numeric %in% rebal_dm)
+ ndata = filter(data, !(domain_numeric %in% rebal_dm))
+ sp = split(sdata, sdata$domain_numeric)
+ for (i in seq_len(length(sp))) {
+ sp[[i]] = sp[[i]][sample(1:nrow(sp[[i]]), keepVal[i], FALSE),]
+ }
+ out = tbl_df(do.call(rbind, sp))
+ rebalanced_data = tbl_df(rbind(out, ndata))
+ return(rebalanced_data)
+}
+
+# subset 1: all data rebalanced
+rbl_all = rebalanceData(all, 0.10)
+
+# subset 2: all data excluding first 6 months
+p1 = excdMonth(all, 3)
+p2 = excdMonth(all, 6)
+rbl_p2 = rebalanceData(p2, 0.10)
+
+# subset 3: all data including only users that have at least 18 months of usage
+# rebalanced
+u = select(filter(m, nmonths >= 18), user_pk)
+qdata = filter(all, user_pk %in% u$user_pk) # exclude users active for less than 18 months
+
+# indexing each user-month
+qdata = arrange(qdata, domain_numeric, user_pk, calendar_month)
+qdata = qdata %>%
+ group_by(domain_numeric, user_pk) %>%
+ mutate(user_month_index = seq(n()))
+
+# Bin user months to custom quarters
+bins = c(1+(3*(0:ceiling(max(qdata$user_month_index)/3))))
+labs = paste("Month ", 3*seq(length(bins)-1)-2, "-", 3*seq(length(bins)-1), sep = "")
+qdata = qdata %>%
+ group_by(domain_numeric, user_pk) %>%
+ mutate(user_quarter_index = cut(user_month_index,
+ breaks = bins,
+ labels = labs,
+ right = FALSE))
+# Rebalance data from quarterly data (only for Q1-Q6)
+N = 6
+sub = levels(qdata$user_quarter_index)[1:N]
+qdata_sub = filter(qdata, user_quarter_index %in% sub)
+qdata_split = split(qdata_sub, qdata_sub$user_quarter_index, drop=TRUE)
+rbl_qdata_split = lapply(qdata_split, function(x) rebalanceData(x, 0.10))
+rbl_qdata = do.call(rbind, rbl_qdata_split)
+
+# subset 4: 12 big domains in terms of total user-months
+bb = m %>%
+ group_by(domain_numeric) %>%
+ summarise(um = sum(nmonths)) %>%
+ arrange(., desc(um)) %>%
+ top_n(12)
+
+bb_data = filter(rbl_all, domain_numeric %in% bb$domain_numeric)
+bb_list = split(bb_data, bb_data$domain_numeric)
+
+
+# summary stats
+print(paste("Distinct users: ",
+ n_distinct(m$user_pk), ". Distinct domains: ",
+ n_distinct(m$domain_numeric), sep=""))
+print(paste("Distinct users active for at least 18 months: ",
+ n_distinct(u$user_pk), ". Distinct domains: ",
+ n_distinct(u$domain_numeric), sep=""))
\ No newline at end of file
diff --git a/analysis_scripts/mchen/blog_visualization/distribution_plot_funcs.R b/analysis_scripts/mchen/blog_visualization/distribution_plot_funcs.R
new file mode 100644
index 0000000..d3f065c
--- /dev/null
+++ b/analysis_scripts/mchen/blog_visualization/distribution_plot_funcs.R
@@ -0,0 +1,163 @@
+# multiple base density plots
+library(KernSmooth)
+
+# get quartile information to-be-plotted
+getQuantiles = function(x) {
+ dens = bkde(x$active_days,bandwidth=0.25)
+ lower.v = quantile(x$active_days,probs=0.25)
+ median.v = median(x$active_days)
+ upper.v = quantile(x$active_days,probs=0.75)
+ mean.v = mean(x$active_days)
+ min.v = min(dens$x)
+ max.v = max(dens$x)
+ dPlot = c(dens, lower.v, median.v, upper.v, mean.v, min.v, max.v)
+ names(dPlot) = c("x", "y", # x and y coordinates from bkde() function
+ "lower.v", "median.v", "upper.v", "mean.v", "min.v", "max.v")
+ return(dPlot)
+}
+
+
+##########################################################################
+# Density Distribution Plot
+# Elements
+# smooth.density.kernel.estimate(user-month distribution)
+# vertical line(median, mean of active days)
+# shaded area (25% quartile, 75% quartile of user-month distribution)
+##########################################################################
+getDensityPlot = function(data) {
+ with(data, {
+ d_data = getQuantiles(data)
+ coord = d_data[c("x","y")]
+ plot(coord, type = "l", lwd = 2,
+ axes = FALSE,
+ xlab = "", ylab = "",
+ col = "#004865",
+ xaxt = "n",
+ xaxs = "i",
+ xlim = c(1,30))
+ axis(1, xaxp = c(1,25,8), tck = -0.01)
+ axis(2, yaxp = c(0.00, round(max(coord$y),digits=2), 5), tck = -0.01)
+
+ abline(v =d_data[["median.v"]], col = "#004865", lty = 1, lwd = 2)
+ abline(v =d_data[["mean.v"]], col = "#004865", lty = 2, lwd = 2)
+
+ polygon(c(d_data[["upper.v"]],
+ coord$x[coord$x > d_data[["upper.v"]] &
+ coord$x < d_data[["max.v"]]],
+ d_data[["max.v"]]),
+ c(0, coord$y[coord$x> d_data[["upper.v"]] &
+ coord$x < d_data[["max.v"]]], 0),
+ col="#004865")
+
+ polygon(c(d_data[["min.v"]],
+ coord$x[coord$x > d_data[["min.v"]] &
+ coord$x < d_data[["lower.v"]]], d_data[["lower.v"]]),
+ c(0, coord$y[coord$x > d_data[["min.v"]] &
+ coord$x < d_data[["lower.v"]]], 0),
+ col="#004865")
+ })
+}
+
+# parameter setting
+ps = function(N1, N2, background){
+ x = par(mfrow = c(N1,N2),
+ bg = background,
+ # mar = c(2,2,2,2),
+ oma = c(1,1,3,1),
+ bty = "n")
+ return(x)
+}
+
+# tabulate data
+getDiscreteTbl = function(x) {
+ y = tbl_df(as.data.frame(table(x$active_days)))
+ names(y) = c("active_days", "count")
+ y$pct = y$count/sum(y$count)
+ y$active_days = as.numeric(y$active_days)
+ return(y)
+}
+
+# this returns the exact x-coordinate for 25 percentile, 75 percentile (to be used in geom_vline)
+getPercentiles = function(x) {
+ smr = summary(x$active_days) # get summary stats
+ percentiles = smr[2:5] # exclude min/max. we do not plot those two vertical lines
+ names(percentiles) = c("p25", "median", "mean", "p75")
+ return(percentiles)
+}
+
+
+############################################################
+# final distribution plot
+# Elements
+# point(position(active days * percent of user-month))
+# a smoother
+# vertical lines (mean, median)
+############################################################
+
+# quarterly final plot
+q = tbl_df(as.data.frame(table(rbl_qdata$active_days, rbl_qdata$user_quarter_index)))
+names(q) = c("active_days", "quarters", "count")
+q = filter(q, count > 0)
+q$active_days = as.numeric(q$active_days)
+q$pct = q$count/sum(q$count)
+
+
+###############
+# single plot #
+###############
+# Note ggplot can only operate in global env
+single = ggplot(tbl, aes(x=active_days, y=pct)) +
+ geom_point() +
+ labs(x="", y="") +
+ geom_smooth(se=FALSE, col = "#882608", size = 2) +
+ scale_x_continuous(breaks = seq(1, 30, 2),name="") +
+ scale_y_continuous(limits = c(0, 0.25)) +
+ guides(fill=FALSE) +
+ theme(plot.title = element_text(size=rel(1.5),colour="black"),
+ panel.background = element_rect(fill="#d5e4eb"),
+ panel.grid.minor.y = element_blank(),
+ panel.grid.major.x = element_blank(),
+ panel.grid.minor.x = element_blank(),
+ axis.ticks.y = element_blank(),
+ legend.position = "none")
+
+###################
+# multiple facets #
+###################
+# use domain_numeric as a facetting factor for big domain viz
+bb_tbl_ls = lapply(bb_list, function(x) {
+ y = getDiscreteTbl(x)
+})
+bb_tbl_ls = lapply(seq_along(bb_tbl_ls), function(i) {
+ bb_tbl_ls[[i]]$domain_numeric = rep(names(bb_tbl_ls)[[i]], nrow(bb_tbl_ls[[i]]))
+ return(bb_tbl_ls[[i]])
+})
+bb_tbl = do.call(rbind, bb_tbl_ls)
+bb_tbl$domain_numeric = as.factor(bb_tbl$domain_numeric)
+
+# put vline data in a separate data frame
+bb_vline = lapply(bb_list, function(x) getPercentiles(x))
+bb_vline = do.call(rbind, bb_vline)
+bb_vline = as.data.frame(bb_vline)
+bb_vline$domain_numeric = factor(dimnames(bb_vline)[[1]])
+
+multiple = ggplot(bb_tbl, aes(x=active_days, y=pct)) +
+ geom_point() +
+ labs(x="", y="") +
+ geom_smooth(se=FALSE, col = "#882608", size = 2) +
+ scale_x_continuous(breaks = seq(1, 30, 2)) +
+ scale_y_continuous(limits = c(0, 0.25)) +
+ guides(fill=FALSE) +
+ facet_wrap(~domain_numeric) +
+ theme(plot.title = element_text(size=rel(1.5),colour="black"),
+ panel.background = element_rect(fill="#d5e4eb"),
+ panel.grid.minor.y = element_blank(),
+ panel.grid.major.x = element_blank(),
+ panel.grid.minor.x = element_blank(),
+ axis.ticks.y = element_blank(),
+ legend.position = "none") +
+ # add different vertical lines to each facet
+ geom_vline(aes(xintercept = p25, colour = "#ed7d31"), bb_vline) +
+ geom_vline(aes(xintercept = p75, colour = "#5d9cd6"), bb_vline) +
+ geom_vline(aes(xintercept = median, colour = "#ffbf01"), bb_vline) +
+ geom_vline(aes(xintercept = mean, colour = "#ed7d31"), bb_vline)
diff --git a/analysis_scripts/mchen/blog_visualization/distribution_plot_out.R b/analysis_scripts/mchen/blog_visualization/distribution_plot_out.R
new file mode 100644
index 0000000..fc08e38
--- /dev/null
+++ b/analysis_scripts/mchen/blog_visualization/distribution_plot_out.R
@@ -0,0 +1,205 @@
+# density distribution
+source("data_prep.R")
+source("density_funcs.R")
+
+# density distribution for first 6 quarters
+png("rbl_rm_6.png")
+getDensityPlot(rbl_p2)
+dev.off()
+
+# subset 4 (12 big projects)
+pdf("big_bros_density.pdf",
+ width=18,
+ height=15)
+
+op = par(mfrow = c(3,4),
+ bg = "#d5e4eb",
+ las = 1,
+ mar = c(2,2,2,2),
+ oma = c(2,2,4,2),
+ bty = "n",
+ cex = 1.2)
+
+lapply(big_bros_ls, function(x) {
+ getDensityPlot(x)
+})
+legend("topright",
+ legend = c("Median", "Mean"),
+ lty = 1:2,
+ lwd=c(2.5, 2.5, col=rep("#005266")))
+
+dev.off()
+par(op)
+
+# density distribution of crs
+crs = filter(rbl_all, domain_numeric == 40)
+pdf("crs_density.pdf")
+op = par(bg = "#d5e4eb", bty="n", las=1)
+getDensityPlot(crs)
+legend("topright",
+ legend = c("Median", "Mean"),
+ lty = 1:2,
+ lwd=c(2.5, 2.5, col=rep("#005266")))
+dev.off()
+par(op)
+
+# density distribution of all data (rebalanced)
+pdf("rbl_all_data_density.pdf")
+op = par(bg = "#d5e4eb", bty="n")
+getDensityPlot(rbl_all)
+legend("topright",
+ legend = c("Median", "Mean"),
+ lty = 1:2,
+ lwd=c(2.5, 2.5, col=rep("#005266")))
+dev.off()
+par(op)
+
+
+###################################################################
+# Discrete Distribution
+# Elements
+# point(position(active days * active month), size(constant))
+# vertical line(length(percentage of user-month))
+###################################################################
+
+# 12 big domains
+pdf("big_bros_discrete.pdf",
+ width=18,
+ height=15)
+op = ps(3,4,"#c2d6ef")
+lapply(big_bros_ls, function(x){
+ tbl = getDiscreteTbl(x)
+ getDiscretePlot(tbl, max(tbl$pct))
+})
+dev.off()
+par(op)
+
+# quarterly final plot
+q = tbl_df(as.data.frame(table(rbl_qdata$active_days, rbl_qdata$user_quarter_index)))
+names(q) = c("active_days", "quarters", "count")
+q = filter(q, count > 0)
+q$active_days = as.numeric(q$active_days)
+q$pct = q$count/sum(q$count)
+
+quarterly = ggplot(q, aes(x=active_days, y=pct))+
+ geom_point()+
+ labs(x="", y="")+
+ geom_smooth(se=FALSE, col = "#882608", size = 2)+
+ scale_x_continuous(breaks = seq(1, 30, 2),name="") +
+ facet_grid(quarters~.) +
+ guides(fill=FALSE) +
+ theme(plot.title = element_text(size=rel(1.5),colour="black"),
+ panel.background = element_rect(fill="#d5e4eb"),
+ panel.grid.minor.y = element_blank(),
+ panel.grid.major.x = element_blank(),
+ panel.grid.minor.x = element_blank(),
+ axis.ticks.y = element_blank(),
+ legend.position = "none")
+ggsave("quarterly_3.pdf", quarterly, width=7,height=14,dpi=200)
+
+# get 25%, 75% numbers for each quarter (post-processing in AI)
+q_tbl_split = split(q, q$quarters)
+q_tbl_lines = lapply(q_tbl_split, function(x) percentileLines(x))
+q_lines = q_tbl_lines[1:6]
+
+
+# big brother final
+bb_tbl_list = lapply(bb_list, function(x){
+ y = tbl_df(as.data.frame(table(x$active_days, x$domain_numeric)))
+ names(y) = c("active_days", "domain_numeric", "count")
+ y = filter(y, count > 0)
+ y$active_days = as.numeric(y$active_days)
+ y$pct = y$count/sum(y$count)
+ return(y)
+})
+
+bb_tbl = do.call(rbind, bb_tbl_list)
+max_pct = bb_tbl %>%
+ group_by(domain_numeric) %>%
+ summarise(mpct = max(pct)) %>%
+ arrange(., desc(mpct))
+max_pct$dm_reorder = seq_len(nrow(max_pct))
+reordered_dm = select(max_pct, domain_numeric, dm_reorder)
+bb_tbl = left_join(bb_tbl, reordered_dm)
+
+bbly = ggplot(bb_tbl, aes(x=active_days, y=pct))+
+ geom_point()+
+ labs(x="",
+ y="")+
+ geom_smooth(se=FALSE, col = "#005266", size = 1.5)+
+ scale_x_continuous(breaks = seq(1, 30, 2)) +
+ facet_wrap(~dm_reorder, ncol=3, as.table = TRUE) +
+ guides(fill=FALSE) +
+ theme(strip.background = element_rect(fill="#b4d8e7"),
+ strip.text = element_text(face="bold",colour="black",hjust=0),
+ panel.background = element_rect(fill="#d5e4eb"),
+ plot.background = element_rect(fill="#d5e4eb"),
+ panel.grid.major.y = element_line(size=1.0),
+ panel.grid.minor.y = element_blank(),
+ panel.grid.major.x = element_blank(),
+ panel.grid.minor.x = element_blank(),
+ axis.ticks.y = element_blank(),
+ legend.position = "none")
+
+ggsave("big_domain_final.pdf", bbly, width=12,height=12,dpi=200)
+
+# get 25%, 75%, median, mean for post-processing in AI
+bb_tbl_lines = lapply(bb_list, function(x) summary(x$active_days))
+q_tbl_lines = lapply(qdata_split, function(x) summary(x$active_days))
+
+# single plot: all data (balanced)
+all_tbl = getDiscreteTbl(rbl_all)
+all_lines = percentileLines(all_tbl)
+all = ggplot(all_tbl, aes(x=active_days, y=pct))+
+ geom_point()+
+ labs(x="",
+ y="")+
+ geom_smooth(se=FALSE, col = "#882608", size = 1.5)+
+ scale_x_continuous(breaks = seq(1, 30, 2)) +
+ guides(fill=FALSE) +
+ theme(panel.background = element_rect(fill="#d5e4eb"),
+ plot.background = element_rect(fill="#d5e4eb"),
+ panel.grid.major.y = element_line(size=1.0),
+ panel.grid.minor.y = element_blank(),
+ panel.grid.major.x = element_blank(),
+ panel.grid.minor.x = element_blank(),
+ axis.ticks.y = element_blank(),
+ legend.position = "none")
+ggsave("all_final.pdf", all, width=7,height=7)
+
+# single plot: crs
+p2_tbl = getDiscreteTbl(rbl_p2)
+p2_lines = summary(rbl_p2$active_days)
+crs = ggplot(p2_tbl, aes(x=active_days, y=pct))+
+ geom_point()+
+ labs(x="",
+ y="")+
+ geom_smooth(se=FALSE, col = "#882608", size = 1.5)+
+ scale_x_continuous(breaks = seq(1, 30, 2)) +
+ scale_y_continuous(limits=c(0,0.25))+
+ guides(fill=FALSE) +
+ theme(panel.background = element_rect(fill="#d5e4eb"),
+ plot.background = element_rect(fill="#d5e4eb"),
+ panel.grid.major.y = element_line(size=1.0),
+ panel.grid.minor.y = element_blank(),
+ panel.grid.major.x = element_blank(),
+ panel.grid.minor.x = element_blank(),
+ axis.ticks.y = element_blank(),
+ legend.position = "none")
+ggsave("all_final.pdf", all, width=7,height=7)
+
+# quarterly distribution data export
+qtr_data = lapply(rbl_qdata_split, function(x){
+ y = getDiscreteTbl(x)
+ return(y)
+})
+# merge all quarterly aggregate data
+colNames = names(qtr_data)
+for (i in seq_len(length(qtr_data))) {
+ names(qtr_data[[i]]) = c("active_days",
+ paste("count",colNames[i],sep="_"),
+ paste("pct",colNames[i],sep="_"))
+}
+qtr_data_merged = Reduce(function(...) merge(..., all=TRUE), qtr_data)
+qtr_data_merged[is.na(qtr_data_merged)] <- 0
+write.csv(qtr_data_merged, "data_table_qtr.csv")
diff --git a/analysis_scripts/mchen/blog_visualization/hq_proj_activity.R b/analysis_scripts/mchen/blog_visualization/hq_proj_activity.R
new file mode 100644
index 0000000..b89ec58
--- /dev/null
+++ b/analysis_scripts/mchen/blog_visualization/hq_proj_activity.R
@@ -0,0 +1,77 @@
+# IMPORT DATA #
+library(plyr)
+library(dplyr)
+data = read.csv("blog.csv", stringsAsFactors=FALSE)
+data = tbl_df(data)
+
+# data format conversion
+library(zoo)
+data$calendar_month = as.Date(data$calendar_month)
+
+# keep columns of interest
+hq = filter(data, typical_flw == "TRUE") %>% # keeping typical FLW only
+ select(., domain_numeric, domain, user_id, calendar_month)
+hq = arrange(hq, domain_numeric, user_id, calendar_month)
+
+agg = hq %>%
+ group_by(domain_numeric) %>%
+ summarise(nusers=n_distinct(user_id),
+ nmonths = n_distinct(calendar_month))
+
+# test subset for viz
+t = filter(hq, calendar_month <= as.Date("2015-04-01"))
+t = filter(t, calendar_month >= as.Date("2010-01-01"))
+
+# X dimension: calendar month
+getMonthlyUser = function(t) {
+ t_agg_1 = t %>%
+ group_by(domain, calendar_month) %>%
+ summarise(nusers = n_distinct(user_id))
+ # x_range = seq(min(t_agg_1$calendar_month),max(t_agg_1$calendar_month),by="month")
+ # t_agg_1$domain = factor(t_agg_1$domain)
+ # t_expanded = expand.grid(unique(t_agg_1$domain),x_range) # this expansion is actually unnecessary
+ # names(t_expanded) = c("domain", "calendar_month")
+ # t_agg_1 = left_join(t_expanded, t_agg_1, by = c("domain","calendar_month"))
+ # t_agg_1[is.na(t_agg_1)] <- 0
+ # t_agg_1 = tbl_df(t_agg_1)
+}
+
+
+t_agg_1 = arrange(t_agg_1, domain, calendar_month)
+
+# Y dimension: number of user blocks in each domain
+getUserBlock = function(t_agg_1) {
+
+ bins = c(5*(0:ceiling(max(t_agg_1$nusers)/5)))
+ labs = c(1:ceiling(max(t_agg_1$nusers)/5))
+ t_agg_1$blocks = cut(t_agg_1$nusers,
+ breaks = bins,
+ labels = labs)
+ t_agg_1$blocks = as.numeric(t_agg_1$blocks)
+ # t_agg_1 = na.omit(t_agg_1) # let's not worry about adding gap months back for now
+ # t_agg_1$blocks[is.na(t_agg_1$blocks)] <- 0
+}
+
+
+# Replicate each row of the data and set the number of replications to the number of blocks
+expanded = as.numeric(rep(row.names(t_agg_1), t_agg_1$blocks))
+texp = t_agg_1[expanded,]
+
+texp = texp %>%
+ group_by(domain, calendar_month) %>%
+ mutate(cum_blocks = seq(n()))
+
+# for every single value of cum_blocks, return the sum of cum_blocks and the max value of the domain below
+piles = texp %>%
+ group_by(domain) %>%
+ summarise(max_blocks = max(cum_blocks))
+piles$max_blocks = c(0, cumsum(piles$max_blocks[-nrow(piles)]))
+
+texp = arrange(texp, domain)
+texp = left_join(texp, piles)
+texp$piled_blocks = texp$cum_blocks + texp$max_blocks
+
+# using geom_tile to plot it
+library(ggplot2)
+ggplot(texp, aes(x=calendar_month,y=piled_blocks,fill=domain)) + geom_tile()
+
diff --git a/analysis_scripts/mchen/blog_visualization/plot_funcs.R b/analysis_scripts/mchen/blog_visualization/plot_funcs.R
new file mode 100644
index 0000000..be08ff4
--- /dev/null
+++ b/analysis_scripts/mchen/blog_visualization/plot_funcs.R
@@ -0,0 +1,29 @@
+# plot functions
+plotOut <- function(data, medianColor, palette1, palette2, xmax, ymax, rLabel, rSquareX, rSquareY) {
+ p <- regression_plot(ncases_touched ~ prev_ncases_touched, data,
+ shade=FALSE,
+ spag=TRUE,
+ median.col=medianColor, # median line (not regression line)
+ palette=colorRampPalette(c(palette1,palette2))(4))
+
+ p <- p +
+ scale_x_continuous("Cases visited in month N-1", limits=c(0,xmax)) +
+ scale_y_continuous("Cases visited in month N", limits=c(0,ymax)) # modify the axis limits
+
+ # p <- p + labs(x="Cases visited in month N-1", y="Cases visited in month N") # modify axis labels
+
+ # adding annotation layer
+ p <- p + annotate("text", x=rSquareX, y=rSquareY, size=7, label=rLabel, colour="red", parse=T)
+ # modify theme settings:
+ p <- p +
+ theme(axis.title.x=element_text(vjust = -1.5,size=rel(2)),
+ axis.title.y=element_text(vjust = 1.5,size=rel(2)),
+ plot.margin=unit(c(1,1,1,1),"cm"),
+ panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank(),
+ panel.background = element_blank()) +
+ theme(aspect.ratio = 1) +
+ theme(legend.position="none")
+ return(p)
+}
+
diff --git a/analysis_scripts/mchen/blog_visualization/user_consistency.R b/analysis_scripts/mchen/blog_visualization/user_consistency.R
new file mode 100644
index 0000000..0e8e975
--- /dev/null
+++ b/analysis_scripts/mchen/blog_visualization/user_consistency.R
@@ -0,0 +1,210 @@
+#####################################################################
+# visualization code developed for Dimagi's data blog series
+# Dec 17, 2014
+#####################################################################
+
+library(grid)
+library(gridExtra)
+library(ggplot2)
+library(sorvi)
+library(dplyr)
+library(RColorBrewer)
+
+
+
+
+
+############################################
+# domain consistency comparison plot
+############################################
+
+# side-by-side plot of two demo domains: domain 18 and 264
+# data source: domain_consistency_comparison.csv
+data <- read.table("domain_consistency_comparison.csv",header=T,sep=",")
+dm1 <- data %>% filter(data$domain_numeric == 18)
+dm2 <- data %>% filter(data$domain_numeric == 264)
+
+source("plot_funcs.R")
+plot1 <- plotOut(dm1,"red","#67a9cf","#ef8a62",50,50,paste("r^2==",0.75),5,45)
+plot2 <- plotOut(dm2,"red","#67a9cf","#ef8a62",50,50,paste("r^2==",0.36),5,45)
+
+# output two plots side-by-side
+# format: pdf
+pdf(file="domain_comparison.pdf",
+ width=14,
+ height=10)
+# png("domain_comparison.png", width = 1600, height = 1600, units = "px", pointsize = 16,bg = "transparent", res=150)
+grid.arrange(plot1, plot2, ncol = 2)
+dev.off()
+
+
+
+
+
+############################################
+# top10 vs. bottom10 comparison plot
+############################################
+top_10p <- read.table("top_10p.csv",header=T,sep=",")
+bottom_10p <- read.table("bottom_10p.csv",header=T,sep=",")
+plot3 <- plotOut(top_10p, "red", "#67a9cf","#ef8a62",100,100,paste("r^2==",0.69),10,90)
+plot4 <- plotOut(bottom_10p, "red", "#67a9cf","#ef8a62",100,100,paste("r^2==",0.34),10,90)
+pdf(file="top_bottom_comparison.pdf",
+ width=14,
+ height=10)
+grid.arrange(plot3, plot4, ncol = 2)
+dev.off()
+
+
+
+
+
+######################
+# density plot
+######################
+# main plot: March only data
+# exclude users who visit more than 50 cases
+# exclude users who visit only 1 case that month
+
+test2 <- read.table("test2.csv",header=T,sep=",")
+data <- test2 %>% filter(test2$x <= 50 & test2$x > 1)
+mainPlotData<- data %>% filter(as.character(data$calendar_month) == "2014-03-01")
+p <- ggplot(mainPlotData, aes(x = x, y = y)) +
+ geom_point(position=position_jitter(w=0.4,h=0.4),
+ colour="black",
+ alpha=0.7) +
+ xlim(1,50) + ylim(1,50)
+p1 <- p +
+ stat_density2d(aes(fill = rev(..level..)), geom = "polygon")
+p2 <- p1 +
+ labs(x="Cases visited in March 2014", y = "Cases visited in April 2014") +
+ theme_bw() +
+ theme(legend.position="none") +
+ theme(aspect.ratio = 1)
+# increase label size, space between axis and labels
+p3 <- p2 + theme(plot.margin = unit(c(1.5,1.5,1.5,1.5), "cm"),
+ axis.title.x = element_text(size=rel(1.5), vjust = -1.5),
+ axis.title.y = element_text(size=rel(1.5), vjust = 1.5)
+)
+
+ggsave("mainPlot.pdf", p3, dpi = 600, scale = 1.5, width = 4, height = 4)
+
+
+
+
+
+######################
+# Line graph
+######################
+attrition <- read.table("leadup.csv", header = T, sep = ",")
+month_levels <- rev(levels(attrition$X1))
+
+l1 <- ggplot(attrition, aes(x = X1, y = X2, colour = att_duration, group = att_duration, linetype = att_duration)) +
+ geom_point(shape = 15, size = 4.0, colour="peachpuff4") +
+ geom_line(size = 1.5) +
+ scale_x_discrete("Month 'X' before attrition event",limits = month_levels)
+
+l2 <- l1 +
+ scale_y_continuous("Number of cases visited",limits=c(0,12),breaks=c(0,4,8,12)) +
+ theme(aspect.ratio=0.5,
+# panel.grid.minor=element_blank(),
+ panel.grid.major=element_line(colour = "white",size=1),
+ panel.background=element_blank())
+
+l3 <- l2 + theme(axis.title.x = element_text(size=rel(1.5), vjust = -1.5),
+ axis.title.y = element_text(size=rel(1.5), vjust = 1.5),
+ legend.title = element_blank())
+
+
+
+
+
+
+##########################################
+# user experience: stacked barplot
+##########################################
+
+# use colclasses to load data faster (http://www.r-bloggers.com/using-colclasses-to-load-data-more-quickly-in-r/)
+ux_data <- read.csv("user_experience.csv",
+ stringsAsFactors = FALSE,
+ colClasses=c("integer","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric",
+ "factor","factor","factor","factor"))
+colnames(ux_data)[14:17] <- c("Q1 to Q2","Q2 to Q3", "Q3 to Q4", "Q1 to Q4")
+
+# resize bins:
+# <-20%, 1
+#[-20%,20%], 2
+#[20%,50%],3
+# >50%,4
+
+binCut <- function(var, n1, n2, n3, n4){
+ sapply(var, function(x)
+ if(x <= n1) {return("1")
+ } else if(x > n1 & x <= n2) {return("2")
+ } else if (x > n2 & x <= n3) {return("3")
+ } else if (x > n3 & x <= n4) {return("4")
+ } else {return("5")
+ })
+}
+
+ux_data$Q1_to_Q2 <- binCut(ux_data$percent_change_q1_2, -50, -20, 20, 50)
+ux_data$Q2_to_Q3 <- binCut(ux_data$percent_change_q2_3, -50, -20, 20, 50)
+ux_data$Q3_to_Q4 <- binCut(ux_data$percent_change_q3_4, -50, -20, 20, 50)
+ux_data$Q1_to_Q4 <- binCut(ux_data$percent_change_q1_4, -50, -20, 20, 50)
+
+# reshape data: wide to long
+l <- reshape(ux_data,
+ varying = c("Q1_to_Q2","Q2_to_Q3","Q3_to_Q4","Q1_to_Q4"),
+ v.names = "percentage_change",
+ timevar = "quarter",
+ times = c("Q1_to_Q2","Q2_to_Q3","Q3_to_Q4","Q1_to_Q4"),
+ direction = "long")
+
+# change the order of levels of factor quarter for visualization
+l$quarter <- factor(l$quarter, levels = c("Q1_to_Q2","Q2_to_Q3","Q3_to_Q4","Q1_to_Q4"))
+
+p0 <- ggplot(l, aes(x=factor(quarter), fill=factor(percentage_change), y=4*100*(..count..)/sum(..count..)))
+p1 <- p0 + geom_bar(width=0.7) +
+ scale_fill_manual(values = c("#e34a33","#fdbb84","#e5e5ab","#91bfdb","#67a9cf"),
+ breaks = c(1:5),
+ labels = c("Substantial drop (>50%)",
+ "Moderate drop (20-50%)",
+ "Stable (+/- 20%)",
+ "Moderate increase (20-50%)",
+ "Substantial increase (>50%)"))
+
+##########################################################################
+### control style (http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/)###
+##########################################################################
+
+# change title appearance
+p2 <- p1 +
+ labs(x="Interval", y="Percent of Users") +
+ scale_x_discrete(labels=c("Q1 to Q2",
+ "Q2 to Q3",
+ "Q3 to Q4",
+ "Q1 to Q4")) +
+ theme(plot.title = element_text(face = "bold"))
+
+# set title to twice the base font size
+p3 <- p2 + theme(plot.title = element_text(size = rel(1.8)))
+
+# change panel and plot attributes
+p4 <- p3 + theme(panel.background = element_blank())
+
+# change legend attributes
+p5 <- p4 + theme(legend.text = element_text(size = 10),
+ legend.title = element_blank(),
+ legend.background = element_rect(colour = "black"),
+ legend.key.size = unit(0.75, "cm"),
+ axis.title.x = element_text(vjust = - 1.5, size = rel(1.35)),
+ axis.title.y = element_text(vjust = 1.5, size = rel(1.35)),
+ plot.margin = unit(c(1,1,1,1),"cm"))
+
+# reverse the legend order
+p6 <- p5 + guides(fill = guide_legend(reverse=TRUE))
+
+print(p6)
+ggsave("version6.pdf", p6, dpi = 600)
+
+
+
diff --git a/analysis_scripts/mchen/blog_visualization/user_perf_distribution.R b/analysis_scripts/mchen/blog_visualization/user_perf_distribution.R
new file mode 100644
index 0000000..93f1ba9
--- /dev/null
+++ b/analysis_scripts/mchen/blog_visualization/user_perf_distribution.R
@@ -0,0 +1,303 @@
+library(plyr)
+library(dplyr)
+library(lattice)
+library(latticeExtra)
+
+data = tbl_df(read.csv("blog.csv", stringsAsFactors=FALSE))
+
+# SUBSET COLUMNS OF INTEREST (not sure )
+all = select(data, domain_numeric, user_pk, user_id, active_days, ncases_touched, calendar_month)
+all$calendar_month = as.Date(all$calendar_month)
+all = filter(all, calendar_month >= as.Date("2010-01-01"))
+all = filter(all, calendar_month <= as.Date("2015-04-01"))
+n_distinct(all$domain_numeric); n_distinct(all$user_pk); n_distinct(all$calendar_month); min(all$calendar_month); max(all$calendar_month)
+
+# CREATE CUSTOM QUARTERS BY USER (METHOD#2: EXCLUDING INACTIVE MONTHS)
+all = arrange(all, domain_numeric, user_pk, calendar_month)
+all = all %>%
+ group_by (domain_numeric, user_pk) %>%
+ mutate(user_month_index = seq(n())) # indexing each user-month
+
+# CUT USER MONTHS INTO CUSTOM QUARTERS (IMPLEMENT METHOD 2)
+bins = c(1+(3*(0:ceiling(max(all$user_month_index)/3))))
+labs = paste("Q", seq(length(bins)-1), sep = "")
+all = all %>%
+ group_by(domain_numeric, user_pk) %>%
+ mutate(user_quarter_index = cut(user_month_index,
+ breaks = bins,
+ labels = labs,
+ right = FALSE))
+
+# ALL DATA
+all_naOmit = na.omit(all)
+pdf(7, 14, file="allData.pdf", useDingbats=FALSE)
+png(filename="allData.png")
+
+p_all = densityplot(~active_days, data=all_naOmit,
+ panel=function(x,...){
+ panel.densityplot(x,...)
+ lower.v <- quantile(x,probs=0.25)
+ median.v <- median(x)
+ upper.v <- quantile(x,probs=0.75)
+ mean.v <- mean(x)
+
+ panel.abline(v=median.v, col.line="#003d71", lwd=1.2) # 50 percentile
+ panel.abline(v=mean.v, col.line="#003d71", lwd=1.2, lty=2) # mean
+
+ dens <- density(x)
+ min.v <- min(dens$x)
+ max.v <- max(dens$x)
+ panel.polygon(c(upper.v, dens$x[dens$x > upper.v & dens$x < max.v], max.v),
+ c(0, dens$y[dens$x> upper.v & dens$x < max.v], 0),
+ col="#f58220")
+ panel.polygon(c(min.v, dens$x[dens$x > min.v & dens$x < lower.v], lower.v),
+ c(0, dens$y[dens$x > min.v & dens$x < lower.v], 0),
+ col="#f58220")
+ },
+ scales=list(x=list(at=c(1,2,3,4,5,6,7,8,9,10,20,30)),
+ alternating = 1),
+ xlim=range(all_naOmit$active_days),
+ ylim=c(0,0.15),
+ xlab="Active days per month",
+ ylab="Density",
+ col="#413f3f",
+ plot.points=FALSE)
+print(p_all)
+dev.off()
+
+
+# SUBSETS OF INTEREST
+# flagship project in India
+crs = filter(all, domain_numeric == 40)
+crs_qtr_1 = filter(crs, user_quarter_index == "Q1")
+crs_qtr_1 = na.omit(crs_qtr_1)
+
+pdf(7, 14, file="crs_remind_quarter_1.pdf", useDingbats=FALSE)
+png(filename="crs_remind_quarter_1.png")
+
+p_crs_q1 = densityplot(~active_days, data=crs_qtr_1,
+ panel=function(x,...){
+ panel.densityplot(x,...)
+ lower.v <- quantile(x,probs=0.25)
+ median.v <- median(x)
+ upper.v <- quantile(x,probs=0.75)
+ mean.v <- mean(x)
+
+ panel.abline(v=median.v, col.line="#003d71", lwd=1.2) # 50 percentile
+ panel.abline(v=mean.v, col.line="#003d71", lwd=1.2, lty=2) # mean
+
+ dens <- density(x)
+ min.v <- min(dens$x)
+ max.v <- max(dens$x)
+ panel.polygon(c(upper.v, dens$x[dens$x > upper.v & dens$x < max.v], max.v),
+ c(0, dens$y[dens$x> upper.v & dens$x < max.v], 0),
+ col="#f58220")
+ panel.polygon(c(min.v, dens$x[dens$x > min.v & dens$x < lower.v], lower.v),
+ c(0, dens$y[dens$x > min.v & dens$x < lower.v], 0),
+ col="#f58220")
+ },
+ scales=list(x=list(at=c(1,2,3,4,5,6,7,8,9,10,20,30)),
+ alternating = 1),
+ xlim=range(crs_qtr_1$active_days),
+ ylim=c(0,0.25),
+ xlab="Active days per month",
+ ylab="Density",
+ col="#413f3f",
+ plot.points=FALSE)
+
+print(p_crs_q1)
+dev.off()
+
+# big bromance: top 12 projects in certain month
+bigBros = function(data, N) {
+ data %>%
+ group_by(domain_numeric, calendar_month) %>%
+ summarise(tot_users = n_distinct(user_pk)) %>%
+ summarise(max_users = max(tot_users)) %>%
+ arrange(., desc(max_users)) %>%
+ top_n(N)
+}
+
+big_12 = bigBros(all_naOmit, 12) # may include more than 12 if there are ties
+if(nrow(big_12) > 12) big_12 = big_12[1:12,]
+big_12$rank = c(1:12)
+
+# subset data of these 12 domains
+big_12_domains = big_12$domain_numeric
+big_bros = left_join(big_12, all)
+
+png(filename = "big_12_domains.png",
+ width = 1080,
+ height = 1080,
+ bg = "transparent")
+
+big_bros$user_quarter_index = factor(big_bros$user_quarter_index)
+big_bros$rank = factor(big_bros$rank)
+
+pdf(14, 14, file = "p_big_bros.pdf")
+myUsers = big_12$max_users
+myMonths = big_bros %>%
+ group_by(rank) %>%
+ summarise(nmonths = n_distinct(calendar_month)) %>%
+ select(., nmonths)
+
+p_big_bros = densityplot(~active_days|rank, data=big_bros_naOmit,
+ panel=function(x,...){
+ panel.densityplot(x,...)
+ lower.v <- quantile(x,probs=0.25)
+ median.v <- median(x)
+ upper.v <- quantile(x,probs=0.75)
+ mean.v <- mean(x)
+
+ panel.abline(v=median.v, col.line="#003d71", lwd=1.2) # 50 percentile
+ # panel.abline(v=lower.v, col.line="#f58220") # 25 percentile
+ # panel.abline(v=upper.v, col.line="#f58220") # 75 percentile
+ panel.abline(v=mean.v, col.line="#003d71", lwd=1.2, lty=2) # mean
+
+ dens <- density(x)
+ min.v <- min(dens$x)
+ max.v <- max(dens$x)
+ panel.polygon(c(upper.v, dens$x[dens$x > upper.v & dens$x < max.v], max.v),
+ c(0, dens$y[dens$x> upper.v & dens$x < max.v], 0),
+ col="#f58220")
+ panel.polygon(c(min.v, dens$x[dens$x > min.v & dens$x < lower.v], lower.v),
+ c(0, dens$y[dens$x > min.v & dens$x < lower.v], 0),
+ col="#f58220")
+ # panel.text(25,0.20,labels=paste("Users: ", myUsers[which.packet(),], sep = ""))
+ # panel.text(25,0.17,labels=paste("Months: ", myMonths[which.packet(),], sep = ""))
+ },
+ # par.settings = theEconomist.theme(),
+ # scales = "free",
+ scales=list(x=list(at=c(1,2,3,4,5,6,7,8,9,10,20,30)),
+ alternating = 1),
+ xlim=range(big_bros_naOmit$active_days),
+ ylim=c(0,0.25),
+ xlab="Active days per month",
+ ylab="Density",
+ # aspect=0.1,
+ layout=c(3,4),
+ as.table=TRUE, # The flag "as.table=TRUE" changes to left to right and top to bottom
+ strip=strip.custom(bg = "#ffffff",
+ par.strip.text = list(font=0.5,col="#413f3f"),
+ factor.levels = levels(big_bros_naOmit$rank)),
+ col="#413f3f",
+ plot.points=FALSE)
+print(p_big_bros)
+dev.off()
+
+
+# from each project, select users that have been active for at least N quarters
+N = 6
+qtr_n = all_naOmit %>%
+ group_by(domain_numeric, user_pk) %>%
+ filter(., max(user_month_index) >= N*3) %>%
+ filter(., user_month_index <= N*3)
+qtr_n$user_quarter_index = factor(qtr_n$user_quarter_index)
+# only visualize N quarters (no more than N as we want every quarter on the graph is contributed by the same set of users)
+viz_filename = paste("quarters_", N, sep = "")
+pdf(7,14, file = paste(viz_filename, "pdf",sep = "."))
+png(filename = paste(viz_filename, "png",sep = "."),
+ width = 720,
+ height = 840,
+ bg = "transparent")
+qtr_n_viz = distributionViz(qtr_n$active_days, qtr_n$user_quarter_index, qtr_n)
+print(qtr_n_viz)
+dev.off()
+
+
+# ADD VERTICAL LINES TO DENSITY PLOTS
+addLine<- function(a=NULL, b=NULL, v = NULL, h = NULL, ..., once=F) {
+ tcL<- trellis.currentLayout()
+ k<-0
+ for(i in 1:nrow(tcL))
+ for(j in 1:ncol(tcL))
+ if (tcL[i,j]> 0) {
+ k<-k+1
+ trellis.focus("panel", j, i, highlight = FALSE)
+ if (once) panel.abline(a=a[k], b=b[k], v=v[k], h=h[k], ...) else
+ panel.abline(a=a, b=b, v=v, h=h, ...)
+ trellis.unfocus()
+ }
+}
+
+
+# DISTRIBUTION FUNCTION
+distributionViz = function(var, groups, data){
+ densityplot(~var|groups, data=data,
+ panel=function(x,...){
+ panel.densityplot(x,...)
+ lower.v <- quantile(x,probs=0.25)
+ median.v <- median(x)
+ upper.v <- quantile(x,probs=0.75)
+ mean.v <- mean(x)
+
+ panel.abline(v=median.v, col.line="#003d71", lwd=1.2) # 50 percentile
+ # panel.abline(v=lower.v, col.line="#f58220") # 25 percentile
+ # panel.abline(v=upper.v, col.line="#f58220") # 75 percentile
+ panel.abline(v=mean.v, col.line="#003d71", lwd=1.2, lty=2) # mean
+
+ dens <- density(x)
+ min.v <- min(dens$x)
+ max.v <- max(dens$x)
+ panel.polygon(c(upper.v, dens$x[dens$x > upper.v & dens$x < max.v], max.v),
+ c(0, dens$y[dens$x> upper.v & dens$x < max.v], 0),
+ col="#f58220")
+ panel.polygon(c(min.v, dens$x[dens$x > min.v & dens$x < lower.v], lower.v),
+ c(0, dens$y[dens$x > min.v & dens$x < lower.v], 0),
+ col="#f58220")
+ # panel.text(0,0, labels = myText[panel.number()])
+ },
+ scales=list(x=list(at=c(1,2,3,4,5,6,7,8,9,10,20,30))),
+ xlim=range(var), # reversing the order of x-axis values
+ ylim=c(0,0.15),
+ xlab="Active days per month",
+ ylab="Density",
+ # aspect=0.1,
+ layout=c(1,length(levels(groups))),
+ index.cond=list(c(length(levels(groups)):1)),
+ strip=strip.custom(bg="#ffffff", par.strip.text=list(font=0.5,col="#413f3f")),
+ col="#413f3f",
+ plot.points=FALSE)
+}
+
+
+# DISTRIBUTION VIZ: ALL DATA
+lower.v = quantile(all$active_days,probs=0.25)
+lower.v <- quantile(all$active_days,probs=0.25)
+median.v <- median(all$active_days)
+upper.v <- quantile(all$active_days,probs=0.75)
+mean.v <- mean(all$active_days)
+dens <- density(all$active_days)
+min.v <- min(dens$x)
+max.v <- max(dens$x)
+
+#png(filename = "distAllData.png",
+# bg = "#fff0b4")
+distAll = densityplot(~active_days, data=all,
+ scales=list(x=list(at=c(1,2,3,4,5,6,7,8,9,10,20,30))),
+ xlim=range(all$active_days), # reversing the order of x-axis values
+ ylim=c(0,0.25),
+ xlab="Active days per month",
+ ylab="Density",
+ # col="#f58220",
+ # lwd=2.5,
+ par.settings = theEconomist.theme(with.bg = TRUE, box = "transparent"),
+ plot.points=FALSE
+)
+
+distAll = distAll +
+ layer(
+ panel.polygon(c(upper.v, dens$x[dens$x > upper.v & dens$x < max.v], max.v),
+ c(0, dens$y[dens$x> upper.v & dens$x < max.v], 0),
+ col="#f58220")) +
+ layer(
+ panel.polygon(c(min.v, dens$x[dens$x > min.v & dens$x < lower.v], lower.v),
+ c(0, dens$y[dens$x > min.v & dens$x < lower.v], 0),
+ col="#f58220")) +
+ layer(
+ panel.abline(v=median.v, col.line="#003d71", lwd=1.2)) +
+ layer(
+ panel.abline(v=mean.v, col.line="#003d71", lwd=1.2, lty=2))
+print(distAll)
+dev.copy(pdf, "distAll_2.pdf", bg = "#fff0b4") # copy the graph already on the screen graphics device to a PDF
+dev.off()
\ No newline at end of file
diff --git a/analysis_scripts/mchen/case_activity_calc/case_activity.R b/analysis_scripts/mchen/case_activity_calc/case_activity.R
new file mode 100644
index 0000000..aa598a3
--- /dev/null
+++ b/analysis_scripts/mchen/case_activity_calc/case_activity.R
@@ -0,0 +1,67 @@
+# correlation between case activity rate and number of active days
+library(plyr)
+library(dplyr)
+
+data = tbl_df(read.csv("blog.csv", stringsAsFactors = FALSE))
+jharks = filter(data, domain == "jharkhand-mch")
+jharks$calendar_month = as.Date(jharks$calendar_month)
+
+# select time period
+jks = filter(jharks, calendar_month >= as.Date("2014-09-01"))
+jks = filter(jks, calendar_month < as.Date("2015-01-01"))
+jks = filter(jks, !(user_id %in% "demo_user"))
+
+# remove test user_id
+users = tbl_df(read.csv("users.csv", stringsAsFactors=FALSE))
+users = select(users, user_id, username)
+jks_user_indice = grep("@jharkhand-mch.commcarehq.org", users$username)
+jks_user = users[jks_user_indice,]
+grep("test", jks_user$username) -> testPos # test users indices
+jks_user = jks_user[-testPos,]
+
+library(stringr)
+jks_user$user = str_split_fixed(jks_user$username, "@", 2)[,1]
+user_indice_2 = grep("s[1-9]", jks_user$user)
+jks_user = jks_user[user_indice_2,] # 222 distinct users
+
+# Import case activity data
+ca = tbl_df(read.csv("ca_data.csv", stringsAsFactors = FALSE))
+names(ca)[1] = c("user")
+ca2 = left_join(ca, jks_user)
+dimnames(ca2)[[1]] = ca2$user_id
+ca3 = select(ca2, Sep, Oct, Nov, Dec) # user_id were added to each row as rownames
+ca3 = as.data.frame(ca3) # without this step there will be misleading error message about row.names() having unequal length
+
+long_ca3 = reshape(ca3, varying = list(names(ca3)),
+ idvar = "user_id", ids = row.names(ca3),
+ times = names(ca3), timevar = "month",
+ direction = "long")
+
+names(long_ca3) = c("month.index", "case_activity_rate", "user_id")
+long_ca3$month.index = ifelse(long_ca3$month.index == "Sep", "2014-09-01",
+ ifelse(long_ca3$month.index == "Oct", "2014-10-01",
+ ifelse(long_ca3$month.index == "Nov", "2014-11-01", "2014-12-01")))
+long_ca3$month.index = as.Date(long_ca3$month.index)
+long_ca3 = tbl_df(long_ca3)
+
+# Join case activity data from HQ and active days data from DP
+jks_2 = select(jks, user_id, active_days, calendar_month)
+long_ca3 = rename(long_ca3, calendar_month = month.index)
+jks_3 = left_join(jks_2, long_ca3, by = c("user_id", "calendar_month"))
+jks_3[is.na(jks_3)] <- 0
+
+# Correlation 1: case_activity_rate vs. active days
+library(Hmisc)
+c1 = rcorr(jks_3$active_days, jks_3$case_activity_rate, type="pearson")
+
+# Correlation 2: rank_car vs. rank_ad
+# rank by case activity rate
+jk_4 = jks_3 %>%
+ group_by(user_id) %>%
+ summarise(a=median(case_activity_rate), b=median(active_days))
+jk_4 = arrange(jk_4, desc(a)) # order by case activity rate
+jk_4$car_rank = seq_along(jk_4$a)
+jk_4 = arrange(jk_4, desc(b)) # order by active days per month
+jk_4$ad_rank = seq_along(jk_4$b)
+
+c2 = rcorr(jk_4$ad_rank, jk_4$car_rank, type="pearson")
diff --git a/analysis_scripts/mchen/case_activity_calc/case_activity_utils.R b/analysis_scripts/mchen/case_activity_calc/case_activity_utils.R
new file mode 100644
index 0000000..a0d957d
--- /dev/null
+++ b/analysis_scripts/mchen/case_activity_calc/case_activity_utils.R
@@ -0,0 +1,146 @@
+# Utilify functions to generate case table and same metrics as in WAR and HQ Admin report
+
+
+# date of data export
+export_date <- c("2014-10-12")
+get_inactive_line <- function(d1, d2) { # d1 format: YY-MM-DD
+ as.Date(d1) - d2
+}
+
+
+# get data by case types
+get_case_data <- function(dt, case_type) {
+ case_dt <- dt[which(dt$case_type == case_type), ]
+ return(case_dt)
+}
+
+
+# total cases that are 'currently open' since deployment
+get_open_close <- function(dt) {
+ closed <- which(dt$closed == "TRUE") # total closed cases since deployment
+ closed_case <- length(unique(dt[closed,]$case_id))
+ open_case <- length(unique(dt$case_id)) - closed_case
+ return(open_case)
+}
+
+
+# visits to open/closed case
+get_visits_closed_case <- function(dt) {
+ closed_case <- unique(dt[which(dt$closed == "TRUE"),]$case_id) # this returns a vector of cases that have been closed
+ return(closed_case)
+}
+
+
+# open case data
+get_open_case <- function(dt) {
+ case_open <- dt[which(dt$closed == "FALSE"),]
+ return(case_open)
+}
+
+
+# this function selects the first row within a grouping
+get_first_visit <- function(dt) {
+ dt1 = dt %>%
+ group_by(case_id) %>%
+ do(head(., n=1)) %>%
+ rename(c('visit_date' = 'first_visit'))
+ return(dt1)
+}
+
+
+# this function selects the last row within a grouping
+get_last_visit <- function(dt) {
+ dt2 = dt %>%
+ group_by(case_id) %>%
+ do(tail(., n=1)) %>%
+ rename(c('visit_date' = 'last_visit'))
+ return(dt2)
+}
+
+
+# total visits for each case
+get_total_visits <- function(dt) {
+ dt3 = dt %>%
+ group_by(case_id) %>%
+ summarise(total_visits = n())
+ return(dt3)
+}
+
+
+# number of active days per user
+get_active_days <- function(dt) {
+ dt4 = dt %>%
+ group_by(user_pk, month.index) %>%
+ summarise(active_days = n_distinct(visit_date))
+ return(dt4)
+}
+
+
+# get cases created after the beginning of the range
+get_num_cases_created <-function(dt, d1, d2) {
+ cases_created <- which(as.Date(dt$time_start) > get_inactive_line(d1, d2) & is.na(dt$prev_visit_start))
+ return(length(cases_created))
+}
+
+# get cases closed before the beginning of the date range
+get_num_cases_closed_before_range <- function(dt, d1, d2) {
+ cases_closed_before_range <- which(as.Date(dt$time_start) < get_inactive_line(d1, d2) & dt$closed == "TRUE")
+ return(length(cases_closed_before_range))
+}
+
+
+# avg. days between each visit
+get_avg_days_elapsed <- function(dt) {
+ library(plyr)
+ dt <- dt[order(dt$case_id, dt$time_start),]
+ dt4 <- ddply(dt, .(case_id), function(x) mean(as.numeric(diff(as.Date(x$time_start))))) # unit is days
+ colnames(dt4)[2] <- c("avg_days_elapsed_btw_visits")
+ return(dt4)
+}
+
+# match vector x, y and find element in vector x that do not match y
+"%w/o%" <- function(x, y) x[!x %in% y]
+
+# get active users for each type of cases
+get_last_interaction <- function(dt) {
+ library(plyr)
+ dt <- dt[order(dt$user_id, dt$time_start),]
+ dt5 <- ddply(dt, .(user_id), function(x) x[nrow(x),])
+ colnames(dt5)[9] <- c("last_interaction")
+ return(dt5)
+}
+
+
+# return active mobile users
+get_active_users <- function(dt) {
+ user_interaction <- get_last_interaction(dt)
+ inactive_line <- get_inactive_line(export_date, 30)
+ user_interaction$active <- ifelse(as.Date(user_interaction$last_interaction) >= as.Date(inactive_interaction_line), "yes", "no")
+ print(table(user_interaction$active))
+}
+
+
+# get total days a case has been on CC: from deployment to the day of data export
+get_life_length <- function(dt, export_date) {
+ dt$life_length <- as.numeric(as.Date(export_date) - as.Date(dt$first_visit))
+ return(dt)
+}
+
+
+# convert numeric days on CC to age categories for cases
+get_age_range <- function(dt, x1, x2, x3){
+ dt$age_range <- ifelse(dt$life_length < x1, "1",
+ ifelse(dt$life_length >= x1 & dt$life_length < x2, "2",
+ ifelse(dt$life_length >= x2 & dt$life_length < x3, "3", "4")))
+ return(dt)
+}
+
+
+# get inactive/active subset of case table
+get_subset_120 <- function(dt, boolean) {
+ dt_subset <- dt[which(dt$touched_120 == boolean),]
+ return(dt_subset)
+}
+
+
+
diff --git a/analysis_scripts/mchen/case_followup_investigation/breakdown.R b/analysis_scripts/mchen/case_followup_investigation/breakdown.R
new file mode 100644
index 0000000..244803b
--- /dev/null
+++ b/analysis_scripts/mchen/case_followup_investigation/breakdown.R
@@ -0,0 +1,64 @@
+# case table breakdown by case type and metrics computation
+
+x_first <- get_first_visit(x)
+x_last <- get_last_visit(x)
+x_total_visits <- get_total_visits(x)
+
+x_first <- x_first[order(x_first$case_id),]
+x_last <- x_last[order(x_last$case_id),]
+x_total_visits <- x_total_visits[order(x_total_visits$case_id),]
+x_last$first_visit <- x_first$first_visit
+x_last$total_visits <- x_total_visits$total_visits
+
+colnames(x_last)[5:7] <- c("last_visit_created", "last_visit_updated", "last_visit_closed")
+
+x_last$touched_120 <- ifelse(as.Date(x_last$last_visit) > get_inactive_line(export_date, 120), "yes", "no")
+x_last$touched_60 <- ifelse(as.Date(x_last$last_visit) > get_inactive_line(export_date, 60), "yes", "no")
+
+x_last$total_days <- as.numeric(as.Date(x_last$last_visit) - as.Date(x_last$first_visit))
+avg_days_between_visits <- get_avg_days_elapsed(x)
+x_last$avg_days_between_visits <- round(avg_days_between_visits$avg_days_elapsed_btw_visits, digits = 1)
+
+
+
+# WAR METRICS
+# cases created in last 120 (first visit happened in last 120 days)
+cases_created_120 <- which(as.Date(x_last$first_visit) > get_inactive_line(export_date, 120))
+
+# cases closed in last 120
+cases_closed_120 <- which(x_last$touched_120 == "yes" & x_last$last_visit_closed == "TRUE")
+
+# active cases: cases touched within the date range
+cases_touched <- which(x_last$touched_120 == "yes")
+
+# total cases: cases that are open at some point during the date range
+total_cases <- length(unique(x$case_id)) - get_num_cases_closed_before_range(x, export_date, 120)
+
+# % active case
+fu_war <- round(length(cases_touched)/total_cases, digits = 2)
+
+print(c(length(cases_created_120), length(cases_closed_120), length(cases_touched), total_cases, fu_war))
+
+
+# PMP METRICS
+# cases active in last 60: cases that are created or updated but not closed in last 60 days
+cases_created_updated_60 <- length(which(x_last$touched_60 == "yes" & x_last$last_visit_closed == "FALSE"))
+
+# active cases: cases that are created or updated but not closed in last 120
+cases_created_updated_120 <- length(which(x_last$touched_120 == "yes" & x_last$last_visit_closed == "FALSE"))
+
+# inactive cases: open cases that are untouched in last 120
+cases_inactive <- length(which(x_last$touched_120 == "no" & x_last$last_visit_closed == "FALSE"))
+
+# cases
+length(unique(x$case_id))
+
+# fu_rate
+fu_pmp <- cases_created_updated_60/(cases_inactive + cases_created_updated_120)
+
+print(c(cases_created_updated_60, cases_created_updated_120, cases_inactive, fu_pmp))
+
+# cases that have lived longer than they are supposed to
+cases_suspicious <- length(which(x_last$total_days > 270 & x_last$last_visit_closed == "FALSE"))
+
+
diff --git a/analysis_scripts/mchen/case_followup_investigation/crs_func.R b/analysis_scripts/mchen/case_followup_investigation/crs_func.R
new file mode 100644
index 0000000..dbffabc
--- /dev/null
+++ b/analysis_scripts/mchen/case_followup_investigation/crs_func.R
@@ -0,0 +1,144 @@
+# Utilify functions to generate case table and same metrics as in WAR and HQ Admin report
+
+
+# date of data export
+export_date <- c("2014-11-02") # this should be the date of the most recent data pull
+get_inactive_line <- function(d1, d2) { # d1 format: YY-MM-DD
+ as.Date(d1) - d2
+}
+
+
+# get data by case types
+get_case_data <- function(dt, case_type) {
+ case_dt <- dt[which(dt$case_type == case_type), ]
+ return(case_dt)
+}
+
+
+# total cases that are 'currently open' since deployment
+get_open_close <- function(dt) {
+ closed <- which(dt$closed == "TRUE") # total closed cases since deployment
+ closed_case <- length(unique(dt[closed,]$case_id))
+ open_case <- length(unique(dt$case_id)) - closed_case
+ return(open_case)
+}
+
+
+# visits to open/closed case
+get_visits_closed_case <- function(dt) {
+ closed_case <- unique(dt[which(dt$closed == "TRUE"),]$case_id) # this returns a vector of cases that have been closed
+ return(closed_case)
+}
+
+
+# open case data
+get_open_case <- function(dt) {
+ case_open <- dt[which(dt$closed == "FALSE"),]
+ return(case_open)
+}
+
+
+# this function selects the first row within a grouping
+get_first_visit <- function(dt) {
+ library(plyr)
+ dt <- dt[order(dt$case_id, dt$time_start),]
+ dt1 <- ddply(dt, .(case_id), function(x) x[1,])
+ dt1 <- rename(dt1, c("visit_date" = "first_visit"))
+ return(dt1)
+}
+
+
+# this function selects the last row within a grouping
+get_last_visit <- function(dt) {
+ library(plyr)
+ dt <- dt[order(dt$case_id, dt$time_start),]
+ dt2 <- ddply(dt, .(case_id), function(x) x[nrow(x),])
+ dt2 <- rename(dt2, c("visit_date" = "last_visit"))
+ return(dt2)
+}
+
+
+# total visits for each case
+get_total_visits <- function(dt) {
+ library(plyr)
+ dt <- dt[order(dt$case_id, dt$time_start),]
+ dt3 <- ddply(dt, .(case_id), function(x) nrow(x))
+ colnames(dt3)[2] <- c("total_visits")
+ return(dt3)
+}
+
+
+# get cases created after the beginning of the range
+get_num_cases_created <-function(dt, d1, d2) {
+ cases_created <- which(as.Date(dt$time_start) > get_inactive_line(d1, d2) & is.na(dt$prev_visit_start))
+ return(length(cases_created))
+}
+
+# get cases closed before the beginning of the date range
+get_num_cases_closed_before_range <- function(dt, d1, d2) {
+ cases_closed_before_range <- which(as.Date(dt$time_start) < get_inactive_line(d1, d2) & dt$closed == "TRUE")
+ return(length(cases_closed_before_range))
+}
+
+
+# avg. days between each visit
+get_avg_days_elapsed <- function(dt) {
+ library(plyr)
+ dt <- dt[order(dt$case_id, dt$time_start),]
+ dt4 <- ddply(dt, .(case_id), function(x) mean(as.numeric(diff(as.Date(x$time_start))))) # unit is days
+ colnames(dt4)[2] <- c("avg_days_elapsed_btw_visits")
+ return(dt4)
+}
+
+# match vector x, y and find element in vector x that do not match y
+"%w/o%" <- function(x, y) x[!x %in% y]
+
+# get active users for each type of cases
+get_last_interaction <- function(dt) {
+ library(plyr)
+ dt <- dt[order(dt$user_id, dt$time_start),]
+ dt5 <- ddply(dt, .(user_id), function(x) x[nrow(x),])
+ colnames(dt5)[9] <- c("last_interaction")
+ return(dt5)
+}
+
+
+# return active mobile users
+get_active_users <- function(dt) {
+ user_interaction <- get_last_interaction(dt)
+ inactive_line <- get_inactive_line(export_date, 30)
+ user_interaction$active <- ifelse(as.Date(user_interaction$last_interaction) >= as.Date(inactive_interaction_line), "yes", "no")
+ print(table(user_interaction$active))
+}
+
+
+# get total days a case has been on CC: from deployment to the day of data export
+get_life_length <- function(dt, export_date) {
+ dt$life_length <- as.numeric(as.Date(export_date) - as.Date(dt$first_visit))
+ return(dt)
+}
+
+
+# convert numeric days on CC to age categories for cases
+get_age_range <- function(dt, x1, x2, x3){
+ dt$age_range <- ifelse(dt$life_length < x1, "1",
+ ifelse(dt$life_length >= x1 & dt$life_length < x2, "2",
+ ifelse(dt$life_length >= x2 & dt$life_length < x3, "3", "4")))
+ return(dt)
+}
+
+
+# get inactive/active subset of case table
+get_subset_120 <- function(dt, boolean) {
+ dt_subset <- dt[which(dt$touched_120 == boolean),]
+ return(dt_subset)
+}
+
+
+# unique cases created by demo_user
+demo_case <- function(data) {
+ demo_data <- data[which(data$user_id == "demo_user"),]
+ demo_ncase <- length(unique(demo_data$case_id))
+ return(demo_ncase)
+}
+
diff --git a/analysis_scripts/mchen/case_followup_investigation/crs_merged.R b/analysis_scripts/mchen/case_followup_investigation/crs_merged.R
new file mode 100644
index 0000000..e78dc45
--- /dev/null
+++ b/analysis_scripts/mchen/case_followup_investigation/crs_merged.R
@@ -0,0 +1,77 @@
+# case table computation with the same metrics as
+# in Worker Activity Report and HQ Admin Report
+# for the whole domain
+
+# remove visits by demo_user
+if (length(which(merged$user_id == "demo_user"))) merged[-which(merged$user_id == "demo_user"),] -> merged
+
+# total closed cases
+get_open_close(merged)
+
+
+
+##### to be functioned #####
+
+# get case table with the following fields:
+ # visit_first, visit_last,
+ # visit_last_created, visit_last_updated, visit_last_closed,
+ # days elapsed between creation and last visit
+ # average days elapsed between visits for a given case
+ # touched_120, touched_60
+merged_first <- get_first_visit(merged)
+merged_last <- get_last_visit(merged)
+merged_total_visits <- get_total_visits(merged)
+
+merged_first <- merged_first[order(merged_first$case_id),]
+merged_last <- merged_last[order(merged_last$case_id),]
+merged_total_visits <- merged_total_visits[order(merged_total_visits$case_id),]
+merged_last$first_visit <- as.Date(merged_first$first_visit)
+merged_last$total_visits <- merged_total_visits$total_visits
+
+merged_last <- rename(merged_last, c("created" = "last_visit_created", "updated" = "last_visit_updated", "closed" = "last_visit_closed"))
+
+merged_last$total_days <- as.numeric(as.Date(merged_last$last_visit) - as.Date(merged_last$first_visit))
+
+merged_last$touched_120 <- ifelse(as.Date(merged_last$last_visit) > get_inactive_line(export_date, 120), "yes", "no")
+merged_last$touched_60 <- ifelse(as.Date(merged_last$last_visit) > get_inactive_line(export_date, 60), "yes", "no")
+
+avg_days_between_visits <- get_avg_days_elapsed(merged)
+merged_last$avg_days_between_visits <- round(avg_days_between_visits$avg_days_elapsed_btw_visits, digits = 1)
+
+
+# WAR METRICS
+# cases created in last 120 (first visit happened in last 120 days)
+cases_created_120 <- which(as.Date(merged_last$first_visit) > get_inactive_line(export_date, 120))
+
+# cases closed in last 120
+cases_closed_120 <- which(merged_last$touched_120 == "yes" & merged_last$last_visit_closed == "TRUE")
+
+# active cases: cases touched within the date range
+cases_touched <- which(merged_last$touched_120 == "yes")
+
+# total cases: cases that are open at some point during the date range
+total_cases <- length(unique(merged$case_id)) - get_num_cases_closed_before_range(merged, export_date, 120)
+
+# % active case
+fu_war <- round(length(cases_touched)/total_cases, digits = 2)
+
+print(c(length(cases_created_120), length(cases_closed_120), length(cases_touched), total_cases, fu_war))
+
+
+# PMP METRICS
+# cases active in last 60: cases that are created or updated but not closed in last 60 days
+cases_created_updated_60 <- length(which(merged_last$touched_60 == "yes" & merged_last$last_visit_closed == "FALSE"))
+
+# active cases: cases that are created or updated but not closed in last 120
+cases_created_updated_120 <- length(which(merged_last$touched_120 == "yes" & merged_last$last_visit_closed == "FALSE"))
+
+# inactive cases: open cases that are untouched in last 120
+cases_inactive <- length(which(merged_last$touched_120 == "no" & merged_last$last_visit_closed == "FALSE"))
+
+# cases
+length(unique(merged$case_id))
+
+# fu_rate
+fu_pmp <- cases_created_updated_120/(cases_inactive + cases_created_updated_120)
+
+
diff --git a/analysis_scripts/mchen/case_followup_investigation/crs_untouched_baby.R b/analysis_scripts/mchen/case_followup_investigation/crs_untouched_baby.R
new file mode 100644
index 0000000..dcd43b0
--- /dev/null
+++ b/analysis_scripts/mchen/case_followup_investigation/crs_untouched_baby.R
@@ -0,0 +1,23 @@
+# a breakdown of inactive baby cases
+# domain: crs-remind
+
+# FU rate: cases touched in last 120 / total open cases
+
+# calc months that a given case has been open for
+x_temp <- get_life_length(x_last, export_date)
+x_temp <- get_age_range(x_temp, 90, 180, 360)
+table(x_temp$age_range); nrow(x_temp)
+
+# cases touched in last 120
+x_touched_120 <- get_subset_120(x_temp, "yes")
+table(x_touched_120$age_range)
+
+# open cases that are untouched in last 120
+x_untouched_120 <- get_subset_120(x_temp, "no")
+x_open_untouched_120 <- x_untouched_120[which(x_untouched_120$last_visit_closed == FALSE),]
+table(x_open_untouched_120$age_range)
+
+
+# average visits to cases created in last 3/6/9/12 months
+test <- ddply(x_temp, .(age_range), summarize, mean(total_visits))
+
diff --git a/analysis_scripts/mchen/case_followup_investigation/fu_rate_by_domain.R b/analysis_scripts/mchen/case_followup_investigation/fu_rate_by_domain.R
new file mode 100644
index 0000000..e7727db
--- /dev/null
+++ b/analysis_scripts/mchen/case_followup_investigation/fu_rate_by_domain.R
@@ -0,0 +1,94 @@
+
+# to be written into functions: for now it's 6 domains i will just do the split and export manually
+care <- interactions_subset[which(interactions_subset$domain == "care-ecd"),]
+aaharbaseline <- interactions_subset[which(interactions_subset$domain == "aaharbaseline"),]
+myrada <- interactions_subset[which(interactions_subset$domain == "myrada"),]
+spandan <- interactions_subset[which(interactions_subset$domain == "spandan"),]
+rdi_hiht <- interactions_subset[which(interactions_subset$domain == "rdi-hiht"),]
+crs_remind <- interactions_subset[which(interactions_subset$domain == "crs-remind"),]
+
+write.csv(care, "care.csv")
+write.csv(aaharbaseline, "aaharbaseline.csv")
+write.csv(myrada, "myrada.csv")
+write.csv(spandan, "spandan.csv")
+write.csv(rdi_hiht, "rdi_hiht.csv")
+write.csv(crs_remind, "crs_remind.csv")
+
+
+# remove demo_user
+merged <- spandan[-which(spandan$user_id == "demo_user"),]
+
+# nrow(merged)
+
+merged_first <- get_first_visit(merged)
+merged_last <- get_last_visit(merged)
+merged_total_visits <- get_total_visits(merged)
+
+merged_first <- merged_first[order(merged_first$case_id),]
+merged_last <- merged_last[order(merged_last$case_id),]
+merged_total_visits <- merged_total_visits[order(merged_total_visits$case_id),]
+merged_last$first_visit <- as.Date(merged_first$first_visit)
+merged_last$total_visits <- merged_total_visits$total_visits
+
+merged_last <- rename(merged_last, c("created" = "last_visit_created", "updated" = "last_visit_updated", "closed" = "last_visit_closed"))
+
+merged_last$total_days <- as.numeric(as.Date(merged_last$last_visit) - as.Date(merged_last$first_visit))
+
+merged_last$touched_120 <- ifelse(as.Date(merged_last$last_visit) > get_inactive_line(export_date, 120), "yes", "no")
+merged_last$touched_60 <- ifelse(as.Date(merged_last$last_visit) > get_inactive_line(export_date, 60), "yes", "no")
+
+avg_days_between_visits <- get_avg_days_elapsed(merged)
+merged_last$avg_days_between_visits <- round(avg_days_between_visits$avg_days_elapsed_btw_visits, digits = 1)
+
+
+# WAR METRICS
+# cases created in last 120 (first visit happened in last 120 days)
+cases_created_120 <- which(as.Date(merged_last$first_visit) > get_inactive_line(export_date, 120))
+
+# cases closed in last 120
+cases_closed_120 <- which(merged_last$touched_120 == "yes" & merged_last$last_visit_closed == "TRUE")
+
+# active cases: cases touched within the date range
+cases_touched <- which(merged_last$touched_120 == "yes")
+
+# total cases: cases that are open at some point during the date range
+total_cases <- length(unique(merged$case_id)) - get_num_cases_closed_before_range(merged, export_date, 120)
+
+
+# HQ FU-rate
+# active cases: cases that are created or updated but not closed in last 120
+cases_created_updated_120 <- length(which(merged_last$touched_120 == "yes" & merged_last$last_visit_closed == "FALSE"))
+
+# inactive cases: open cases that are untouched in last 120
+cases_inactive <- length(which(merged_last$touched_120 == "no" & merged_last$last_visit_closed == "FALSE"))
+
+hq_fu_rate <- cases_created_updated_120/(cases_inactive + cases_created_updated_120)
+
+# DP FU-rate
+# cases touched / (cases touched + inactive cases)
+dp_fu_rate <- length(cases_touched)/(length(cases_touched) + cases_inactive)
+
+
+
+
+# breakdown by case type and their time on CommCare
+merged_last_temp <- get_life_length(merged_last, export_date)
+merged_last_temp <- get_age_range(merged_last_temp, 90, 180, 360)
+table(merged_last_temp$age_range); nrow(merged_last_temp)
+
+cases_touched_by_type_age <- ddply(merged_last_temp, .(case_type, age_range), function(x)length(which(x$touched_120 == "yes")))
+cases_inactive_by_type_age <- ddply(merged_last_temp, .(case_type, age_range), function(x)length(which(x$touched_120 == "no" & x$last_visit_closed == "FALSE")))
+cases_closed_by_type_age <- ddply(merged_last_temp, .(case_type, age_range), function(x)length(which(x$touched_120 == "yes" & x$last_visit_closed == "TRUE")))
+cases_created_by_type_age <- ddply(merged_last_temp, .(case_type, age_range), function(x)length(which(as.Date(x$first_visit) > get_inactive_line(export_date, 120))))
+
+cases_touched_by_type_age <- rename(cases_touched_by_type_age, c("V1" = "cases_touched"))
+cases_inactive_by_type_age <- rename(cases_inactive_by_type_age, c("V1" = "cases_inactive"))
+cases_closed_by_type_age <- rename(cases_closed_by_type_age, c("V1" = "cases_closed"))
+cases_created_by_type_age <- rename(cases_created_by_type_age, c("V1" = "cases_created"))
+
+
+dp_fu <- merge(cases_touched_by_type_age, cases_inactive_by_type_age, by = c("case_type", "age_range"))
+dp_fu$fu_rate <- round(dp_fu$cases_touched/(dp_fu$cases_touched + dp_fu$cases_inactive), digits = 3)
+
+dp_fu <- merge(dp_fu, cases_closed_by_type_age, by = c("case_type", "age_range"))
+dp_fu <- merge(dp_fu, cases_created_by_type_age, by = c("case_type", "age_range")) # to be written into a function merging multiple data frames
diff --git a/analysis_scripts/mchen/div2/stage_1.R b/analysis_scripts/mchen/div2/stage_1.R
new file mode 100644
index 0000000..108148c
--- /dev/null
+++ b/analysis_scripts/mchen/div2/stage_1.R
@@ -0,0 +1,461 @@
+## IMPORT / LOADING PACKAGES
+# if(!require(installr)) {
+# install.packages("installr"); require(installr)}
+
+library(plyr)
+library(dplyr)
+library(reshape) # loading this package for the purpose of renaming variables
+library(zoo) # converting month.index format
+
+data = tbl_df(read.csv("blog_data_2_2_15.csv", stringsAsFactors = FALSE))
+d_1 = data
+d_1 = data %>%
+ filter(., calendar_month <= as.Date("2014-12-01")) # the rating surveys are done before the end of 2014
+
+d_1_domain = unique(d_1$domain_numeric)
+
+# read in multiple flw rating sheets
+main_dir = getwd()
+flw_rating_dir = file.path(getwd(), "flw_rating")
+filenames = list.files(path = flw_rating_dir)
+flw_rating_data = list()
+for (i in 1:length(filenames)){
+ flw_rating_data[[i]] = read.csv(paste(flw_rating_dir, "/", filenames[i], sep = ""),
+ stringsAsFactors = FALSE,
+ nrows = 21)
+}
+
+# data cleaning
+for(i in 1:length(filenames)){
+ flw_rating_data[[i]] = tbl_df(flw_rating_data[[i]])
+ flw_rating_data[[i]] = select(flw_rating_data[[i]], -starts_with("X"))
+ flw_rating_data[[i]]$domain_numeric = as.numeric(gsub(".csv","",filenames[[i]]))
+ # flw_rating_data[[i]] = filter(flw_rating_data[[i]], as.numeric(Username %in% "") == 0)
+ flw_rating_data[[i]] = filter(flw_rating_data[[i]], as.numeric(Education.background %in% "") == 0)
+ flw_rating_data[[i]] = select(flw_rating_data[[i]],
+ Username, Gender, Age.range, Education.background,
+ Ability.to.use.CommCare,Overall.performance.as.an.FLW,Previous.work.experience.in.maternal.and.child.health,domain_numeric)
+
+}
+
+merged_rating = rbind(flw_rating_data[[2]], flw_rating_data[[1]])
+for (i in 3:length(filenames)) {
+ merged_rating = rbind(merged_rating, flw_rating_data[[i]])
+}
+
+# clean up values for each variable (believe it or not i did it manually)
+# truncate all good* to good
+merged_rating$Overall.performance.as.an.FLW = gsub("Good - meets or exceeds expectations", "Good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Good, meets or exceeds expectations ", "Good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Very good, exceeds most expectations", "Very good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Very good - exceeds most expectations", "Very good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("NA", "Unknown", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Needs improvement, does not meet expectations", "Needs improvement", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Newly involved not yet performed ", "Unknown", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("1. Very good", "Very good",merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("2. Good", "Good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("3. Needs improvement - does not meet expectations", "Needs improvement", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("4. Unknown", "Unknown",merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Average", "Needs improvement", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Excellent", "Very good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Satisfactory", "Good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Very good,survey was done throughly, all the forms filled up completely, ", "Very good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Very, good, sincerely done the survey, all the forms filled up meticulosly", "Very good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW = gsub("Very Good", "Very good", merged_rating$Overall.performance.as.an.FLW)
+merged_rating$Overall.performance.as.an.FLW[which(as.numeric(is.na(merged_rating$Overall.performance.as.an.FLW)) == 1)] <- "Unknown"
+merged_rating$Overall.performance.as.an.FLW[merged_rating$Overall.performance.as.an.FLW==""] <- "Unknown" # set all blank cells to NA
+
+
+merged_rating$Ability.to.use.CommCare = gsub("Good - meets or exceeds expectations", "Good", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Good, meets or exceeds expectations ", "Good", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Very good, exceeds most expectations", "Very good", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Very good - exceeds most expectations", "Very good", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Needs improvement - does not meet expectations", "Needs improvement", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Needs improvement, does not meet expectations", "Needs improvement", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Newly involved not yet performed ", "Unknown", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("1. Very good", "Very good",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("2. Good", "Good", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("3. Needs improvement - does not meet expectations", "Needs improvement", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("3. Needs improvement", "Needs improvement", merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("4. Unknown", "Unknown",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Able to use effectively", "Very good",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Could not use effectively", "Needs improvement",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Able to manage", "Good",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("She is able to use CommCare", "Good",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Yes", "Good",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Has ability to use ", "Good",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Able to use CommCare", "Good",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("Very effectively use", "Very good",merged_rating$Ability.to.use.CommCare)
+merged_rating$Ability.to.use.CommCare = gsub("YES ,Very efficiently", "Very good",merged_rating$Ability.to.use.CommCare)
+
+# data check
+unique(merged_rating$Overall.performance.as.an.FLW)
+unique(merged_rating$Ability.to.use.CommCare)
+
+# filter domains from monthly table
+monthly = tbl_df(read.csv("monthly.csv"))
+users = tbl_df(read.csv("users.csv"))
+
+# import domain table from live db connection and match domain_numeric in merged_rating with domain names in domain table
+colnames(domains)[2] <- c("domain_numeric")
+merged_rating = left_join(merged_rating, domains)
+colnames(merged_rating)[1] <- c("username") # this adds dmain names to the merged_rating data table
+merged_rating = select(merged_rating,
+ username, domain_numeric, name,
+ Ability.to.use.CommCare, Overall.performance.as.an.FLW,
+ Gender, Age.range, Education.background,
+ Previous.work.experience.in.maternal.and.child.health)
+merged_rating = rename(merged_rating, c(name = "domain"))
+
+# match username with user_id
+domain_list = unique(merged_rating$domain)
+monthly_sub = filter(monthly, domain %in% domain_list) # lata-medical is missing
+
+userNameSplit = function(x){
+ unlist(strsplit(x,"@"))[1]
+}
+domainNameSplit = function(x){
+ temp = unlist(strsplit(x, "@"))[2]
+ unlist(strsplit(temp, "[.]"))[1]
+}
+
+users$uname = sapply(as.character(users$username), userNameSplit)
+users$domain = sapply(as.character(users$username), domainNameSplit)
+merged_rating = rename(merged_rating, c(username = "uname"))
+merged_rating_2 = inner_join(merged_rating, users, by = c("uname", "domain"))
+
+# for each user_id, retrieve monthly data
+merged_rating_3 = left_join(merged_rating_2, monthly, by = c("user_id", "domain"))
+merged_data = select(merged_rating_3, uname, domain_numeric, domain,
+ Ability.to.use.CommCare, Overall.performance.as.an.FLW, user_id, username, user_pk, month.index,
+ date_first_visit, date_last_visit, nvisits, active_days, active_day_percent, nforms, ncases_touched,
+ Gender, Age.range, Education.background, Previous.work.experience.in.maternal.and.child.health)
+
+merged_data = rename(merged_data, c(Ability.to.use.CommCare = "ability_to_use_commcare",
+ Overall.performance.as.an.FLW = "overall_performance_as_an_flw",
+ Gender = "gender",
+ Age.range = "age_range",
+ Education.background = "edu_background",
+ Previous.work.experience.in.maternal.and.child.health = "prev_experience"
+))
+
+# order data by domain
+merged_data = arrange(merged_data, domain_numeric, user_pk, month.index)
+n_distinct(merged_data$user_pk)
+
+# drop users who have user_id but have no user_pk (lata-medical was removed)
+to_drop = which(is.na(merged_data$user_pk))
+# this would be the first working data set: One row for every active FLW month who was in the FLW rating during the time range
+merged_data_2 = tbl_df(merged_data[-to_drop,])
+merged_data_2 = filter(merged_data_2, as.Date(month.index) <= as.Date("2015-02-01")) # there are odd data points with a month index in 2023
+merged_data_2$month.index = factor(as.Date(as.yearmon(merged_data_2$month.index)))
+
+working_data_1 = select(merged_data_2, domain_numeric, domain, ability_to_use_commcare, overall_performance_as_an_flw,
+ gender, age_range, edu_background, prev_experience,
+ user_pk, month.index, active_days, active_day_percent, ncases_touched) # keeping activity measures of interest only
+working_data_1 = arrange(working_data_1, domain_numeric, user_pk, month.index)
+working_data_1$month.index = as.Date(as.yearmon(working_data_1$month.index))
+#working_data_1 = working_data_1 %>%
+# filter(., month.index >= as.Date("2014-04-01")) %>%
+# filter(., month.index <= as.Date("2014-11-01"))
+
+# working_data_1 = na.omit(working_data_1) # For now i will keep NA as a value
+
+
+# working dataset 2
+# One row for every month (active or not) in the time period for each FLW in the FLW ratings.
+# This is similar to working_data_1, but the activity measures should have zero for inactive months
+
+# create a sequence of months
+# get the first and the last active month for each user
+working_data_1 = arrange(working_data_1, domain_numeric, user_pk, month.index)
+user_first_last = working_data_1 %>%
+ group_by(domain_numeric, user_pk) %>%
+ do(data.frame(
+ head(.,n=1)$month.index,
+ tail(.,n=1)$month.index))
+names(user_first_last) = c("domain_numeric", "user_pk", "first_month", "last_month")
+
+# for each user, create a sequence of months
+user_month_seq = list()
+user_month_full = list() # adding inactive months back for each user
+for(i in 1:nrow(user_first_last)){
+ user_month_seq[[i]] = seq(as.Date(user_first_last$first_month[i]),
+ as.Date(user_first_last$last_month[i]),
+ by="mon")
+ user_month_full[[i]] = data.frame(rep(user_first_last$user_pk[i],length(user_month_seq[i])),
+ user_month_seq[i])
+ names(user_month_full[[i]]) = c("user_pk", "month_seq")
+}
+
+user_month_full = tbl_df(do.call(rbind, user_month_full))
+colnames(user_month_full)[2] = c("month.index")
+# join the full-month table with working_data_1
+working_data_2 = left_join(user_month_full, working_data_1, by = c("user_pk","month.index"))
+# NA should be replaced with 0 (except domain_numeric, should be replaced )
+
+
+# constructing user-month table
+by_user_1 = group_by(working_data_1, domain_numeric, user_pk)
+user_months_1 = summarise(by_user_1, um_1 = n_distinct(by_user$month.index)) # total active months per user
+by_user_2 = group_by(working_data_2, user_pk) # Note here we do NOT need to order by domain_numeric
+user_months_2 = summarise(by_user_2, um_2 = n_distinct(by_user_2$month.index)) # total months (active+inactive) per user
+user_months = merge(user_months_1, user_months_2)
+user_months$gap_rate = round((user_months$um_2 - user_months$um_1)/user_months$um_2, digits = 2) # the low the gap rate, the more consistent usage of commcare
+
+
+
+# working dataset 3
+# one row per FLW
+# create aggregate usage statistics for: active_days, ncases_touched
+wd1_rm = which(is.na(working_data_1$active_days))
+working_data_1_sub = tbl_df(working_data_1[-wd1_rm,])
+n_distinct(working_data_1_sub$user_pk) # seems a few users are dropped
+working_data_3 = working_data_1_sub %>%
+ group_by(domain_numeric, user_pk) %>%
+ do(data.frame(
+ summarise(., round(median(active_days))),
+ summarise(., round(median(ncases_touched))),
+ summarise(., sum(active_days)),
+ summarise(., sum(ncases_touched)),
+ summarise(., round(mean(active_days))),
+ summarise(., round(mean(ncases_touched)))
+ ))
+
+working_data_2$active_days = ifelse(is.na(working_data_2$active_days)==TRUE, 0, working_data_2$active_days)
+working_data_2$ncases_touched = ifelse(is.na(working_data_2$ncases_touched)==TRUE, 0, working_data_2$ncases_touched)
+working_data_4 = working_data_2 %>%
+ group_by(user_pk) %>%
+ do(data.frame(
+ summarise(., round(median(active_days))),
+ summarise(., round(median(ncases_touched)))
+ ))
+colnames(working_data_4)[2:3] = c("median_ad_including_inactive_months","median_nt_including_inactive_months")
+
+# join working_data_3 and working_data_4 into one FLW table
+working_data_4 = inner_join(working_data_3, working_data_4, by = "user_pk")
+working_data_4 = rename(working_data_4, c(round.median.active_days..= "median_ad",
+ round.median.ncases_touched.. = "median_nt",
+ sum.active_days. = "sum_ad",
+ sum.ncases_touched. = "sum_nt",
+ round.mean.active_days.. = "average_ad",
+ round.mean.ncases_touched.. = "average_nt"))
+
+working_data_4 = rename(working_data_4, c(sum.ncases_touched. = "sum_nt",mean.ncases_touched. = "average_nt"))
+
+# join performance measure with rating measure
+user_rating = unique(select(merged_data_2, uname, user_pk, ability_to_use_commcare, overall_performance_as_an_flw))
+working_data_5 = tbl_df(inner_join(working_data_4, user_rating, by = c("user_pk")))
+
+# add gender, age_range, edu, prev_experience back to the FLW table
+working_data_2$gender = gsub("1. Female", "Female", working_data_2$gender)
+working_data_2$gender = gsub("2. Male", "Male", working_data_2$gender)
+edu_levels = c("1. No schooling completed",
+ "2. Nursery to 8th class",
+ "3. Some high school with no degree",
+ "4. Class 10",
+ "5. CLass 12",
+ "6. Some college with no degree",
+ "7. Vocational training",
+ "8. Bachelor's degree",
+ "9. Master's degree or more",
+ "10. Unknown")
+prev_experience_levels = c("1. Very experienced",
+ "2. Some experience",
+ "3. Little experience",
+ "4. Unknown")
+age_levels = c("Below 20", # left closed, right open
+ "20-25",
+ "25-30",
+ "30-35",
+ "35-40",
+ "40-45",
+ "45-50",
+ "Older than 50")
+
+working_data_2$age_range = factor(working_data_2$age_range,
+ levels = age_levels)
+working_data_2$prev_experience = factor(working_data_2$prev_experience,
+ levels = prev_experience_levels)
+
+working_data_2$prev_experience = gsub("No experience, has work experience in disability field", prev_experience_levels[3], working_data_2$prev_experience)
+working_data_2$prev_experience = gsub("Work directly with mother and child but not through CommCare", prev_experience_levels[2], working_data_2$prev_experience)
+working_data_2$prev_experience = gsub("Mahila Sanghatana Adolsecnt Health specially in schools for 1 year, Family survey special emphasis on maternal aspects about ante natal care, health education , immunization among children for last 4 years", prev_experience_levels[1], working_data_2$prev_experience)
+working_data_2$prev_experience = gsub("ASHA (frontline Health worker)", prev_experience_levels[1], working_data_2$prev_experience)
+working_data_2$prev_experience = factor(working_data_2$prev_experience)
+
+working_data_2$age_range = gsub("37 years", age_levels[5], working_data_2$age_range)
+working_data_2$age_range = gsub("54 years", age_levels[8], working_data_2$age_range)
+working_data_2$age_range = gsub("45 years", age_levels[7], working_data_2$age_range)
+working_data_2$age_range = gsub("44 years", age_levels[6], working_data_2$age_range)
+working_data_2$age_range = gsub("30 years", age_levels[4], working_data_2$age_range)
+working_data_2$age_range = gsub("35 years", age_levels[5], working_data_2$age_range)
+working_data_2$age_range = gsub("40 years", age_levels[6], working_data_2$age_range)
+working_data_2$age_range = gsub("46 years", age_levels[7], working_data_2$age_range)
+working_data_2$age_range = gsub("55 years", age_levels[8], working_data_2$age_range)
+working_data_2$age_range = gsub("31 -32years", age_levels[4], working_data_2$age_range)
+working_data_2$age_range = gsub("1. Below 20", age_levels[1], working_data_2$age_range)
+working_data_2$age_range = gsub("2. 20-25", age_levels[2], working_data_2$age_range)
+working_data_2$age_range = gsub("3. 25-30", age_levels[3], working_data_2$age_range)
+working_data_2$age_range = gsub("4. 30-35", age_levels[4], working_data_2$age_range)
+working_data_2$age_range = gsub("5. 35-40", age_levels[5], working_data_2$age_range)
+working_data_2$age_range = gsub("6. 40-45", age_levels[6], working_data_2$age_range)
+working_data_2$age_range = gsub("7. 45-50", age_levels[7], working_data_2$age_range)
+working_data_2$age_range = gsub("8. Older than 50", age_levels[8], working_data_2$age_range)
+
+working_data_2$age_range = gsub("25", age_levels[3], working_data_2$age_range)
+working_data_2$age_range = gsub("26", age_levels[3], working_data_2$age_range)
+working_data_2$age_range = gsub("28", age_levels[3], working_data_2$age_range)
+working_data_2$age_range = gsub("30", age_levels[4], working_data_2$age_range)
+working_data_2$age_range = gsub("32", age_levels[4], working_data_2$age_range)
+working_data_2$age_range = gsub("29", age_levels[3], working_data_2$age_range)
+working_data_2$age_range = gsub("27", age_levels[3], working_data_2$age_range)
+
+
+# DOMAIN TABLE
+# columns: domain_numeric; overall_perf_score per level, total users per level
+wd_5 = working_data_5
+wd_5$overall_performance_as_an_flw = factor(wd_5$overall_performance_as_an_flw,
+ levels=c("Unknown", "Needs improvement", "Good", "Very good"))
+wd_5$overall_performance_as_an_flw_n = factor(wd_5$overall_performance_as_an_flw,
+ labels=(1:length(levels(factor(wd_5$overall_performance_as_an_flw)))))
+wd_5$ability_to_use_commcare = factor(wd_5$ability_to_use_commcare,
+ levels=c("Unknown", "Needs improvement", "Good", "Very good"))
+wd_5$ability_to_use_commcare_n = factor(wd_5$ability_to_use_commcare,
+ labels=(1:length(levels(factor(wd_5$ability_to_use_commcare)))))
+
+# for each domain, sum up the overall score at each performance level
+perf_score_long = wd_5 %>%
+ group_by(domain_numeric, overall_performance_as_an_flw) %>%
+ summarise(nusers = n_distinct(user_pk),
+ perf_score = sum(overall_performance_as_an_flw_n))
+
+# for each domain, get the median active days at each level of overall performance level
+mad_by_perf_domain = wd_5 %>%
+ group_by(domain_numeric, overall_performance_as_an_flw) %>%
+ summarise(mad = median(median_ad))
+tbl_1 = inner_join(perf_score_long, mad_by_perf_domain)
+# reshape: long - wide format
+tbl_2 = reshape(tbl_1, timevar = "overall_performance_as_an_flw", idvar = "domain_numeric", direction = "wide")
+tbl_2[is.na(tbl_2)] <- 0 # replace all NA values with 0
+tbl_3 = select(tbl_2, domain_numeric, contains("nusers"))
+names(tbl_3) = c("domain_numeric", "Needs Improvement", "Good", "Very Good", "Unknown")
+tbl_4 = select(tbl_2, domain_numeric, contains("mad"))
+names(tbl_4) = names(tbl_3)
+tbl_5 = select(tbl_2, domain_numeric, contains("score"))
+names(tbl_5) = names(tbl_3)
+# for each domain, calculate the total perf score at all performance level
+perf_score_domain_sum = perf_score_long %>%
+ group_by(domain_numeric) %>%
+ summarise(perf_score_sum = sum(perf_score)) %>%
+ arrange(., desc(perf_score_sum))
+
+# split the data into multiple data frames by performance level
+perf_score_levels = split(perf_score_long, as.numeric(perf_score_long$overall_performance_as_an_flw))
+
+# FLW TABLE
+# add relative activity level to each user with their domain
+names(wd_5)[3:6] = c("median_ad", "median_nt", "sum_ad", "sum_nt")
+wd_5 = wd_5 %>%
+ group_by(domain_numeric) %>%
+ mutate(ntile_mad = ntile(median_ad, 4)) # the higher the number, the higher the median active day value is
+domain_size = perf_score_long %>%
+ group_by(domain_numeric) %>%
+ summarise(tot_users = sum(nusers))
+domain_size$dsize = ifelse(domain_size$tot_users >= 10, "big", "small")
+wd_6 = left_join(wd_5, domain_size)
+
+# CONTINGENCY TABLE 1
+c1 = select(wd_6, domain_numeric, overall_performance_as_an_flw, ability_to_use_commcare)
+c1 = table(c3$overall_performance_as_an_flw, c3$ability_to_use_commcare,
+ dnn = c("Overall performance as an FLW", "Ability to use CommCare"))
+c1 = ftable(c1)
+
+# CONTINGENCY TABLE 2
+c2 = select(wd_6, domain_numeric, overall_performance_as_an_flw, ability_to_use_commcare, dsize)
+c2 = ftable(c2$overall_performance_as_an_flw, c2$ability_to_use_commcare, c2$dsize)
+names(attr(c2, "row.vars")) = c("Overall performance", "Ability to use CommCare")
+c2
+
+# DETECTION OF CONFOUNDING / EFFECT MODIFICATION
+
+
+# REGRESSION
+# Explanatory factor: monthly active days
+# Outcome factor: overall performance as an flw
+# Confounders: domain_numeric
+# Effect modifiers: domain size, ability to use commcare, within-domain relative activity level as an flw
+
+# converting categorical var domain_numeric to 13 dummy factors. same for domain size
+ologit_dat = select(wd_6, domain_numeric, user_pk, median_ad, round.mean.active_days.., overall_performance_as_an_flw_n, ability_to_use_commcare, ntile_mad, dsize)
+ologit_dat$domain_numeric = as.factor(ologit_dat$domain_numeric)
+ologit_dat$overall_performance_as_an_flw_n = as.numeric(ologit_dat$overall_performance_as_an_flw_n)
+domain_dummy = model.matrix(~domain_numeric, data = ologit_dat)
+ologit_dat = cbind(ologit_dat, domain_dummy)
+perf_glm_1 = glm(overall_performance_as_an_flw_n~median_ad, data = ologit_dat)
+perf_glm_2 = glm(overall_performance_as_an_flw_n~median_ad + domain_numeric, data = ologit_dat)
+perf_glm_3 = glm(overall_performance_as_an_flw_n~median_ad + domain_numeric + ability_to_use_commcare, data = ologit_dat)
+perf_glm_4 = glm(overall_performance_as_an_flw_n~median_ad + domain_numeric + ability_to_use_commcare + domain_numeric*ability_to_use_commcare, data = ologit_dat)
+
+
+
+ologit_dat = select(wd_6, user_pk, domain_numeric, overall_performance_as_an_flw, ability_to_use_commcare, dsize, round.mean.active_days.., median_ad, ntile_mad)
+names(ologit_dat)[6] = c("mean_ad")
+ologit_ft = ftable(ologit_dat$overall_performance_as_an_flw, ologit_dat$ability_to_use_commcare, ologit_dat$dsize)
+names(attr(ologit_ft, "row.vars")) = c("Overall performance", "Ability to use CommCare")
+domain_dummy = model.matrix(~domain_numeric, data = ologit_dat)
+ologit_dat_2 = cbind(ologit_dat, domain_dummy)
+
+
+
+
+
+# LOG-LINEAR ANALYSIS
+odat = select(wd_6, domain_numeric, user_pk, median_ad, dsize,
+ overall_performance_as_an_flw,
+ overall_performance_as_an_flw_n,
+ ability_to_use_commcare,
+ ability_to_use_commcare_n)
+odat$perf = ifelse(as.numeric(odat$overall_performance_as_an_flw_n) <= 2, "Not good enough", "Good enough")
+odat$ability = ifelse(as.numeric(odat$ability_to_use_commcare_n) <= 2, "Not good enough", "Good enough" )
+odat$active = ifelse(odat$median_ad >= 5, "More active", "Less active")
+
+# frequency tables
+f1 = table(odat$overall_performance_as_an_flw, dnn = c("Overall Performance"))
+grp_f1 = table(odat$dsize, odat$overall_performance_as_an_flw, dnn = c("Domain size", "Overall Performance"))
+
+
+# 2-way crosstabs
+c1 = table(odat$median_ad,
+ odat$overall_performance_as_an_flw, dnn = c("Median active days per month", "Overall Performance"))
+tbl_odat = table(odat$perf, odat$ability, odat$active)
+op_oa = margin.table(tbl_odat, c(3,1)) # 2-way contingency table
+chisq.test(op_oa) # as a building block to log-linear analysis, chi-square test is performed to test if performance and active levels are independent from each other
+likelihood_test = op_oa[2]/margin.table(op_oa, 1)[2] / (op_oa[1]/margin.table(op_oa, 1)[1])
+
+c2 = table(odat$ability_to_use_commcare,
+ odat$overall_performance_as_an_flw, dnn = c("Ability", "Performance"))
+
+op_ab = margin.table(tbl_odat, c(2,1))
+chisq.test(op_ab)
+likelihood_test_2 = op_ab[1]/margin.table(op_ab, 1)[1] / (op_ab[2]/margin.table(op_ab, 1)[2])
+
+# GRAPHIC DISPLAY
+# Freq distribution of overall performance
+b_f1 = barchart(f1, xlab="Number of FLW", ylab="Overall Performance")
+f2 = table(odat$median_ad)
+b_f2 = barchart(f2, xlab="Number of FLW", ylab="Active days per month")
+grid.arrange(b_f1, b_f2)
+
+# barplot
+b1 = barplot(t(c1), beside=TRUE,
+ horiz=TRUE,
+ legend=rownames(t(c1)),
+ col=c("lightblue","lightcyan","lavender","mistyrose"),
+ xlim=c(0,max(c1)))
+# 10x4 mosaicplot for 2-way contingency table
+m1 = mosaicplot(t(c1), shade=TRUE)
+
+# 2x2x2 mosaicplot
+m2 = mosaicplot(tbl_odat, shade=TRUE)
+
diff --git a/analysis_scripts/mchen/div2/stage_2.Rmd b/analysis_scripts/mchen/div2/stage_2.Rmd
new file mode 100644
index 0000000..5dbca3c
--- /dev/null
+++ b/analysis_scripts/mchen/div2/stage_2.Rmd
@@ -0,0 +1,231 @@
+---
+output: html_document
+---
+# DIV2 USER RATING DATA ANALYSIS
+
+***
+
+In the analysis of the data we will study the extent to which supervisor-perceived overall performance and ability to use commcare as an FLW are associated with the active measures we have defined and computed from HQ log data In this process we will test various hypotheses of complete and partial independence.
+
+Of all flw rating sheets (~50) we sent out to our DIV2 partners, we got 13 responses that rated 120 users. However 2 of them were not having any data from DP. Thus for this analysis we included the rest 188 users.
+
+Time period of the analysis: 2012-12-01 to 2015-01-01
+
+***
+
+```{r results='hide', message=FALSE, warning=FALSE, echo=FALSE}
+library(MASS) # loading MASS after dplyr would make select function crash
+library(dplyr)
+library(lattice)
+library(gridExtra)
+library(ggplot2)
+library(reshape)
+library(zoo)
+library(corrplot)
+library(vcd)
+library(ca)
+library(corrgram)
+library(mvtsplot)
+suppressPackageStartupMessages(library(googleVis))
+```
+
+```{r dat, message=FALSE, echo=FALSE}
+domain_size = perf_score_long %>%
+ group_by(domain_numeric) %>%
+ summarise(tot_users = sum(nusers))
+domain_size$dsize = ifelse(domain_size$tot_users >= 10, "big", "small")
+wd_6 = left_join(wd_5, domain_size)
+odat = select(wd_6, domain_numeric, user_pk, median_ad, dsize,
+ overall_performance_as_an_flw,
+ overall_performance_as_an_flw_n,
+ ability_to_use_commcare,
+ ability_to_use_commcare_n)
+odat$perf = ifelse(as.numeric(odat$overall_performance_as_an_flw_n) <= 2, "Not good enough", "Good enough")
+odat$ability = ifelse(as.numeric(odat$ability_to_use_commcare_n) <= 2, "Not good enough", "Good enough" )
+odat$active = ifelse(odat$median_ad >= 3, "More active", "Less active")
+```
+
+> **Time series plot of monthly active days of all users**
+
+```{r mvts, message=FALSE, echo=FALSE}
+# multivariate time series plot
+ # create a matrix of users and calendar months
+d = select(working_data_1, user_pk, month.index, active_days)
+dlist = split(d, d$user_pk)
+m = c(min(d$month.index), max(d$month.index))
+d2 = data.frame(seq(m[1],m[2],by="month"))
+names(d2) = c("month.index")
+d2 = tbl_df(d2)
+d_user = expand.grid(user_pk=unique(d$user_pk),month.index=d2$month.index)
+d_full = tbl_df(left_join(d_user,d,by=c("month.index","user_pk"),all=TRUE)) # already ordered
+d_full = arrange(d_full, month.index, user_pk)
+
+ # d_full should be used to construct the matrix for mtvs plot
+dmat = matrix(d_full$active_days,
+ nrow=nrow(d2),ncol=n_distinct(d$user_pk),
+ byrow=TRUE,
+ dimnames=list(as.character(unique(d_full$month.index)),
+ unique(d_full$user_pk)))
+dmat_ts = mvtsplot(dmat, norm="internal")
+```
+
+* ##### **Notes on the time series plot**
+ + Each line represents a distinct user. Blank cells indicate inactive gap months.
+ + Each color represents the user's activity level (internal, not at a global level). Green is assigned to high values, grey to medium values, and purple to low values. Data is divided into tertiles with roughly an equal number of points in each.
+ + On the right hand side panel displays boxplots of the data on median active days of each user.
+ + On the bottom panel are median values across all users for each month.
+
+> **Data Tables**
+
+* ##### **Frequency Table**
+```{r freq1, message=FALSE, echo=FALSE}
+f1 = table(odat$overall_performance_as_an_flw, dnn = c("Overall Performance"))
+f2 = table(odat$ability_to_use_commcare, dnn = c("Ability to Use CommCare"))
+df1 = as.data.frame(f1)
+df2 = as.data.frame(f2)
+names(df1) = c("categories", "overall performance")
+names(df2) = c("categories", "user ability")
+df3 = tbl_df(merge(df1, df2))
+df3 = arrange(df3, categories)
+df3
+```
+***
+
+> **Motion Chart: Domain-level activity**
+
+```{r setOptions, message=FALSE}
+library(googleVis)
+op <- options(gvis.plot.tag='chart')
+```
+
+```{r, mchart1Data, message=FALSE, echo=FALSE}
+tsData = select(working_data_1,
+ user_pk,
+ domain_numeric,
+ ncases_touched,
+ active_days,
+ month.index)
+tsData = na.omit(tsData)
+tsData_1 = tsData %>%
+ group_by(domain_numeric, month.index) %>%
+ summarise_each(funs(sum)) %>%
+ select(., domain_numeric, month.index, ncases_touched, active_days)
+tsData_2 = tsData %>%
+ group_by(domain_numeric, month.index) %>%
+ summarise(nusers_active = n_distinct(user_pk))
+tsData_1 = inner_join(tsData_1, tsData_2, by=c("domain_numeric", "month.index"))
+```
+
+```{r results='asis'}
+M = gvisMotionChart(tsData_1, "domain_numeric", "month.index")
+#str(M)
+print(M, 'chart')
+```
+
+
+> **Overall Performance vs. Domain Size**
+
+* ##### **Two-way contingency table of Overall Performance and Domain Size**
+```{r test, message=FALSE, echo=FALSE}
+temp1 = select(odat, overall_performance_as_an_flw, dsize)
+ftemp1 = ftable(temp1)
+ftemp1
+```
+***
+
+* ##### **Same information displayed as a mosaic plot**
+ + The width of the rectangles represent the proportion of FLWs at each level of overall performance and their heights represent the proportion of FLWs at each performance level within big (>10users) and small domains.
+ + The area of each rectangle is proportional to the frequency of each combined overall performance and user ability group. In other words, the areas represent the numbers in the body of the contingency table.
+ + Aside from FLWs rated as "Unknown" in overall performance, it seems the perceived overall performance level is independent of project size.
+
+```{r m_dsize, message=FALSE, echo=FALSE}
+mosaicplot(ftemp1, main=NULL, shade=TRUE)
+```
+***
+
+> **Overall Performance vs. Median Active Days**
+
+* ##### **Two-way contingency table of Overall Performance and User Ability**
+```{r tab1, echo=FALSE}
+c1 = table(odat$median_ad,
+ odat$overall_performance_as_an_flw, dnn = c("Median active days", "Overall Performance"))
+c1
+
+tbl_odat = table(odat$perf, odat$ability, odat$active)
+op_oa = margin.table(tbl_odat, c(3,1))
+test = chisq.test(op_oa, correct=F)
+likelihood_test = op_oa[2]/margin.table(op_oa, 1)[2] / (op_oa[1]/margin.table(op_oa, 1)[1])
+
+cat(paste("Users who have been active for more than 3 days in a month are ", round(likelihood_test, digits=2), " times as likely to be rated Good/Very Good. A Pearson Chi-square test on this contingency table with the p-value of ", round(test$p.value, digits=2), " indicated that supervisor-perceived FLW overall performance is independent from median active days.", sep=""))
+```
+
+* ##### **Same information displayed as a correlation**
+```{r cor, echo=FALSE}
+corr_dat = select(odat, median_ad, overall_performance_as_an_flw, ability_to_use_commcare)
+corr_dat$overall_performance_as_an_flw = as.numeric(corr_dat$overall_performance_as_an_flw)
+corr_dat$ability_to_use_commcare = as.numeric(corr_dat$ability_to_use_commcare)
+cor = corrgram(corr_dat, order = TRUE,
+ lower.panel = panel.ellipse,
+ upper.panel = panel.pts,
+ text.panel = panel.txt,
+ diag.panel = panel.minmax)
+cor
+```
+
+* ##### **Same information displayed as a barplot**
+ + We are interested in knowing if overall performance level is related or can be predicted by median active days. It seems at each median active day, there are more FLWs rated as Good rather than the other three levels
+ + There does not seem to be a clear and strong relation between median active days and overall performance level
+
+```{r g1, echo=FALSE}
+c1
+barplot(t(c1), beside=TRUE,
+ horiz=TRUE,
+ legend=rownames(t(c1)),
+ col=c("blue","magenta","yellow","gray"),
+ main=NULL,
+ xlab="Number of FLW",
+ ylab="Median active days",
+ xlim=c(0,max(c1)))
+
+fit_1 = ca(c1)
+plot(fit_1)
+```
+***
+
+
+> **Overall Performance vs. User Ability**
+
+* ##### **Two-way contingency table of Overall Performance and User Ability**
+```{r tab2, echo=FALSE}
+c2 = table(odat$ability_to_use_commcare,
+ odat$overall_performance_as_an_flw, dnn = c("Ability to use CommCare", "Overall Performance"))
+c2
+
+op_ab = margin.table(tbl_odat, c(2,1))
+test_2 = round(chisq.test(op_ab)$p.value, digits=2)
+likelihood_test_2 = op_ab[1]/margin.table(op_ab, 1)[1] / (op_ab[2]/margin.table(op_ab, 1)[2])
+
+cat(paste("Users who have are rated Good/Very Good in their ability to use CommCare are ", round(likelihood_test_2,digits=2), " times as likely to be rated Good/Very Good in their overall performance. A Pearson Chi-square test on this contingency table with the p-value of ", test_2, "indicates a strong positive association between these two variables.", sep=""))
+```
+***
+
+* ##### **Same information displayed as a mosaic plot**
+ + The width of the rectangles represent the proportion of FLWs at each level of overall performance and their heights represent the proportion of FLWs at each performance level within each user ability group.
+ + It seems that overall performance and user ability are significantly corresponding to each other.
+
+```{r g0, fig.height=8, fig.width=8, echo=FALSE}
+mosaicplot(t(c2),main=NULL,shade=TRUE)
+fit_2 = ca(c2)
+plot(fit_2)
+```
+
+
+
+```{r dist, message=FALSE, echo=FALSE}
+#mad_tab = table(odat$median_ad)
+#mad_fit = goodfit(mad_tab, type = "poisson", method = "ML")
+#summary(mad_fit)
+#plot(mad_fit) # cannot change the color of expected frequencies
+#cat(paste("This hanging rootogram shows that the distribution of median active days per month differs systematically from a Poisson."))
+```
+
diff --git a/analysis_scripts/mchen/div2/time_series.R b/analysis_scripts/mchen/div2/time_series.R
new file mode 100644
index 0000000..6de1780
--- /dev/null
+++ b/analysis_scripts/mchen/div2/time_series.R
@@ -0,0 +1,59 @@
+# time series visualization of all data points included in round 2 analysis
+
+# multivariate time series plot
+# create a matrix of users and calendar months
+dm = select(working_data_1, user_pk, domain_numeric)
+d = select(working_data_1, user_pk, month.index, active_days)
+dlist = split(d, d$user_pk)
+m = c(min(d$month.index), max(d$month.index))
+d2 = data.frame(seq(m[1],m[2],by="month"))
+names(d2) = c("month.index")
+d2 = tbl_df(d2)
+d_user = expand.grid(user_pk=unique(d$user_pk),month.index=d2$month.index)
+d_full = tbl_df(left_join(d_user,d,by=c("month.index","user_pk"),all=TRUE)) # already ordered
+d_full = arrange(d_full, month.index, user_pk)
+
+# d_full should be used to construct the matrix for mtvs plot
+dmat = matrix(d_full$active_days,
+ nrow=nrow(d2),ncol=n_distinct(d$user_pk),
+ byrow=TRUE,
+ dimnames=list(as.character(unique(d_full$month.index)),
+ unique(d_full$user_pk)))
+png(filename="mvts_div2.png",
+ width=720,
+ height=1080,
+ res=144,
+ bg="transparent")
+dmat_ts = mvtsplot(dmat, norm="global")
+dev.off()
+
+# Motion Chart
+# each bubble is a unique user (or domain?)
+# bubble size: number of unique users in this domain
+# x-axis: calendar month
+# y-axis: median active days
+
+suppressPackageStartupMessages(library(googleVis))
+library(dplyr)
+tsData = select(working_data_1,
+ user_pk,
+ domain_numeric,
+ ncases_touched,
+ active_days,
+ month.index)
+
+tsData = na.omit(tsData)
+
+tsData_1 = tsData %>%
+ group_by(domain_numeric, month.index) %>%
+ summarise_each(funs(sum)) %>%
+ select(., domain_numeric, month.index, ncases_touched, active_days)
+
+tsData_2 = tsData %>%
+ group_by(domain_numeric, month.index) %>%
+ summarise(nusers_active = n_distinct(user_pk))
+
+inner_join(tsData_1, tsData_2, by=c("domain_numeric", "month.index"))
+
+plot(gvisMotionChart(tsData_1, idvar='domain_numeric', timevar='month.index'))
+
diff --git a/misc_scripts/attrition_report_old.R b/analysis_scripts/misc_scripts/attrition_report_old.R
similarity index 100%
rename from misc_scripts/attrition_report_old.R
rename to analysis_scripts/misc_scripts/attrition_report_old.R
diff --git a/misc_scripts/device_types_venn.R b/analysis_scripts/misc_scripts/device_types_venn.R
similarity index 100%
rename from misc_scripts/device_types_venn.R
rename to analysis_scripts/misc_scripts/device_types_venn.R
diff --git a/misc_scripts/example_split_by.R b/analysis_scripts/misc_scripts/example_split_by.R
similarity index 100%
rename from misc_scripts/example_split_by.R
rename to analysis_scripts/misc_scripts/example_split_by.R
diff --git a/analysis_scripts/misc_scripts/followup_by_case_type.R b/analysis_scripts/misc_scripts/followup_by_case_type.R
new file mode 100644
index 0000000..0540334
--- /dev/null
+++ b/analysis_scripts/misc_scripts/followup_by_case_type.R
@@ -0,0 +1,24 @@
+library(dplyr)
+library(lubridate)
+
+# load system conf
+source(file.path("function_libraries","config_file_funcs.R", fsep = .Platform$file.sep))
+system_conf <- get_system_config(file.path("config_system.json"))
+
+# get db connection
+db <- get_db_connection(system_conf)
+
+# get interactions data source
+source(file.path("data_sources.R", fsep = .Platform$file.sep))
+interactions <- get_data_source(db, "interactions", limit = -1) # gets the whole table and takes a while, change the limit to for e.g. 1000 to get just 1000 rows
+
+# head(interactions) # check what columns we have
+
+domains <- c('crs-remind','aaharbaseline','myrada', 'spandan', 'rdi-hiht', 'care-ecd')
+interactions_subset <- interactions[interactions$domain %in% domains,] #subset of interactions table - just the domains were analysing
+
+
+
+
+
+
diff --git a/misc_scripts/monthly_usage_overall.R b/analysis_scripts/misc_scripts/monthly_usage_overall.R
similarity index 100%
rename from misc_scripts/monthly_usage_overall.R
rename to analysis_scripts/misc_scripts/monthly_usage_overall.R
diff --git a/misc_scripts/monthly_usage_report_old.R b/analysis_scripts/misc_scripts/monthly_usage_report_old.R
similarity index 100%
rename from misc_scripts/monthly_usage_report_old.R
rename to analysis_scripts/misc_scripts/monthly_usage_report_old.R
diff --git a/misc_scripts/output.R b/analysis_scripts/misc_scripts/output.R
similarity index 100%
rename from misc_scripts/output.R
rename to analysis_scripts/misc_scripts/output.R
diff --git a/misc_scripts/real_time_usage_report_old.R b/analysis_scripts/misc_scripts/real_time_usage_report_old.R
similarity index 100%
rename from misc_scripts/real_time_usage_report_old.R
rename to analysis_scripts/misc_scripts/real_time_usage_report_old.R
diff --git a/raw_data/data_import.R b/analysis_scripts/raw_data/data_import.R
similarity index 69%
rename from raw_data/data_import.R
rename to analysis_scripts/raw_data/data_import.R
index e248d2b..7681ad4 100644
--- a/raw_data/data_import.R
+++ b/analysis_scripts/raw_data/data_import.R
@@ -1,5 +1,5 @@
-#The purpose of this code is import the all_monthly dataset as defined in
-#the config_run file that is referenced by config_setup.R
+#The purpose of this code is to import the all_monthly dataset
+#as defined in config_run.R, which is referenced by config_setup.R
library(dplyr)
@@ -10,12 +10,13 @@ library(dplyr)
# load config files
source(file.path("function_libraries","config_file_funcs.R", fsep = .Platform$file.sep))
system_conf <- get_system_config(file.path("config_system.json"))
-
-source(file.path("config_setup.R", fsep = .Platform$file.sep)) # sets the path to the run config to use
+source(file.path("config_setup.R", fsep = .Platform$file.sep))
run_conf <-get_run_config(config_run_path)
-# get domain table from db
+#Get db connection
db <- get_db_connection(system_conf)
+
+# get domain table from db
source(file.path("function_libraries","db_queries.R", fsep = .Platform$file.sep))
domain_table <- get_domain_table(db)
@@ -24,7 +25,10 @@ domains_for_run <- get_domains_for_run(domain_table,run_conf)
# get the monthly table domains to run on
source(file.path("function_libraries","report_utils.R", fsep = .Platform$file.sep))
-monthly_table <- get_aggregate_table (db, "aggregate_monthly_interactions", domains_for_run)
+monthly_table <- get_aggregate_table(db, "aggregate_monthly_interactions", domains_for_run)
+#if (run_conf$permitted_data_only != FALSE) {
+# monthly_table <- monthly_table[monthly_table$domain %in% get_permitted_domains(domain_table),]
+#}
# write to csv
output_directory <- system_conf$directories$output
diff --git a/analysis_scripts/raw_data/db_table_import.R b/analysis_scripts/raw_data/db_table_import.R
new file mode 100644
index 0000000..867b3f4
--- /dev/null
+++ b/analysis_scripts/raw_data/db_table_import.R
@@ -0,0 +1,57 @@
+#------------------------------------------------------------------------#
+# ACCESS ANY DATABASE TABLE
+#------------------------------------------------------------------------#
+
+library(dplyr)
+
+# Load system config file
+source(file.path("function_libraries","config_file_funcs.R", fsep = .Platform$file.sep))
+source(file.path("data_sources.R"))
+system_conf <- get_system_config(file.path("config_system.json"))
+
+# Get db connection
+db <- get_db_connection(system_conf)
+
+#------------------------------------------------------------------------#
+
+#Get interactions table or device_type table (or other db source)
+#Limit interactions by specified number. If want entire interaction table, enter "-1".
+#device_type is also a data source
+inter <- get_data_source(db, "interactions", 1000)
+inter <- get_data_source(db, "interactions", -1)
+
+#Get form table (or another straight db dplyr table - a list of these tables is available
+#when you print db)
+form_table <- tbl(db, "form")
+form_table <- get_data_source(db, "form", 1000) #limited number of forms
+form_table <- collect(form_table)
+
+app <- tbl(db, "application")
+device_log <- tbl(db, "device_log")
+domain <- tbl(db, "domain")
+users <- tbl(db, "users")
+
+#Get table for user_type (mobile, web, superuser, etc.)
+#This function has been defined in data_sources.R
+user_type <- get_user_type_table(db)
+
+#Get visit_detail data source
+#First run functions in data_sources.R
+visit_detail <- get_visit_detail(db, 1000)
+#------------------------------------------------------------------------#
+#Calculations on DB tables
+#------------------------------------------------------------------------#
+
+#Use collect() to bring these results into a dataframe
+# Count forms per app_id
+sum_forms <-
+ form_table %.%
+ group_by(app_id) %.%
+ summarise(form_total = count(form_id))
+
+summary(sum_forms)
+dim(sum_forms)
+head(sum_forms$select)
+
+
+
diff --git a/analysis_scripts/raw_data/indicator_development.R b/analysis_scripts/raw_data/indicator_development.R
new file mode 100644
index 0000000..078a414
--- /dev/null
+++ b/analysis_scripts/raw_data/indicator_development.R
@@ -0,0 +1,20 @@
+#Indicator development
+#First load a sample of the interactions table (first 1K rows or so)
+#Then pass this interactions table as x to test the indicator functions you write below
+
+#For the interaction table "id" = visit id. This will be the same for two different rows if two
+#related cases were updated. Only the case_id will be different. time_end and time_start will be the
+#same too because the whole interaction will apply to both cases.
+
+nvisits <- function(x) length(unique(x$id))
+
+#Median visit duration: if there are two cases visited in the same visit, we don't want to count
+#that interaction twice in our calculation for median visit duration, so we need to group by "id" first
+#and then operate on that new table
+median_visit_duration <- function(x) as.numeric(median((x$time_end - x$time_start)/ 60, na.rm=TRUE))
+time_using_cc <- function(x) sum(x$form_duration, na.rm = T)
+
+nvisits_travel <- function(x) sum(x$home_visit, na.rm=T)
+nvisits_travel_batch <- function(x) sum(x$time_since_previous_hv/60<10, na.rm = T)
+travel_batch_percent <- function(x) (nvisits_travel_batch(x) / nvisits_travel(x))*100
+
diff --git a/analysis_scripts/rdayalu/Dag Holmboe 10.13.14.R b/analysis_scripts/rdayalu/Dag Holmboe 10.13.14.R
new file mode 100644
index 0000000..1713ec3
--- /dev/null
+++ b/analysis_scripts/rdayalu/Dag Holmboe 10.13.14.R
@@ -0,0 +1,109 @@
+#Data for Dag Holmboe
+#unique_domains <- unique(all_monthly$domain)
+#sample_domains <- sample(unique_domains, 9)
+sample_domains <- c("crs-senegal", "cahabon", "teba-hbc",
+"nyu-corrections-study", "matts-sandbox", "ucla-dermatology", "m4change",
+"adra", "mvp-mayange")
+
+sample_monthly <- all_monthly[all_monthly$domain %in% sample_domains,]
+
+month_count <- sample_monthly %.%
+ group_by(domain, user_pk) %.%
+ summarise(months_on_cc = length(unique(calendar_month)))
+month_count <- filter(month_count, months_on_cc > 1)
+
+#Keep user with >1 month on CC
+sample_monthly <-
+ sample_monthly[sample_monthly$user_pk %in% month_count$user_pk,]
+
+#Domains with >= 5 users with > 1 month on CC
+cc_tenure <- month_count %.%
+ group_by(domain) %.%
+ summarise(nusers = length(unique(user_pk)))
+cc_tenure <- filter(cc_tenure, nusers >= 5)
+
+#Create dataset only from these domains for Dag
+sample_monthly <-
+ sample_monthly[sample_monthly$domain %in% cc_tenure$domain,]
+
+#Inititalize sample_users vector
+sample_users <- c()
+
+#Sample for 5 users from each of these domains
+single_domain <-
+ sample_monthly[sample_monthly$domain == "ucla-dermatology",]
+users <- sample(unique(single_domain$user_pk), 5)
+sample_users <- append(sample_users, users)
+
+#Convert to factor vector
+#sample_users <- as.factor(sample_users)
+
+#Create final dataset for Dag
+sample_monthly <-
+ sample_monthly[sample_monthly$user_pk %in% sample_users,]
+
+drops <- c("row.names", "active_days_percent",
+ "median_time_elapsed_btw_visits", "batch_entry_visit",
+ "batch_entry_percent", "user_numeric", "sample_undefined",
+ "sample_normal", "active", "domain", "register_followup",
+ "case_register_followup_rate", "morning", "afternoon",
+ "night", "after_midnight", "ncases_opened", "total_logs",
+ "audio_plays", "network_warnings", "diff_days")
+
+sample_monthly <- sample_monthly[,!(names(sample_monthly) %in% drops)]
+#names(sample_monthly)[names(sample_monthly) == "obsnum"] = "month_index"
+
+#Calculate differences between month_index to calculate next_month_active
+#previous_month_active
+sample_monthly <- arrange(sample_monthly, domain_numeric, user_id,
+ calendar_month)
+df <- data.table(sample_monthly)
+#Can we setkey by domain and user_id since some user_ids might be the same?
+setkey(df,user_id)
+df[,diff:=c(NA,diff(calendar_month)),by=user_id]
+dag_data <- as.data.frame(df)
+dag_data$previous_month_active <- dag_data$diff <= 31
+
+users <- unique(dag_data$user_id)
+next_month_active <- c()
+for (i in users) {
+ single_user <- dag_data[dag_data$user_id == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_month_active[-1], F)
+ next_month_active <- append(next_month_active, next_active)
+}
+dag_data$next_month_active <- next_month_active
+
+#If calendar_month = 8/1/14 then next_month_active = NA
+#because we don't know if the user will be active in the following month
+is.na(dag_data$next_month_active) <- dag_data$calendar_month == "2014-08-01"
+
+write.csv(sample_monthly, file="sample_monthly.csv")
+
+#------------------------------------------------------------------------#
+#Create previous/next month active for all_monthly (full) dataset
+#------------------------------------------------------------------------#
+
+#Calculate differences between month_index to calculate next_month_active
+#previous_month_active
+all_monthly <- arrange(all_monthly, domain_numeric, user_pk, calendar_month)
+df <- data.table(all_monthly)
+#Can we setkey by domain and user_id since some user_ids might be the same?
+setkey(df,user_id)
+df[,diff:=c(NA,diff(calendar_month)),by=user_id]
+all_monthly <- as.data.frame(df)
+all_monthly$previous_month_active <- all_monthly$diff <= 31
+
+users <- unique(all_monthly$user_id)
+next_month_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$user_id == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_month_active[-1], F)
+ next_month_active <- append(next_month_active, next_active)
+}
+all_monthly$next_month_active <- next_month_active
+
+#If calendar_month = 8/1/14 then next_month_active = NA
+#because we don't know if the user will be active in the following month
+is.na(all_monthly$next_month_active) <- all_monthly$calendar_month == "2014-08-01"
diff --git a/analysis_scripts/rdayalu/INC_summit.R b/analysis_scripts/rdayalu/INC_summit.R
new file mode 100644
index 0000000..b19f636
--- /dev/null
+++ b/analysis_scripts/rdayalu/INC_summit.R
@@ -0,0 +1,111 @@
+#This code is to calculate the numbers that Neal wants for the INC summit
+#https://docs.google.com/a/dimagi.com/document/d/1WxjTeRNgXPvxMqn1m46jFlU_rt21AfuoVF_t2SL4mzY/edit
+
+all_monthly <- monthly_table
+library(data.table)
+library(zoo)
+detach("package:lubridate")
+library(lubridate)
+library(ggplot2)
+
+#Format dates
+all_monthly$date_first_visit = as.Date(all_monthly$date_first_visit)
+all_monthly$date_last_visit = as.Date(all_monthly$date_last_visit)
+
+#Change column names as needed
+names (all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+names (all_monthly)[names(all_monthly) == "numeric_index"] = "month_index"
+
+#Modify relevant variables
+all_monthly$domain_numeric = as.numeric(as.factor(all_monthly$domain))
+
+#Merge domain facets from domain table into all_monthly table
+facets_to_merge <- select(domain_table, name, country, Sector, Sub.Sector,
+ business_unit, active, Test.Project.)
+all_monthly <- merge(all_monthly, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+
+#Convert calendar month to actual date
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+all_monthly$month_abbr <- month(all_monthly$calendar_month, label = T, abbr = T)
+
+#Remove users with device_type = "Sms" at any point in time
+device_type_sms <- unique(filter(all_monthly, summary_device_type == "Sms")$user_pk)
+all_monthly <- all_monthly[!(all_monthly$user_pk %in% device_type_sms),]
+
+#Run the code below first without excluding any more users - keep these results for comparison
+users_start_month <- all_monthly %>% group_by(user_pk) %>%
+ summarise(start_month = min(calendar_month),
+ total_active_days = sum(active_days, na.rm = T),
+ start_date = min(date_first_visit, na.rm = T),
+ end_date = max(date_last_visit, na.rm = T))
+users_start_month$days_on_cc <-
+ as.numeric(users_start_month$end_date - users_start_month$start_date) + 1
+users_start_month$le_31 <- users_start_month$days_on_cc <= 31
+#There are a lot of users with days_on_cc == NA - need to figure out why.
+#Then exclude users with <= 3 active days and keep those results
+users_start_month <- filter(users_start_month, total_active_days >= 4)
+all_monthly <- all_monthly[all_monthly$user_pk %in% users_start_month$user_pk,]
+
+#2: 2014 gains and losses
+#List of users active in Q4 2013
+q4_2013 <- filter(all_monthly, calendar_month >= "2013-10-01" &
+ calendar_month <= "2013-12-01")
+users_2013_Q4 <- unique(q4_2013$user_pk)
+#List of users active in Q4 2014
+q4_2014 <- filter(all_monthly, calendar_month >= "2014-10-01" &
+ calendar_month <= "2014-12-01")
+users_2014_Q4 <- unique(q4_2014$user_pk)
+#List of users active in Q3 2014
+q3_2014 <- filter(all_monthly, calendar_month >= "2014-07-01" &
+ calendar_month <= "2014-09-01")
+users_2014_Q3 <- unique(q3_2014$user_pk)
+#List of users started in Q1/Q2/Q3 2014
+users_started_2014_three_qs <- (filter(users_start_month, start_month >= "2014-01-01" &
+ start_month <= "2014-09-01"))$user_pk
+users_started_2014_q4 <- (filter(users_start_month, start_month >= "2014-10-01"))$user_pk
+
+#Kept: #users who were active in Q4 2013 and in Q4 2014
+summary(users_2013_Q4 %in% users_2014_Q4)
+kept <- (sum(users_2013_Q4 %in% users_2014_Q4)/length(users_2013_Q4))*100
+
+#Lost: #users who were active in Q4 2013 but not active in Q4 2014
+summary(!(users_2013_Q4 %in% users_2014_Q4))
+lost <- (sum(!(users_2013_Q4 %in% users_2014_Q4))/length(users_2013_Q4))*100
+
+#New: #users who started in Q1, Q2, or Q3 2014 and were active in Q4 2014
+summary(users_started_2014_three_qs %in% users_2014_Q4)
+new <- (sum(users_started_2014_three_qs %in% users_2014_Q4)/length(users_started_2014_three_qs))*100
+
+#Temporary: #users who started in Q1, Q2, or Q3 2014 but were not active in Q4 2014
+summary(!(users_started_2014_three_qs %in% users_2014_Q4))
+temporary <- (sum(!(users_started_2014_three_qs %in% users_2014_Q4))/length(users_started_2014_three_qs))*100
+
+#Started: #users who started in Q4 2014
+length(users_started_2014_q4)
+
+##########################################################################
+
+ggplot(users_start_month, aes(x=total_active_days)) +
+ geom_histogram(binwidth=1, colour = "black", fill = "antiquewhite1") +
+ scale_x_continuous(limits=c(0,30)) +
+ xlab("Total # active days") +
+ ylab("Number of users") +
+ theme(axis.title.x=element_text(size=14), axis.text.x=element_text(size=14, colour = "black")) +
+ theme(axis.title.y=element_text(size=14), axis.text.y=element_text(size=14, colour = "black"))
+
+active_day_summary <- users_start_month %>% group_by(total_active_days) %>%
+ summarise(nusers = length(user_pk),
+ days_on_cc_le_31 = sum(le_31))
+active_day_summary$per_le_31 <-
+ (active_day_summary$days_on_cc_le_31/active_day_summary$nusers)*100
+
+g <- ggplot(active_day_summary, aes(x = total_active_days, y = per_le_31)) +
+ geom_point(shape = 15, size = 3.0, colour="darkblue") +
+ geom_line(size = 1.0, colour = "cornflowerblue") +
+ scale_x_continuous(limits = c(0,30)) +
+ xlab("Total # active days") +
+ ylab("% Users with total days on CC <= 31") +
+ theme(axis.title.x=element_text(size=14), axis.text.x=element_text(size=14, colour = "black")) +
+ theme(axis.title.y=element_text(size=14), axis.text.y=element_text(size=14, colour = "black"))
diff --git a/analysis_scripts/rdayalu/WAM_tables.R b/analysis_scripts/rdayalu/WAM_tables.R
new file mode 100644
index 0000000..2c4379e
--- /dev/null
+++ b/analysis_scripts/rdayalu/WAM_tables.R
@@ -0,0 +1,447 @@
+#This code is to create the WAM table as described in the following document
+#https://docs.google.com/a/dimagi.com/document/d/1VwaJm_wUJmHWOH0aUsAbwbWplTyzUSo0dkN30R1en7s/edit
+#First we need to add three variables to each row of the monthly table:
+#wam_eligible, wam_experienced, wam_using
+#From this table, we will calculate the WAM overview table (Table 1) and
+#the WAM data table (Table 2), and the WAM/PAM annotation stats (Table 3)
+#as described in the google doc.
+
+#------------------------------------------------------------------------#
+#Import data
+#------------------------------------------------------------------------#
+
+library(dplyr)
+library(data.table)
+library(zoo)
+library(lubridate)
+
+# Load system config file
+source(file.path("function_libraries","config_file_funcs.R", fsep = .Platform$file.sep))
+source(file.path("data_sources.R"))
+system_conf <- get_system_config(file.path("config_system.json"))
+# Get db connection
+db <- get_db_connection(system_conf)
+
+#Pull monthly_table for all domains
+#Be sure to set config_run first
+#permitted_data_only = T and is_test = F
+source(file.path("analysis_scripts","raw_data","data_import.R", fsep = .Platform$file.sep))
+all_monthly <- monthly_table
+#Change column names as needed
+names(all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+names(all_monthly)[names(all_monthly) == "numeric_index"] = "month_index"
+#Convert calendar month to actual date
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+all_monthly$month_abbr <- month(all_monthly$calendar_month, label = T, abbr = T)
+
+#Import user table and user_type table
+users <- tbl(db, "users")
+users <- collect(users)
+user_type <- get_user_type_table(db)
+user_type <- collect(user_type)
+#Merge user tables
+users <- merge(users, user_type, by.x = "id", by.y = "user_pk", all.x = T)
+users <- select(users, -c(user_id.y, username.y, first_name, last_name, default_phone_number,
+ groups, phone_numbers, user_id.x))
+names(users)[names(users) == "username.x"] = "username"
+names(users)[names(users) == "id"] = "user_pk"
+
+#Pull the app table directly from the db
+app <- tbl(db, "application")
+app <- collect(app)
+app_amplifies <- app
+
+#Remove all double quotes from inside attributes string. Replace with underscores
+app_amplifies$attributes <- gsub('"', "_", app_amplifies$attributes)
+
+#Import form table
+form_table <- tbl(db, "form")
+form_table <- collect(form_table)
+names(form_table)[names(form_table) == "user_id"] = "user_pk"
+#Remove forms with application_id = NA
+form_table <- filter(form_table, !(is.na(application_id)))
+
+#Pull domain table
+domain <- get_domain_table(db)
+
+#------------------------------------------------------------------------#
+#User exclusions
+#------------------------------------------------------------------------#
+
+#Exclude demo users and NA/NONE users
+all_monthly = all_monthly[!(all_monthly$user_id =="demo_user"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="NONE"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="none"),]
+all_monthly = all_monthly[!is.na(all_monthly$user_id),]
+
+#Exclude users who submit to multiple domains
+chw_single_domain <- all_monthly %>% group_by(user_pk) %>%
+ summarise(n_domains = length(unique(domain)))
+chw_single_domain <- filter(chw_single_domain, n_domains == 1)
+all_monthly <- all_monthly[all_monthly$user_pk %in% chw_single_domain$user_pk,]
+
+#Exclude users with user_type = web and summary_device_type != cloud care during any month
+all_monthly <- merge(all_monthly, users, by = "user_pk", all.x = T)
+exclude1 <- all_monthly[all_monthly$user_type == "web" &
+ all_monthly$summary_device_type != "Cloudcare",]
+all_monthly <- all_monthly[!(all_monthly$user_pk %in% exclude1$user_pk),]
+
+#Exclude dimagi users and superusers
+exclude2 <- all_monthly[grep("dimagi", all_monthly$email, fixed=T),]
+all_monthly <- all_monthly[!(all_monthly$user_pk %in% exclude2$user_pk),]
+all_monthly <- filter(all_monthly, is_superuser == F | is.na(is_superuser))
+
+#Exclude users with summary_device_type = SMS during any month
+exclude3 <- all_monthly[all_monthly$summary_device_type == "Sms",]
+all_monthly <- all_monthly[!(all_monthly$user_pk %in% exclude3$user_pk),]
+
+#Exclude users without form_ids and app_ids
+#Flag users who have submitted data through just 1 app or > 1 app
+#We are going to handle single_app and multiple_app users differently
+chw_n_app <- form_table %>% group_by(user_pk) %>%
+ summarise(n_applications = length(unique(application_id)))
+#write.csv(chw_n_app, file = "chw_n_app.csv")
+chw_multiple_app <- filter(chw_n_app, n_applications > 1)
+chw_single_app <- filter(chw_n_app, n_applications == 1)
+all_monthly$submitted_single_app <- all_monthly$user_pk %in% chw_single_app$user_pk
+all_monthly$submitted_multiple_app <- all_monthly$user_pk %in% chw_multiple_app$user_pk
+all_monthly <- all_monthly[!(all_monthly$submitted_single_app == F &
+ all_monthly$submitted_multiple_app == F),]
+
+#------------------------------------------------------------------------#
+#Parse amplifies_X values from app table
+#Add amplifies_workers to form_table by app_id
+#------------------------------------------------------------------------#
+
+#amplifies_workers
+app_amplifies$amplifies_workers <- NA
+app_amplifies$amplifies_workers[grep("_amplifies_workers_=>_yes_", app_amplifies$attributes,
+ fixed=T)] <- T
+app_amplifies$amplifies_workers[grep("_amplifies_workers_=>_no_", app_amplifies$attributes,
+ fixed=T)] <- F
+#Manually tag mvp domains
+app_amplifies$amplifies_workers[app_amplifies$id %in% test2$application_id] <- T
+
+#amplifies_project
+app_amplifies$amplifies_project <- NA
+app_amplifies$amplifies_project[grep("_amplifies_project_=>_yes_", app_amplifies$attributes,
+ fixed=T)] <- T
+app_amplifies$amplifies_project[grep("_amplifies_project_=>_no_", app_amplifies$attributes,
+ fixed=T)] <- F
+#Manually tag mvp domains
+app_amplifies$amplifies_project[app_amplifies$id %in% test2$application_id] <- T
+
+#Add amplifies_workers to form_table by app_pk
+form_table <- merge(form_table,
+ select(app_amplifies, id, amplifies_workers),
+ by.x = "application_id", by.y = "id", all.x = T)
+
+#Add amplifies_workers to domain table
+domain$domain_has_amplifies_workers <- domain$id %in%
+ app_amplifies[app_amplifies$amplifies_workers == T,]$domain_id
+domain$domain_has_amplifies_project <- domain$id %in%
+ app_amplifies[app_amplifies$amplifies_project == T,]$domain_id
+#Prepare domain table for merging in domain facets
+#Bring in sector information
+sector <- tbl(db, "sector")
+sector <- collect(sector)
+names(sector)[names(sector) == "name"] = "sector_final"
+domain_sector <- tbl(db, "domain_sector")
+domain_sector <- collect(domain_sector)
+domain_sector <- select(domain_sector, domain_id, sector_id)
+domain <- merge(domain, domain_sector, by.x = "id", by.y = "domain_id", all.x = T)
+domain <- merge(domain, sector, by.x = "sector_id", by.y = "id", all.x = T)
+#Bring in subsector information
+subsector <- tbl(db, "subsector")
+subsector <- collect(subsector)
+subsector <- select(subsector, id, name)
+subsector <- filter(subsector, !is.na(name))
+subsector <- filter(subsector, name != "")
+names(subsector)[names(subsector) == "name"] = "subsector_final"
+domain_subsector <- tbl(db, "domain_subsector")
+domain_subsector <- collect(domain_subsector)
+domain_subsector <- select(domain_subsector, domain_id, subsector_id)
+domain <- merge(domain, domain_subsector, by.x = "id", by.y = "domain_id", all.x = T)
+domain <- merge(domain, subsector, by.x = "subsector_id", by.y = "id", all.x = T)
+#Consolidate country information
+is.na(domain$deployment.country) <- domain$deployment.country == ""
+is.na(domain$country) <- domain$country == ""
+domain$country_final <- domain$deployment.country
+keep_country <- which(is.na(domain$deployment.country) & !is.na(domain$country))
+domain$country_final[keep_country] <- domain$country[keep_country]
+#Consolidate Dimagi level of support
+is.na(domain$internal.services) <- domain$internal.services == ""
+is.na(domain$internal.self_started) <- domain$internal.self_started == ""
+domain$self_start[domain$internal.self_started == "True"] <- "self"
+domain$dimagi_services <- domain$internal.services
+keep_self <- which(is.na(domain$internal.services) & !is.na(domain$self_start))
+domain$dimagi_services[keep_self] <- domain$self_start[keep_self]
+
+
+all_monthly$amp_w_true <- NA
+all_monthly$amp_w_false <- NA
+all_monthly$amp_w_na <- NA
+all_monthly$wam_eligible <- NA
+
+#------------------------------------------------------------------------#
+#wam_eligible
+#------------------------------------------------------------------------#
+
+#Single_app users = 12420 (As of 3/15/15)
+all_monthly_single <- filter(all_monthly, submitted_single_app == T)
+
+#Multiple_app users = 1314 (As of 3/15/15)
+all_monthly_multiple<- filter(all_monthly, submitted_multiple_app == T)
+
+#Single app users, wam_eligible == T
+#The FLW submitted at least one form that month on a CommCare app with
+#amplifies_worker set to true
+all_monthly_single$amp_w_true <- all_monthly_single$user_pk %in%
+ form_table[form_table$amplifies_workers == T,]$user_pk
+all_monthly_single$wam_eligible[all_monthly_single$amp_w_true == T] <- T
+
+#Single app users, wam_eligible == F
+#The FLW submitted at least one form that month on a CommCare app with
+#amplifies_worker set to false
+all_monthly_single$amp_w_false <- all_monthly_single$user_pk %in%
+ form_table[form_table$amplifies_workers == F,]$user_pk
+all_monthly_single$wam_eligible[all_monthly_single$amp_w_false == T] <- F
+
+#Single app users, wam_eligible == NA
+#The FLW submitted at least one form that month on a CommCare app with
+#amplifies_worker not set
+all_monthly_single$amp_w_na <- all_monthly_single$user_pk %in%
+ form_table[is.na(form_table$amplifies_workers),]$user_pk
+all_monthly_single$wam_eligible[is.na(all_monthly_single$amp_w_na)] <- NA
+
+#Multiple app users
+#The FLW submitted at least one form that month on a CommCare app with
+#amplifies_worker set to true
+all_monthly_multiple$amp_w_true <- all_monthly_multiple$user_pk %in%
+ form_table[form_table$amplifies_workers == T,]$user_pk
+
+#Multiple app users
+#The FLW submitted at least one form that month on a CommCare app with
+#amplifies_worker set to false
+all_monthly_multiple$amp_w_false <- all_monthly_multiple$user_pk %in%
+ form_table[form_table$amplifies_workers == F,]$user_pk
+
+#Multiple app users
+#The FLW submitted at least one form that month on a CommCare app with
+#amplifies_worker not set
+all_monthly_multiple$amp_w_na <- all_monthly_multiple$user_pk %in%
+ form_table[is.na(form_table$amplifies_workers),]$user_pk
+
+#Multiple app users, wam_eligible == T
+#This will be true if a user submitted data through an amplifies_workers app at
+#any point in time
+all_monthly_multiple$wam_eligible <- all_monthly_multiple$user_pk %in%
+ all_monthly_multiple[all_monthly_multiple$amp_w_true == T,]$user_pk
+
+#Multiple app users, wam_eligible == F
+#This will be false if the user submitted forms using amplfies_workers = F but
+#not if amplifies_workers was ever = T
+#Might not actually need to run this code!
+all_monthly_multiple[all_monthly_multiple$amp_w_false == T &
+ all_monthly_multiple$amp_w_true == F,]$wam_eligible <- F
+
+#wam_eligible == NA
+#This will be NA if the user submitted forms using amplifies_workers = NA
+#for all their apps
+all_monthly_multiple[all_monthly_multiple$amp_w_na == T &
+ all_monthly_multiple$amp_w_true == F &
+ all_monthly_multiple$amp_w_false == F,]$wam_eligible <- NA
+
+#Append single app and multiple app tables together
+all_monthly <- rbind(all_monthly_single, all_monthly_multiple)
+
+#------------------------------------------------------------------------#
+#wam_experienced
+#------------------------------------------------------------------------#
+
+#Add rolling # active months to each user's rows
+all_monthly <- arrange(all_monthly, user_pk, calendar_month)
+all_monthly <- all_monthly %>% group_by(user_pk) %>%
+ mutate(previous_active_months_rolling = seq_along(calendar_month)-1)
+
+#wam_experienced == T if the user has submitted data for at least three months prior
+#to the month in question
+all_monthly$wam_experienced <- all_monthly$previous_active_months_rolling >= 3
+
+#------------------------------------------------------------------------#
+#wam_using
+#------------------------------------------------------------------------#
+
+#wam_using == T if the user met our criteria this month for sufficient usage of the app.
+#Specifically, they submitted forms on at least 4 different days. That is, they have at
+#least for active days in the month in question.
+all_monthly$active_days <- as.numeric(all_monthly$active_days)
+all_monthly$wam_using <- all_monthly$active_days >= 4
+
+#------------------------------------------------------------------------#
+#Table 1: WAM OVERVIEW
+#------------------------------------------------------------------------#
+
+#We have 12 combinations for of wam_eligible(T/F/NA), wam_experienced(T/F)
+#and wam_using(T/F)
+all_monthly$elig_exp_using <- all_monthly$wam_eligible == T & all_monthly$wam_experienced == T &
+ all_monthly$wam_using == T
+all_monthly$elig_exp_notusing <- all_monthly$wam_eligible == T & all_monthly$wam_experienced == T &
+ all_monthly$wam_using == F
+#all_monthly$elig_exp_na <- all_monthly$wam_eligible == T & all_monthly$wam_experienced == T &
+# is.na(all_monthly$wam_using)
+
+all_monthly$notelig_exp_using <- all_monthly$wam_eligible == F & all_monthly$wam_experienced == T &
+ all_monthly$wam_using == T
+all_monthly$notelig_exp_notusing <- all_monthly$wam_eligible == F & all_monthly$wam_experienced == T &
+ all_monthly$wam_using == F
+#all_monthly$notelig_exp_na <- all_monthly$wam_eligible == F & all_monthly$wam_experienced == T &
+# is.na(all_monthly$wam_using)
+
+all_monthly$na_exp_using <- is.na(all_monthly$wam_eligible)& all_monthly$wam_experienced == T &
+ all_monthly$wam_using == T
+all_monthly$na_exp_notusing <- is.na(all_monthly$wam_eligible) & all_monthly$wam_experienced == T &
+ all_monthly$wam_using == F
+#all_monthly$na_exp_na <- is.na(all_monthly$wam_eligible) & all_monthly$wam_experienced == T &
+# is.na(all_monthly$wam_using)
+
+all_monthly$elig_notexp_using <- all_monthly$wam_eligible == T & all_monthly$wam_experienced == F &
+ all_monthly$wam_using == T
+all_monthly$elig_notexp_notusing <- all_monthly$wam_eligible == T & all_monthly$wam_experienced == F &
+ all_monthly$wam_using == F
+#all_monthly$elig_notexp_na <- all_monthly$wam_eligible == T & all_monthly$wam_experienced == F &
+# is.na(all_monthly$wam_using)
+
+all_monthly$notelig_notexp_using <- all_monthly$wam_eligible == F & all_monthly$wam_experienced == F &
+ all_monthly$wam_using == T
+all_monthly$notelig_notexp_notusing <- all_monthly$wam_eligible == F & all_monthly$wam_experienced == F &
+ all_monthly$wam_using == F
+#all_monthly$notelig_notexp_na <- all_monthly$wam_eligible == F & all_monthly$wam_experienced == F &
+# is.na(all_monthly$wam_using)
+
+all_monthly$na_notexp_using <- is.na(all_monthly$wam_eligible)& all_monthly$wam_experienced == F &
+ all_monthly$wam_using == T
+all_monthly$na_notexp_notusing <- is.na(all_monthly$wam_eligible) & all_monthly$wam_experienced == F &
+ all_monthly$wam_using == F
+#all_monthly$na_notexp_na <- is.na(all_monthly$wam_eligible) & all_monthly$wam_experienced == F &
+# is.na(all_monthly$wam_using)
+
+#Create Table 1
+table_1_wam <- all_monthly %>% group_by(calendar_month) %>%
+ summarise(nusers = length(unique(user_pk)),
+ elig_exp_using_nusers = sum(elig_exp_using, na.rm=T),
+ elig_exp_notusing_nusers = sum(elig_exp_notusing, na.rm=T),
+ #elig_exp_na_nusers = sum(elig_exp_na, na.rm=T),
+ notelig_exp_using_nusers = sum(notelig_exp_using, na.rm=T),
+ notelig_exp_notusing_nusers = sum(notelig_exp_notusing, na.rm=T),
+ #notelig_exp_na_nusers = sum(notelig_exp_na, na.rm=T),
+ na_exp_using_nusers = sum(na_exp_using, na.rm=T),
+ na_exp_notusing_nusers = sum(na_exp_notusing, na.rm=T),
+ #na_exp_na_nusers = sum(na_exp_na, na.rm=T),
+ elig_notexp_using_nusers = sum(elig_notexp_using, na.rm=T),
+ elig_notexp_notusing_nusers = sum(elig_notexp_notusing, na.rm=T),
+ #elig_notexp_na_nusers = sum(elig_notexp_na, na.rm=T),
+ notelig_notexp_using_nusers = sum(notelig_notexp_using, na.rm=T),
+ notelig_notexp_notusing_nusers = sum(notelig_notexp_notusing, na.rm=T),
+ #notelig_notexp_na_nusers = sum(notelig_notexp_na, na.rm=T),
+ na_notexp_using_nusers = sum(na_notexp_using, na.rm=T),
+ na_notexp_notusing_nusers = sum(na_notexp_notusing, na.rm=T))
+ #na_notexp_na_nusers = sum(na_notexp_na, na.rm=T))
+
+#Filter table_1
+table_1_wam <- filter(table_1_wam, calendar_month >= "2013-01-01", calendar_month <= "2015-02-01")
+
+write.csv(table_1_wam, file = "table_1_wam.csv")
+
+#------------------------------------------------------------------------#
+#Table 2: WAM DATA
+#------------------------------------------------------------------------#
+
+#Merge domain facets to all_monthly
+names(domain)[names(domain) == "id"] = "domain_id"
+facets_to_merge <- select(domain, name, domain_id, country_final, sector_final,
+ subsector_final, self_start, domain_has_amplifies_workers,
+ domain_has_amplifies_project)
+all_monthly <- merge(all_monthly, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+names(all_monthly)[names(all_monthly) == "domain_id"] = "domain_numeric"
+all_monthly$start_month <- as.Date("2015-01-01")
+all_monthly$end_month <- as.Date("2015-01-01")
+
+all_monthly_table2 <- filter(all_monthly, calendar_month >= start_month,
+ calendar_month <= end_month)
+all_monthly_table2$unique_number <- 1:nrow(all_monthly_table2)
+
+table_2_wam <- all_monthly_table2 %>% group_by(domain) %>%
+ summarise(start_month = unique(start_month),
+ end_month = unique(end_month),
+ country = unique(country_final),
+ sector = unique(sector_final),
+ subsector = unique(subsector_final),
+ nusers_active = length(unique(user_pk)),
+ domain_has_amplifies_workers = unique(domain_has_amplifies_workers),
+ domain_has_amplifies_project = unique(domain_has_amplifies_project),
+ nuser_months = length(unique(unique_number)),
+ elig_exp_using_nusers = sum(elig_exp_using, na.rm=T),
+ elig_exp_notusing_nusers = sum(elig_exp_notusing, na.rm=T),
+ #elig_exp_na_nusers = sum(elig_exp_na, na.rm=T),
+ notelig_exp_using_nusers = sum(notelig_exp_using, na.rm=T),
+ notelig_exp_notusing_nusers = sum(notelig_exp_notusing, na.rm=T),
+ #notelig_exp_na_nusers = sum(notelig_exp_na, na.rm=T),
+ na_exp_using_nusers = sum(na_exp_using, na.rm=T),
+ na_exp_notusing_nusers = sum(na_exp_notusing, na.rm=T),
+ #na_exp_na_nusers = sum(na_exp_na, na.rm=T),
+ elig_notexp_using_nusers = sum(elig_notexp_using, na.rm=T),
+ elig_notexp_notusing_nusers = sum(elig_notexp_notusing, na.rm=T),
+ #elig_notexp_na_nusers = sum(elig_notexp_na, na.rm=T),
+ notelig_notexp_using_nusers = sum(notelig_notexp_using, na.rm=T),
+ notelig_notexp_notusing_nusers = sum(notelig_notexp_notusing, na.rm=T),
+ #notelig_notexp_na_nusers = sum(notelig_notexp_na, na.rm=T),
+ na_notexp_using_nusers = sum(na_notexp_using, na.rm=T),
+ na_notexp_notusing_nusers = sum(na_notexp_notusing, na.rm=T))
+ #na_notexp_na_nusers = sum(na_notexp_na, na.rm=T))
+
+write.csv(table_2_wam, file = "table_2_wam_nov_2014.csv")
+
+#------------------------------------------------------------------------#
+#Table 3: ANNOTATION STATS
+#------------------------------------------------------------------------#
+
+table_3_wam <- table(app_amplifies$amplifies_project,
+ app_amplifies$amplifies_workers, useNA = "always")
+write.csv(table_3_wam, file = "table_3_wam.csv")
+
+#------------------------------------------------------------------------#
+#Old code
+#------------------------------------------------------------------------#
+
+form_table$form_date <- substr(form_table$time_start, 1, 10)
+form_table$form_date <- as.Date(form_table$form_date)
+test <- filter(form_table, amplifies_workers == T)
+test <- filter(form_table, amplifies_workers == T & form_date >= "2015-01-01")
+
+test2 <- app_amplifies[!(app_amplifies$id %in% test$application_id),]
+test2 <- filter(test2, amplifies_workers == T)
+test2 <- merge(test2, select(domain, domain_id, name), by = "domain_id", all.x = T)
+test2 <- select(test2, domain_id, app_name, name)
+
+crs <- filter(form_table, application_id == 428)
+
+#Import amplifies_worker data for each app
+#DUMMY CODE START#####################################
+#We are using typical_flw app annotations as dummy data till we get the WAM annotations
+#app <- tbl(db, "application")
+#app <- collect(app)
+#typical_flw_apps <- read.csv(file = "typical_flw_apps.csv")
+#app <- merge(app, typical_flw_apps, by.x = "app_id", all.x = T)
+#app$typical_flw_app[app$typical_flw_app == 1] <- "yes"
+#app$typical_flw_app[app$typical_flw_app == 0] <- "no"
+#app$typical_flw_app[is.na(app$typical_flw_app)] <- "not_set"
+#names(app)[names(app) == "typical_flw_app"] = "amplifies_workers"
+#DUMMY CODE END######################################
+#The next two lines are suggested by Yedi, but the parsing is incorrect, so the resulting table
+#has wrong values for amplifies_X variables.
+#app <- get_application_table(db)
+#app <- select(app, app_id, domain_name, amplifies_project, amplifies_workers, application_version)
+
diff --git a/analysis_scripts/rdayalu/attrition_event_calculations.R b/analysis_scripts/rdayalu/attrition_event_calculations.R
new file mode 100644
index 0000000..7a846ec
--- /dev/null
+++ b/analysis_scripts/rdayalu/attrition_event_calculations.R
@@ -0,0 +1,83 @@
+#This code can be sourced to add calculations for 1-month, 2-month and 3-month attrition
+#events to your monthly tables.
+#I've used for loops and append steps, so it takes a while to run.
+
+library(data.table)
+library(zoo)
+detach("package:lubridate")
+library(lubridate) #Need to install this after data.table otherwise we will lose some
+#important lubridate functions
+
+#First import monthly_table for all your domains of interest
+#Be sure to set config_run first
+#source(file.path("analysis_scripts","raw_data","data_import.R", fsep = .Platform$file.sep))
+all_monthly <- monthly_table
+
+#Set report_options
+#Format date variables
+all_monthly$date_first_visit = as.Date(all_monthly$date_first_visit)
+all_monthly$date_last_visit = as.Date(all_monthly$date_last_visit)
+report = run_conf$reports$modules$name
+report_options <- get_report_options(run_conf,report)
+report_end_date <- as.Date(report_options$end_date)
+end_month <- as.yearmon(report_end_date)
+end_month <- parse_date_time(paste('01', end_month), '%d %b %Y!')
+end_month <- as.Date(end_month)
+
+#Convert calendar month to an actual date to make it easier to work with
+names(all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+
+#Modify relevant variables
+all_monthly$domain_numeric = as.numeric(as.factor(all_monthly$domain))
+
+#Calculate differences (in days) between consecutive monthly rows per user to calculate
+#next_month_active and previous_month_active variables
+all_monthly <- arrange(all_monthly, domain_numeric, user_pk, calendar_month)
+df <- data.table(all_monthly)
+setkey(df,user_pk)
+df[,diff_days:=c(NA,diff(calendar_month)),by=user_pk]
+all_monthly <- as.data.frame(df)
+all_monthly$previous_month_active <- all_monthly$diff_days <= 31
+all_monthly$previous_two_months_active <- all_monthly$diff_days <= 62
+all_monthly$previous_three_months_active <- all_monthly$diff_days <= 93
+
+#Vector of unique users
+users <- unique(all_monthly$user_pk)
+
+#1-month attrition events
+next_month_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$user_pk == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_month_active[-1], F)
+ next_month_active <- append(next_month_active, next_active)
+}
+all_monthly$next_month_active <- next_month_active
+
+#2-month attrition events
+next_two_months_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$user_pk == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_two_months_active[-1], F)
+ next_two_months_active <- append(next_two_months_active, next_active)
+}
+all_monthly$next_two_months_active <- next_two_months_active
+
+#3-month attrition events
+next_three_months_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$user_pk == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_three_months_active[-1], F)
+ next_three_months_active <- append(next_three_months_active, next_active)
+}
+all_monthly$next_three_months_active <- next_three_months_active
+
+#Based on the end_month in our dataset, we don't know if the user will be active in any of
+#the months following end_month. Must change all those attrition values to NA.
+is.na(all_monthly$next_month_active) <- all_monthly$calendar_month == end_month
+is.na(all_monthly$next_two_months_active) <- all_monthly$calendar_month >= end_month - months(1)
+is.na(all_monthly$next_three_months_active) <- all_monthly$calendar_month >= end_month - months(2)
diff --git a/analysis_scripts/rdayalu/attrition_visuals.R b/analysis_scripts/rdayalu/attrition_visuals.R
new file mode 100644
index 0000000..0e96a07
--- /dev/null
+++ b/analysis_scripts/rdayalu/attrition_visuals.R
@@ -0,0 +1,371 @@
+#------------------------------------------------------------------------#
+#Data for attrition blog based on Neal's attrition document
+#------------------------------------------------------------------------#
+training_typical <- arrange(training_typical, user_pk, calendar_month)
+users <- unique(training_typical$user_pk)
+
+#ncases_touched
+leadup <- data.frame(matrix(ncol = 11, nrow = 1))
+names(leadup) <- c("user_pk", "calendar_month", "previous_three_months_active",
+ "next_month_active", "next_two_months_active",
+ "next_three_months_active", "month_1", "month_2", "month_3", "month_4",
+ "diff_days")
+leadup$calendar_month <- as.Date(leadup$calendar_month)
+leadup$user_pk <- as.numeric(leadup$user_pk)
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ single_user$calendar_month <- as.Date(single_user$calendar_month)
+ for (j in 1:(nrow(single_user)-3)) {
+ leadup_single <- data.frame(matrix(ncol = 11, nrow = 1))
+ names(leadup_single) <- c("user_pk", "calendar_month", "previous_three_months_active",
+ "next_month_active", "next_two_months_active",
+ "next_three_months_active", "month_1", "month_2", "month_3",
+ "month_4", "diff_days")
+ leadup_single$user_pk <- as.numeric(single_user$user_pk[1])
+ leadup_single$calendar_month <- as.Date(single_user$calendar_month[3+j])
+ leadup_single$previous_three_months_active <- single_user$previous_three_months_active[3+j]
+ leadup_single$next_month_active <- single_user$next_month_active[3+j]
+ leadup_single$next_two_months_active <- single_user$next_two_months_active[3+j]
+ leadup_single$next_three_months_active <- single_user$next_three_months_active[3+j]
+ leadup_single$month_1 <- single_user$ncases_touched[3+j]
+ leadup_single$month_2 <- single_user$ncases_touched[2+j]
+ leadup_single$month_3 <- single_user$ncases_touched[1+j]
+ leadup_single$month_4 <- single_user$ncases_touched[j]
+ leadup_single$diff_days <- as.numeric(single_user$calendar_month[3+j] - single_user$calendar_month[j])
+ leadup <- rbind(leadup, leadup_single)
+ }
+}
+leadup <- leadup[!(is.na(leadup$calendar_month)),]
+leadup_subset <- filter(leadup, diff_days <= 92)
+leadup_subset$before_one_month_attrition <- leadup_subset$next_month_active == F &
+ leadup_subset$next_two_months_active == T
+leadup_subset$before_two_month_attrition <- leadup_subset$next_two_months_active == F &
+ leadup_subset$next_three_months_active == T
+write.csv(leadup_subset, file = "attrition_leadup_ntouched.csv")
+
+#Graph leadups to attrition events
+leadup <- read.csv(file = "attrition_leadup_ntouched.csv")
+leadup$calendar_month <- as.Date(leadup$calendar_month)
+leadup <- arrange(leadup, user_pk, calendar_month)
+
+leadup_att3 <- filter(leadup, next_three_months_active == F)
+leadup_3 <- c(median(leadup_att3$month_1), median(leadup_att3$month_2),
+ median(leadup_att3$month_3), median(leadup_att3$month_4))
+
+leadup_att2 <- filter(leadup, before_two_month_attrition == T)
+leadup_2 <- c(median(leadup_att2$month_1), median(leadup_att2$month_2),
+ median(leadup_att2$month_3), median(leadup_att2$month_4))
+
+leadup_att1 <- filter(leadup, before_one_month_attrition == T)
+leadup_1 <- c(median(leadup_att1$month_1), median(leadup_att1$month_2),
+ median(leadup_att1$month_3), median(leadup_att1$month_4))
+
+leadup_active <- filter(leadup, next_month_active == T)
+leadup_comparison <- c(median(leadup_active$month_1), median(leadup_active$month_2),
+ median(leadup_active$month_3), median(leadup_active$month_4))
+
+leadup_data <- data.frame(cbind(rep(c(1:4), 2), c(leadup_3, leadup_comparison)))
+leadup_data$X1 <- as.factor(leadup_data$X1)
+month_levels <- rev(levels(leadup_data$X1))
+leadup_data$att_duration <- as.factor(c(rep("Attrition",4), rep("No attrition",4)))
+
+g <- ggplot(leadup_data, aes(x = X1, y = X2, colour = att_duration, group = att_duration, linetype = att_duration)) +
+ geom_point(shape = 15, size = 4.0, colour="peachpuff4") +
+ geom_line(size = 1.5) +
+ scale_colour_brewer(palette="Set1") +
+ scale_y_continuous(limits=c(0,13)) +
+ scale_x_discrete(limits = month_levels) +
+ xlab("Number of months prior to attrition event") +
+ ylab("Number of cases visited") +
+ theme(axis.title.x=element_text(size=14), axis.text.x=element_text(size=14, colour = "black")) +
+ theme(axis.title.y=element_text(size=14), axis.text.y=element_text(size=14, colour = "black")) +
+ theme(legend.title=element_blank())
+
+
+
+#ncases_registered
+leadup <- data.frame(matrix(ncol = 11, nrow = 1))
+names(leadup) <- c("user_pk", "calendar_month", "previous_three_months_active",
+ "next_month_active", "next_two_months_active",
+ "next_three_months_active", "month_1", "month_2", "month_3", "month_4",
+ "diff_days")
+leadup$calendar_month <- as.Date(leadup$calendar_month)
+leadup$user_pk <- as.numeric(leadup$user_pk)
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ single_user$calendar_month <- as.Date(single_user$calendar_month)
+ for (j in 1:(nrow(single_user)-3)) {
+ leadup_single <- data.frame(matrix(ncol = 11, nrow = 1))
+ names(leadup_single) <- c("user_pk", "calendar_month", "previous_three_months_active",
+ "next_month_active", "next_two_months_active",
+ "next_three_months_active", "month_1", "month_2", "month_3",
+ "month_4", "diff_days")
+ leadup_single$user_pk <- as.numeric(single_user$user_pk[1])
+ leadup_single$calendar_month <- as.Date(single_user$calendar_month[3+j])
+ leadup_single$previous_three_months_active <- single_user$previous_three_months_active[3+j]
+ leadup_single$next_month_active <- single_user$next_month_active[3+j]
+ leadup_single$next_two_months_active <- single_user$next_two_months_active[3+j]
+ leadup_single$next_three_months_active <- single_user$next_three_months_active[3+j]
+ leadup_single$month_1 <- single_user$ncases_registered[3+j]
+ leadup_single$month_2 <- single_user$ncases_registered[2+j]
+ leadup_single$month_3 <- single_user$ncases_registered[1+j]
+ leadup_single$month_4 <- single_user$ncases_registered[j]
+ leadup_single$diff_days <- as.numeric(single_user$calendar_month[3+j] - single_user$calendar_month[j])
+ leadup <- rbind(leadup, leadup_single)
+ }
+}
+leadup <- leadup[!(is.na(leadup$calendar_month)),]
+leadup_subset <- filter(leadup, diff_days <= 92)
+leadup_subset$before_one_month_attrition <- leadup_subset$next_month_active == F &
+ leadup_subset$next_two_months_active == T
+leadup_subset$before_two_month_attrition <- leadup_subset$next_two_months_active == F &
+ leadup_subset$next_three_months_active == T
+write.csv(leadup_subset, file = "attrition_leadup_nregistered.csv")
+
+#------------------------------------------------------------------------#
+#ncases_touched
+#------------------------------------------------------------------------#
+users <- unique((filter(training_typical, next_month_active == F))$user_pk)
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_touched[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_touched[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Median absolute and relative values leading up to attrition period
+med_leadups_1mo <- apply(as.matrix(attrition_subset), 2, median)
+write.csv(attrition_subset[,1:5], file = "attrition_leadup_1mo_ntouched.csv")
+
+users <- unique((filter(training_typical, next_two_months_active == F))$user_pk)
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_two_months_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_touched[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_touched[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Median absolute and relative values leading up to attrition period
+med_leadups_2mos <- apply(as.matrix(attrition_subset), 2, median)
+write.csv(attrition_subset[,1:5], file = "attrition_leadup_2mo_ntouched.csv")
+
+users <- unique((filter(training_typical, next_three_months_active == F))$user_pk)
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_three_months_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_touched[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_touched[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Median absolute and relative values leading up to attrition period
+med_leadups_3mos <- apply(as.matrix(attrition_subset), 2, median)
+write.csv(attrition_subset[,1:5], file = "attrition_leadup_3mo_ntouched.csv")
+
+#------------------------------------------------------------------------#
+#ncases_registered
+#------------------------------------------------------------------------#
+users <- unique((filter(training_typical, next_month_active == F))$user_pk)
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_registered[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_registered[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Median absolute and relative values leading up to attrition period
+med_leadups_1mo <- apply(as.matrix(attrition_subset), 2, median)
+write.csv(attrition_subset[,1:5], file = "attrition_leadup_1mo_nregistered.csv")
+
+users <- unique((filter(training_typical, next_two_months_active == F))$user_pk)
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_two_months_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_registered[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_registered[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Median absolute and relative values leading up to attrition period
+med_leadups_2mos <- apply(as.matrix(attrition_subset), 2, median)
+write.csv(attrition_subset[,1:5], file = "attrition_leadup_2mo_nregistered.csv")
+
+users <- unique((filter(training_typical, next_three_months_active == F))$user_pk)
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_three_months_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_registered[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_registered[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Median absolute and relative values leading up to attrition period
+med_leadups_3mos <- apply(as.matrix(attrition_subset), 2, median)
+write.csv(attrition_subset[,1:5], file = "attrition_leadup_3mo_nregistered.csv")
+write.csv(data.frame(cbind(med_leadups_1mo, med_leadups_2mos, med_leadups_3mos)),
+ file = "median_nregistered_leadup.csv")
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/blog_indicator_evaluation.R b/analysis_scripts/rdayalu/blog_indicator_evaluation.R
new file mode 100644
index 0000000..a19ead0
--- /dev/null
+++ b/analysis_scripts/rdayalu/blog_indicator_evaluation.R
@@ -0,0 +1,421 @@
+all_monthly <- read.csv(file = "blog_data_12_4_14.csv")
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+#all_monthly <- filter(all_monthly, !(domain %in% exclude_domains))
+#write.csv(all_monthly, file = "blog_data_12_4_14.csv")
+library(dplyr)
+library(data.table)
+library(zoo)
+detach("package:lubridate")
+library(lubridate)
+library(ggplot2)
+library(scales) #to customize ggplot axis labeling
+library(gridExtra) #graphing plots in columns/rows for ggplot
+library(RColorBrewer) #Color palettes
+source('s_dplyr.R')
+
+#Exclude any users who logged > 100 visits in any month
+all_monthly$visits_ge_100 <- all_monthly$nvisits > 100
+user_ge_100 <- all_monthly %.%
+ group_by(user_pk) %.%
+ summarise(ge_100 = sum(visits_ge_100))
+user_le_100 <- filter(user_ge_100, ge_100 == 0)
+#Keep only user with <= 100 visits per month
+training_typical <-
+ all_monthly[all_monthly$user_pk %in% user_le_100$user_pk, ]
+
+#Exclude any users that don't have a month_index = 1
+#These users have months that started outside our data range for this dataset
+#so we shouldn't include them
+training_typical$has_index_1 <- training_typical$month_index == 1
+user_index_1 <- training_typical %.%
+ group_by(user_pk) %.%
+ summarise(keep_user = sum(has_index_1))
+user_index_1 <- filter(user_index_1, keep_user != 0)
+#Keep users that have a month_index = 1
+training_typical <-
+ training_typical[training_typical$user_pk %in% user_index_1$user_pk, ]
+
+#Add sample_increase variable: increasing by steady increments of 1
+#Add sample_decrease variable: decreasing by steady increments of 1
+training_typical <- arrange(training_typical, user_pk, calendar_month)
+users <- unique(training_typical$user_pk)
+sample_increase <- c()
+sample_decrease <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #sample_increase <- append(sample_increase, 100+(c(1:nrow(single_user))))
+ #sample_decrease <- append(sample_decrease, 100+(rev(c(1:nrow(single_user)))))
+ sample_increase <- append(sample_increase, cumsum(sample(1:5, nrow(single_user), replace=T)))
+ sample_decrease <- append(sample_decrease, rev(cumsum(sample(1:5, nrow(single_user), replace=T))))
+}
+training_typical$sample_increase <- sample_increase
+training_typical$sample_decrease <- sample_decrease
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+#Also want differences between indicators for each user from one month to the next
+#Differences in indicators will be used for tests 1a/b
+training_typical <- arrange(training_typical, user_pk, calendar_month)
+df <- data.table(training_typical)
+setkey(df,user_pk)
+#df[,diff_days:=c(NA,diff(calendar_month)),by=user_pk]
+df[,diff_nvisits:=c(NA,diff(nvisits)),by=user_pk]
+df[,diff_active_day_percent:=c(NA,diff(active_day_percent)),by=user_pk]
+df[,diff_nforms:=c(NA,diff(nforms)),by=user_pk]
+df[,diff_median_visit_duration:=c(NA,diff(median_visit_duration)),by=user_pk]
+df[,diff_median_visits_per_day:=c(NA,diff(median_visits_per_day)),by=user_pk]
+df[,diff_time_using_cc:=c(NA,diff(time_using_cc)),by=user_pk]
+#df[,diff_ninteractions:=c(NA,diff(ninteractions)),by=user_pk]
+df[,diff_ncases_registered:=c(NA,diff(ncases_registered)),by=user_pk]
+df[,diff_register_followup:=c(NA,diff(register_followup)),by=user_pk]
+df[,diff_case_register_followup_rate:=c(NA,diff(case_register_followup_rate)),by=user_pk]
+df[,diff_ncases_touched:=c(NA,diff(ncases_touched)),by=user_pk]
+df[,diff_nunique_followups:=c(NA,diff(nunique_followups)),by=user_pk]
+#df[,diff_audio_plays:=c(NA,diff(audio_plays)),by=user_pk]
+#df[,diff_network_warnings:=c(NA,diff(network_warnings)),by=user_pk]
+#df[,diff_num_user_pk:=c(NA,diff(num_user_pk)),by=user_pk]
+#df[,diff_domain_numeric:=c(NA,diff(domain_numeric)),by=user_pk]
+#df[,diff_sample_undefined:=c(NA,diff(sample_undefined)),by=user_pk]
+#df[,diff_sample_normal:=c(NA,diff(sample_normal)),by=user_pk]
+#df[,diff_sample_percentile:=c(NA,diff(sample_percentile)),by=user_pk]
+#df[,diff_sample_increase:=c(NA,diff(sample_increase)),by=user_pk]
+#df[,diff_sample_decrease:=c(NA,diff(sample_decrease)),by=user_pk]
+training_typical <- as.data.frame(df)
+
+#Choose only ONE of the three options below:
+# 1. Exclude users with < 4 months on CC
+month_count <- training_typical %.%
+ group_by(user_pk) %.%
+ summarise(months_on_cc = length(unique(calendar_month)))
+month_count <- filter(month_count, months_on_cc >= 4)
+training_typical <-
+ training_typical[training_typical$user_pk %in% month_count$user_pk, ]
+
+# 2. Exclude rows before the 6th month on CC
+training_typical <- filter(training_typical, month_index >= 6)
+
+# 3. Exclude rows before the 6th month on CC AND users with < 6 active months on CC
+training_typical <- filter(training_typical, month_index >= 6)
+training_typical <- filter(training_typical, active_months >= 6)
+
+#training_typical$previous_month_active <- training_typical$diff_days <= 31
+#next_month_active <- c()
+#for (i in users) {
+# single_user <- training_typical[training_typical$user_pk == i,]
+# next_active <- c()
+# next_active <- append(single_user$previous_month_active[-1], F)
+# next_month_active <- append(next_month_active, next_active)
+#}
+#training_typical$next_month_active <- next_month_active
+#If calendar_month = 10/1/14 then next_month_active = NA
+#because we don't know if the user will be active in the following month
+#is.na(training_typical$next_month_active) <- training_typical$calendar_month == "2014-10-01"
+
+#------------------------------------------------------------------------#
+#General plots
+#------------------------------------------------------------------------#
+#Number of users by calendar month
+n_user <- training_typical %.%
+ group_by(calendar_month) %.%
+ summarise(n_users = length(unique(user_id)))
+
+g <- ggplot(n_user, aes(x=calendar_month, y=n_users)) +
+ geom_point(size = 3, shape = 19, alpha = 0.5, colour = "darkblue",
+ fill = "lightblue") +
+ geom_line(colour = "darkblue") +
+ scale_size_area() +
+ xlab("Calendar month") +
+ ylab("# unique users/month") +
+ theme(axis.text=element_text(size=12), axis.title=element_text(size=14,
+ face="bold")) +
+ ggtitle("Number of users by calendar month") +
+ theme(plot.title = element_text(size=14, face="bold"))
+
+pdf("plots.pdf")
+plot(g)
+dev.off()
+
+#------------------------------------------------------------------------#
+#Code for Test 1
+#------------------------------------------------------------------------#
+# % difference in indicators for each user for consectutive months
+# This isn't for truly consecutive months, so later on,
+# we will only use rows with previous_month_active == T
+#This will be used for test 1b
+source(file.path("analysis_scripts","rdayalu","test_1b.R", fsep = .Platform$file.sep))
+
+#Must only include rows with previous_month_active == T. Exclude F & NA
+training_consec <- filter(training_typical, previous_month_active == T)
+training_consec$concat <- paste(training_consec$user_pk, training_consec$calendar_month,
+ sep = "_")
+
+
+#Exclude domain calendar_months with nusers < 5 for that domain
+#Use this dataset only for test 1a/1b
+nusers <- training_consec %>%
+ group_by(domain, calendar_month) %>%
+ summarise(nusers = length(unique(user_pk)))
+nusers <- filter(nusers, nusers >= 5)
+nusers$concat <- paste(nusers$domain, nusers$calendar_month, sep = "_")
+training_consec <-
+ training_consec[paste(training_consec$domain, training_consec$calendar_month, sep = "_") %in%
+ nusers$concat, ]
+
+#Domain median ABSOLUTE change per user per calendar month,
+#excluding each user from the domain median for that user's row
+#This is used for test 1a
+source(file.path("analysis_scripts","rdayalu","test_1a.R", fsep = .Platform$file.sep))
+
+#Domain median PERCENTAGE change per user per calendar month,
+#excluding each user from the domain median for that user's row
+source(file.path("analysis_scripts","rdayalu","test_1b_2.R", fsep = .Platform$file.sep))
+
+names(training_consec)
+diff_indicator <- names(training_consec[41:61])
+per_diff_indicator <- names(training_consec[64:84])
+
+#Dataset for Mengji with test 1a/1b x,y values for each user for each month
+#mengji <- select(training_consec, user_id, user_pk, domain, calendar_month, diff_ncases_touched,
+# med_ncases_touched_1a, per_diff_ncases_touched, med_ncases_touched_1b)
+#names(mengji)[names(mengji) == "diff_ncases_touched"] = "ntouched_test1a_x"
+#names(mengji)[names(mengji) == "med_ncases_touched_1a"] = "ntouched_test1a_y"
+#names(mengji)[names(mengji) == "per_diff_ncases_touched"] = "ntouched_test1b_x"
+#names(mengji)[names(mengji) == "med_ncases_touched_1b"] = "ntouched_test1b_y"
+
+test_1a <-
+ c(cor(training_consec$med_nvisits_1a, training_consec$diff_nvisits, use = "complete.obs"),
+ cor(training_consec$med_active_day_percent_1a, training_consec$diff_active_day_percent, use = "complete.obs"),
+ cor(training_consec$med_nforms_1a, training_consec$diff_nforms, use = "complete.obs"),
+ cor(training_consec$med_median_visit_duration_1a, training_consec$diff_median_visit_duration, use = "complete.obs"),
+ cor(training_consec$med_median_visits_per_day_1a, training_consec$diff_median_visits_per_day, use = "complete.obs"),
+ cor(training_consec$med_time_using_cc_1a, training_consec$diff_time_using_cc, use = "complete.obs"),
+ #cor(training_consec$med_ninteractions_1a, training_consec$diff_ninteractions, use = "complete.obs"),
+ cor(training_consec$med_ncases_registered_1a, training_consec$diff_ncases_registered, use = "complete.obs"),
+ cor(training_consec$med_register_followup_1a, training_consec$diff_register_followup, use = "complete.obs"),
+ cor(training_consec$med_case_register_followup_rate_1a, training_consec$diff_case_register_followup_rate, use = "complete.obs"),
+ cor(training_consec$med_ncases_touched_1a, training_consec$diff_ncases_touched, use = "complete.obs"),
+ cor(training_consec$med_nunique_followups_1a, training_consec$diff_nunique_followups, use = "complete.obs"),
+ cor(training_consec$med_audio_plays_1a, training_consec$diff_audio_plays, use = "complete.obs"),
+ cor(training_consec$med_network_warnings_1a, training_consec$diff_network_warnings, use = "complete.obs"),
+ cor(training_consec$med_num_user_pk_1a, training_consec$diff_num_user_pk, use = "complete.obs"),
+ cor(training_consec$med_domain_numeric_1a, training_consec$diff_domain_numeric, use = "complete.obs"),
+ cor(training_consec$med_sample_undefined_1a, training_consec$diff_sample_undefined, use = "complete.obs"),
+ cor(training_consec$med_sample_normal_1a, training_consec$diff_sample_normal, use = "complete.obs"),
+ cor(training_consec$med_sample_percentile_1a, training_consec$diff_sample_percentile, use = "complete.obs"),
+ cor(training_consec$med_sample_increase_1a, training_consec$diff_sample_increase, use = "complete.obs"),
+ cor(training_consec$med_sample_decrease_1a, training_consec$diff_sample_decrease, use = "complete.obs"))
+names(test_1a) <- indicators
+
+test_1b <-
+ c(cor(training_consec$med_nvisits_1b, training_consec$per_diff_nvisits, use = "complete.obs"),
+ cor(training_consec$med_active_day_percent_1b, training_consec$per_diff_active_day_percent, use = "complete.obs"),
+ cor(training_consec$med_nforms_1b, training_consec$per_diff_nforms, use = "complete.obs"),
+ cor(training_consec$med_median_visit_duration_1b, training_consec$per_diff_median_visit_duration, use = "complete.obs"),
+ cor(training_consec$med_median_visits_per_day_1b, training_consec$per_diff_median_visits_per_day, use = "complete.obs"),
+ cor(training_consec$med_time_using_cc_1b, training_consec$per_diff_time_using_cc, use = "complete.obs"),
+ #cor(training_consec$med_ninteractions_1b, training_consec$per_diff_ninteractions, use = "complete.obs"),
+ cor(training_consec$med_ncases_registered_1b, training_consec$per_diff_ncases_registered, use = "complete.obs"),
+ cor(training_consec$med_register_followup_1b, training_consec$per_diff_register_followup, use = "complete.obs"),
+ cor(training_consec$med_case_register_followup_rate_1b, training_consec$per_diff_case_register_followup_rate, use = "complete.obs"),
+ cor(training_consec$med_ncases_touched_1b, training_consec$per_diff_ncases_touched, use = "complete.obs"),
+ cor(training_consec$med_nunique_followups_1b, training_consec$per_diff_nunique_followups, use = "complete.obs"),
+ cor(training_consec$med_audio_plays_1b, training_consec$per_diff_audio_plays, use = "complete.obs"),
+ cor(training_consec$med_network_warnings_1b, training_consec$per_diff_network_warnings, use = "complete.obs"),
+ cor(training_consec$med_num_user_pk_1b, training_consec$per_diff_num_user_pk, use = "complete.obs"),
+ cor(training_consec$med_domain_numeric_1b, training_consec$per_diff_domain_numeric, use = "complete.obs"),
+ cor(training_consec$med_sample_undefined_1b, training_consec$per_diff_sample_undefined, use = "complete.obs"),
+ cor(training_consec$med_sample_normal_1b, training_consec$per_diff_sample_normal, use = "complete.obs"),
+ cor(training_consec$med_sample_percentile_1b, training_consec$per_diff_sample_percentile, use = "complete.obs"),
+ cor(training_consec$med_sample_increase_1b, training_consec$per_diff_sample_increase, use = "complete.obs"),
+ cor(training_consec$med_sample_decrease_1b, training_consec$per_diff_sample_decrease, use = "complete.obs"))
+names(test_1b) <- indicators
+
+
+#Pairwise plots of absolute and % changes for individual FLWs by domain medians
+g <- ggplot(training_consec, aes(x=med_sample_increase_1a, y=diff_sample_increase)) +
+ #scale_y_continuous(limits=c(-100,100)) +
+ geom_smooth(method=lm) +
+ geom_point(shape=1, size = 1)
+
+g <- ggplot(training_consec, aes(x=med_sample_increase_1b, y=per_diff_sample_increase)) +
+ scale_y_continuous(limits=c(0,150)) +
+ geom_smooth(method=lm) +
+ geom_point(shape=1, size = 1)
+
+
+#------------------------------------------------------------------------#
+#Code for Test 2
+#------------------------------------------------------------------------#
+
+#Previous month's indicator value
+training_typical$prev_nvisits <- training_typical$nvisits - training_typical$diff_nvisits
+training_typical$prev_active_day_percent <- training_typical$active_day_percent - training_typical$diff_active_day_percent
+training_typical$prev_nforms<- training_typical$nforms - training_typical$diff_nforms
+training_typical$prev_median_visit_duration <- training_typical$median_visit_duration - training_typical$diff_median_visit_duration
+training_typical$prev_median_visits_per_day <- training_typical$median_visits_per_day - training_typical$diff_median_visits_per_day
+training_typical$prev_time_using_cc <- training_typical$time_using_cc - training_typical$diff_time_using_cc
+#training_typical$prev_ninteractions <- training_typical$ninteractions - training_typical$diff_ninteractions
+training_typical$prev_ncases_registered <- training_typical$ncases_registered - training_typical$diff_ncases_registered
+training_typical$prev_register_followup <- training_typical$register_followup - training_typical$diff_register_followup
+training_typical$prev_case_register_followup_rate <- training_typical$case_register_followup_rate - training_typical$diff_case_register_followup_rate
+training_typical$prev_ncases_touched <- training_typical$ncases_touched - training_typical$diff_ncases_touched
+training_typical$prev_nunique_followups <- training_typical$nunique_followups - training_typical$diff_nunique_followups
+training_typical$prev_audio_plays <- training_typical$audio_plays - training_typical$diff_audio_plays
+training_typical$prev_network_warnings <- training_typical$network_warnings- training_typical$diff_network_warnings
+training_typical$prev_num_user_pk <- training_typical$num_user_pk - training_typical$diff_num_user_pk
+training_typical$prev_domain_numeric <- training_typical$domain_numeric - training_typical$diff_domain_numeric
+training_typical$prev_sample_undefined <- training_typical$sample_undefined - training_typical$diff_sample_undefined
+training_typical$prev_sample_normal <- training_typical$sample_normal - training_typical$diff_sample_normal
+training_typical$prev_sample_percentile <- training_typical$sample_percentile - training_typical$diff_sample_percentile
+training_typical$prev_sample_increase <- training_typical$sample_increase - training_typical$diff_sample_increase
+training_typical$prev_sample_decrease <- training_typical$sample_decrease - training_typical$diff_sample_decrease
+
+test2_data <- training_typical[training_typical$previous_month_active == T,]
+
+#Dataset for Mengji with test 2a x,y values for each user for each month
+#mengji2 <- select(test2_data, user_id, user_pk, domain, calendar_month, ncases_touched,
+# prev_ncases_touched)
+#names(mengji2)[names(mengji2) == "ncases_touched"] = "ntouched_test2a_x"
+#names(mengji2)[names(mengji2) == "prev_ncases_touched"] = "ntouched_test2a_y"
+
+indicators <- c("nvisits", "active_day_percent", "nforms",
+ "median_visit_duration", "median_visits_per_day",
+ "time_using_cc", "ncases_registered", "register_followup",
+ "case_register_followup_rate", "ncases_touched",
+ "nunique_followups", "sample_increase", "sample_decrease")
+
+test_2a <-
+ c(cor(training_typical$prev_nvisits, training_typical$nvisits, use = "complete.obs"),
+ cor(training_typical$prev_active_day_percent, training_typical$active_day_percent, use = "complete.obs"),
+ cor(training_typical$prev_nforms, training_typical$nforms, use = "complete.obs"),
+ cor(training_typical$prev_median_visit_duration, training_typical$median_visit_duration, use = "complete.obs"),
+ cor(training_typical$prev_median_visits_per_day, training_typical$median_visits_per_day, use = "complete.obs"),
+ cor(training_typical$prev_time_using_cc, training_typical$time_using_cc, use = "complete.obs"),
+ #cor(training_typical$prev_ninteractions, training_typical$ninteractions, use = "complete.obs"),
+ cor(training_typical$prev_ncases_registered, training_typical$ncases_registered, use = "complete.obs"),
+ cor(training_typical$prev_register_followup, training_typical$register_followup, use = "complete.obs"),
+ cor(training_typical$prev_case_register_followup_rate, training_typical$case_register_followup_rate, use = "complete.obs"),
+ cor(training_typical$prev_ncases_touched, training_typical$ncases_touched, use = "complete.obs"),
+ cor(training_typical$prev_nunique_followups, training_typical$nunique_followups, use = "complete.obs"))
+
+ cor(training_typical$prev_audio_plays, training_typical$audio_plays, use = "complete.obs"),
+ cor(training_typical$prev_network_warnings, training_typical$network_warnings, use = "complete.obs"),
+ cor(training_typical$prev_num_user_pk, training_typical$num_user_pk, use = "complete.obs"),
+ cor(training_typical$prev_domain_numeric, training_typical$domain_numeric, use = "complete.obs"),
+ cor(training_typical$prev_sample_undefined, training_typical$sample_undefined, use = "complete.obs"),
+ cor(training_typical$prev_sample_normal, training_typical$sample_normal, use = "complete.obs"),
+ cor(training_typical$prev_sample_percentile, training_typical$sample_percentile, use = "complete.obs"),
+ cor(training_typical$prev_sample_increase, training_typical$sample_increase, use = "complete.obs"),
+ cor(training_typical$prev_sample_decrease, training_typical$sample_decrease, use = "complete.obs"))
+names(test_2a) <- indicators
+
+g <- ggplot(training_typical, aes(x=prev_ncases_touched, y=ncases_touched)) +
+ geom_point(shape=1) +
+ scale_x_continuous(limits=c(0,100)) +
+ scale_y_continuous(limits=c(0,100)) +
+ geom_smooth(method=lm)
+
+#------------------------------------------------------------------------#
+#Code for Test 4
+#------------------------------------------------------------------------#
+
+#Create function to append attrition list
+lappend <- function (lst, ...){
+ lst <- c(lst, list(...))
+ return(lst)
+}
+
+#Extract users with at least one attrition event:
+#One-month attrition event
+users <- unique((filter(training_typical, next_month_active == F))$user_pk)
+#Two-month attrition event
+users <- unique((filter(training_typical, next_two_months_active == F))$user_pk)
+#Three-month attrition event
+users <- unique((filter(training_typical, next_three_months_active == F))$user_pk)
+
+#This is the test_4b code
+source(file.path("analysis_scripts","rdayalu","test_4b.R", fsep = .Platform$file.sep))
+names(test_4b) <- indicators
+
+test <- data.frame(cbind(test_1a, test_1b, test_2a, test_4b))
+write.csv(test, file = "blog_set_results.csv")
+#------------------------------------------------------------------------#
+#Other random code
+#------------------------------------------------------------------------#
+
+#Overall attrition_data: Median of each month column
+months_median <- apply(attrition_data[,1:23], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_data[,1:23], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black")
+
+#Relative attrition_data: Median of each relative month column
+months_median <- apply(attrition_subset[,25:29], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_subset[,25:29], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black") +
+ xlab("month_before_attrition") +
+ ylab("nvisits relative to month 5 (%)")
+
+#Subset of attrition_data: Median of each month column
+months_median <- apply(attrition_subset[,1:5], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_subset[,1:5], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black")
+
+
+#Test 4A: slope of line (lm) for absolute months 1-4 for each row
+#attrition_subset$slope_abs <- apply(attrition_subset[,1:4], 1, function(x)
+# lm(x~c(1:4))$coefficients[[2]])
+
+#Rehape all_monthly from long format to wide
+#This creates only one row per user with columns for each calendar month
+users_long <- select(tula_typical, domain, user_id, nvisits, calendar_month)
+users_wide <- reshape(users_long,
+ timevar = "calendar_month",
+ idvar = c("domain", "user_id"),
+ direction = "wide")
+
+g <- ggplot(data=tula_typical, aes(x=calendar_month, y=nvisits, group = user_id)) +
+ geom_line(colour="grey", size=1.0)
+
+users_long <- select(raw_percentile, domain, user_id, percentile, calendar_month)
+users_wide <- reshape(users_long,
+ timevar = "calendar_month",
+ idvar = c("domain", "user_id"),
+ direction = "wide")
+
+# Number of users by calendar_month
+users_month_tula <- tula_typical %.%
+ group_by(domain, calendar_month) %.%
+ summarise(nusers = length(unique(user_id)))
+
+g <- ggplot(data=users_month_tula, aes(x=calendar_month, y=nusers)) +
+ geom_line(colour="black", size=1.0) +
+ geom_point(colour="red", size=3, shape=21, fill="red")
+
+users_wide <- reshape(users_month_tula,
+ timevar = "calendar_month",
+ idvar = c("domain"),
+ direction = "wide")
+
+users_wide <- users_wide[,order(names(users_wide))]
+write.csv(users_wide, file = "tula_nusers_wide.csv")
diff --git a/analysis_scripts/rdayalu/business_unit_mystery.R b/analysis_scripts/rdayalu/business_unit_mystery.R
new file mode 100644
index 0000000..7d38fe8
--- /dev/null
+++ b/analysis_scripts/rdayalu/business_unit_mystery.R
@@ -0,0 +1,52 @@
+#BU mystery
+#2/24/15
+
+#How many country_final = India domains were active in Q4 of 2014?
+#First load monthly table for all test = F and permitted_data = T domains
+all_monthly <- monthly_table
+
+#Consolidate country information
+domain_facets <- domain_table
+is.na(domain_facets$deployment.country) <- domain_facets$deployment.country == ""
+is.na(domain_facets$country) <- domain_facets$country == ""
+domain_facets$country_final <- domain_facets$deployment.country
+keep_country <- which(is.na(domain_facets$deployment.country) & !is.na(domain_facets$country))
+domain_facets$country_final[keep_country] <- domain_facets$country[keep_country]
+is.na(domain_facets$internal.self_started) <- domain_facets$internal.self_started == ""
+
+#Using just deployment.country
+#domain_facets <- domain_table
+#is.na(domain_facets$deployment.country) <- domain_facets$deployment.country == ""
+#domain_facets$country_final <- domain_facets$deployment.country
+
+#Keep only columns of interest from domain_table
+names(domain_facets)[names(domain_facets) == "id"] = "domain_id"
+facets_to_merge <- select(domain_facets, name, domain_id,is_test, country_final,
+ internal.self_started)
+
+#Merge domain facets from domain table into all_monthly table
+all_monthly <- merge(all_monthly, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+
+#Keep only India domains
+all_monthly <- filter(all_monthly, country_final=="India")
+
+#Change column names as needed
+names(all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+names(all_monthly)[names(all_monthly) == "numeric_index"] = "month_index"
+names(all_monthly)[names(all_monthly) == "domain_id"] = "domain_numeric"
+
+#Convert calendar month to actual date
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+all_monthly$month_abbr <- month(all_monthly$calendar_month, label = T, abbr = T)
+
+# Table of Q4 2014 and Jan 2015
+q4 <- filter(all_monthly, calendar_month >= "2014-10-01" & calendar_month <= "2014-12-31")
+test <- filter(all_monthly, calendar_month == "2014-12-01")
+jan_2015 <- filter(all_monthly, calendar_month == "2015-01-01")
+
+#Number of active domains per month/quarter
+length(unique(q4$domain))
+length(unique(test$domain))
+length(unique(jan_2015$domain))
diff --git a/analysis_scripts/rdayalu/corr_matrix.R b/analysis_scripts/rdayalu/corr_matrix.R
new file mode 100644
index 0000000..c16581c
--- /dev/null
+++ b/analysis_scripts/rdayalu/corr_matrix.R
@@ -0,0 +1,28 @@
+# 12/15/14
+# Correlation matrix for full blog dataset
+# 57,207 observations
+library(corrplot)
+
+# Make datset with only the following variable of interest
+indicators_short = c("nvisits", "active_day_percent", "nforms", "median_visit_duration",
+ "median_visits_per_day", "time_using_cc", "ncases_registered",
+ "register_followup", "case_register_followup_rate",
+ "ncases_touched", "nunique_followups", "ninteractions")
+corr_set <- fullset
+corr_set <- corr_set[,names(corr_set) %in% indicators_short]
+names(corr_set) <- c("# visits", "% active days", "# forms", "visit duration", "visits/day",
+ "time using cc", "# interactions", "# cases registered", "# follow-up visits",
+ "% follow-up visits", "# cases", "# cases followed-up")
+
+#Generate numerical correlation matrix
+num_corr <- cor(corr_set, use="complete.obs")
+num_corr <- round(num_corr, digits = 2)
+
+#Generate visual correlation plot
+corrplot(num_corr, tl.srt=45, tl.cex = 0.6,
+ diag=F, type={"upper"}, order = "AOE")
+cor_plot <- corrplot(num_corr, tl.srt=45, tl.cex = 0.6,
+ diag=F, order = "AOE")
+
+
+
diff --git a/analysis_scripts/rdayalu/data_blog_set.R b/analysis_scripts/rdayalu/data_blog_set.R
new file mode 100644
index 0000000..17e6d68
--- /dev/null
+++ b/analysis_scripts/rdayalu/data_blog_set.R
@@ -0,0 +1,324 @@
+#The purpose of this code is to create a standard dataset that people can use for
+#the "Under the Data Tree" data blog series
+
+#First import monthly_table for all test = F domains then run the following code
+#Be sure to set config_run first
+source(file.path("analysis_scripts","raw_data","data_import.R", fsep = .Platform$file.sep))
+
+
+library(data.table)
+library(zoo)
+detach("package:lubridate")
+library(lubridate)
+library(ggplot2)
+
+#------------------------------------------------------------------------#
+#DATA MANAGEMENT
+#------------------------------------------------------------------------#
+
+# Load system config file
+source(file.path("function_libraries","config_file_funcs.R", fsep = .Platform$file.sep))
+source(file.path("data_sources.R"))
+system_conf <- get_system_config(file.path("config_system.json"))
+
+# Get db connection
+# db <- get_db_connection(system_conf)
+
+#List of users by user_type, keeping only mobile users
+#Get user_type table from db (mobile, web, superuser, etc.)
+user_type <- get_user_type_table(db)
+user_type <- filter(user_type, user_type == "mobile")
+user_type <- select(user_type, user_pk, user_type)
+
+#Create aggregate monthly data set
+all_monthly <- monthly_table
+
+#Merge these two lists together, keeping only mobile users
+all_monthly <- merge(all_monthly, user_type, by = "user_pk", all.x = T)
+all_monthly <- filter(all_monthly, user_type == "mobile")
+
+#Set report_options
+report <- run_conf$reports$modules$name
+report_options <- get_report_options(run_conf,report)
+
+#Remove demo users and NA/NONE users
+all_monthly = all_monthly[!(all_monthly$user_id =="demo_user"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="NONE"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="none"),]
+all_monthly = all_monthly[!is.na(all_monthly$user_id),]
+
+#Remove any dates before report start_date and after report end_date
+all_monthly$date_first_visit = as.Date(all_monthly$date_first_visit)
+all_monthly$date_last_visit = as.Date(all_monthly$date_last_visit)
+start_date = as.Date(report_options$start_date)
+end_date = as.Date(report_options$end_date)
+all_monthly = subset(all_monthly, all_monthly$date_first_visit >= start_date
+ & all_monthly$date_last_visit <= end_date)
+report_end_date <- as.Date(report_options$end_date)
+end_month <- as.yearmon(report_end_date)
+end_month <- parse_date_time(paste('01', end_month), '%d %b %Y!')
+end_month <- as.Date(end_month)
+
+#Bring in domain_numeric
+#domain <- tbl(db, "domain")
+#domain <- collect(domain)
+#domain <- select(domain, id, name)
+#write.csv(domain, file = "domain_master_list.csv")
+
+#Merge domain ID into all_monthly table
+#all_monthly <- merge(all_monthly, domain, by.x = "domain",
+# by.y = "name", all.x = T)
+
+#Merge domain facets into all_monthly table
+#Prepare domain_table for merging in domain facets
+#Bring in sector information
+sector <- tbl(db, "sector")
+sector <- collect(sector)
+names(sector)[names(sector) == "name"] = "sector_final"
+domain_sector <- tbl(db, "domain_sector")
+domain_sector <- collect(domain_sector)
+domain_sector <- select(domain_sector, domain_id, sector_id)
+domain_table <- merge(domain_table, domain_sector, by.x = "id", by.y = "domain_id", all.x = T)
+domain_table <- merge(domain_table, sector, by.x = "sector_id", by.y = "id", all.x = T)
+#Bring in subsector information
+subsector <- tbl(db, "subsector")
+subsector <- collect(subsector)
+subsector <- select(subsector, id, name)
+subsector <- filter(subsector, !is.na(name))
+subsector <- filter(subsector, name != "")
+names(subsector)[names(subsector) == "name"] = "subsector_final"
+domain_subsector <- tbl(db, "domain_subsector")
+domain_subsector <- collect(domain_subsector)
+domain_subsector <- select(domain_subsector, domain_id, subsector_id)
+domain_table <- merge(domain_table, domain_subsector, by.x = "id", by.y = "domain_id", all.x = T)
+domain_table <- merge(domain_table, subsector, by.x = "subsector_id", by.y = "id", all.x = T)
+#Consolidate country information
+is.na(domain_table$deployment.country) <- domain_table$deployment.country == ""
+is.na(domain_table$country) <- domain_table$country == ""
+domain_table$country_final <- domain_table$deployment.country
+keep_country <- which(is.na(domain_table$deployment.country) & !is.na(domain_table$country))
+domain_table$country_final[keep_country] <- domain_table$country[keep_country]
+#Consolidate Dimagi level of support
+is.na(domain_table$internal.services) <- domain_table$internal.services == ""
+is.na(domain_table$internal.self_started) <- domain_table$internal.self_started == ""
+domain_table$self_start[domain_table$internal.self_started == "True"] <- "self"
+domain_table$dimagi_services <- domain_table$internal.services
+keep_self <- which(is.na(domain_table$internal.services) & !is.na(domain_table$self_start))
+domain_table$dimagi_services[keep_self] <- domain_table$self_start[keep_self]
+
+#Keep only columns of interest
+names(domain_table)[names(domain_table) == "id"] = "domain_id"
+facets_to_merge <- select(domain_table, name, domain_id, country_final, sector_final,
+ subsector_final, dimagi_services, test)
+
+#Merge domain facets from domain table into all_monthly table
+all_monthly <- merge(all_monthly, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+
+#Change column names as needed
+names (all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+names (all_monthly)[names(all_monthly) == "numeric_index"] = "month_index"
+#names (all_monthly)[names(all_monthly) == "id"] = "domain_numeric"
+names(all_monthly)[names(all_monthly) == "domain_id"] = "domain_numeric"
+
+#Convert calendar month to actual date
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+all_monthly$month_abbr <- month(all_monthly$calendar_month, label = T, abbr = T)
+
+#Exclude any users who logged > 100 visits in any month
+#These are probably atypical users
+all_monthly$visits_ge_100 <- all_monthly$nvisits > 100
+user_ge_100 <- all_monthly %.%
+ group_by(user_id) %.%
+ summarise(ge_100 = sum(visits_ge_100))
+user_le_100 <- filter(user_ge_100, ge_100 == 0)
+all_monthly <- all_monthly[all_monthly$user_id %in% user_le_100$user_id, ]
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+all_monthly <- arrange(all_monthly, domain_numeric, user_pk, calendar_month)
+df <- data.table(all_monthly)
+setkey(df,user_pk)
+df[,diff_days:=c(NA,diff(calendar_month)),by=user_pk]
+all_monthly <- as.data.frame(df)
+all_monthly$previous_month_active <- all_monthly$diff_days <= 31
+all_monthly$previous_two_months_active <- all_monthly$diff_days <= 62
+all_monthly$previous_three_months_active <- all_monthly$diff_days <= 93
+
+#We are working with user_pk/domain combination since user_pk
+#might not be unique across domains. A single user_pk can submit to multiple domains.
+all_monthly$domain_user <- paste(all_monthly$domain, all_monthly$user_pk, sep = "_")
+users <- unique(all_monthly$domain_user)
+
+next_month_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$domain_user == i,]
+ next_active <- c()
+ next_active <- c(single_user$previous_month_active[-1], F)
+ next_month_active <- c(next_month_active, next_active)
+}
+all_monthly$next_month_active <- next_month_active
+
+next_two_months_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$domain_user == i,]
+ next_active <- c()
+ next_active <- c(single_user$previous_two_months_active[-1], F)
+ next_two_months_active <- c(next_two_months_active, next_active)
+}
+all_monthly$next_two_months_active <- next_two_months_active
+
+next_three_months_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$domain_user == i,]
+ next_active <- c()
+ next_active <- c(single_user$previous_three_months_active[-1], F)
+ next_three_months_active <- c(next_three_months_active, next_active)
+}
+all_monthly$next_three_months_active <- next_three_months_active
+
+#Based on the end_month in our dataset, we don't know if the user will be active in any of
+#the months following end_month. Must change all those attrition values to NA.
+is.na(all_monthly$next_month_active) <- all_monthly$calendar_month == end_month
+is.na(all_monthly$next_two_months_active) <- all_monthly$calendar_month >= end_month - months(1)
+is.na(all_monthly$next_three_months_active) <- all_monthly$calendar_month >= end_month - months(2)
+
+#Get lifetime table for total nunique_followups, active_months per user
+lifetime_table <- get_aggregate_table(db, "aggregate_lifetime_interactions", domains_for_run)
+lifetime_table <- lifetime_table[lifetime_table$user_pk %in% all_monthly$user_pk,]
+
+#Merge nunique_followups, active_months to all_monthly
+lifetime_table <- select(lifetime_table, user_pk, nunique_followups, active_months, calendar_month_on_cc)
+names(lifetime_table)[names(lifetime_table) == "nunique_followups"] = "lifetime_followup"
+names(lifetime_table)[names(lifetime_table) == "calendar_month_on_cc"] = "months_on_cc"
+all_monthly <- merge(all_monthly, lifetime_table, by = "user_pk", all.x = T)
+
+#Lifetime aggregate table is not available on the db as of 2/17/15.
+#I will calculate months_on_cc, active_months here until the lifetime table is available.
+total_months_cc <- all_monthly %>% group_by(domain_user) %>%
+ summarise(first_month = min(calendar_month),
+ last_month = max(calendar_month),
+ active_months = length(unique(calendar_month)))
+total_months_cc$months_on_cc <- (interval(total_months_cc$first_month,
+ total_months_cc$last_month) %/% months(1))+1
+total_months_cc <- select(total_months_cc, domain_user, active_months, months_on_cc)
+all_monthly <- merge(all_monthly, total_months_cc, by = "domain_user", all.x = T)
+
+#Exclude a samples of users (from pradan-mis-dev?) so that each domain contributes only
+#< 5% of the total users
+nusers <- all_monthly %>% group_by(domain) %>% summarise(nusers = length(unique(user_pk)))
+nusers <- arrange(nusers, desc(nusers))
+nusers$total_users <- sum(nusers$nusers)
+nusers$per_users <- (nusers$nusers/nusers$total_users)*100
+exclude_users_pathfinder <- sample(unique(all_monthly$user_pk[all_monthly$domain == "pradan-mis-dev"]), 299)
+all_monthly <- all_monthly[!(all_monthly$user_pk %in% exclude_users_pathfinder),]
+
+#Was the user ever active again after an attrition event (defined as next_month_active == F)?
+all_monthly$attrition_event <- !(all_monthly$next_month_active == T | is.na(all_monthly$next_month_active))
+all_monthly$continuing <- all_monthly$month_index < all_monthly$months_on_cc
+all_monthly$ever_active_again <- all_monthly$attrition_event == T & all_monthly$continuing == T
+is.na(all_monthly$ever_active_again) <- all_monthly$attrition_event == F
+
+all_monthly <- select(all_monthly, -c(user_id, domain, test, visits_ge_100))
+
+#Flag users from typical FLW domains
+#Note that it's better to flag at the user level (based on apps) in the future as
+#opposed to flagging on the domain level as I am doing here.
+typical_flw_domains <-
+c("aaharsneha", "aarohi", "acf", "aed-hth", "arogyasarita", "care-ecd",
+ "chasssmt-moz", "crc-intervention", "crhp", "crs-catch", "crs-remind",
+ "crs-senegal", "dtree-familyplanning", "engender-ethiopia-pilot", "icap-tb",
+ "kawok-malaria-p", "keiskamma", "kgvk", "maternalznz", "nutritionmeast", "opm",
+ "pasmo-nicaragua-dmg", "pci-india", "project", "puami-tsf-mnch-myanmar", "rdi-hiht",
+ "savethechildren", "savethechildren-nepal", "slttc", "ssqh-cs", "teba-hbc",
+ "tulasalud", "world-renew", "wvindia", "wvindia-nutrition", "wvindia2",
+ "wvindonesia", "wvug", "yonsei-emco")
+
+blog <- merge(all_monthly, domain_master_list, by = "domain_numeric", all.x = T)
+blog$typical_flw <- blog$domain %in% typical_flw_domains
+blog <- select(blog, -domain)
+
+#write.csv(blog, file = "blog_data_2_13_15.csv")
+
+#----------------------------------------------------------------------#
+#Older code - not used to create future blog datasets
+#----------------------------------------------------------------------#
+
+#Import mobile and web user lists
+#Note that we need to merge/select by user_id instead of user_pk for now because we
+#don't currently have complete/correct user_pk for all user_ids. Melissa will be changing this.
+#The mobile user table has duplicate user_ids only for demo_user, ntest, etc.
+#Otherwise, all user_ids are unique. Dedup this anyway.
+mobile_users <- read.csv(file = "mobile_users.csv", na.strings = "")
+mobile_users <- unique(mobile_users)
+
+#The web_user table does have duplicate user_ids, but since user_pk is different for those users
+#(even though we know they are the same users), we need to exclude user_pk and dedup web_users
+web_users <- read.csv(file = "web_users1.csv", na.strings = "")
+web_users <- select(web_users, user_id, username, is_dimagi)
+web_users <- unique(web_users)
+
+summary(mobile_users$user_id %in% web_users$user_id)
+summary(all_monthly$user_id %in% web_users$user_id)
+summary(all_monthly$user_id %in% mobile_users$user_id)
+
+#All users in mobile_users are true mobile users
+mobile_users$user_type <- "mobile"
+
+#All users in web_users are web | dimagi users
+web_users$user_type[web_users$is_dimagi == "t"] <- "dimagi"
+web_users$user_type[web_users$is_dimagi == "f"] <- "web"
+web_users <- select(web_users, user_id, username, user_type)
+
+#Combine into one dataset
+user_type <- rbind(mobile_users, web_users)
+
+all_monthly$domain_numeric = as.numeric(as.factor(all_monthly$domain))
+
+#Keep only mobile/unknown users
+all_monthly <- merge(all_monthly, user_type, by = "user_id", all.x = T)
+all_monthly$user_type[is.na(all_monthly$user_type)] <- "unknown"
+all_monthly <- filter(all_monthly, user_type == "mobile" | user_type == "unknown")
+
+# Check # and % of users per domain
+nusers <- all_monthly %>% group_by(domain) %>% summarise(n_users = length(unique(user_pk)))
+nusers$per_users <- (nusers$n_users/length(unique(all_monthly$user_pk)))*100
+nusers <- arrange(nusers, desc(per_users))
+
+#Frequency of nunique_followups for all relevant users
+table(lifetime_table$lifetime_followup, useNA = "always")
+
+myhist <- ggplot(lifetime_table, aes(x=lifetime_followup)) +
+ geom_histogram(binwidth=1, colour="black", fill="lightblue") +
+ scale_x_continuous(limits=c(0,25)) +
+ geom_vline(aes(xintercept=median(lifetime_followup, na.rm=T)),
+ color="red", linetype="dashed", size=1)
+
+lifetime_table$zero_reg <- lifetime_table$ncases_registered == 0
+lifetime_table$one_reg <- lifetime_table$ncases_registered == 1
+lifetime_table$two_reg <- lifetime_table$ncases_registered == 2
+lifetime_table$three_plus_reg <- lifetime_table$ncases_registered >= 3
+lifetime_table$na_reg <- is.na(lifetime_table$ncases_registered)
+
+lifetime_table$zero_fu <- lifetime_table$nunique_followups == 0
+lifetime_table$one_fu <- lifetime_table$nunique_followups == 1
+lifetime_table$two_fu <- lifetime_table$nunique_followups == 2
+lifetime_table$three_plus_fu <- lifetime_table$nunique_followups >= 3
+lifetime_table$na_fu <- is.na(lifetime_table$nunique_followups)
+
+nusers_lifetime <- lifetime_table %>% group_by(domain) %>%
+ summarise(nusers_zero_reg = sum(zero_reg, na.rm = T)/length(unique(user_pk))*100,
+ nusers_one_reg = sum(one_reg, na.rm = T)/length(unique(user_pk))*100,
+ nusers_two_reg = sum(two_reg, na.rm = T)/length(unique(user_pk))*100,
+ nusers_three_plus_reg = sum(three_plus_reg, na.rm = T)/length(unique(user_pk))*100,
+ nusers_na_reg = sum(na_reg, na.rm = T)/length(unique(user_pk))*100,
+ nusers_zero_fu = sum(zero_fu, na.rm = T)/length(unique(user_pk))*100,
+ nusers_one_fu = sum(one_fu, na.rm = T)/length(unique(user_pk))*100,
+ nusers_two_fu = sum(two_fu, na.rm = T)/length(unique(user_pk))*100,
+ nusers_three_plus_fu = sum(three_plus_fu, na.rm = T)/length(unique(user_pk))*100,
+ nusers_na_fu = sum(na_fu, na.rm = T)/length(unique(user_pk))*100,
+ nusers = length(unique(user_pk)))
+
+nusers_lifetime <- arrange(nusers_lifetime, desc(nusers))
+
diff --git a/analysis_scripts/rdayalu/domains_with_forms.R b/analysis_scripts/rdayalu/domains_with_forms.R
new file mode 100644
index 0000000..2b61c8e
--- /dev/null
+++ b/analysis_scripts/rdayalu/domains_with_forms.R
@@ -0,0 +1,22 @@
+#Which domains have forms in the DP but do not have monthly rows?
+all_monthly <- test
+domain <- get_domain_table(db)
+domain <- select(domain, id, name, X..Active.Cases, X..Active.Mobile.Workers,
+ is_test, internal.can_use_data)
+
+form_table$form_date <- as.Date(substr(form_table$time_start, 1, 10))
+
+
+domains_with_forms <- form_table %>% group_by(domain_id) %>%
+ summarise(nforms = length(unique(id)),
+ wam_domain = sum(amplifies_workers, na.rm=T) > 0,
+ earliest_form_date = min(form_date))
+domains_with_forms <- merge(domains_with_forms, domain,
+ by.x = "domain_id", by.y = "id", all.x = T)
+domains_with_forms$has_monthly_rows <- domains_with_forms$name %in% all_monthly$domain
+test <- filter(domains_with_forms, has_monthly_rows == F)
+test <- arrange(test, earliest_form_date)
+
+test2 <- filter(test, wam_domain == T)
+write.csv(test, file="domains_without_monthly_rows.csv")
+
diff --git a/analysis_scripts/rdayalu/general_visuals.R b/analysis_scripts/rdayalu/general_visuals.R
new file mode 100644
index 0000000..2421a9a
--- /dev/null
+++ b/analysis_scripts/rdayalu/general_visuals.R
@@ -0,0 +1,18 @@
+#Visualize tenure on CC
+
+month_count <- master_set %.%
+ group_by(domain, user_id) %.%
+ summarise(months_on_cc = length(unique(calendar_month)))
+
+month_count <- tula_typical %.%
+ group_by(domain, user_id) %.%
+ summarise(months_on_cc = length(unique(calendar_month)))
+
+#Histogram
+# Include all observations
+myhist <- ggplot(n_apps, aes(x=n_apps)) +
+ geom_histogram(binwidth=1, colour="black", fill="lightblue") +
+ geom_vline(aes(xintercept=median(n_apps, na.rm=T)),
+ color="red", linetype="dashed", size=1)
+
+cumulative_dist <- plot(ecdf(month_count$months_on_cc))
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/google_analytics_WAR.R b/analysis_scripts/rdayalu/google_analytics_WAR.R
new file mode 100644
index 0000000..6797e09
--- /dev/null
+++ b/analysis_scripts/rdayalu/google_analytics_WAR.R
@@ -0,0 +1,41 @@
+# Google analytics - worker activity reports (WAR)
+# This scirpt is to integrate csv downloads from GA. Based on user_id,
+# WAR usage statistics will be linked to the the users table and eventually to
+# the monthly_table. Only non-Dimagi users will be considered.
+# 2/9/15
+
+#Import GA csv
+ga <- read.csv(file = "GA_aug_2014.csv")
+domain_name <- sub(".*/a/", "", ga$page)
+ga$domain <- sub("/reports/.*", "", domain_name)
+
+#Import user table and user_type table
+users <- tbl(db, "users")
+users <- collect(users)
+user_type <- get_user_type_table(db)
+user_type <- collect(user_type)
+
+#Merge user tables
+users <- merge(users, user_type, by.x = "id", by.y = "user_pk", all.x = T)
+users <- select(users, -c(user_id.y, username.y))
+names(users)[names(users) == "user_id.x"] = "user_id"
+names(users)[names(users) == "username.x"] = "username"
+
+#Exclude dimagi users and superusers from GA stats
+dimagi <- grep("dimagi", users$email, fixed=T)
+dimagi_users <- users[dimagi,]
+non_dimagi_users <- users[!(users$id %in% dimagi_users$id),]
+non_dimagi_users <- filter(non_dimagi_users, is_superuser == F | is.na(is_superuser))
+
+#Merge ga values with user values
+ga_users <- merge(ga, non_dimagi_users, by = "user_id", all.x = T)
+names(ga_users)[names(ga_users) == "id"] = "user_pk"
+
+#Exclude users with user_type = NA.
+#These users were not in the "non_dimagi_users" list, so we don't want to include them.
+ga_users <- filter(ga_users, !is.na(user_type))
+
+#Merge with monthly table of interest
+#all_monthly <- filter(monthly_table, month.index == "Aug 2014")
+#ga_users <- merge(ga_users, all_monthly, by = "user_pk", all.x = T)
+
diff --git a/analysis_scripts/rdayalu/hq_dp.R b/analysis_scripts/rdayalu/hq_dp.R
new file mode 100644
index 0000000..3d4c5da
--- /dev/null
+++ b/analysis_scripts/rdayalu/hq_dp.R
@@ -0,0 +1,238 @@
+#This code is to get 100% alignment between DP and HQ, especially in terms
+#(1) number of active users per month, (2) user type, (3) general # of forms.
+#Number of forms might be difficult to get 100% alignment on.
+
+#2/16/15
+library(lubridate)
+library(dplyr)
+#Upload HQ data
+forms_hq <- read.csv(file="test.csv")
+names(forms_hq)[names(forms_hq) == "user_id"] = "user_id_hq"
+names(forms_hq)[names(forms_hq) == "domain"] = "domain_hq"
+names(forms_hq)[names(forms_hq) == "is_test"] = "is_test_hq"
+names(forms_hq)[names(forms_hq) == "user_type"] = "user_type_hq"
+forms_hq$domain_hq <- as.character(forms_hq$domain_hq)
+forms_hq$hq <- T
+
+#Upload DP data
+#forms_dp <- read.csv(file="forms_dp.csv")
+#forms_dp$form_date <- as.Date(forms_dp$form_date)
+forms_dp <- read.csv(file="forms_nov_received.csv")
+forms_dp$form_date_received <- as.Date(forms_dp$form_date_received)
+forms_dp <- select(forms_dp, -X)
+names(forms_dp)[names(forms_dp) == "user_id"] = "user_pk"
+domains <- read.csv(file="domain_master_list.csv")
+names(domains)[names(domains) == "id"] = "domain_pk"
+users <- read.csv(file="users.csv")
+names(users)[names(users) == "id"] = "user_pk"
+domains <- select(domains, -X)
+users <- select(users, user_pk, user_id)
+users$user_id <- as.character(users$user_id)
+forms_dp <- merge(forms_dp, domains, by.x = "domain_id", by.y = "domain_pk",
+ all.x = T)
+forms_dp <- merge(forms_dp, users, by="user_pk", all.x = T)
+names(forms_dp)[names(forms_dp) == "user_id"] = "user_id_dp"
+names(forms_dp)[names(forms_dp) == "name"] = "domain_dp"
+forms_dp$dp <- T
+forms_dp$domain_dp <- as.character(forms_dp$domain_dp)
+
+#Merge together
+hq_dp <- merge(forms_hq, forms_dp, by = "form_id", all = T)
+hq_dp$user_id_hq <- as.character(hq_dp$user_id_hq)
+hq_dp$user_id_dp <- as.character(hq_dp$user_id_dp)
+hq_dp$domain_dp <- as.character(hq_dp$domain_dp)
+hq_dp$domain_hq <- as.character(hq_dp$domain_hq)
+
+#Tables of users in HQ but not in DP and vice versa
+hq_not_in_dp <- filter(hq_dp, is.na(dp))
+dp_not_in_hq <- filter(hq_dp, is.na(hq))
+hq_dp_same <- filter(hq_dp, dp == T & hq == T)
+
+#Flag users in HQ but not at all in DP user table
+hq_not_in_dp$not_dp_user <- !(hq_not_in_dp$user_id_hq %in% users$user_id)
+
+#Flag domains in HQ that do not have any forms per DP
+domains_dp_nov <- as.character(unique(forms_dp$domain_dp))
+domains_hq_nov <- as.character(unique(forms_hq$domain_hq))
+hq_not_in_dp$not_dp_domain_nov <- !(hq_not_in_dp$domain_hq %in% domains_dp_nov)
+dp_domain_pk_with_forms <- read.csv(file = "dp_domain_pk_with_forms.csv")
+dp_domain_pk_with_forms <- select(dp_domain_pk_with_forms, -X)
+names(dp_domain_pk_with_forms)[names(dp_domain_pk_with_forms) ==
+ "unique.form_table.domain_id."] = "domain_pk"
+dp_domain_pk_with_forms <- merge(dp_domain_pk_with_forms, domains,
+ by="domain_pk", all.x = T)
+dp_domain_pk_with_forms$name <- as.character(dp_domain_pk_with_forms$name)
+hq_not_in_dp$not_dp_domain_ever <- !(hq_not_in_dp$domain_hq %in%
+ dp_domain_pk_with_forms$name)
+
+#Flag users in HQ that do no have any forms per DP
+user_dp_has_forms <- read.csv(file = "user_dp_has_forms.csv")
+user_dp_has_forms <- select(user_dp_has_forms, -X)
+names(user_dp_has_forms)[names(user_dp_has_forms) ==
+ "unique.form_table.user_id."] = "user_pk"
+user_dp_has_forms <- merge(user_dp_has_forms, users,
+ by="user_pk", all.x = T)
+user_dp_has_forms$user_id <- as.character(user_dp_has_forms$user_id)
+hq_not_in_dp$not_dp_user_ever <- !(hq_not_in_dp$user_id_hq %in%
+ user_dp_has_forms$user_id)
+
+sample(hq_not_in_dp$user_id_hq[hq_not_in_dp$not_dp_user_ever == F], 5,
+ replace = FALSE)
+test <- filter(hq_dp, user_id_hq == "59f0d35f0cd17844630742144a1b0c20")
+
+#Analyze discrepancies
+summary(hq_dp$user_id_hq == hq_dp$user_id_dp)
+summary(hq_dp$hq)
+summary(hq_dp$dp)
+
+table(hq_not_in_dp$domain_hq, useNA = "always")
+length(unique(hq_not_in_dp$user_id_hq))
+
+table(dp_not_in_hq$domain_dp, useNA = "always")
+length(unique(dp_not_in_hq$user_id_dp))
+
+#DP not in HQ
+#dp_users_not_in_hq <- unique(dp_not_in_hq$user_id_dp)
+#Which of these users are not in November HQ?
+dp_not_in_hq$received_on <- as.Date(dp_not_in_hq$received_on)
+forms_hq$user_id_hq <- as.character(forms_hq$user_id_hq)
+dp_not_in_hq$user_not_in_hq_nov <- !(dp_not_in_hq$user_id_dp %in%
+ forms_hq$user_id_hq)
+test <- filter(dp_not_in_hq, user_not_in_hq_nov == T)
+user_type <- read.csv(file="user_type.csv")
+user_type <- select(user_type, user_pk, deactivated, deleted)
+test <- merge(test, user_type, by="user_pk", all.x = T)
+deleted_users <- filter(test, deleted == T)
+non_deleted_users <- filter(test, deleted == F | is.na(deleted))
+non_deleted_users <- filter(non_deleted_users, deactivated == F | is.na(deactivated))
+
+
+# 2/15/15
+form_table$form_date <- substr(form_table$time_start, 1, 10)
+form_table$form_date <- as.Date(form_table$form_date)
+form_table$form_date_received <- substr(form_table$received_on, 1, 10)
+form_table$form_date_received <- as.Date(form_table$form_date_received)
+forms_nov <- filter(form_table, form_date >= "2014-11-01" & form_date <= "2014-11-30")
+dp_domain_pk_with_forms <- data.frame(unique(form_table$domain_id))
+write.csv(dp_domain_pk_with_forms, file = "dp_domain_pk_with_forms.csv")
+test <- filter(all_monthly, user_id == "1b9883444260f98aa44c2bc397dc9a9a")
+
+# 2/4/15
+hq <- read.csv(file="hq_nov_2014.csv")
+dp <- read.csv(file="dp_nov_2014.csv")
+
+hq$dup_id <- duplicated(hq$user_id) | duplicated(hq$user_id, fromLast=T)
+dp$dup_id <- duplicated(dp$user_id) | duplicated(dp$user_id, fromLast=T)
+
+dp_unique <- dp[dp$dup_id == F,]
+test <- data.frame(unique(dp$user_id[duplicated(dp$user_id)]))
+
+summary(dp_unique$user_id %in% hq$user_id)
+summary(hq$user_id %in% dp_unique$user_id)
+test <- data.frame(unique(dp_unique$user_id[!(dp_unique$user_id %in% hq$user_id)]))
+test <- data.frame(unique(hq$user_id[!(hq$user_id %in% dp_unique$user_id)]))
+test <- dp_unique[!(dp_unique$user_id %in% hq$user_id),]
+test <- hq[!(hq$user_id %in% dp_unique$user_id),]
+
+#Change column names as needed
+names(dp_unique)[names(dp_unique) == "domain"] = "domain_dp"
+names(dp_unique)[names(dp_unique) == "user_type"] = "user_type_dp"
+names(dp_unique)[names(dp_unique) == "is_test"] = "is_test_dp"
+names(dp_unique)[names(dp_unique) == "nforms"] = "nforms_dp"
+names(dp_unique)[names(dp_unique) == "ncases_touched"] = "ncases_dp"
+names(hq)[names(hq) == "domain"] = "domain_hq"
+names(hq)[names(hq) == "user_type"] = "user_type_hq"
+names(hq)[names(hq) == "is_test"] = "is_test_hq"
+names(hq)[names(hq) == "nforms"] = "nforms_hq"
+names(hq)[names(hq) == "ncase"] = "ncases_hq"
+
+#Merge two tables
+hq_dp <- merge(hq, dp_unique, by = "user_id", all.x = F)
+
+#Check domains
+hq_dp$domain_dp <- as.character(hq_dp$domain_dp)
+hq_dp$domain_hq <- as.character(hq_dp$domain_hq)
+summary(hq_dp$domain_dp == hq_dp$domain_hq)
+test <- hq_dp[!(hq_dp$domain_dp == hq_dp$domain_hq),]
+test <- select(test, user_id)
+
+#Check user_type
+test <- filter(hq_dp, user_type_dp == "web")
+
+#Check test domains
+hq_dp$is_test_hq <- as.character(hq_dp$is_test_hq)
+hq_dp$is_test_dp <- as.character(hq_dp$is_test_dp)
+hq_dp$is_test_hq[hq_dp$is_test_hq == ""] <- "none"
+hq_dp$is_test_hq[hq_dp$is_test_hq == "FALSE"] <- "false"
+hq_dp$is_test_hq[hq_dp$is_test_hq == "TRUE"] <- "true"
+summary(hq_dp$is_test_hq == hq_dp$is_test_dp)
+test <- hq_dp[!(hq_dp$is_test_hq == hq_dp$is_test_dp),]
+
+#Check # forms and # cases
+summary(hq_dp$nforms_dp == hq_dp$nforms_hq)
+summary(hq_dp$ncases_dp == hq_dp$ncases_hq)
+
+#1/9/2015
+#First import monthly_table for all domains
+#Set permitted_data_only = false
+source(file.path("analysis_scripts","raw_data","data_import.R", fsep = .Platform$file.sep))
+
+all_monthly <- monthly_table
+
+#Merge domain facets from domain table into all_monthly table
+facets_to_merge <- select(domain_table, name, is_test, active)
+all_monthly <- merge(all_monthly, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+
+#Format date variables
+all_monthly$date_first_visit = as.Date(all_monthly$date_first_visit)
+all_monthly$date_last_visit = as.Date(all_monthly$date_last_visit)
+
+#Convert calendar month to actual date
+names(all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+
+#Keep only rows for Nov 2014 to compare with HQ
+Nov_2014 <- filter(all_monthly, calendar_month == "2014-11-01")
+
+#Get user_type from db and merge to Nov_2014
+user_type <- get_user_type_table(db)
+user_type <- select(user_type, user_pk, username, user_type, is_superuser)
+Nov_2014 <- merge(Nov_2014, user_type, by.all = "user_pk", all.x = T)
+
+#Keep variables of interest per Neal's 1/9/15 email
+Nov_2014 <- select(Nov_2014, user_pk, user_id, username, domain, user_type,
+ summary_device_type, is_test, nforms, ncases_touched, is_superuser)
+
+#Create csv file
+write.csv(Nov_2014, file = "Nov_2014_dp.csv")
+
+#------------------------------------------------------------------------#
+#Older code
+#------------------------------------------------------------------------#
+nusers <- monthly_table %.%
+ group_by(domain, month.index) %.%
+ summarise (nusers = length(unique(user_id)),
+ nandroid = sum(summary_device_type == "Android", na.rm = T),
+ nnokia = sum(summary_device_type == "Nokia", na.rm = T),
+ ncloudcare = sum(summary_device_type == "Cloudcare", na.rm = T),
+ nmulti = sum(summary_device_type == "Multi", na.rm = T),
+ nnone = sum(summary_device_type == "None", na.rm = T),
+ nother = sum(summary_device_type == "Other", na.rm = T),
+ nmissing = sum(is.na(summary_device_type)))
+
+nusers_aug <- filter(nusers, month.index == "Aug 2014")
+
+#Merge domain facets from domain table into all_monthly table
+to_merge <- select(domains_HQ, Project, X..Active.Mobile.Workers, X..Mobile.Workers,
+ X..Mobile.Workers..Submitted.Form., X..Web.Users)
+
+nusers_aug <- merge(nusers_aug, to_merge, by.x = "domain",
+ by.y = "Project", all.x = T)
+
+nusers_aug <- arrange(nusers_aug, desc(nusers))
+
+nusers_aug$nusers_diff <- nusers_aug$nusers - nusers_aug$X..Active.Mobile.Workers
+
+write.csv(nusers_aug, file = "nusers_dp_hq_v2.csv")
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/improvement_table.R b/analysis_scripts/rdayalu/improvement_table.R
new file mode 100644
index 0000000..2994dd4
--- /dev/null
+++ b/analysis_scripts/rdayalu/improvement_table.R
@@ -0,0 +1,271 @@
+# Create improvement table for Gates grant
+# This table is to see if WAR usage (based on GA stats) can be correlated with
+# improvement in CommCare usage (as measured by % active days)
+# https://docs.google.com/a/dimagi.com/document/d/1Lx_q9jVaDORTSkBE6iEvPqlC2rrrRSOYB590Cq5AZmw/edit
+# 2/11/15
+
+#Create aggregate monthly data set
+all_monthly <- monthly_table
+
+#Set report_options
+report = run_conf$reports$modules$name
+report_options <- get_report_options(run_conf,report)
+
+#Remove demo users and NA/NONE users
+all_monthly = all_monthly[!(all_monthly$user_id =="demo_user"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="NONE"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="none"),]
+all_monthly = all_monthly[!is.na(all_monthly$user_id),]
+
+#Remove any dates before report start_date and after report end_date
+all_monthly$date_first_visit = as.Date(all_monthly$date_first_visit)
+all_monthly$date_last_visit = as.Date(all_monthly$date_last_visit)
+start_date = as.Date(report_options$start_date)
+end_date = as.Date(report_options$end_date)
+all_monthly = subset(all_monthly, all_monthly$date_first_visit >= start_date
+ & all_monthly$date_last_visit <= end_date)
+report_end_date <- as.Date(report_options$end_date)
+end_month <- as.yearmon(report_end_date)
+end_month <- parse_date_time(paste('01', end_month), '%d %b %Y!')
+end_month <- as.Date(end_month)
+
+#Keep only columns of interest from domain_table
+names(domain_table)[names(domain_table) == "id"] = "domain_id"
+facets_to_merge <- select(domain_table, name, domain_id)
+
+#Merge domain facets from domain table into all_monthly table
+all_monthly <- merge(all_monthly, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+
+#Change column names as needed
+names(all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+names(all_monthly)[names(all_monthly) == "numeric_index"] = "month_index"
+names(all_monthly)[names(all_monthly) == "domain_id"] = "domain_numeric"
+
+#Convert calendar month to actual date
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+all_monthly$month_abbr <- month(all_monthly$calendar_month, label = T, abbr = T)
+
+# Table for 2014 users
+all_2014 <- filter(all_monthly, calendar_month >= "2014-01-01" & calendar_month <= "2014-12-31")
+
+# Tables of 4 quarters
+q1 <- filter(all_monthly, calendar_month >= "2014-01-01" & calendar_month <= "2014-03-31")
+q2 <- filter(all_monthly, calendar_month >= "2014-04-01" & calendar_month <= "2014-06-30")
+q3 <- filter(all_monthly, calendar_month >= "2014-07-01" & calendar_month <= "2014-09-30")
+q4 <- filter(all_monthly, calendar_month >= "2014-10-01" & calendar_month <= "2014-12-31")
+
+# Exclude users who submitted data to > 1 domain
+# Create dataset for all active users in 2014
+n_domains <- all_2014 %>% group_by(user_pk) %>% summarise(ndomains = length(unique(domain)))
+n_domains <- filter(n_domains, ndomains == 1)
+all_2014 <- all_2014[all_2014$user_pk %in% n_domains$user_pk,]
+
+# Create user 2014 table
+user_2014 <- all_2014 %>%
+ group_by(user_pk) %>%
+ summarise(user_id = unique(user_id),
+ domain = unique(domain),
+ domain_numeric = unique(domain_numeric))
+
+# Calculate quarterly active day % only for users who were active for the full quarter
+# of interest
+n_active_months_q1 <- q1 %>% group_by(user_pk) %>%
+ summarise(nmos = length(unique(calendar_month)),
+ active_days_q1 = sum(active_days),
+ total_days_q1 = as.numeric(as.Date("2014-04-01") - as.Date("2014-01-01")),
+ per_active_days_q1 = (active_days_q1/total_days_q1)*100,
+ med_per_active_days_q1 = median(active_day_percent, na.rm = T),
+ nmos_ge_4_days_active = sum(active_days >= 4))
+n_active_months_q1$ge_4_days_active_q1 <- n_active_months_q1$nmos_ge_4_days_active == 3
+n_active_months_q1 <- filter(n_active_months_q1, nmos == 3)
+n_active_months_q1 <- select(n_active_months_q1, user_pk, per_active_days_q1,
+ med_per_active_days_q1, ge_4_days_active_q1)
+user_2014 <- merge(user_2014, n_active_months_q1, by = "user_pk", all.x = T)
+
+n_active_months_q2 <- q2 %>% group_by(user_pk) %>%
+ summarise(nmos = length(unique(calendar_month)),
+ active_days_q2 = sum(active_days),
+ total_days_q2 = as.numeric(as.Date("2014-07-01") - as.Date("2014-04-01")),
+ per_active_days_q2 = (active_days_q2/total_days_q2)*100,
+ med_per_active_days_q2 = median(active_day_percent, na.rm = T),
+ nmos_ge_4_days_active = sum(active_days >= 4))
+n_active_months_q2$ge_4_days_active_q2 <- n_active_months_q2$nmos_ge_4_days_active == 3
+n_active_months_q2 <- filter(n_active_months_q2, nmos == 3)
+n_active_months_q2 <- select(n_active_months_q2, user_pk, per_active_days_q2,
+ med_per_active_days_q2, ge_4_days_active_q2)
+user_2014 <- merge(user_2014, n_active_months_q2, by = "user_pk", all.x = T)
+
+n_active_months_q3 <- q3 %>% group_by(user_pk) %>%
+ summarise(nmos = length(unique(calendar_month)),
+ active_days_q3 = sum(active_days),
+ total_days_q3 = as.numeric(as.Date("2014-10-01") - as.Date("2014-07-01")),
+ per_active_days_q3 = (active_days_q3/total_days_q3)*100,
+ med_per_active_days_q3 = median(active_day_percent, na.rm = T),
+ nmos_ge_4_days_active = sum(active_days >= 4))
+n_active_months_q3$ge_4_days_active_q3 <- n_active_months_q3$nmos_ge_4_days_active == 3
+n_active_months_q3 <- filter(n_active_months_q3, nmos == 3)
+n_active_months_q3 <- select(n_active_months_q3, user_pk, per_active_days_q3,
+ med_per_active_days_q3, ge_4_days_active_q3)
+user_2014 <- merge(user_2014, n_active_months_q3, by = "user_pk", all.x = T)
+
+n_active_months_q4 <- q4 %>% group_by(user_pk) %>%
+ summarise(nmos = length(unique(calendar_month)),
+ active_days_q4 = sum(active_days),
+ total_days_q4 = as.numeric(as.Date("2015-01-01") - as.Date("2014-10-01")),
+ per_active_days_q4 = (active_days_q4/total_days_q4)*100,
+ med_per_active_days_q4 = median(active_day_percent, na.rm = T),
+ nmos_ge_4_days_active = sum(active_days >= 4))
+n_active_months_q4$ge_4_days_active_q4 <- n_active_months_q4$nmos_ge_4_days_active == 3
+n_active_months_q4 <- filter(n_active_months_q4, nmos == 3)
+n_active_months_q4 <- select(n_active_months_q4, user_pk, per_active_days_q4,
+ med_per_active_days_q4, ge_4_days_active_q4)
+user_2014 <- merge(user_2014, n_active_months_q4, by = "user_pk", all.x = T)
+
+# Lists of users who have been active in particular quarters
+active_q2_q4 <- n_active_months_q2$user_pk[n_active_months_q2$user_pk %in% n_active_months_q4$user_pk]
+active_q2_q3_q4 <- n_active_months_q3$user_pk[n_active_months_q3$user_pk %in% active_q2_q4]
+active_q2_q3 <- n_active_months_q2$user_pk[n_active_months_q2$user_pk %in% n_active_months_q3$user_pk]
+active_q3_q4 <- n_active_months_q3$user_pk[n_active_months_q3$user_pk %in% n_active_months_q4$user_pk]
+active_q1_q2_q3 <- n_active_months_q1$user_pk[n_active_months_q1$user_pk %in% active_q2_q3]
+
+# Flag users who were active in relevant quarters
+user_2014$active_q2_q3_q4 <- user_2014$user_pk %in% active_q2_q3_q4
+user_2014$active_q2_q3 <- user_2014$user_pk %in% active_q2_q3
+user_2014$active_q3_q4 <- user_2014$user_pk %in% active_q3_q4
+user_2014$active_q1_q2_q3 <- user_2014$user_pk %in% active_q1_q2_q3
+
+# Calculate change in % active days per user between quarters of interest
+user_2014$change_q2_q4 <- user_2014$per_active_days_q4 - user_2014$per_active_days_q2
+user_2014$change_q2_q3 <- user_2014$per_active_days_q3 - user_2014$per_active_days_q2
+user_2014$change_q3_q4 <- user_2014$per_active_days_q4 - user_2014$per_active_days_q3
+user_2014$change_q1_q3 <- user_2014$per_active_days_q3 - user_2014$per_active_days_q1
+
+user_2014$change_med_q2_q4 <- user_2014$med_per_active_days_q4 - user_2014$med_per_active_days_q2
+user_2014$change_med_q2_q3 <- user_2014$med_per_active_days_q3 - user_2014$med_per_active_days_q2
+user_2014$change_med_q3_q4 <- user_2014$med_per_active_days_q4 - user_2014$med_per_active_days_q3
+user_2014$change_med_q1_q3 <- user_2014$med_per_active_days_q3 - user_2014$med_per_active_days_q1
+
+
+# Create improvement table
+improve_2014 <- all_2014 %>%
+ group_by(domain_numeric) %>%
+ summarise(domain = unique(domain),
+ nusers_2014 = length(unique(user_pk)),
+ nusers_q1 = length(unique(user_pk[calendar_month >= "2014-01-01" & calendar_month <= "2014-03-31"])),
+ nusers_q2 = length(unique(user_pk[calendar_month >= "2014-04-01" & calendar_month <= "2014-06-30"])),
+ nusers_q3 = length(unique(user_pk[calendar_month >= "2014-07-01" & calendar_month <= "2014-09-30"])),
+ nusers_q4 = length(unique(user_pk[calendar_month >= "2014-10-01" & calendar_month <= "2014-12-31"])))
+
+#--------------------------------------------------------------------------------------#
+
+# Populate improvement table: Q2 to Q4
+domain_stats <- user_2014 %>% group_by(domain_numeric) %>%
+ summarise(active_users_q2_q3_q4 = length(user_pk[active_q2_q3_q4 == T]))
+improve_2014 <- merge(improve_2014, domain_stats, by = "domain_numeric", all.x = T)
+
+user_2014_active <- filter(user_2014, active_q2_q3_q4 == T)
+domain_stats_active <- user_2014_active %>%
+ group_by(domain_numeric) %>%
+ summarise(med_change_q2_q4 = median(change_q2_q4, na.rm=T),
+ mean_change_q2_q4 = mean(change_q2_q4, na.rm = T),
+ med_change_med_q2_q4 = median(change_med_q2_q4, na.rm=T),
+ nusers_ge_4_active_days_q2 = sum(ge_4_days_active_q2),
+ nusers_ge_4_active_days_q4 = sum(ge_4_days_active_q4))
+improve_2014 <- merge(improve_2014, domain_stats_active, by = "domain_numeric", all.x = T)
+improve_2014$prop_users_ge_4_active_days_q2 <- improve_2014$nusers_ge_4_active_days_q2/improve_2014$nusers_q2
+improve_2014$prop_users_ge_4_active_days_q4 <- improve_2014$nusers_ge_4_active_days_q4/improve_2014$nusers_q4
+improve_2014$per_change_users_ge_4_active_days_q2_q4 <- 100*(improve_2014$prop_users_ge_4_active_days_q4-improve_2014$prop_users_ge_4_active_days_q2)
+
+# Populate improvement table: Q2 to Q3
+domain_stats <- user_2014 %>% group_by(domain_numeric) %>%
+ summarise(active_users_q2_q3 = length(user_pk[active_q2_q3 == T]))
+improve_2014 <- merge(improve_2014, domain_stats, by = "domain_numeric", all.x = T)
+
+user_2014_active <- filter(user_2014, active_q2_q3 == T)
+domain_stats_active <- user_2014_active %>%
+ group_by(domain_numeric) %>%
+ summarise(med_change_q2_q3 = median(change_q2_q3, na.rm=T),
+ mean_change_q2_q3 = mean(change_q2_q3, na.rm = T))
+improve_2014 <- merge(improve_2014, domain_stats_active, by = "domain_numeric", all.x = T)
+
+# Populate improvement table: Q3 to Q4
+domain_stats <- user_2014 %>% group_by(domain_numeric) %>%
+ summarise(active_users_q3_q4 = length(user_pk[active_q3_q4 == T]))
+improve_2014 <- merge(improve_2014, domain_stats, by = "domain_numeric", all.x = T)
+
+user_2014_active <- filter(user_2014, active_q3_q4 == T)
+domain_stats_active <- user_2014_active %>%
+ group_by(domain_numeric) %>%
+ summarise(med_change_q3_q4 = median(change_q3_q4, na.rm=T),
+ mean_change_q3_q4 = mean(change_q3_q4, na.rm = T))
+improve_2014 <- merge(improve_2014, domain_stats_active, by = "domain_numeric", all.x = T)
+
+# Populate improvement table: Q1 to Q3
+domain_stats <- user_2014 %>% group_by(domain_numeric) %>%
+ summarise(active_users_q1_q2_q3 = length(user_pk[active_q1_q2_q3 == T]))
+improve_2014 <- merge(improve_2014, domain_stats, by = "domain_numeric", all.x = T)
+
+user_2014_active <- filter(user_2014, active_q1_q2_q3 == T)
+domain_stats_active <- user_2014_active %>%
+ group_by(domain_numeric) %>%
+ summarise(med_change_q1_q3 = median(change_q1_q3, na.rm=T),
+ mean_change_q1_q3 = mean(change_q1_q3, na.rm = T))
+improve_2014 <- merge(improve_2014, domain_stats_active, by = "domain_numeric", all.x = T)
+
+#Merge improvement table with GA table from google_analytics_WAR.R
+#First need to aggregate GA table by domain
+war_domain <- ga_users %>% group_by(domain) %>%
+ summarise(nviews = sum(pageviews),
+ nunique_views = sum(unique_pageviews))
+
+# med_time_on_page = median(avg_time_on_page, na.rm=T))
+
+improve_2014 <- merge(improve_2014, war_domain, by = "domain", all.x = T)
+improve_2014$used_war <- !is.na(improve_2014$nviews)
+
+write.csv(improve_2014, file = "improvement_q2_q4_2014.csv")
+
+# Test correlations between WAR usage and CommCare usage
+test <- filter(improve_2014, !is.na(med_change_q1_q3) & used_war == T)
+test <- filter(improve_2014, !is.na(med_change_med_q2_q4) & used_war == T)
+test <- filter(improve_2014, !is.na(per_change_users_ge_4_active_days_q2_q4) & used_war == T)
+
+cor(test$per_change_users_ge_4_active_days_q2_q4, test$nviews)
+cor(test$per_change_users_ge_4_active_days_q2_q4, test$nunique_views)
+
+# T-test for mean CommCare usage between WAR domains and non-WAR domains
+test <- filter(improve_2014, !is.na(per_change_users_ge_4_active_days_q2_q4))
+t.test(test$per_change_users_ge_4_active_days_q2_q4~test$used_war)
+
+#Visuals
+test <- all_2014 %>% group_by(month_abbr) %>%
+ summarise(med_ncases_touched = median(ncases_touched, na.rm=T))
+
+ggplot(data=test, aes(x=month_abbr, y=med_ncases_touched, group=1)) +
+ geom_line() +
+ scale_y_continuous(limits=c(0,40))
+
+ggplot(test, aes(x=med_change_med_q2_q4, y=nunique_views)) +
+ geom_point(shape=1) +
+ #geom_smooth(method=lm) +
+ scale_x_continuous(limits=c(-20,25)) +
+ scale_y_continuous(limits=c(0,22)) +
+ annotate("text", label="r^2 == 0.196", parse = T, x=-15, y=21)
+
+test <- filter(all_2014, calendar_month >= "2014-04-01" & calendar_month <= "2014-09-30")
+nactive_users <- test %>% group_by(user_pk) %>%
+ summarise(nmos = length(unique(calendar_month)))
+nactive_users <- filter(nactive_users, nmos == 6)
+test <- test[test$user_pk %in% nactive_users$user_pk,]
+test$calendar_factor <- as.factor(test$calendar_month)
+test$month_number <- as.numeric(test$calendar_factor)
+cor(test$month_number, test$active_day_percent)
+
+ggplot(test, aes(x=month_number, y=active_day_percent)) +
+ geom_point(shape=1)
+ #geom_smooth(method=lm) +
+ #scale_x_continuous(limits=c(-20,25)) +
+ #scale_y_continuous(limits=c(0,22)) +
+ #annotate("text", label="r^2 == 0.196", parse = T, x=-15, y=21)
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/indicator_evaluation_report.R b/analysis_scripts/rdayalu/indicator_evaluation_report.R
new file mode 100644
index 0000000..8478d52
--- /dev/null
+++ b/analysis_scripts/rdayalu/indicator_evaluation_report.R
@@ -0,0 +1,546 @@
+#The purpose of this code is to create quantitative tests to evaluate our
+#usage indicators. See the following document for more detail:
+#https://docs.google.com/a/dimagi.com/document/d/1hP-ewigPuUwuac8K9Tx-VC9Z8epC03lMrnwqzNWveY8/edit
+
+#Be sure to set config_run first
+source(file.path("analysis_scripts","raw_data","data_import.R", fsep = .Platform$file.sep))
+
+library(data.table)
+library(zoo)
+detach("package:lubridate")
+library(lubridate)
+library(ggplot2)
+library(scales) #to customize ggplot axis labeling
+library(gridExtra) #graphing plots in columns/rows for ggplot
+library(RColorBrewer) #Color palettes
+source('s_dplyr.R')
+
+#------------------------------------------------------------------------#
+#DATA MANAGEMENT
+#------------------------------------------------------------------------#
+
+all_monthly <- monthly_table
+
+#Set report_options
+report = run_conf$reports$modules$name
+report_options <- get_report_options(run_conf,report)
+
+#Remove demo users and NA/NONE users
+#We also need to find a way to exclude admin/web users
+#Keep only confirmed mobile users for now
+all_monthly = all_monthly[!(all_monthly$user_id =="demo_user"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="NONE"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="none"),]
+all_monthly = all_monthly[!is.na(all_monthly$user_id),]
+mobile <- read.csv(file="mobile_users.csv")
+all_monthly$mobile <- all_monthly$user_id %in% mobile$user_id
+all_monthly = filter(all_monthly, mobile == T)
+
+#Remove any dates before report start_date and after report end_date
+all_monthly$date_first_visit = as.Date(all_monthly$date_first_visit)
+all_monthly$date_last_visit = as.Date(all_monthly$date_last_visit)
+start_date = as.Date(report_options$start_date)
+end_date = as.Date(report_options$end_date)
+all_monthly = subset(all_monthly, all_monthly$date_first_visit >= start_date
+ & all_monthly$date_last_visit <= end_date)
+
+#Change column names as needed
+names (all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+names (all_monthly)[names(all_monthly) == "numeric_index"] = "month_index"
+
+#Create "red herring" indicators
+#domain to numeric: First convert from character to factor
+all_monthly$domain_numeric <- as.numeric(as.factor(all_monthly$domain))
+#User_pk as num_user_pk because we need user_pk as a grouping variable
+all_monthly$num_user_pk <- all_monthly$user_pk
+#Random numbers per row, using wide range (5x nrow), not specifying distribution, mean or sd
+all_monthly$sample_undefined <- sample(1:(5*nrow(all_monthly)), nrow(all_monthly), replace=T)
+#Random numbers per row, normal distribution, defining mean and sd
+all_monthly$sample_normal <- rnorm(nrow(all_monthly), mean = 10, sd = 1)
+#Add sample_percentile variable
+all_monthly$sample_percentile <- sample(1:100, nrow(all_monthly), replace=T)
+
+#Convert calendar month to actual date
+#all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+all_monthly$month_abbr <- month(all_monthly$calendar_month, label = T, abbr = T)
+
+#Add sample_increase variable: increasing by steady increments of 1
+all_monthly <- arrange(all_monthly, user_pk, calendar_month)
+all_monthly$sample_increase <- c(1:nrow(all_monthly))
+#Add sample_decrease variable: decreasing by steady increments of 1
+all_monthly$sample_decrease <- rev(c(1:nrow(all_monthly)))
+
+#------------------------------------------------------------------------#
+#Tests for usage indicator evaluation
+#------------------------------------------------------------------------#
+
+indicators = c("nvisits", "active_day_percent", "nforms", "median_visit_duration",
+ "median_visits_per_day", "time_using_cc", "ninteractions",
+ "ncases_registered", "register_followup", "case_register_followup_rate",
+ "ncases_touched", "nunique_followups", "audio_plays", "network_warnings",
+ "num_user_pk", "domain_numeric", "sample_undefined", "sample_normal",
+ "sample_percentile", "sample_increase", "sample_decrease")
+
+#Health sector domains to exclude because median
+#cases followed-up per domain == 0
+#Note that we will also want to exclude forms submitted by atypical FLW apps
+#https://docs.google.com/a/dimagi.com/spreadsheets/d/1QwkgRZPR81rQF9h-E7END_ontZye8xQW_WojAObKsgU/edit#gid=0
+#This exclusion will need to happen at the form/app level, so we need to figure
+#out how to do it.
+
+domains_to_exclude <- c("a5288-study",
+ "aed-togo",
+ "agada-tufts-nnos",
+ "aiha-ca",
+ "aphiaplusnc-2012",
+ "bihar-project",
+ "cidrz",
+ "cmmhr",
+ "deoghar",
+ "ekam",
+ "fh-mozambique",
+ "gsid",
+ "ict-women-health",
+ "iicp",
+ "itech-etc",
+ "jhccpmz",
+ "kawok-vc-desarrollo",
+ "mc-inscale",
+ "mc-socialautopsy",
+ "mchip-haryana",
+ "mgh-india",
+ "msf-demo",
+ "mtsinai",
+ "oneworld",
+ "operation-smile",
+ "projectbom",
+ "promot",
+ "reach-india",
+ "sneha-mnh",
+ "special-olympics",
+ "stjohns-soukhya",
+ "union-jharkhand",
+ "wits-ca",
+ "wvmozambique")
+
+all_monthly <- all_monthly[!(all_monthly$domain %in% domains_to_exclude),]
+
+#------------------------------------------------------------------------#
+#Training dataset
+#------------------------------------------------------------------------#
+
+#Picked random 10% of these 73 domains for our training dataset of 8 domains
+#sample_domains <- sample(unique(all_monthly$domain_name), 8)
+#This generated the following vector on first run (below)
+#Note that the list of health domains has increased, but we will stick with this
+#training dataset of 8 domains.
+sample_domains <- c("afguinea", "nsf-lifefirst", "yonsei-emco", "keiskamma",
+ "image-sa", "ictwomenhealth", "fenway", "tulasalud")
+training_set <- all_monthly[all_monthly$domain %in% sample_domains,]
+master_set <- all_monthly
+all_monthly <- training_set
+
+#Create dataset for smallest domain of training set - nsf-lifefirst
+all_monthly_nsf <- subset(all_monthly[all_monthly$domain == "nsf-lifefirst",])
+#Create dataset for largest domain of training set - tulasalud
+all_monthly_tula <- subset(all_monthly[all_monthly$domain == "tulasalud",])
+
+#------------------------------------------------------------------------#
+#Descriptive stats and distribution histograms for the dataset
+#------------------------------------------------------------------------#
+
+all_monthly %.% group_by(domain) %.% summarise(nusers = length(unique(user_id)))
+all_monthly %.% group_by(domain) %.% summarise(nobs = length(user_id))
+all_monthly %.% group_by(domain) %.% summarise(nmonth = length(unique(calendar_month)))
+
+# Histogram overlaid with kernel density curve
+# Overlay with transparent density plot
+# Include all observations
+myhist <- ggplot(all_monthly_tula, aes(x=nvisits)) +
+ geom_histogram(binwidth=0.5, colour="black", fill="lightblue") +
+ geom_vline(aes(xintercept=median(nvisits, na.rm=T)),
+ color="red", linetype="dashed", size=1)
+
+# Exclude outliters
+no_outliers <- filter(all_monthly_tula, nvisits <= 100)
+myhist <- ggplot(no_outliers, aes(x=nvisits)) +
+ geom_histogram(binwidth=1, colour="black", fill="lightblue") +
+ geom_vline(aes(xintercept=median(nvisits, na.rm=T)),
+ color="red", linetype="dashed", size=1)
+ #scale_x_continuous(limits=c(0,20))
+
+#Histograms for training dataset(s)
+myhist <- hist(all_monthly_nsf$sample_normal)
+multiplier <- myhist$counts / myhist$density
+mydensity <- density(all_monthly_nsf$sample_normal)
+mydensity$y <- mydensity$y * multiplier[1]
+plot(myhist)
+lines(mydensity)
+abline(v = mean(all_monthly_nsf$sample_normal), col = "blue", lwd = 2)
+text(11, 7, labels = paste0("sd=", sd(all_monthly_nsf$sample_normal)))
+
+myhist <- hist(all_monthly_nsf$sample_undefined)
+multiplier <- myhist$counts / myhist$density
+mydensity <- density(all_monthly_nsf$sample_undefined)
+mydensity$y <- mydensity$y * multiplier[1]
+plot(myhist)
+lines(mydensity)
+abline(v = mean(all_monthly_nsf$sample_undefined), col = "blue", lwd = 2)
+text(150000, 10, labels = paste0("sd=", sd(all_monthly_nsf$sample_undefined)))
+
+#------------------------------------------------------------------------#
+#Function codes for tests
+#------------------------------------------------------------------------#
+
+#TEST 1
+#Calculate CVs by project by calendar_month
+#First detach plyr because it's doing weird things with sd calculation in dplyr, then
+#check # of rows with sd = NA to make sure that there really is only one observation
+#for that domain for that month
+#nrow(all_monthly[all_monthly$calendar_month == "Feb 2011" & all_monthly$domain_name == "mvp-sauri",])
+#The rows with sd = NA (because of only one month worth of observation)
+#will be excluded from all CV calculations, which is the correct thing to do.
+
+source('s_dplyr.R')
+
+test_1 <- function(indicator, data) {
+
+ test_1_compute <- data %.%
+ group_by(domain, calendar_month) %.%
+ s_summarise(paste0('mean_indicator=mean(', indicator, ', na.rm=TRUE)'),
+ paste0('sd_indicator=sd(', indicator, ', na.rm=TRUE)'))
+
+ test_1_compute$cv = (test_1_compute$sd_indicator/test_1_compute$mean_indicator)*100
+
+ #Compute CV of CVs by project
+ test_1_gp_cv = group_by(test_1_compute, domain)
+ test_1_compute_cv = summarise(test_1_gp_cv,
+ mean_indicator = mean(cv, na.rm = T),
+ sd_indicator = sd(cv, na.rm=T))
+ test_1_compute_cv$cv = (test_1_compute_cv$sd_indicator/test_1_compute_cv$mean_indicator)*100
+ test_1_score = median(test_1_compute_cv$cv)
+ return(test_1_score)
+}
+
+#Create vector of test scores
+test_1_score_vector <- sapply(indicators_to_test, test_1, data=all_monthly_tula)
+print(test_1_score_vector)
+
+#Can also calculate differences between months per project, but need to consider cases of
+#skipped months - we are assuming no skipped months for now (too simplistic)
+#Also, might want count diff in variance between one month and all other months -
+#will this be a more robust measure rather than just diff between consecutive months?
+#list_diff = tapply(test_1_compute$cv, test_1_compute$domain_name, diff)
+#test_1_score = median(unlist(lapply(list_diff, median, na.rm=T)))
+#------------------------------------------------------------------------#
+
+#------------------------------------------------------------------------#
+#TEST 2
+#Group by project and then by obsnum
+#Similar to (2), users with just one month on CC will be excluded from this analysis
+#Check sd = NA with the following code:
+#nrow(raw_percentile[raw_percentile$user_id == "c536bc72043e1c225ed9e30884c5641a",])
+
+percentile <- function(x) ecdf(x)(x)
+
+test_2 <- function(indicator, data) {
+ percentile_s <- paste0('percentile=percentile(', indicator, ')')
+ raw_percentile <- data %.%
+ group_by(domain, calendar_month) %.%
+ s_mutate(percentile_s)
+
+ test_2_compute <- raw_percentile %.%
+ group_by(domain, user_id) %.%
+ summarise(
+ mean_indicator=mean(percentile, na.rm=TRUE),
+ sd_indicator=sd(percentile, na.rm=TRUE)
+ )
+
+ test_2_compute$cv <- (test_2_compute$sd_indicator / test_2_compute$mean_indicator) * 100
+ test_2_score <- median(test_2_compute$cv, na.rm=TRUE)
+ return(test_2_score)
+}
+
+#Create vector of test scores
+test_2_score_vector <- sapply(indicators_to_test, test_2, data = all_monthly_tula)
+print(test_2_score_vector)
+
+#------------------------------------------------------------------------#
+
+#------------------------------------------------------------------------#
+#TEST 3
+#Subset data frames for holiday months and active months
+#Then calculate median indicator values per user for each dataframe
+#For now, we will work with just India domains, counting November as the main
+#holiday month. Confirming with Devu and Mohini if we also need to include
+#December as a holiday month.
+
+#Calculate % change in median indicator values for holiday vs. active months
+#Create datasets for country/countries based on known holiday months
+#To select more than one month, use perl = T. For example:
+#all_monthly[grep("Nov|Dec", all_monthly$calendar_month, perl=T), ]
+
+all_monthly_india <- all_monthly[all_monthly$country == "India",] #data1
+all_monthly_other <- all_monthly[all_monthly$country != "India",] #data2
+
+test_3 <- function(indicator, data1, data2) {
+ holiday_subset1 <- data1[data1$month_abbr == "Oct" | data1$month_abbr == "Nov", ]
+ active_subset1 <- data1[!(data1$month_abbr == "Oct" | data1$month_abbr == "Nov"), ]
+
+ holiday_subset2 <- data2[data2$month_abbr == "Dec", ]
+ active_subset2 <- data2[!(data2$month_abbr == "Dec"), ]
+
+ test_3_compute_holiday1 <- holiday_subset1 %.%
+ group_by(domain, user_id) %.%
+ s_summarise(paste0('median_indicator_holiday=median(', indicator, ', na.rm=TRUE)'))
+
+ test_3_compute_holiday2 <- holiday_subset2 %.%
+ group_by(domain, user_id) %.%
+ s_summarise(paste0('median_indicator_holiday=median(', indicator, ', na.rm=TRUE)'))
+
+ test_3_compute_active1 <- active_subset1 %.%
+ group_by(domain, user_id) %.%
+ s_summarise(paste0('median_indicator_active=median(', indicator, ', na.rm=TRUE)'))
+
+ test_3_compute_active2 <- active_subset2 %.%
+ group_by(domain, user_id) %.%
+ s_summarise(paste0('median_indicator_active=median(', indicator, ', na.rm=TRUE)'))
+
+
+ #Merge the two data frames by domain and user_id and then compute % change
+ merge1 <- merge(test_3_compute_holiday1, test_3_compute_active1,
+ by=c("domain", "user_id"))
+ merge2 <- merge(test_3_compute_holiday2, test_3_compute_active2,
+ by=c("domain", "user_id"))
+
+ test_3_compute <- rbind(merge1, merge2)
+
+ test_3_compute$per_change <-
+ (test_3_compute$median_indicator_holiday-test_3_compute$median_indicator_active)/
+ test_3_compute$median_indicator_active *100
+ test_3_score <- median(test_3_compute$per_change, na.rm = T)
+ return(test_3_score)
+}
+
+test_3_score_vector <- sapply(indicators_to_test, test_3,
+ data1 = all_monthly_india, data2 = all_monthly_other)
+print(test_3_score_vector)
+
+#------------------------------------------------------------------------#
+
+#------------------------------------------------------------------------#
+#TEST 4
+#Subset data frames for one month before attrition vs. active months
+#http://r.789695.n4.nabble.com/adding-in-missing-values-in-a-sequence-td839900.html
+
+library(reshape)
+
+#Count NAs in indicators to test before running this test because we are later on
+#adding NAs as indicator values whenever there is an inactive month. There should be
+#none/very few NA values in our indicators to test before we add in the inactive months
+#for this test
+all_monthly_check <-
+ select(all_monthly_tula, domain, user_id, obsnum, nvisits, active_days_percent,
+ median_visits_per_day, ncases_registered, nunique_followups,
+ median_visit_duration, nforms, user_numeric, domain_numeric,
+ sample_undefined, sample_normal)
+sapply(all_monthly_check, summary)
+
+#Create new all_monthly with missing obsnum filled in, by user_id
+#First make two subsets: (1) users with last calendar_month == last month
+# of data pull (2) users with last calendar_month != last month of data pull
+
+print(as.Date(run_conf$reports$end_date)) #print last month of data pull
+
+pre_test_4 <- function(data) {
+data$ongoing <- data$calendar_month == "2014-08-01"
+domain_ongoing <- data$domain[data$ongoing]
+user_id_ongoing <- data$user_id[data$ongoing]
+users_ongoing <- data.frame(domain_ongoing, user_id_ongoing)
+
+#Initialize dataframe
+all_monthly_ongoing <- as.data.frame(matrix(ncol=40, nrow=10))
+names(all_monthly_ongoing) <- names(data)
+
+#Populate with all rows for all ongoing users by domain
+for(i in 1:nrow(users_ongoing)) {
+ rows_ongoing <- data[(data$domain == users_ongoing$domain_ongoing[i] &
+ data$user_id == users_ongoing$user_id_ongoing[i]),]
+
+ all_monthly_ongoing <- rbind(all_monthly_ongoing, rows_ongoing)
+}
+
+all_monthly_ongoing <- all_monthly_ongoing[11:nrow(all_monthly_ongoing),]
+all_monthly_ongoing$calendar_month <- as.Date(all_monthly_ongoing$calendar_month)
+
+all_monthly_lost <- rbind(data, all_monthly_ongoing)
+dup <- duplicated(all_monthly_lost) | duplicated(all_monthly_lost, fromLast = TRUE)
+all_monthly_lost <- all_monthly_lost[!dup, ]
+
+#Now get max obsnum per user_id
+#Add one month for the users that were lost before end date of report pull
+#Keep original max for users that stayed on till end date of report pull
+max_ongoing <- all_monthly_ongoing %.%
+ group_by(domain, user_id) %.%
+ summarise(obsnum_max = max(obsnum))
+
+max_lost <- all_monthly_lost %.%
+ group_by(domain, user_id) %.%
+ summarise(obsnum_max = max(obsnum)+1)
+
+all_monthly_max <- rbind(max_ongoing, max_lost)
+
+#Create list of full obsnum sequence based on max obsnum per user_id
+all_monthly_new <- lapply(split(all_monthly_max, list(all_monthly_max$domain,
+ all_monthly_max$user_id), drop = T),
+ function(x) seq(x$obsnum_max))
+#One row per user - with one column per obsnum
+column_per_obsnum <- do.call(rbind,lapply(all_monthly_new,
+ function(x) c(as.numeric(x),
+ rep(NA,max(sapply(all_monthly_new,length)-length(x))))))
+#Convert columns to rows - one row per obsnum per user
+mdata <- melt(column_per_obsnum, id=c("row.names"))
+#Sort by domain.user (X1) and then by obsnum. This is a matrix.
+mdata <- arrange(mdata, X1, value)
+#Pull apart domain_name and user_id from X1
+mat = as.matrix(mdata[,1])
+mat2 = apply(mat, 2, function(x) unlist(strsplit(x, ".", fixed = TRUE)))
+mat_user_id = mat2[c(F,T),]
+mat_domain_name = mat2[c(T,F),]
+#Column bind user_id and domain_name back to mdata and convert to dataframe
+#Rename columns and delete rows with obsnum = NA
+mdata = cbind(mdata, mat_user_id)
+mdata = as.data.frame(cbind(mdata, mat_domain_name))
+mdata = select(mdata, -X2)
+names(mdata) = c("domain_user", "obsnum", "user_id", "domain")
+mdata = filter(mdata, obsnum != "NA")
+#Merge with all_monthly_check by domain_name, user_id, and obsnum: keeping all rows in mdata
+mdata <- merge(mdata, all_monthly_check,
+ by=c("domain", "user_id", "obsnum"), all.x = TRUE)
+#Check # inactive months by looking at # NA's for all indicators (they should be the same)
+sapply(mdata, summary)
+#Sort by domain_name, user_id and obsnum
+mdata <- arrange(mdata, domain, user_id, obsnum)
+#Create logic vector of inactive months
+mdata$inactive_month <- is.na(mdata$nvisits)
+#Create logic vector of months before all inactive months
+#Choose only the second element in inactive_month onwards and then append "F"
+mdata$before_inactive <- c(mdata$inactive_month[-1], F)
+return(mdata)
+}
+
+mdata <- pre_test_4(all_monthly_tula)
+
+test_4 <- function(indicator, data) {
+ active_subset <- data[data$before_inactive == F, ]
+ before_inactive_subset <- data[data$before_inactive == T, ]
+
+ test_4_compute_active <- active_subset %.%
+ group_by(domain, user_id) %.%
+ s_summarise(paste0('median_indicator_active=median(', indicator, ', na.rm=TRUE)'))
+
+ test_4_compute_before_active <- before_inactive_subset %.%
+ group_by(domain, user_id) %.%
+ s_summarise(paste0('median_indicator_before_inactive=median(', indicator, ', na.rm=TRUE)'))
+
+ #Merge the two data frames by domain and user_id and then compute % change
+ test_4_compute <- merge(test_4_compute_active, test_4_compute_before_active,
+ by=c("domain", "user_id"))
+ test_4_compute$per_change <-
+ (test_4_compute$median_indicator_before_inactive-test_4_compute$median_indicator_active)/
+ test_4_compute$median_indicator_active *100
+ test_4_score <- median(test_4_compute$per_change, na.rm = T)
+ return(test_4_score)
+}
+
+test_4_score_vector <- sapply(indicators_to_test, test_4, data=mdata)
+print(test_4_score_vector)
+
+#Combine all test vectors into dataframe
+test_scores <- data.frame(cbind(test_1_score_vector, test_2_score_vector,
+ test_3_score_vector, test_4_score_vector))
+
+write.csv(test_scores, file = "test_scores.csv")
+
+#------------------------------------------------------------------------#
+#Test 3 code for single domain
+#------------------------------------------------------------------------#
+
+test_3 <- function(indicator, data) {
+ holiday_subset1 <- data[data$month_abbr == "Dec", ]
+ active_subset1 <- data[!(data$month_abbr == "Dec"), ]
+
+ test_3_compute_holiday1 <- holiday_subset1 %.%
+ group_by(domain, user_id) %.%
+ s_summarise(paste0('median_indicator_holiday=median(', indicator, ', na.rm=TRUE)'))
+
+ test_3_compute_active1 <- active_subset1 %.%
+ group_by(domain, user_id) %.%
+ s_summarise(paste0('median_indicator_active=median(', indicator, ', na.rm=TRUE)'))
+
+
+ #Merge the two data frames by domain and user_id and then compute % change
+ test_3_compute <- merge(test_3_compute_holiday1, test_3_compute_active1,
+ by=c("domain", "user_id"))
+
+ test_3_compute$per_change <-
+ (test_3_compute$median_indicator_holiday-test_3_compute$median_indicator_active)/
+ test_3_compute$median_indicator_active *100
+ test_3_score <- median(test_3_compute$per_change, na.rm = T)
+ return(test_3_score)
+}
+
+test_3_score_vector <- sapply(indicators_to_test, test_3, data = all_monthly_tula)
+print(test_3_score_vector)
+
+#------------------------------------------------------------------------#
+#Check all health sector domains to remove any with median # cases f/u == 0
+#------------------------------------------------------------------------#
+overall_split = ddply(all_monthly, .(domain_name), summarise,
+ cases_fu_med = median(follow_up_unique_case, na.rm=T))
+overall_split = arrange(overall_split, cases_fu_med)
+
+
+
+#------------------------------------------------------------------------#
+#Older test code
+#------------------------------------------------------------------------#
+
+test_3 <- function(indicator, data) {
+ holiday_subset <- data[grep("Oct|Nov", data$calendar_month, perl=T), ]
+ active_subset <- data[grep("Oct|Nov", data$calendar_month, invert = T, perl = T), ]
+
+ test_3_compute_holiday <- holiday_subset %.%
+ group_by(domain_name, user_id) %.%
+ s_summarise(paste0('median_indicator_holiday=median(', indicator, ', na.rm=TRUE)'))
+
+ test_3_compute_active <- active_subset %.%
+ group_by(domain_name, user_id) %.%
+ s_summarise(paste0('median_indicator_active=median(', indicator, ', na.rm=TRUE)'))
+
+ #Merge the two data frames by domain and user_id and then compute % change
+ test_3_compute <- merge(test_3_compute_holiday, test_3_compute_active,
+ by=c("domain_name", "user_id"))
+ test_3_compute$per_change <-
+ (test_3_compute$median_indicator_holiday-test_3_compute$median_indicator_active)/
+ test_3_compute$median_indicator_active *100
+ test_3_score <- median(test_3_compute$per_change, na.rm = T)
+ return(test_3_score)
+}
+
+#Create obsnum here since we can't create this at the aggregate table stage
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+
+add_index <- function(x) {
+ start <- min(x$calendar_month, na.rm=TRUE)
+ x$numeric_index <- sapply(x$calendar_month, function(end) interval(start, end) %/% months(1))
+ return(x)
+}
+
+df2 <- all_monthly %.% group_by(domain, user_id) %.% do(add_index(.))
+df2$obsnum <- df2$numeric_index + 1
+all_monthly <- select(df2, -numeric_index)
+
+
diff --git a/analysis_scripts/rdayalu/individual_domain_analysis.R b/analysis_scripts/rdayalu/individual_domain_analysis.R
new file mode 100644
index 0000000..7ffae19
--- /dev/null
+++ b/analysis_scripts/rdayalu/individual_domain_analysis.R
@@ -0,0 +1,10 @@
+form_table <- tbl(db, "form")
+form_table <- collect(form_table)
+domain <- tbl(db, "domain")
+domain <- collect(domain)
+test <- filter(domain, name == "slab-tanzania")
+forms <- filter(form_table, domain_id == 4926)
+
+forms$form_date <- as.Date(substr(forms$time_start, 1, 10))
+summary(forms$form_date)
+forms <- arrange(forms, desc(form_date))
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/jeremy_check.R b/analysis_scripts/rdayalu/jeremy_check.R
new file mode 100644
index 0000000..4219842
--- /dev/null
+++ b/analysis_scripts/rdayalu/jeremy_check.R
@@ -0,0 +1,30 @@
+#583 users
+d_238 <- filter(blog, domain_numeric == 238)
+length(unique(d_238$user_pk))
+
+#Keep only users that have a month_index = 1 in this dataset
+#Want to exclude users that started outside of this dataset
+#Now have 573 users
+d_238 <- d_238[d_238$user_pk %in% d_238$user_pk[d_238$month_index == 1],]
+
+#Keep users with at least 12 months on CC
+#Now have 128 users
+#d_238 <- filter(d_238, months_on_cc >= 12)
+#summary(d_238$months_on_cc)
+
+#Keep only the first 12 months for each user (month_index = 1 through 12)
+#So exclude month_index > 12
+#Now have 128 users
+#unique(d_238$user_pk[!(d_238$user_pk %in% with_mi_1)])
+#test <- d_238[d_238$user_pk %in% unique(d_238$user_pk[!(d_238$user_pk %in% with_mi_1)]),]
+#test <- arrange(test, user_pk, month_index)
+d_238 <- filter(d_238, month_index <= 12)
+
+#Keep only users that have a max(month_index) = 12 and have num_obs = 12
+#97 users
+users_for_12 <- d_238 %>% group_by(user_pk) %>% summarise(max_index = max(month_index),
+ num_obs = length(unique(calendar_month)))
+users_for_12 <- filter(users_for_12, max_index == 12 & num_obs == 12)
+d_238 <- d_238[d_238$user_pk %in% users_for_12$user_pk,]
+
+#Keep only users that have been active for the
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/journal_analysis_1.R b/analysis_scripts/rdayalu/journal_analysis_1.R
new file mode 100644
index 0000000..a35042b
--- /dev/null
+++ b/analysis_scripts/rdayalu/journal_analysis_1.R
@@ -0,0 +1,661 @@
+# 12/26/14
+# This analysis is for the ITI journal article titled "Measuring mobile technology
+# usage to inform public health programmatic design: a metadata analysis"
+# The manuscript for this article is here:
+# https://docs.google.com/a/dimagi.com/document/d/1AuFF40FMGfe49wWYVhvLQ0DEvGuSayAHGk4o8b67N4I/edit
+
+#------------------------------------------------------------------------#
+#SELECT CHWs TO KEEP FOR ANALYSIS
+#------------------------------------------------------------------------#
+
+#Exclude users who have submitted forms using > 1 appplication_id
+#First import full form_table from db
+library(dplyr)
+# Load system config file
+source(file.path("function_libraries","config_file_funcs.R", fsep = .Platform$file.sep))
+source(file.path("data_sources.R"))
+system_conf <- get_system_config(file.path("config_system.json"))
+# Get db connection
+db <- get_db_connection(system_conf)
+form_table <- tbl(db, "form")
+form_table <- collect(form_table)
+names(form_table)[names(form_table) == "user_id"] = "user_pk"
+chw_single_app <- form_table %>% group_by(user_pk) %>%
+ summarise(n_applications = length(unique(application_id)))
+chw_single_app <- filter(chw_single_app, n_applications == 1)
+
+#List of users by user_type, keeping only mobile users
+#Get user_type table from db (mobile, web, superuser, etc.)
+user_type <- get_user_type_table(db)
+user_type <- filter(user_type, user_type == "mobile")
+user_type <- select(user_type, user_pk, user_id, user_type)
+
+#Merge these two lists together, keeping only mobile users
+chw_single_app <- merge(chw_single_app, user_type, by = "user_pk", all.x = T)
+chw_single_app <- filter(chw_single_app, user_type == "mobile")
+
+#Exclude users who submitted through one of the three atypical apps in this google doc:
+#https://docs.google.com/a/dimagi.com/spreadsheets/d/1QwkgRZPR81rQF9h-E7END_ontZye8xQW_WojAObKsgU/edit#gid=0
+app <- tbl(db, "application")
+app <- collect(app)
+app <- filter(app, app_id == "4c4c0f5a7bd4834994e93cc7dde8b91a" |
+ app_id == "237e500aed04239b98f0aecb904227ac" |
+ app_id == "223c5ff68da7d1805fbf8224976f2587")
+atypical <- form_table[form_table$application_id %in% app$id,]
+chw_single_app <- chw_single_app[!(chw_single_app$user_pk %in% atypical$user_pk),]
+write.csv(chw_single_app, file = "chw_single_app.csv")
+#------------------------------------------------------------------------#
+#PREPARE MONTHLY TABLE
+#------------------------------------------------------------------------#
+
+#Pull monthly_table for all domains
+#Be sure to set config_run first
+source(file.path("analysis_scripts","raw_data","data_import.R", fsep = .Platform$file.sep))
+all_monthly <- monthly_table
+
+#Get required libraries
+library(data.table)
+library(zoo)
+detach("package:lubridate")
+library(lubridate)
+library(ggplot2)
+library(gridExtra)
+
+#Keep only the "typical FLW" domains. There are 39 of these domains
+#I set the config_run to "permitted_data_only" : false otherwise tulasalud
+#will be excluded, which isn't correct.
+#However, keiskamma and ssqh-cs have opted out per EULA, so we shouldn't include those?
+#These domains eventually have 16 and 9 users respectively, but ssqh-cs has not actually
+#opted out per Sheel's knowledge.
+#Excluding those two domains for now because I haven't received information otherwise.
+#Keeping 37 domains for the journal article
+typical_FLW_domains <- c("aaharsneha", "aarohi", "acf", "aed-hth", "arogyasarita",
+ "care-ecd", "chasssmt-moz", "crc-intervention", "crhp",
+ "crs-catch", "crs-remind", "crs-senegal", "dtree-familyplanning",
+ "engender-ethiopia-pilot", "icap-tb", "kawok-malaria-p",
+ "kgvk", "maternalznz", "nutritionmeast", "opm",
+ "pasmo-nicaragua-dmg", "pci-india", "project", "puami-tsf-mnch-myanmar",
+ "rdi-hiht", "savethechildren", "savethechildren-nepal",
+ "slttc", "teba-hbc", "tulasalud", "world-renew",
+ "wvindia", "wvindia-nutrition", "wvindia2", "wvindonesia",
+ "wvug", "yonsei-emco")
+all_monthly <- all_monthly[all_monthly$domain %in% typical_FLW_domains,]
+
+#Keep only the users in the chw_single_app list
+all_monthly <- all_monthly[all_monthly$user_pk %in% chw_single_app$user_pk,]
+
+#Get report_options from config run file
+report <- run_conf$reports$modules$name
+report_options <- get_report_options(run_conf,report)
+
+#Keep rows only from 1/1/10 - 11/30/14 (based on config run file)
+#This leaves us with 2397 FLWs
+all_monthly$date_first_visit = as.Date(all_monthly$date_first_visit)
+all_monthly$date_last_visit = as.Date(all_monthly$date_last_visit)
+start_date = as.Date(report_options$start_date)
+end_date = as.Date(report_options$end_date)
+all_monthly = subset(all_monthly, all_monthly$date_first_visit >= start_date
+ & all_monthly$date_last_visit <= end_date)
+
+#Remove demo users and NA/NONE users
+#This does not exclude any FLWs, so we still have 2397 FLWs
+all_monthly = all_monthly[!(all_monthly$user_id =="demo_user"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="NONE"),]
+all_monthly = all_monthly[!(all_monthly$user_id =="none"),]
+all_monthly = all_monthly[!is.na(all_monthly$user_id),]
+
+#Change column names as needed
+names(all_monthly)[names(all_monthly) == "month.index"] = "calendar_month"
+names(all_monthly)[names(all_monthly) == "numeric_index"] = "month_index"
+
+#Convert time_using_cc to minutes
+all_monthly$time_using_cc <- all_monthly$time_using_cc/60
+
+#Prepare domain_table for merging in domain facets
+#Bring in sector information
+sector <- tbl(db, "sector")
+sector <- collect(sector)
+names(sector)[names(sector) == "name"] = "sector_final"
+domain_sector <- tbl(db, "domain_sector")
+domain_sector <- collect(domain_sector)
+domain_sector <- select(domain_sector, domain_id, sector_id)
+domain_table <- merge(domain_table, domain_sector, by.x = "id", by.y = "domain_id", all.x = T)
+domain_table <- merge(domain_table, sector, by.x = "sector_id", by.y = "id", all.x = T)
+#Bring in subsector information
+subsector <- tbl(db, "subsector")
+subsector <- collect(subsector)
+subsector <- select(subsector, id, name)
+subsector <- filter(subsector, !is.na(name))
+subsector <- filter(subsector, name != "")
+names(subsector)[names(subsector) == "name"] = "subsector_final"
+domain_subsector <- tbl(db, "domain_subsector")
+domain_subsector <- collect(domain_subsector)
+domain_subsector <- select(domain_subsector, domain_id, subsector_id)
+domain_table <- merge(domain_table, domain_subsector, by.x = "id", by.y = "domain_id", all.x = T)
+domain_table <- merge(domain_table, subsector, by.x = "subsector_id", by.y = "id", all.x = T)
+#Consolidate country information
+is.na(domain_table$deployment.country) <- domain_table$deployment.country == ""
+is.na(domain_table$country) <- domain_table$country == ""
+domain_table$country_final <- domain_table$deployment.country
+keep_country <- which(is.na(domain_table$deployment.country) & !is.na(domain_table$country))
+domain_table$country_final[keep_country] <- domain_table$country[keep_country]
+#Consolidate Dimagi level of support
+is.na(domain_table$internal.services) <- domain_table$internal.services == ""
+is.na(domain_table$internal.self_started) <- domain_table$internal.self_started == ""
+domain_table$self_start[domain_table$internal.self_started == "True"] <- "self"
+domain_table$dimagi_services <- domain_table$internal.services
+keep_self <- which(is.na(domain_table$internal.services) & !is.na(domain_table$self_start))
+domain_table$dimagi_services[keep_self] <- domain_table$self_start[keep_self]
+
+#Keep only columns of interest
+names(domain_table)[names(domain_table) == "id"] = "domain_id"
+facets_to_merge <- select(domain_table, name, domain_id, country_final, sector_final,
+ subsector_final, dimagi_services, test)
+
+#Merge domain facets from domain table into all_monthly table
+all_monthly <- merge(all_monthly, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+names(all_monthly)[names(all_monthly) == "domain_id"] = "domain_numeric"
+
+#Convert calendar month to actual date
+all_monthly$calendar_month <- parse_date_time(paste('01', all_monthly$calendar_month), '%d %b %Y!')
+all_monthly$calendar_month <- as.Date(all_monthly$calendar_month)
+all_monthly$month_abbr <- month(all_monthly$calendar_month, label = T, abbr = T)
+
+#Exclude any users who logged > 100 visits in any month
+#These are probably atypical users
+#We lose one domain because of this step (crc-intervention)
+#We are left with 2149 FLWs that have only <= 100 visits per month
+all_monthly$visits_ge_100 <- all_monthly$nvisits > 100
+user_ge_100 <- all_monthly %.%
+ group_by(user_pk) %.%
+ summarise(ge_100 = sum(visits_ge_100))
+user_le_100 <- filter(user_ge_100, ge_100 == 0)
+all_monthly <- all_monthly[all_monthly$user_pk %in% user_le_100$user_pk, ]
+
+#Get lifetime table for total nunique_followups, active_months per user
+#lifetime_table <- get_aggregate_table(db, "aggregate_lifetime_interactions", domains_for_run)
+#lifetime_table <- lifetime_table[lifetime_table$user_pk %in% all_monthly$user_pk,]
+#Merge nunique_followups, active_months to all_monthly
+#lifetime_table <- select(lifetime_table, user_pk, nunique_followups, active_months, calendar_month_on_cc)
+#names(lifetime_table)[names(lifetime_table) == "nunique_followups"] = "lifetime_followup"
+#names(lifetime_table)[names(lifetime_table) == "calendar_month_on_cc"] = "months_on_cc"
+#all_monthly <- merge(all_monthly, lifetime_table, by = "user_pk", all.x = T)
+
+#Lifetime aggregate table is not available on the db as of 12/28/14.
+#I will calculate months_on_cc, active_months here until the lifetime table is available.
+total_months_cc <- all_monthly %>% group_by(user_pk) %>%
+ summarise(first_month = min(calendar_month),
+ last_month = max(calendar_month),
+ active_months = length(unique(calendar_month)))
+total_months_cc$months_on_cc <- (interval(total_months_cc$first_month,
+ total_months_cc$last_month) %/% months(1))+1
+total_months_cc <- select(total_months_cc, user_pk, active_months, months_on_cc)
+all_monthly <- merge(all_monthly, total_months_cc, by = "user_pk", all.x = T)
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+all_monthly <- arrange(all_monthly, domain_numeric, user_pk, calendar_month)
+df <- data.table(all_monthly)
+setkey(df,user_pk)
+df[,diff_days:=c(NA,diff(calendar_month)),by=user_pk]
+all_monthly <- as.data.frame(df)
+all_monthly$previous_month_active <- all_monthly$diff_days <= 31
+all_monthly$previous_two_months_active <- all_monthly$diff_days <= 62
+all_monthly$previous_three_months_active <- all_monthly$diff_days <= 93
+
+users <- unique(all_monthly$user_pk)
+
+next_month_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$user_pk == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_month_active[-1], F)
+ next_month_active <- append(next_month_active, next_active)
+}
+all_monthly$next_month_active <- next_month_active
+
+next_two_months_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$user_pk == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_two_months_active[-1], F)
+ next_two_months_active <- append(next_two_months_active, next_active)
+}
+all_monthly$next_two_months_active <- next_two_months_active
+
+next_three_months_active <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$user_pk == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_three_months_active[-1], F)
+ next_three_months_active <- append(next_three_months_active, next_active)
+}
+all_monthly$next_three_months_active <- next_three_months_active
+
+#Based on the end_month in our dataset, we don't know if the user will be active in any of
+#the months following end_month. Must change all those attrition values to NA.
+is.na(all_monthly$next_month_active) <- all_monthly$calendar_month == end_month
+is.na(all_monthly$next_two_months_active) <- all_monthly$calendar_month >= end_month - months(1)
+is.na(all_monthly$next_three_months_active) <- all_monthly$calendar_month >= end_month - months(2)
+
+#Was the user ever active again after an attrition event (defined as next_month_active == F)?
+all_monthly$attrition_event <- !(all_monthly$next_month_active == T | is.na(all_monthly$next_month_active))
+all_monthly$continuing <- all_monthly$month_index < all_monthly$months_on_cc
+all_monthly$ever_active_again <- all_monthly$attrition_event == T & all_monthly$continuing == T
+is.na(all_monthly$ever_active_again) <- all_monthly$attrition_event == F
+
+#Exclude any users that don't have a month_index = 1
+#These users have months that started outside our data range for this dataset
+#so we shouldn't include them. There are 3 of these users.
+#Keep users that have a month_index = 1. We now have 2146 users
+all_monthly$has_index_1 <- all_monthly$month_index == 1
+user_index_1 <- all_monthly %.%
+ group_by(user_pk) %.%
+ summarise(keep_user = sum(has_index_1))
+user_index_1 <- filter(user_index_1, keep_user != 0)
+all_monthly <-
+ all_monthly[all_monthly$user_pk %in% user_index_1$user_pk, ]
+
+#Add sample_increase variable and sample_decrease variables
+all_monthly <- arrange(all_monthly, user_pk, calendar_month)
+users <- unique(all_monthly$user_pk)
+sample_increase <- c()
+sample_decrease <- c()
+for (i in users) {
+ single_user <- all_monthly[all_monthly$user_pk == i,]
+ sample_increase <- append(sample_increase, cumsum(sample(1:5, nrow(single_user), replace=T)))
+ sample_decrease <- append(sample_decrease, rev(cumsum(sample(1:5, nrow(single_user), replace=T))))
+}
+all_monthly$sample_increase <- sample_increase
+all_monthly$sample_decrease <- sample_decrease
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+#Also want differences between indicators for each user from one month to the next
+#Differences in indicators will be used for test 1a
+all_monthly <- arrange(all_monthly, user_pk, calendar_month)
+df <- data.table(all_monthly)
+setkey(df,user_pk)
+df[,diff_nvisits:=c(NA,diff(nvisits)),by=user_pk]
+df[,diff_active_day_percent:=c(NA,diff(active_day_percent)),by=user_pk]
+df[,diff_nforms:=c(NA,diff(nforms)),by=user_pk]
+df[,diff_median_visit_duration:=c(NA,diff(median_visit_duration)),by=user_pk]
+df[,diff_median_visits_per_day:=c(NA,diff(median_visits_per_day)),by=user_pk]
+df[,diff_time_using_cc:=c(NA,diff(time_using_cc)),by=user_pk]
+df[,diff_ninteractions:=c(NA,diff(ninteractions)),by=user_pk]
+df[,diff_ncases_registered:=c(NA,diff(ncases_registered)),by=user_pk]
+df[,diff_register_followup:=c(NA,diff(register_followup)),by=user_pk]
+df[,diff_case_register_followup_rate:=c(NA,diff(case_register_followup_rate)),by=user_pk]
+df[,diff_ncases_touched:=c(NA,diff(ncases_touched)),by=user_pk]
+df[,diff_nunique_followups:=c(NA,diff(nunique_followups)),by=user_pk]
+df[,diff_sample_increase:=c(NA,diff(sample_increase)),by=user_pk]
+df[,diff_sample_decrease:=c(NA,diff(sample_decrease)),by=user_pk]
+all_monthly <- as.data.frame(df)
+
+#In the blog, I only kept users who have been active for
+#at least 6 mos and I only counted their rows from month 6 onwards
+#See this document for further details:
+#https://docs.google.com/a/dimagi.com/spreadsheets/d/1weMI03KGQPffWHM3y2AR1VbSVCZIbJEEXiFd_0jRL7A/edit#gid=0
+#Excluded rows before the 6th month on CC AND users with < 6 active months on CC
+#We now have 937 users and 30 domains
+all_monthly <- filter(all_monthly, month_index >= 6)
+all_monthly <- filter(all_monthly, active_months >= 6)
+
+#write.csv(all_monthly, file = "all_monthly.csv")
+
+training_typical <- all_monthly
+fullset <- all_monthly
+
+#------------------------------------------------------------------------#
+#HYPOTHESIS TESTING
+#------------------------------------------------------------------------#
+
+#Create test dataset
+#Remove training set domains
+training_domains <- c("afguinea", "nsf-lifefirst", "yonsei-emco", "keiskamma",
+ "image-sa", "ictwomenhealth", "fenway", "tulasalud")
+training_typical <- training_typical[!(training_typical$domain %in% training_domains),]
+
+#Check number of users/domain for sampling purposes
+n_chw <- training_typical %>% group_by(domain) %>%
+ summarise(nusers = length(unique(user_pk)))
+n_chw <- arrange(n_chw, desc(nusers))
+n_chw$total_users <- sum(n_chw$nusers)
+n_chw$per_users <- (n_chw$nusers/n_chw$total_users)*100
+
+#Exclude a sample of 103 users from crs-remind and 59 users from maternalznz
+#so that each domain contributes <= 20% of the total users
+#We are left with 663 users
+exclude_users <- c(sample(unique(training_typical$user_pk[training_typical$domain == "crs-remind"]), 103),
+ sample(unique(training_typical$user_pk[training_typical$domain == "maternalznz"]), 59))
+training_typical <- training_typical[!(training_typical$user_pk %in% exclude_users),]
+
+#Indicators to evaluate
+indicators <- c("nvisits", "active_day_percent", "nforms",
+ "median_visit_duration", "median_visits_per_day",
+ "time_using_cc", "ninteractions", "ncases_registered",
+ "register_followup", "case_register_followup_rate",
+ "ncases_touched", "nunique_followups", "sample_increase",
+ "sample_decrease")
+
+#------------------------------------------------------------------------#
+#Code for Test 1
+#------------------------------------------------------------------------#
+# % difference in indicators for each user for consectutive months
+# This isn't for truly consecutive months, so later on,
+# we will only use rows with previous_month_active == T
+#This will be used for test 1b
+#source(file.path("analysis_scripts","rdayalu","test_1b_journal.R", fsep = .Platform$file.sep))
+#Not running test 1b for the paper
+
+#Must only include rows with previous_month_active == T. Exclude F & NA
+training_consec <- filter(training_typical, previous_month_active == T)
+training_consec$concat <- paste(training_consec$user_pk, training_consec$calendar_month,
+ sep = "_")
+
+
+#Exclude domain calendar_months with nusers < 5 for that domain
+#Use this dataset only for test 1a because we don't want to calculate medians
+#for <= 5 users/month
+nusers <- training_consec %>%
+ group_by(domain, calendar_month) %>%
+ summarise(nusers = length(unique(user_pk)))
+nusers <- filter(nusers, nusers >= 5)
+nusers$concat <- paste(nusers$domain, nusers$calendar_month, sep = "_")
+training_consec <-
+ training_consec[paste(training_consec$domain, training_consec$calendar_month, sep = "_") %in%
+ nusers$concat, ]
+
+#Domain median ABSOLUTE change per user per calendar month,
+#excluding each user from the domain median for that user's row
+#This is used for test 1a
+source(file.path("analysis_scripts","rdayalu","test_1a_journal.R", fsep = .Platform$file.sep))
+
+#Domain median PERCENTAGE change per user per calendar month,
+#excluding each user from the domain median for that user's row
+#source(file.path("analysis_scripts","rdayalu","test_1b_2_journal.R", fsep = .Platform$file.sep))
+
+#names(training_consec)
+
+test_1a <-
+ c(cor(training_consec$med_nvisits_1a, training_consec$diff_nvisits, use = "complete.obs"),
+ cor(training_consec$med_active_day_percent_1a, training_consec$diff_active_day_percent, use = "complete.obs"),
+ cor(training_consec$med_nforms_1a, training_consec$diff_nforms, use = "complete.obs"),
+ cor(training_consec$med_median_visit_duration_1a, training_consec$diff_median_visit_duration, use = "complete.obs"),
+ cor(training_consec$med_median_visits_per_day_1a, training_consec$diff_median_visits_per_day, use = "complete.obs"),
+ cor(training_consec$med_time_using_cc_1a, training_consec$diff_time_using_cc, use = "complete.obs"),
+ cor(training_consec$med_ninteractions_1a, training_consec$diff_ninteractions, use = "complete.obs"),
+ cor(training_consec$med_ncases_registered_1a, training_consec$diff_ncases_registered, use = "complete.obs"),
+ cor(training_consec$med_register_followup_1a, training_consec$diff_register_followup, use = "complete.obs"),
+ cor(training_consec$med_case_register_followup_rate_1a, training_consec$diff_case_register_followup_rate, use = "complete.obs"),
+ cor(training_consec$med_ncases_touched_1a, training_consec$diff_ncases_touched, use = "complete.obs"),
+ cor(training_consec$med_nunique_followups_1a, training_consec$diff_nunique_followups, use = "complete.obs"),
+ cor(training_consec$med_sample_increase_1a, training_consec$diff_sample_increase, use = "complete.obs"),
+ cor(training_consec$med_sample_decrease_1a, training_consec$diff_sample_decrease, use = "complete.obs"))
+names(test_1a) <- indicators
+#test_1a <- data.frame(test_1a)
+
+test1_data <- training_consec
+write.csv(test1_data, file = "test1_data.csv")
+#------------------------------------------------------------------------#
+#Code for Test 2
+#------------------------------------------------------------------------#
+
+#Previous month's indicator value
+training_typical$prev_nvisits <- training_typical$nvisits - training_typical$diff_nvisits
+training_typical$prev_active_day_percent <- training_typical$active_day_percent - training_typical$diff_active_day_percent
+training_typical$prev_nforms<- training_typical$nforms - training_typical$diff_nforms
+training_typical$prev_median_visit_duration <- training_typical$median_visit_duration - training_typical$diff_median_visit_duration
+training_typical$prev_median_visits_per_day <- training_typical$median_visits_per_day - training_typical$diff_median_visits_per_day
+training_typical$prev_time_using_cc <- training_typical$time_using_cc - training_typical$diff_time_using_cc
+training_typical$prev_ninteractions <- training_typical$ninteractions - training_typical$diff_ninteractions
+training_typical$prev_ncases_registered <- training_typical$ncases_registered - training_typical$diff_ncases_registered
+training_typical$prev_register_followup <- training_typical$register_followup - training_typical$diff_register_followup
+training_typical$prev_case_register_followup_rate <- training_typical$case_register_followup_rate - training_typical$diff_case_register_followup_rate
+training_typical$prev_ncases_touched <- training_typical$ncases_touched - training_typical$diff_ncases_touched
+training_typical$prev_nunique_followups <- training_typical$nunique_followups - training_typical$diff_nunique_followups
+training_typical$prev_sample_increase <- training_typical$sample_increase - training_typical$diff_sample_increase
+training_typical$prev_sample_decrease <- training_typical$sample_decrease - training_typical$diff_sample_decrease
+
+#Must only include rows with previous_month_active == T. Exclude F & NA
+training_typical <- training_typical[training_typical$previous_month_active == T,]
+test2_data <- training_typical
+
+test_2a <-
+ c(cor(training_typical$prev_nvisits, training_typical$nvisits, use = "complete.obs"),
+ cor(training_typical$prev_active_day_percent, training_typical$active_day_percent, use = "complete.obs"),
+ cor(training_typical$prev_nforms, training_typical$nforms, use = "complete.obs"),
+ cor(training_typical$prev_median_visit_duration, training_typical$median_visit_duration, use = "complete.obs"),
+ cor(training_typical$prev_median_visits_per_day, training_typical$median_visits_per_day, use = "complete.obs"),
+ cor(training_typical$prev_time_using_cc, training_typical$time_using_cc, use = "complete.obs"),
+ cor(training_typical$prev_ninteractions, training_typical$ninteractions, use = "complete.obs"),
+ cor(training_typical$prev_ncases_registered, training_typical$ncases_registered, use = "complete.obs"),
+ cor(training_typical$prev_register_followup, training_typical$register_followup, use = "complete.obs"),
+ cor(training_typical$prev_case_register_followup_rate, training_typical$case_register_followup_rate, use = "complete.obs"),
+ cor(training_typical$prev_ncases_touched, training_typical$ncases_touched, use = "complete.obs"),
+ cor(training_typical$prev_nunique_followups, training_typical$nunique_followups, use = "complete.obs"),
+ cor(training_typical$prev_sample_increase, training_typical$sample_increase, use = "complete.obs"),
+ cor(training_typical$prev_sample_decrease, training_typical$sample_decrease, use = "complete.obs"))
+names(test_2a) <- indicators
+#test_2a <- data.frame(test_2a)
+
+write.csv(test2_data, file = "test2_data.csv")
+
+test <- data.frame(cbind(test_1a, test_2a))
+test$sum <- test$test_1a + test$test_2a
+write.csv(test, file = "journal_set_results.csv")
+
+#------------------------------------------------------------------------#
+#TABLE 2 - country and subsector
+#------------------------------------------------------------------------#
+
+table(fullset$subsector_final, useNA = "always")
+
+program <- fullset %>% group_by(domain) %>%
+ summarise(country = unique(country_final),
+ subsector = unique(subsector_final),
+ nusers = length(unique(user_pk)),
+ nforms = sum(nforms))
+program$subsector[program$domain == "care-ecd"] <- "Maternal, Newborn, & Child Health"
+program$country[program$domain == "puami-tsf-mnch-myanmar"] <- "Myanmar"
+
+users <- fullset %>% group_by(domain, user_pk) %>%
+ summarise(country = unique(country_final),
+ subsector = unique(subsector_final))
+users$subsector[users$domain == "care-ecd"] <- "Maternal, Newborn, & Child Health"
+users$country[users$domain == "puami-tsf-mnch-myanmar"] <- "Myanmar"
+
+forms <- fullset %>% group_by(subsector_final) %>%
+ summarise(nforms = sum(nforms))
+forms$subsector_final[is.na(forms$subsector_final)] <- "Maternal, Newborn, & Child Health"
+
+forms <- fullset %>% group_by(country_final) %>%
+ summarise(nforms = sum(nforms))
+forms$country_final[is.na(forms$country_final)] <- "Myanmar"
+
+table(users$country, useNA = "always")
+
+#------------------------------------------------------------------------#
+#TABLE 3 - Univariate summary table
+#------------------------------------------------------------------------#
+
+library(pastecs)
+indicators_table <- cbind(fullset$nvisits, fullset$active_day_percent, fullset$nforms,
+ fullset$median_visit_duration, fullset$median_visits_per_day,
+ fullset$time_using_cc, fullset$ninteractions,
+ fullset$ncases_registered, fullset$register_followup,
+ fullset$case_register_followup_rate, fullset$ncases_touched,
+ fullset$nunique_followups)
+colnames(indicators_table) <- indicators[1:12]
+
+options(scipen=100)
+options(digits=2)
+summary_stats <- as.matrix(stat.desc(indicators_table))
+
+#Number of active months per CHW
+n_months <- fullset %>% group_by(user_pk) %>%
+ summarise(nmonths = length(calendar_month))
+
+#Vector of medians for all 12 usage indicators
+indicator_medians <- as.vector(apply(indicators_table, 2, FUN = median))
+names(indicator_medians) <- indicators[1:12]
+
+#How many CHWs were active for their all their first 6 mos on CC?
+test <- monthly_table
+test <- test[test$user_pk %in% fullset$user_pk, ]
+names(test)[names(test) == "numeric_index"] = "month_index"
+test <- filter(test, month_index <= 6)
+active_6_mos <- test %>% group_by(user_pk) %>%
+ summarise(nmos = length(month.index))
+
+#------------------------------------------------------------------------#
+# M & E questions
+#------------------------------------------------------------------------#
+
+#Seasonal/monthly activity: use fullset
+#We are going to filter India to make seasonal/holiday comparisons more applicable
+# % days active and # forms
+test <- filter(fullset, country_final == "India")
+season <- test %>% group_by(month_abbr) %>%
+ summarise(med_forms = median(nforms),
+ med_active_days = median(active_day_percent))
+season <- data.frame(rep(season$month_abbr,2),
+ c(season$med_forms, season$med_active_days))
+season$indicator <- c(rep("# forms", 12), rep("% active days", 12))
+names(season) <- c("month", "median_metric", "metric")
+
+#normalized metrics
+season <- test %>% group_by(month_abbr) %>%
+ summarise(med_nvisits = median(nvisits),
+ med_median_visit_duration = median(median_visit_duration),
+ med_median_visits_per_day = median(median_visits_per_day),
+ med_time_using_cc = median(time_using_cc),
+ med_ninteractions = median(ninteractions),
+ med_ncases_registered = median(ncases_registered),
+ med_register_followup = median(register_followup),
+ med_case_register_followup_rate = median(case_register_followup_rate),
+ med_ncases_touched = median(ncases_touched),
+ med_nunique_followups = median(nunique_followups))
+season <- data.frame(rep(season$month_abbr,10),
+ c(season$med_nvisits, season$med_median_visit_duration, season$med_median_visits_per_day,
+ season$med_time_using_cc, season$med_ninteractions, season$med_ncases_registered,
+ season$med_register_followup, season$med_case_register_followup_rate,
+ season$med_ncases_touched, season$med_nunique_followups))
+season$indicator <- c(rep("# visits", 12), rep("median visit duration", 12), rep("median visits per day", 12),
+ rep("total duration using CC", 12), rep("# interactions", 12),
+ rep("# cases registered", 12), rep("# follow-up visits", 12),
+ rep("% follow-up visits", 12), rep("# cases", 12),
+ rep("# cases followed-up", 12))
+names(season) <- c("month", "median_metric", "metric")
+season$overall_max <- c(rep(max(test$nvisits), 12), rep(max(test$median_visit_duration), 12),
+ rep(max(test$median_visits_per_day), 12), rep(max(test$time_using_cc), 12),
+ rep(max(test$ninteractions), 12), rep(max(test$ncases_registered), 12),
+ rep(max(test$register_followup), 12), rep(max(test$case_register_followup_rate), 12),
+ rep(max(test$ncases_touched), 12), rep(max(test$nunique_followups), 12))
+season$normalized_median <- (season$median_metric/season$overall_max)*100
+
+#Activity by device type
+#Only include users with a single device type
+users_device_count <- fullset %>% group_by(user_pk) %>%
+ summarise(ndevice_type = length(unique(summary_device_type)))
+table(users_device_count$ndevice_type, useNA = "always")
+#918 users have only one device type
+#Keep only these users
+users_device_count <- filter(users_device_count, ndevice_type == 1)
+monthly_single_device <- fullset[fullset$user_pk %in% users_device_count$user_pk,]
+#Exclude users who have device type = None. We are left with 831 users
+monthly_single_device <- filter(monthly_single_device, summary_device_type != "None")
+#191 android users
+android <- filter(monthly_single_device, summary_device_type == "Android")
+#640 feature phone users
+feature <- filter(monthly_single_device, summary_device_type == "Nokia")
+
+#Median metrics
+android_median <- c(median(android$nvisits), median(android$active_day_percent),
+ median(android$nforms), median(android$median_visit_duration), median(android$median_visits_per_day),
+ median(android$time_using_cc), median(android$ninteractions), median(android$ncases_registered),
+ median(android$register_followup), median(android$case_register_followup_rate),
+ median(android$ncases_touched), median(android$nunique_followups))
+
+feature_median <- c(median(feature$nvisits), median(feature$active_day_percent),
+ median(feature$nforms), median(feature$median_visit_duration), median(feature$median_visits_per_day),
+ median(feature$time_using_cc), median(feature$ninteractions), median(feature$ncases_registered),
+ median(feature$register_followup), median(feature$case_register_followup_rate),
+ median(feature$ncases_touched), median(feature$nunique_followups))
+
+median_metric <- c(android_median, feature_median)
+
+device_metrics <- c("# visits", "% active days", "# forms",
+ "median visit duration", "median visits per day",
+ "total duration using CC", "# interactions",
+ "# cases registered", "# follow-up visits",
+ "% follow-up visits", "# cases",
+ "# cases followed-up")
+
+device_data <- data.frame(cbind(rep(c("Android", "Feature"), each=12), rep(device_metrics, 2),
+ median_metric))
+names(device_data) <- c("device", "metric", "median_metric")
+device_data$median_metric <- as.numeric(levels(device_data$median_metric))[device_data$median_metric]
+
+# percent active days and # forms
+device_eval <- filter(device_data, metric == "% active days" | metric == "# forms")
+g_dev_eval <- ggplot(device_eval, aes(x = metric, y = median_metric, fill = device)) +
+ geom_bar(position = "dodge", stat = "identity") +
+ xlab("Usage metric") +
+ ylab("Median metric") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9))
+
+# Remaining 10 metrics
+device_eval <- filter(device_data, metric != "% active days" & metric != "# forms")
+device_eval$overall_max <- rep(c(max(fullset$nvisits), max(fullset$median_visit_duration),
+ max(fullset$median_visits_per_day), max(fullset$time_using_cc),
+ max(fullset$ninteractions), max(fullset$ncases_registered),
+ max(fullset$register_followup), max(fullset$case_register_followup_rate),
+ max(fullset$ncases_touched), max(fullset$nunique_followups)), 2)
+device_eval$normalized_median <- (device_eval$median_metric/device_eval$overall_max)*100
+
+g_dev_overall <- ggplot(device_eval, aes(x = metric, y = normalized_median, fill = device)) +
+ geom_bar(position = "dodge", stat = "identity") +
+ scale_fill_brewer(palette = "Set2") +
+ xlab("Usage metric") +
+ ylab("Normalized median (% of overall metric maximum)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text.x=element_blank(),
+ axis.title.x=element_blank()) +
+ theme(axis.text.y=element_text(size=9),
+ axis.title.y=element_text(size=9)) +
+ theme(legend.title=element_text(size=8),
+ legend.text=element_text(size=8))
+
+device_eval <- filter(device_eval, metric != "% active days" & metric != "# forms" &
+ metric != "% follow-up visits")
+g_dev_zoom <- ggplot(device_eval, aes(x = metric, y = normalized_median, fill = device)) +
+ geom_bar(position = "dodge", stat = "identity") +
+ scale_fill_brewer(palette = "Set2") +
+ scale_y_continuous(limits=c(0,25)) +
+ xlab("Usage metric") +
+ ylab("Normalized median (% of overall metric maximum)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text.x=element_blank(),
+ axis.title.x=element_blank()) +
+ theme(axis.text.y=element_text(size=9),
+ axis.title.y=element_text(size=9)) +
+ theme(legend.title=element_text(size=8),
+ legend.text=element_text(size=8))
+
+pdf("device_eval.pdf", width=8, height=4)
+grid.arrange(g_dev_eval)
+dev.off()
+
+pdf("dev_overall.pdf")
+grid.arrange(g_dev_overall, g_dev_zoom, nrow = 2, ncol=1)
+dev.off()
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/journal_attrition.R b/analysis_scripts/rdayalu/journal_attrition.R
new file mode 100644
index 0000000..1d106b2
--- /dev/null
+++ b/analysis_scripts/rdayalu/journal_attrition.R
@@ -0,0 +1,117 @@
+training_typical <- arrange(fullset, user_pk, calendar_month)
+n_months <- fullset %>% group_by(user_pk) %>%
+ summarise(nmonths = length(calendar_month))
+n_months <- filter(n_months, nmonths >=4)
+training_typical <- training_typical[training_typical$user_pk %in% n_months$user_pk,]
+users <- unique(training_typical$user_pk)
+
+
+#Create attrition table for each indicator
+leadup <- data.frame(matrix(ncol = 8, nrow = 1))
+names(leadup) <- c("user_pk", "calendar_month", "previous_three_months_active",
+ "next_three_months_active", "month_1", "month_2", "month_3",
+ "month_4")
+leadup$calendar_month <- as.Date(leadup$calendar_month)
+leadup$user_pk <- as.numeric(leadup$user_pk)
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #single_user$calendar_month <- as.Date(single_user$calendar_month)
+ for (j in 1:(nrow(single_user)-3)) {
+ leadup_single <- data.frame(matrix(ncol = 8, nrow = 1))
+ names(leadup_single) <- c("user_pk", "calendar_month", "previous_three_months_active",
+ "next_three_months_active", "month_1", "month_2", "month_3",
+ "month_4")
+ leadup_single$user_pk <- as.numeric(single_user$user_pk[1])
+ leadup_single$calendar_month <- as.Date(single_user$calendar_month[3+j])
+ leadup_single$previous_three_months_active <- single_user$previous_three_months_active[3+j]
+ leadup_single$next_three_months_active <- single_user$next_three_months_active[3+j]
+ leadup_single$month_1 <- single_user$nunique_followups[3+j]
+ leadup_single$month_2 <- single_user$nunique_followups[2+j]
+ leadup_single$month_3 <- single_user$nunique_followups[1+j]
+ leadup_single$month_4 <- single_user$nunique_followups[j]
+ leadup <- rbind(leadup, leadup_single)
+ }
+}
+leadup <- leadup[!(is.na(leadup$calendar_month)),]
+leadup <- filter(leadup, next_three_months_active == F & previous_three_months_active == T)
+leadup_3 <- c(median(leadup$month_1), median(leadup$month_2),
+ median(leadup$month_3), median(leadup$month_4))
+
+
+#leadup_values <- c()
+leadup_values <- c(leadup_values, leadup_3)
+
+#Create dataset for graphing
+leadup_indicators <- c(rep("# visits", 4), rep("% active days", 4), rep("# forms", 4),
+ rep("median visit duration", 4), rep("median visits per day", 4),
+ rep("total duration using CC", 4), rep("# interactions", 4),
+ rep("# cases registered", 4), rep("# follow-up visits", 4),
+ rep("% follow-up visits", 4), rep("# cases", 4),
+ rep("# cases followed-up", 4))
+leadup_data <- data.frame(cbind(rep(c(1:4), 12), leadup_indicators, leadup_values))
+names(leadup_data) <- c("months_prior", "metric", "metric_values")
+leadup_data$metric_values <- as.numeric(levels(leadup_data$metric_values))[leadup_data$metric_values]
+leadup_data$months_prior <- as.factor(leadup_data$months_prior)
+month_levels <- rev(levels(leadup_data$months_prior))
+
+# percent active days and # forms
+attrition_eval <- filter(leadup_data, metric == "% active days" | metric == "# forms")
+g_att_eval <- ggplot(attrition_eval, aes(x = months_prior, y = metric_values, colour = metric, group = metric)) +
+ geom_line() +
+ geom_point(size = 1.5, shape = 19, alpha = 0.5) +
+ scale_y_continuous(limits=c(0,20)) +
+ scale_x_discrete(limits = month_levels) +
+ xlab("# months prior to attrition event") +
+ ylab("Median metric") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9))
+
+# Remaining 10 metrics
+attrition_eval <- filter(leadup_data, metric != "% active days" & metric != "# forms")
+attrition_eval$overall_max <- c(rep(max(fullset$nvisits), 4), rep(max(fullset$median_visit_duration), 4),
+ rep(max(fullset$median_visits_per_day), 4), rep(max(fullset$time_using_cc), 4),
+ rep(max(fullset$ninteractions), 4), rep(max(fullset$ncases_registered), 4),
+ rep(max(fullset$register_followup), 4), rep(max(fullset$case_register_followup_rate), 4),
+ rep(max(fullset$ncases_touched), 4), rep(max(fullset$nunique_followups), 4))
+attrition_eval$normalized_median <- (attrition_eval$metric_values/attrition_eval$overall_max)*100
+
+g_att_overall <- ggplot(attrition_eval, aes(x = months_prior, y = normalized_median, colour = metric, group = metric)) +
+ geom_line() +
+ geom_point(size = 1.5, shape = 19, alpha = 0.5) +
+ #scale_y_continuous(limits=c(0,20)) +
+ scale_x_discrete(limits = month_levels) +
+ xlab("# months prior to attrition event") +
+ ylab("Normalized median (% of overall metric maximum)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9)) +
+ theme(legend.title=element_text(size=8),
+ legend.text=element_text(size=8))
+
+g_att_zoom <- ggplot(attrition_eval, aes(x = months_prior, y = normalized_median, colour = metric, group = metric)) +
+ geom_line() +
+ geom_point(size = 1.5, shape = 19, alpha = 0.5) +
+ scale_y_continuous(limits=c(0,18)) +
+ scale_x_discrete(limits = month_levels) +
+ xlab("# months prior to attrition event") +
+ ylab("Normalized median (% of overall metric maximum)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9)) +
+ theme(legend.title=element_text(size=8),
+ legend.text=element_text(size=8))
+
+pdf("att_eval.pdf", width=8, height=4)
+grid.arrange(g_att_eval)
+dev.off()
+
+pdf("att_remaining.pdf")
+grid.arrange(g_att_overall, g_att_zoom, nrow = 2, ncol=1)
+dev.off()
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/journal_test.R b/analysis_scripts/rdayalu/journal_test.R
new file mode 100644
index 0000000..9d48593
--- /dev/null
+++ b/analysis_scripts/rdayalu/journal_test.R
@@ -0,0 +1,14 @@
+test <- all_monthly
+
+test$has_index_1 <- test$month_index == 1
+user_index_1 <- test %.%
+ group_by(user_pk) %.%
+ summarise(keep_user = sum(has_index_1))
+user_index_1 <- filter(user_index_1, keep_user != 0)
+test <- test[test$user_pk %in% user_index_1$user_pk, ]
+
+test <- filter(test, month_index <= 6)
+
+#How many users have nmos = 6?
+active_6_mos <- test %>% group_by(user_pk) %>%
+ summarise(nmos = length(unique(calendar_month)))
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/journal_visuals.R b/analysis_scripts/rdayalu/journal_visuals.R
new file mode 100644
index 0000000..5b461a0
--- /dev/null
+++ b/analysis_scripts/rdayalu/journal_visuals.R
@@ -0,0 +1,366 @@
+#This code to is creae visuals for the ITID journal article:
+#https://docs.google.com/a/dimagi.com/document/d/1AuFF40FMGfe49wWYVhvLQ0DEvGuSayAHGk4o8b67N4I/edit
+#The datasets used to generate the following visuals are located here:
+#fullset.csv:
+#test1_data.csv:
+#test2_data.csv:
+
+#Number of active users: use fullset
+overall <- fullset %>% group_by(calendar_month) %>%
+ summarise(sum_user = length(unique(user_pk)))
+
+p_users <- ggplot(overall, aes(x=calendar_month, y=sum_user)) +
+ geom_point(size = 1.5, shape = 19, alpha = 0.5, colour = "darkblue",
+ fill = "lightblue") +
+ geom_line(colour = "darkblue") +
+ xlab("Calendar month") +
+ ylab("Number (#) of active CHWs") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9))
+ #ggtitle("Number (#) of active CHWs by month")
+
+#Distribution of total n_active months per CHW: use fullset
+n_months <- fullset %>% group_by(user_pk) %>%
+ summarise(nmonths = length(calendar_month))
+
+g_nmonths <- ggplot(n_months, aes(x=nmonths, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ xlab("Total number (#) of active months per CHW") +
+ ylab("Number (#) of CHWs") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9)) +
+ geom_vline(xintercept = median(n_months$nmonths), linetype = "dashed")
+
+#Univariate density plots: use fullset
+#Density plots with semi-transparent fill
+g_forms <- ggplot(fullset, aes(x=nforms, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,150)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "nforms"], linetype = "dashed") +
+ xlab("# forms")
+
+g_nvisits <- ggplot(fullset, aes(x=nvisits, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "nvisits"], linetype = "dashed") +
+ xlab("# visits")
+
+g_inter <- ggplot(fullset, aes(x=ninteractions, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,100)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "ninteractions"], linetype = "dashed") +
+ xlab("# interactions")
+
+g_cases <- ggplot(fullset, aes(x=ncases_touched, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,70)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "ncases_touched"], linetype = "dashed") +
+ xlab("# cases")
+
+g_cases_reg <- ggplot(fullset, aes(x=ncases_registered, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,30)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "ncases_registered"], linetype = "dashed") +
+ xlab("# cases registered")
+
+g_cases_fu <- ggplot(fullset, aes(x=nunique_followups, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,60)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "nunique_followups"], linetype = "dashed") +
+ xlab("# cases followed-up")
+
+g_nvisits_fu <- ggplot(fullset, aes(x=register_followup, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,80)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "register_followup"], linetype = "dashed") +
+ xlab("# follow-up visits")
+
+g_pervisits_fu <- ggplot(fullset, aes(x=case_register_followup_rate, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,100)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "case_register_followup_rate"], linetype = "dashed") +
+ xlab("% of follow-up visits")
+
+g_per_active <- ggplot(fullset, aes(x=active_day_percent, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,100)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "active_day_percent"], linetype = "dashed") +
+ xlab("% of active days")
+
+g_med_vis_day <- ggplot(fullset, aes(x=median_visits_per_day, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,10), breaks=c(1,3,5,7,9)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "median_visits_per_day"], linetype = "dashed") +
+ xlab("Median # visits per active day")
+
+g_med_vis_dur <- ggplot(fullset, aes(x=median_visit_duration, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,15)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = summary_stats["median", "median_visit_duration"], linetype = "dashed") +
+ xlab("Median visit duration (minutes)")
+
+g_time_cc <- ggplot(fullset, aes(x=time_using_cc, y=..count..)) +
+ geom_density(alpha=.2, colour="black", fill="cornflowerblue") +
+ scale_x_continuous(limits=c(0,800)) +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=7),
+ axis.title=element_text(size=7)) +
+ geom_vline(xintercept = summary_stats["median", "time_using_cc"], linetype = "dashed") +
+ xlab("Total duration of CommCare use (minutes)")
+
+pdf("foo.pdf")
+grid.arrange(g_forms, g_nvisits, g_inter, g_cases, g_cases_reg, g_cases_fu,
+ g_nvisits_fu, g_pervisits_fu, g_per_active, g_med_vis_day,
+ g_med_vis_dur, g_time_cc, nrow = 4, ncol=3)
+dev.off()
+
+#Intra-user correlation: use test2_data
+g1 <- ggplot(test2_data, aes(x=prev_active_day_percent, y=active_day_percent)) +
+ geom_point(shape=1, size = 0.3) +
+ scale_x_continuous(limits=c(0,100)) +
+ scale_y_continuous(limits=c(0,100)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.70", parse = T, x=13, y=95) +
+ xlab("% of active days (month A)") +
+ ylab("% of active days (month B)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8))
+
+g2 <- ggplot(test2_data, aes(x=prev_nforms, y=nforms)) +
+ geom_point(shape=1, size = 0.3) +
+ scale_x_continuous(limits=c(0,100)) +
+ scale_y_continuous(limits=c(0,100)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.68", parse = T, x=13, y=97) +
+ xlab("# forms (month A)") +
+ ylab("# forms (month B)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8))
+
+g3 <- ggplot(test2_data, aes(x=prev_time_using_cc, y=time_using_cc)) +
+ geom_point(shape=1, size = 0.3) +
+ scale_x_continuous(limits=c(0,400)) +
+ scale_y_continuous(limits=c(0,400)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.21", parse = T, x=55, y=395) +
+ xlab("Total duration of CommCare use (month A)") +
+ ylab("Total duration of CommCare use (month B)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8))
+
+g4 <- ggplot(test2_data, aes(x=prev_median_visit_duration, y=median_visit_duration)) +
+ geom_point(shape=1, size = 0.3) +
+ scale_x_continuous(limits=c(0,15)) +
+ scale_y_continuous(limits=c(0,15)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.19", parse = T, x=4, y=14.5) +
+ xlab("Median visit duration (month A)") +
+ ylab("Median visit duration (month B)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8))
+
+pdf("foo.pdf")
+grid.arrange(g1, g2, g3, g4, nrow = 2, ncol=2)
+dev.off()
+
+#------------------------------------------------------------------------#
+# M & E: Seasons
+#------------------------------------------------------------------------#
+#Do CHWs have different levels of activity by calendar month or season?
+#% active days and # forms
+g_season <- ggplot(season, aes(x=month, y=median_metric, group = metric, colour = metric)) +
+ geom_line() +
+ scale_y_continuous(limits=c(0,35)) +
+ geom_point(size = 1.5, shape = 19, alpha = 0.5) +
+ xlab("Month") +
+ ylab("Median metric") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9))
+#normalized graph overall
+g_season_normal_overall <- ggplot(season, aes(x=month, y=normalized_median, group = metric, colour = metric)) +
+ geom_line() +
+ #scale_y_continuous(limits=c(0,20)) +
+ geom_point(size = 1.5, shape = 19, alpha = 0.5) +
+ xlab("Month") +
+ ylab("Normalized median (% of overall metric maximum)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9)) +
+ theme(legend.title=element_text(size=8),
+ legend.text=element_text(size=8))
+#normalized graph zoomed
+g_season_normal_zoom <- ggplot(season, aes(x=month, y=normalized_median, group = metric, colour = metric)) +
+ geom_line() +
+ scale_y_continuous(limits=c(0,20)) +
+ geom_point(size = 1.5, shape = 19, alpha = 0.5) +
+ xlab("Month") +
+ ylab("Normalized median (% of overall metric maximum)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=9),
+ axis.title=element_text(size=9)) +
+ theme(legend.title=element_text(size=8),
+ legend.text=element_text(size=8))
+
+pdf("foo.pdf", width=8, height=4)
+grid.arrange(g_season)
+dev.off()
+
+pdf("foo.pdf")
+grid.arrange(g_season_normal_overall, g_season_normal_zoom, nrow = 2, ncol=1)
+dev.off()
+
+#------------------------------------------------------------------------#
+# M & E: attrition
+#------------------------------------------------------------------------#
+
+
+#------------------------------------------------------------------------#
+# Not using these graphs
+#------------------------------------------------------------------------#
+
+#Intra-program correlation: use test1_data
+g1 <- ggplot(test1_data, aes(x=diff_active_day_percent, y=med_active_day_percent_1a)) +
+ geom_point(shape=1, size = 0.3) +
+ scale_x_continuous(limits=c(-15,15)) +
+ scale_y_continuous(limits=c(-15,15)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.23", parse = T, x=13, y=11) +
+ xlab("Difference in % active days (Individual CHW)") +
+ ylab("Difference in % active days (Program median)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8)) +
+ geom_vline(xintercept = 0) +
+ geom_hline(yintercept = 0)
+
+g2 <- ggplot(test1_data, aes(x=diff_nforms, y=med_nforms_1a)) +
+ geom_point(shape=1, size = 0.3) +
+ scale_x_continuous(limits=c(0,100)) +
+ scale_y_continuous(limits=c(0,100)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.21", parse = T, x=13, y=97) +
+ xlab("Difference in # forms (Individual CHW)") +
+ ylab("Difference in # forms (Program median)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8))
+
+g3 <- ggplot(test1_data, aes(x=diff_time_using_cc, y=med_time_using_cc_1a)) +
+ geom_point(shape=1, size = 0.3) +
+ scale_x_continuous(limits=c(0,400)) +
+ scale_y_continuous(limits=c(0,400)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.10", parse = T, x=55, y=395) +
+ xlab("Difference in total duration CommCare use (Individual CHW)") +
+ ylab("Difference in total duration CommCare use (Program median)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8))
+
+g4 <- ggplot(test1_data, aes(x=diff_median_visit_duration, y=med_median_visit_duration_1a)) +
+ geom_point(shape=1, size = 0.3) +
+ scale_x_continuous(limits=c(0,15)) +
+ scale_y_continuous(limits=c(0,15)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.00", parse = T, x=4, y=14.5) +
+ xlab("Difference in median visit duration (Individual CHW)") +
+ ylab("Difference in median visit duration (Program median)") +
+ theme_bw() +
+ theme(panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank()) +
+ theme(axis.text=element_text(size=8),
+ axis.title=element_text(size=8))
+
+pdf("foo.pdf")
+grid.arrange(g1, g2, g3, g4, nrow = 2, ncol=2)
+dev.off()
+
+
+
diff --git a/analysis_scripts/rdayalu/mbti_jung.R b/analysis_scripts/rdayalu/mbti_jung.R
new file mode 100644
index 0000000..747f9cf
--- /dev/null
+++ b/analysis_scripts/rdayalu/mbti_jung.R
@@ -0,0 +1,155 @@
+#MBTI analysis for INC summit and beyond
+#This code is to analyze the Jung/MBTI results from the Dimagi staff personality survey
+#1/8/15
+
+#Import survey results
+#Mike O'Donnell submitted results twice, so I deleted one submission of his
+mbti <- read.csv(file="mbti_results.csv")
+#Import file with hire date for each employee
+staff <- read.csv(file="staff_data.csv")
+staff <- select(staff, name, email)
+staff2 <- read.csv(file="staff_data2.csv")
+staff2 <- select(staff2, email, hire_date)
+
+#Manually change wrong email addresses
+#mbti$email[mbti$email == "Mcanty@dimagi.com"] <- "mcanty@dimagi.com"
+#mbti$email[mbti$email == "kumar@dimagi.com"] <- "vkumar@dimagi.com"
+#staff$email[staff$email == "ashekar@gmail.com"] <- "ashekar@dimagi.com"
+#staff$email[staff$email == "bbuczyk@dimagi.com"] <- "biyeun@dimagi.com"
+#staff$email[staff$email == "ssynder@dimagi.com"] <- "ssnyder@dimagi.com"
+
+#Merge survey results with names and hire date
+mbti <- merge(mbti, staff, by.x = "email", by.y = "email", all.x = T)
+mbti <- merge(mbti, staff2, by.x = "email", by.y = "email", all.x = T)
+
+#Convert hire_date to date variable
+mbti$hire_date <- as.Date(mbti$hire_date, "%m/%d/%Y")
+
+#Manually replace BUs for people who reported > 1 BU. Use INC as default
+mbti$bu[mbti$bu == "DWA, INC"] <- "INC"
+mbti$bu[mbti$bu == "DLAC, INC"] <- "INC"
+
+#Code MBTI results to NBI
+mbti$nbi[mbti$mbti %in% c("INTJ", "ESTP", "ESFP", "INFJ")] <- "L1"
+mbti$nbi[mbti$mbti %in% c("ISTJ", "ISFJ", "ESTJ", "ENTJ")] <- "L2"
+mbti$nbi[mbti$mbti %in% c("INTP", "ENTP", "ENFP", "ISTP")] <- "R1"
+mbti$nbi[mbti$mbti %in% c("ISFP", "INFP", "ENFJ", "ESFJ")] <- "R2"
+
+#Extract each MBTI letter
+mbti$one <- as.factor(substr(mbti$mbti, 1, 1))
+mbti$two <- as.factor(substr(mbti$mbti, 2, 2))
+mbti$three <- as.factor(substr(mbti$mbti, 3, 3))
+mbti$four <- as.factor(substr(mbti$mbti, 4, 4))
+
+#Extract hire year and calculate total # hires each year
+mbti$hire_year <- year(mbti$hire_date)
+new_hires <- mbti %>% group_by(hire_year) %>% summarise(nhires = length(hire_year),
+ nhires_L1 = sum(nbi=="L1"),
+ nhires_L2 = sum(nbi=="L2"),
+ nhires_R1 = sum(nbi=="R1"),
+ nhires_R2 = sum(nbi=="R2"))
+new_hires$per_L1 <- (new_hires$nhires_L1/new_hires$nhires)*100
+new_hires$per_L2 <- (new_hires$nhires_L2/new_hires$nhires)*100
+new_hires$per_R1 <- (new_hires$nhires_R1/new_hires$nhires)*100
+new_hires$per_R2 <- (new_hires$nhires_R2/new_hires$nhires)*100
+new_hires <- new_hires[!is.na(new_hires$hire_year),]
+new_hires <- new_hires[new_hires$hire_year != 2015,]
+new_hires <- new_hires[new_hires$hire_year >= 2010,]
+new_hires_graph <- data.frame(rep(new_hires$hire_year,4),
+ c(new_hires$per_L1, new_hires$per_L2,
+ new_hires$per_R1, new_hires$per_R2),
+ c(rep("L1",5), rep("L2",5), rep("R1",5), rep("R2",5)))
+names(new_hires_graph) <- c("hire_year", "percentage_nbi", "NBI")
+
+colours <- c("coral", "darkorchid1", "cornflowerblue", "forestgreen")
+g <- ggplot(new_hires_graph, aes(x=hire_year, y=percentage_nbi, colour = NBI)) +
+ geom_line(size = 1.5) +
+ xlab("Year of hire") +
+ ylab("Percentage (%) NBI of new hires") +
+ theme(axis.title.x=element_text(size=14), axis.text.x=element_text(size=14, colour = "black")) +
+ theme(axis.title.y=element_text(size=14), axis.text.y=element_text(size=14, colour = "black"))
+
+
+#Table of counts for MBTI types
+mbti_counts <- mbti %>% group_by(mbti) %>% summarise(count=length(mbti))
+mbti_counts <- mbti %>% group_by(one) %>% summarise(count=length(one))
+mbti_counts <- mbti %>% group_by(two) %>% summarise(count=length(two))
+mbti_counts <- mbti %>% group_by(three) %>% summarise(count=length(three))
+mbti_counts <- mbti %>% group_by(four) %>% summarise(count=length(four))
+
+#MBTI results to first_half and second_half for balloon plot
+mbti_counts$first_half_mbti <- as.factor(substr(mbti_counts$mbti, 1, 2))
+mbti_counts$second_half_mbti <- as.factor(substr(mbti_counts$mbti, 3, 4))
+
+#Balloon plot
+g <- ggplot(mbti_counts, aes(x=second_half_mbti, y=first_half_mbti)) +
+ geom_point(aes(size=count), shape=21, colour="black", fill="cornsilk") +
+ scale_size_area(max_size=20, guide=FALSE) +
+ geom_text(aes(y=as.numeric(mbti_counts$first_half_mbti)-sqrt(count)/12, label=count),
+ vjust=1, colour="grey60", size=4) +
+ theme(axis.title.x=element_blank(), axis.title.y=element_blank())
+
+#Bar charts
+g <- ggplot(mbti_counts, aes(x=two, y=count, fill=two)) +
+ geom_bar(stat = "identity", width = 0.5) +
+ theme(axis.title.x=element_blank())
+
+#Table of counts for NBI types
+nbi_counts <- mbti %>% group_by(nbi) %>% summarise(count=length(nbi))
+
+#NBI results to first_half and second_half for balloon plot
+nbi_counts$first_half_nbi <- as.factor(substr(nbi_counts$nbi, 1, 1))
+nbi_counts$second_half_nbi <- as.factor(substr(nbi_counts$nbi, 2, 2))
+
+#Balloon plot
+nbi_counts$second_half_nbi = with(nbi_counts, factor(second_half_nbi, levels = rev(levels(second_half_nbi))))
+nbi_counts$nbi <- as.factor(nbi_counts$nbi)
+colours <- c("coral", "darkorchid1", "cornflowerblue", "forestgreen")
+g <- ggplot(nbi_counts, aes(x=first_half_nbi, y=second_half_nbi)) +
+ geom_point(aes(size=count), shape=21, colour="black", fill=colours) +
+ scale_size_area(max_size=20, guide=FALSE) +
+ geom_text(aes(y=as.numeric(nbi_counts$second_half_nbi)-sqrt(count)/20, label=count),
+ vjust=1, colour="grey60", size=4) +
+ theme(axis.title.x=element_blank(), axis.title.y=element_blank(), legend.position="none")
+
+
+
+#BY TEAM/BU
+#Table of counts for NBI types
+team <- filter(mbti, team == "Business Development")
+bu <- filter(mbti, bu == "DSI")
+nbi_counts <- bu %>% group_by(nbi) %>% summarise(count=length(nbi))
+
+#NBI results to first_half and second_half for balloon plot
+nbi_counts$first_half_nbi <- as.factor(substr(nbi_counts$nbi, 1, 1))
+nbi_counts$second_half_nbi <- as.factor(substr(nbi_counts$nbi, 2, 2))
+
+#Balloon plot
+nbi_counts$second_half_nbi = with(nbi_counts, factor(second_half_nbi, levels = rev(levels(second_half_nbi))))
+#nbi_counts$nbi <- as.factor(nbi_counts$nbi)
+colours <- c("coral", "darkorchid1", "cornflowerblue", "forestgreen")
+#colours <- c("coral", "darkorchid1", "forestgreen")
+#colours <- c("darkorchid1", "cornflowerblue", "forestgreen")
+g <- ggplot(nbi_counts, aes(x=first_half_nbi, y=second_half_nbi)) +
+ geom_point(aes(size=count), shape=21, colour="black", fill = colours) +
+ scale_size_area(max_size=20, guide=FALSE) +
+ geom_text(aes(y=as.numeric(nbi_counts$second_half_nbi)-sqrt(count)/10, label=count),
+ vjust=1, colour="grey60", size=4) +
+ theme(axis.title.x=element_blank(), axis.title.y=element_blank(), legend.position="none")
+
+
+
+#INC Summit workshop numbers
+#Variable "name" is the list of 59 people who are attending the INC summit
+#We have 52 (identified) people who took the survey.
+#3 people did not given their email addresses, but they are with INC, so I am
+#including them in the workshop total, assuming that they will attend
+inc <- filter(mbti, !(name == "") | is.na(name))
+table(inc$nbi, useNA = "always")
+test <- filter(inc, nbi == "L2")
+test <- filter(inc, team == "Ops")
+
+
+
+
+
diff --git a/analysis_scripts/rdayalu/mvp_wam.R b/analysis_scripts/rdayalu/mvp_wam.R
new file mode 100644
index 0000000..6b882e7
--- /dev/null
+++ b/analysis_scripts/rdayalu/mvp_wam.R
@@ -0,0 +1,22 @@
+
+test <- domain
+test$mvp <- NA
+test$mvp[grep("mvp", test$name, fixed = T)] <- T
+test <- filter(test, mvp == T)
+
+test$has_amplifies_workers_app <- test$domain_id %in% filter(app_amplifies, amplifies_workers == T)$domain_id
+test$has_amplifies_project_app <- test$domain_id %in% filter(app_amplifies, amplifies_project == T)$domain_id
+
+#What is the app volume for all mvp domains?
+app_volume <- form_table %>% group_by(application_id) %>%
+ summarise(nforms = length(unique(id)),
+ domain_id = unique(domain_id))
+
+app_volume <- filter(app_volume, app_volume$domain_id %in% test$domain_id)
+test2 <- merge(app_volume, select(domain, domain_id, name), by = "domain_id", all.x = T)
+test2 <- filter(test2, nforms >= 10000)
+test2 <- merge(test2, select(app_amplifies, id, app_id), by.x = "application_id",
+ by.y = "id", all.x = T)
+
+#MVP apps with >10K forms submissions
+test2 <- filter(test2, nforms >= 10000)
diff --git a/analysis_scripts/rdayalu/pat_data.R b/analysis_scripts/rdayalu/pat_data.R
new file mode 100644
index 0000000..15a0a7e
--- /dev/null
+++ b/analysis_scripts/rdayalu/pat_data.R
@@ -0,0 +1,100 @@
+library(foreign)
+pat <- read.dta("Data_blogset_Rashmi.dta")
+pat$calendar_month <- as.Date(pat$calendar_month, "%m/%d/%Y")
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+pat <- arrange(pat, domain_numeric, calendar_month)
+df <- data.table(pat)
+setkey(df,domain_numeric)
+df[,diff_days_domain:=c(NA,diff(calendar_month)),by=domain_numeric]
+pat <- as.data.frame(df)
+pat$previous_month_active_domain <- pat$diff_days_domain <= 31
+pat$previous_two_months_active_domain <- pat$diff_days_domain <= 62
+pat$previous_three_months_active_domain <- pat$diff_days_domain <= 93
+
+domains <- unique(pat$domain_numeric)
+
+next_month_active_domain <- c()
+for (i in domains) {
+ single_user <- pat[pat$domain_numeric == i,]
+ next_active <- c()
+ next_active <- c(single_user$previous_month_active_domain[-1], F)
+ next_month_active_domain <- c(next_month_active_domain, next_active)
+}
+pat$next_month_active_domain <- next_month_active_domain
+
+next_two_months_active_domain <- c()
+for (i in domains) {
+ single_user <- pat[pat$domain_numeric == i,]
+ next_active <- c()
+ next_active <- c(single_user$previous_two_months_active_domain[-1], F)
+ next_two_months_active_domain <- c(next_two_months_active_domain, next_active)
+}
+pat$next_two_months_active_domain <- next_two_months_active_domain
+
+next_three_months_active_domain <- c()
+for (i in domains) {
+ single_user <- pat[pat$domain_numeric == i,]
+ next_active <- c()
+ next_active <- c(single_user$previous_three_months_active_domain[-1], F)
+ next_three_months_active_domain <- c(next_three_months_active_domain, next_active)
+}
+pat$next_three_months_active_domain <- next_three_months_active_domain
+
+#Based on the end_month in our dataset, we don't know if the domain will be active in any of
+#the months following end_month. Must change all those attrition values to NA.
+end_month <- as.Date("2015-01-01")
+is.na(pat$next_month_active_domain) <- pat$calendar_month == end_month
+is.na(pat$next_two_months_active_domain) <- pat$calendar_month >= end_month - months(1)
+is.na(pat$next_three_months_active_domain) <- pat$calendar_month >= end_month - months(2)
+
+#Calculate months_on_cc_domain and active_calendar_months_domain
+total_months_cc <- pat %>% group_by(domain_numeric) %>%
+ summarise(first_month = min(calendar_month),
+ last_month = max(calendar_month),
+ active_calendar_months_domain = length(unique(calendar_month)))
+total_months_cc$months_on_cc_domain <- (interval(total_months_cc$first_month,
+ total_months_cc$last_month) %/% months(1))+1
+total_months_cc <- select(total_months_cc, domain_numeric, active_calendar_months_domain,
+ months_on_cc_domain)
+pat <- merge(pat, total_months_cc, by = "domain_numeric", all.x = T)
+
+#active_months_rolling
+#Add rolling # active months to each domain's rows
+pat <- pat %>% group_by(domain_numeric) %>%
+ mutate(active_months_rolling = seq_along(calendar_month))
+
+#Was the domain ever active again after an attrition event
+#(defined as next_month_active_domain == F)?
+pat$attrition_event_domain <- !(pat$next_month_active_domain == T |
+ is.na(pat$next_month_active_domain))
+pat$continuing_domain <- pat$active_months_rolling < pat$active_calendar_months_domain
+pat$ever_active_again_domain <- pat$attrition_event_domain == T & pat$continuing_domain == T
+is.na(pat$ever_active_again_domain) <- pat$attrition_event_domain == F
+
+#active_users_max_rolling
+#Maximum number of active users for any month including and prior to this month
+
+active_users_max_rolling <- c()
+for (i in domains) {
+ single_domain <- pat[pat$domain_numeric == i,]
+ standing_max <- single_domain$domain_user_per_month[1]
+ users_max <- standing_max
+ if(nrow(single_domain) > 1) {
+ for (j in 2:nrow(single_domain)) {
+ if (single_domain$domain_user_per_month[j] <= standing_max) {
+ users_max <- c(users_max, standing_max)
+ }
+ if (single_domain$domain_user_per_month[j] > standing_max) {
+ users_max <- c(users_max, single_domain$domain_user_per_month[j])
+ standing_max <- single_domain$domain_user_per_month[j]
+ }
+ }
+}
+ active_users_max_rolling <- c(active_users_max_rolling, users_max)
+}
+pat$active_users_max_rolling <- active_users_max_rolling
+
+write.csv(pat, file="domain_monthly_rows_pat.csv")
+
diff --git a/analysis_scripts/rdayalu/test_1a.R b/analysis_scripts/rdayalu/test_1a.R
new file mode 100644
index 0000000..27d4e38
--- /dev/null
+++ b/analysis_scripts/rdayalu/test_1a.R
@@ -0,0 +1,421 @@
+detach("package:data.table")
+
+#Initialize dataframe
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_nvisits_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nvisits_1a = median(diff_nvisits, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_nvisits_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_active_day_percent_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_active_day_percent_1a = median(diff_active_day_percent, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_active_day_percent_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_nforms_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nforms_1a = median(diff_nforms, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_nforms_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_median_visit_duration_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_median_visit_duration_1a = median(diff_median_visit_duration, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_median_visit_duration_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_median_visits_per_day_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_median_visits_per_day_1a = median(diff_median_visits_per_day, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_median_visits_per_day_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_time_using_cc_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_time_using_cc_1a = median(diff_time_using_cc, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_time_using_cc_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_ninteractions_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ninteractions_1a = median(diff_ninteractions, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_ninteractions_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_ncases_registered_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ncases_registered_1a = median(diff_ncases_registered, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_ncases_registered_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_register_followup_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_register_followup_1a = median(diff_register_followup, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_register_followup_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_case_register_followup_rate_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_case_register_followup_rate_1a = median(diff_case_register_followup_rate, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_case_register_followup_rate_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_ncases_touched_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ncases_touched_1a = median(diff_ncases_touched, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_ncases_touched_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_nunique_followups_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nunique_followups_1a = median(diff_nunique_followups, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_nunique_followups_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_audio_plays_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_audio_plays_1a = median(diff_audio_plays, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_audio_plays_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_network_warnings_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_network_warnings_1a = median(diff_network_warnings, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_network_warnings_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_num_user_pk_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_num_user_pk_1a = median(diff_num_user_pk, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_num_user_pk_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_domain_numeric_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_domain_numeric_1a = median(diff_domain_numeric, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_domain_numeric_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_sample_undefined_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_undefined_1a = median(diff_sample_undefined, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_sample_undefined_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_sample_normal_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_normal_1a = median(diff_sample_normal, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_sample_normal_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_sample_percentile_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_percentile_1a = median(diff_sample_percentile, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_sample_percentile_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_sample_increase_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_increase_1a = median(diff_sample_increase, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_sample_increase_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_sample_decrease_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_decrease_1a = median(diff_sample_decrease, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_sample_decrease_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/test_1a_journal.R b/analysis_scripts/rdayalu/test_1a_journal.R
new file mode 100644
index 0000000..2c55bba
--- /dev/null
+++ b/analysis_scripts/rdayalu/test_1a_journal.R
@@ -0,0 +1,280 @@
+detach("package:data.table")
+
+#Initialize dataframe
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_nvisits_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nvisits_1a = median(diff_nvisits, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_nvisits_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_active_day_percent_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_active_day_percent_1a = median(diff_active_day_percent, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_active_day_percent_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_nforms_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nforms_1a = median(diff_nforms, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_nforms_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_median_visit_duration_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_median_visit_duration_1a = median(diff_median_visit_duration, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_median_visit_duration_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_median_visits_per_day_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_median_visits_per_day_1a = median(diff_median_visits_per_day, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_median_visits_per_day_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_time_using_cc_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_time_using_cc_1a = median(diff_time_using_cc, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_time_using_cc_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_ninteractions_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ninteractions_1a = median(diff_ninteractions, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_ninteractions_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_ncases_registered_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ncases_registered_1a = median(diff_ncases_registered, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_ncases_registered_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_register_followup_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_register_followup_1a = median(diff_register_followup, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_register_followup_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_case_register_followup_rate_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_case_register_followup_rate_1a = median(diff_case_register_followup_rate, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_case_register_followup_rate_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_ncases_touched_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ncases_touched_1a = median(diff_ncases_touched, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_ncases_touched_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_nunique_followups_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nunique_followups_1a = median(diff_nunique_followups, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_nunique_followups_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_sample_increase_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_increase_1a = median(diff_sample_increase, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_sample_increase_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
+
+
+#Initialize matrix
+test_1_abs <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_abs) <- c("calendar_month", "med_sample_decrease_1a", "user_pk")
+test_1_abs$calendar_month <- as.Date(test_1_abs$calendar_month)
+test_1_abs$user_pk <- as.numeric(test_1_abs$user_pk)
+users <- unique(training_consec$user_pk)
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_decrease_1a = median(diff_sample_decrease, na.rm = T),
+ user_pk = i)
+ test_1_abs <- rbind(test_1_abs, exclude_user_median)
+}
+test_1_abs$concat <- paste(test_1_abs$user_pk, test_1_abs$calendar_month, sep = "_")
+test_1_abs <- select(test_1_abs, concat, med_sample_decrease_1a)
+training_consec <- merge(training_consec, test_1_abs, by = "concat", all.x = T)
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/test_1b.R b/analysis_scripts/rdayalu/test_1b.R
new file mode 100644
index 0000000..557e57d
--- /dev/null
+++ b/analysis_scripts/rdayalu/test_1b.R
@@ -0,0 +1,309 @@
+# % difference in indicators for each user for consectutive months
+# This isn't for truly consecutive months, so later on,
+# we will only use rows with previous_month_active == T
+#This will be used for test 1b
+
+training_typical <- arrange(training_typical, user_pk, calendar_month)
+users <- unique(training_typical$user_pk)
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$nvisits)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_nvisits/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_nvisits <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$active_day_percent)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_active_day_percent/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_active_day_percent <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$nforms)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_nforms/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_nforms <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$median_visit_duration)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_median_visit_duration/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_median_visit_duration <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$median_visits_per_day)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_median_visits_per_day/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_median_visits_per_day <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$time_using_cc)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_time_using_cc/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_time_using_cc <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$ninteractions)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_ninteractions/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_ninteractions <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$ncases_registered)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_ncases_registered/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_ncases_registered <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$register_followup)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_register_followup/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_register_followup <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$case_register_followup_rate)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_case_register_followup_rate/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_case_register_followup_rate <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$ncases_touched)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_ncases_touched/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_ncases_touched <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$nunique_followups)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_nunique_followups/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_nunique_followups <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$audio_plays)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_audio_plays/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_audio_plays <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$network_warnings)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_network_warnings/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_network_warnings <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$num_user_pk)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_num_user_pk/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_num_user_pk <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$domain_numeric)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_domain_numeric/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_domain_numeric <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$sample_undefined)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_sample_undefined/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_sample_undefined <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$sample_normal)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_sample_normal/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_sample_normal <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$sample_percentile)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_sample_percentile/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_sample_percentile <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$sample_increase)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_sample_increase/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_sample_increase <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$sample_decrease)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_sample_decrease/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_sample_decrease <- per_diff_indicator
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/test_1b_2.R b/analysis_scripts/rdayalu/test_1b_2.R
new file mode 100644
index 0000000..c15b1f5
--- /dev/null
+++ b/analysis_scripts/rdayalu/test_1b_2.R
@@ -0,0 +1,438 @@
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_nvisits_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nvisits_1b = median(per_diff_nvisits, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_nvisits_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_active_day_percent_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_active_day_percent_1b = median(per_diff_active_day_percent, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_active_day_percent_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_nforms_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nforms_1b = median(per_diff_nforms, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_nforms_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_median_visit_duration_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_median_visit_duration_1b = median(per_diff_median_visit_duration, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_median_visit_duration_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_median_visits_per_day_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_median_visits_per_day_1b = median(per_diff_median_visits_per_day, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_median_visits_per_day_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_time_using_cc_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_time_using_cc_1b = median(per_diff_time_using_cc, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_time_using_cc_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_ninteractions_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ninteractions_1b = median(per_diff_ninteractions, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_ninteractions_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_ncases_registered_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ncases_registered_1b = median(per_diff_ncases_registered, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_ncases_registered_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_register_followup_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_register_followup_1b = median(per_diff_register_followup, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_register_followup_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_case_register_followup_rate_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_case_register_followup_rate_1b = median(per_diff_case_register_followup_rate, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_case_register_followup_rate_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_ncases_touched_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ncases_touched_1b = median(per_diff_ncases_touched, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_ncases_touched_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_nunique_followups_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nunique_followups_1b = median(per_diff_nunique_followups, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_nunique_followups_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_audio_plays_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_audio_plays_1b = median(per_diff_audio_plays, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_audio_plays_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_network_warnings_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_network_warnings_1b = median(per_diff_network_warnings, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_network_warnings_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_num_user_pk_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_num_user_pk_1b = median(per_diff_num_user_pk, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_num_user_pk_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_domain_numeric_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_domain_numeric_1b = median(per_diff_domain_numeric, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_domain_numeric_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_sample_undefined_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_undefined_1b = median(per_diff_sample_undefined, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_sample_undefined_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_sample_normal_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_normal_1b = median(per_diff_sample_normal, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_sample_normal_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_sample_percentile_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_percentile_1b = median(per_diff_sample_percentile, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_sample_percentile_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_sample_increase_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_increase_1b = median(per_diff_sample_increase, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_sample_increase_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_sample_decrease_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_decrease_1b = median(per_diff_sample_decrease, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_sample_decrease_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/test_1b_2_journal.R b/analysis_scripts/rdayalu/test_1b_2_journal.R
new file mode 100644
index 0000000..cd2b6be
--- /dev/null
+++ b/analysis_scripts/rdayalu/test_1b_2_journal.R
@@ -0,0 +1,291 @@
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_nvisits_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nvisits_1b = median(per_diff_nvisits, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_nvisits_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_active_day_percent_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_active_day_percent_1b = median(per_diff_active_day_percent, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_active_day_percent_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_nforms_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nforms_1b = median(per_diff_nforms, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_nforms_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_median_visit_duration_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_median_visit_duration_1b = median(per_diff_median_visit_duration, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_median_visit_duration_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_median_visits_per_day_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_median_visits_per_day_1b = median(per_diff_median_visits_per_day, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_median_visits_per_day_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_time_using_cc_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_time_using_cc_1b = median(per_diff_time_using_cc, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_time_using_cc_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_ninteractions_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ninteractions_1b = median(per_diff_ninteractions, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_ninteractions_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_ncases_registered_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ncases_registered_1b = median(per_diff_ncases_registered, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_ncases_registered_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_register_followup_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_register_followup_1b = median(per_diff_register_followup, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_register_followup_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_case_register_followup_rate_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_case_register_followup_rate_1b = median(per_diff_case_register_followup_rate, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_case_register_followup_rate_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_ncases_touched_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_ncases_touched_1b = median(per_diff_ncases_touched, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_ncases_touched_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_nunique_followups_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_nunique_followups_1b = median(per_diff_nunique_followups, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_nunique_followups_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_sample_increase_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_increase_1b = median(per_diff_sample_increase, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_sample_increase_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
+
+
+#Initialize dataframe
+test_1_rel <- data.frame(matrix(ncol = 3, nrow = 1))
+names(test_1_rel) <- c("calendar_month", "med_sample_decrease_1b", "user_pk")
+test_1_rel$calendar_month <- as.Date(test_1_rel$calendar_month)
+test_1_rel$user_pk <- as.numeric(test_1_rel$user_pk)
+users <- unique(training_consec$user_pk)
+
+for (i in users) {
+ domain_id <- unique((filter(training_consec, user_pk == i))$domain)
+ exclude_user <- filter(training_consec, user_pk != i & domain == domain_id)
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_sample_decrease_1b = median(per_diff_sample_decrease, na.rm = T),
+ user_pk = i)
+ test_1_rel <- rbind(test_1_rel, exclude_user_median)
+}
+test_1_rel$concat <- paste(test_1_rel$user_pk, test_1_rel$calendar_month, sep = "_")
+test_1_rel <- select(test_1_rel, concat, med_sample_decrease_1b)
+training_consec <- merge(training_consec, test_1_rel, by = "concat", all.x = T)
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/test_1b_journal.R b/analysis_scripts/rdayalu/test_1b_journal.R
new file mode 100644
index 0000000..e266f28
--- /dev/null
+++ b/analysis_scripts/rdayalu/test_1b_journal.R
@@ -0,0 +1,210 @@
+# % difference in indicators for each user for consectutive months
+# This isn't for truly consecutive months, so later on,
+# we will only use rows with previous_month_active == T
+#This will be used for test 1b
+
+training_typical <- arrange(training_typical, user_pk, calendar_month)
+users <- unique(training_typical$user_pk)
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$nvisits)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_nvisits/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_nvisits <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$active_day_percent)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_active_day_percent/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_active_day_percent <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$nforms)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_nforms/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_nforms <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$median_visit_duration)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_median_visit_duration/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_median_visit_duration <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$median_visits_per_day)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_median_visits_per_day/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_median_visits_per_day <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$time_using_cc)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_time_using_cc/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_time_using_cc <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$ninteractions)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_ninteractions/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_ninteractions <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$ncases_registered)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_ncases_registered/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_ncases_registered <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$register_followup)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_register_followup/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_register_followup <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$case_register_followup_rate)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_case_register_followup_rate/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_case_register_followup_rate <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$ncases_touched)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_ncases_touched/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_ncases_touched <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$nunique_followups)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_nunique_followups/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_nunique_followups <- per_diff_indicator
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$sample_increase)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_sample_increase/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_sample_increase <- per_diff_indicator
+
+
+per_diff_indicator <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ prev_indicator <- c()
+ prev_indicator <- append(NA, single_user$sample_decrease)
+ prev_indicator <- prev_indicator[-length(prev_indicator)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_sample_decrease/prev_indicator)*100
+ per_diff_indicator <- append(per_diff_indicator, per_diff)
+ is.na(per_diff_indicator) <- is.nan(per_diff_indicator)
+ is.na(per_diff_indicator) <- per_diff_indicator == Inf
+}
+training_typical$per_diff_sample_decrease <- per_diff_indicator
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/test_4b.R b/analysis_scripts/rdayalu/test_4b.R
new file mode 100644
index 0000000..aae167d
--- /dev/null
+++ b/analysis_scripts/rdayalu/test_4b.R
@@ -0,0 +1,855 @@
+test_4b <- c()
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$nvisits[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$nvisits[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]]))
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$active_day_percent[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$active_day_percent[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$nforms[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$nforms[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$median_visit_duration[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$median_visit_duration[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$median_visits_per_day[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$median_visits_per_day[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$time_using_cc[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$time_using_cc[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ninteractions[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ninteractions[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_registered[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_registered[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$register_followup[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$register_followup[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$case_register_followup_rate[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$case_register_followup_rate[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_touched[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$ncases_touched[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$nunique_followups[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$nunique_followups[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$audio_plays[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$audio_plays[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$network_warnings[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$network_warnings[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$num_user_pk[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$num_user_pk[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$domain_numeric[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$domain_numeric[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_undefined[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_undefined[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_normal[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_normal[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_percentile[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_percentile[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_increase[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_increase[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
+
+
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ #Create vector of all attrition positions for this user
+ attrition_positions <- which(single_user$next_month_active == F)
+ #Append "months to first attrition event" to the attrition list
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_decrease[1:attrition_positions[1]]))
+ #Append "months to subsequent attrition events" to the attrition list
+ if(length(attrition_positions)>1) {
+ for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$sample_decrease[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+#Keep rows with at least "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+attrition_subset <- select(attrition_subset, month_1, month_2, month_3, month_4, month_5)
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+is.na(attrition_subset) <- attrition_subset == "Inf"
+attrition_subset <- filter(attrition_subset, !is.nan(rel_1) & !is.nan(rel_2) & !is.nan(rel_3) & !is.nan(rel_4) & !is.nan(rel_5))
+#Test 4B: slope of line (lm) for realtive months 1-4 for each row
+test_4b <- append(test_4b,
+ median(apply(attrition_subset[,6:9], 1, function(x) lm(x~c(1:4))$coefficients[[2]])))
diff --git a/analysis_scripts/rdayalu/top_bottom_users.R b/analysis_scripts/rdayalu/top_bottom_users.R
new file mode 100644
index 0000000..dbadebd
--- /dev/null
+++ b/analysis_scripts/rdayalu/top_bottom_users.R
@@ -0,0 +1,134 @@
+#User consistency blog
+#ID top/bottom 10% ncases_touched
+#Only keep domains with nusers >= 30
+
+#nusers <- test2_data %>% group_by(domain) %>%
+# summarise(nusers = length(unique(user_pk)))
+nusers <- test2_data %>% group_by(domain, calendar_month) %>%
+ summarise(nusers = length(unique(user_pk)))
+nusers <- filter(nusers, nusers >= 30)
+nusers$concat <- paste(nusers$domain, nusers$calendar_month, sep = "_")
+
+test2_data$concat <- paste(test2_data$domain, test2_data$calendar_month, sep = "_")
+tb <- test2_data[test2_data$concat %in% nusers$concat,]
+#tb <- test2_data[test2_data$domain %in% nusers$domain,]
+
+#Percentile function
+percentile <- function(x) ecdf(x)(x)
+ raw_percentile <- tb %>%
+ group_by(domain, calendar_month) %>%
+ mutate(percentile_ntouched = percentile(ncases_touched)*100)
+
+#Mark top/bottom 10% users
+raw_percentile$top_10p_ntouched <- raw_percentile$percentile_ntouched >= 90
+raw_percentile$bot_10p_ntouched <- raw_percentile$percentile_ntouched <= 10
+top_users <- unique(filter(raw_percentile, top_10p_ntouched == T)$user_pk)
+bot_users <- unique(filter(raw_percentile, bot_10p_ntouched == T)$user_pk)
+true_top <- top_users[!(top_users %in% bot_users)]
+true_bot <- bot_users[!(bot_users %in% top_users)]
+
+#Create dataset for exclusively top/bottom users
+top <- test2_data[test2_data$user_pk %in% true_top,]
+bot <- test2_data[test2_data$user_pk %in% true_bot,]
+#top <- test2_data[test2_data$user_pk %in% top_users,]
+#bot <- test2_data[test2_data$user_pk %in% bot_users,]
+
+#Create dataset for top/bottom observations
+top <- filter(raw_percentile, top_10p_ntouched == T)
+bot <- filter(raw_percentile, bot_10p_ntouched == T)
+
+#Calculate correlation
+cor(top$prev_ncases_touched, top$ncases_touched, use = "complete.obs")
+cor(bot$prev_ncases_touched, bot$ncases_touched, use = "complete.obs")
+#Correlation function
+correlation <- function(x, y) cor(x, y, use = "complete.obs")
+corr_domains <- tb %>%
+ group_by(domain) %>%
+ summarise(corr_ntouched = correlation(prev_ncases_touched, ncases_touched))
+corr_domains <- arrange(corr_domains, corr_ntouched)
+top_domains <- top %>%
+ group_by(domain) %>%
+ summarise(corr_top_ntouched = correlation(prev_ncases_touched, ncases_touched))
+
+bot_domains <- bot %>%
+ group_by(domain) %>%
+ summarise(corr_bot_ntouched = correlation(prev_ncases_touched, ncases_touched))
+detach("package:data.table")
+top_domains <- merge(top_domains, bot_domains, by = "domain", all.x = T)
+top_domains <- arrange(top_domains, corr_top_ntouched)
+
+#Scatterplots
+
+tb <- filter(tb, domain_numeric == 18 | domain_numeric == 264)
+tb <- select(test, domain_numeric, user_pk, calendar_month, prev_ncases_touched, ncases_touched)
+
+test <- filter(tb, domain_numeric == 18)
+g <- ggplot(test, aes(x=prev_ncases_touched, y=ncases_touched)) +
+ geom_point(shape=1) +
+ scale_x_continuous(limits=c(0,50)) +
+ scale_y_continuous(limits=c(0,50)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.75", parse = T, x=38, y=3)
+
+test <- filter(tb, domain_numeric == 264)
+g <- ggplot(test, aes(x=prev_ncases_touched, y=ncases_touched)) +
+ geom_point(shape=1) +
+ scale_x_continuous(limits=c(0,50)) +
+ scale_y_continuous(limits=c(0,50)) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.36", parse = T, x=38, y=3)
+
+write.csv(test, file = "domain_consistency_comparison.csv")
+
+#Top 10% and bottom 10% scatterplots
+g <- ggplot(top, aes(x=prev_ncases_touched, y=ncases_touched)) +
+ geom_point(shape=1) +
+ geom_smooth(method=lm) +
+ annotate("text", label="r^2 == 0.69", parse = T, x=6, y=98)
+
+g <- ggplot(bot, aes(x=prev_ncases_touched, y=ncases_touched)) +
+ geom_point(shape=1) +
+ geom_smooth(method=lm) +
+ scale_y_continuous(limits=c(0,50)) +
+ scale_x_continuous(limits=c(0,50)) +
+ annotate("text", label="r^2 == 0.34", parse = T, x=3, y=48)
+
+#Ratio of top/bottom 10% months per user
+#First calculate # months that each user_pk has in the top/bottom category
+top_total <- top %>%
+ group_by(user_pk) %>%
+ summarise(top_total = sum(top_10p_ntouched))
+bot_total <- bot %>%
+ group_by(user_pk) %>%
+ summarise(bot_total = sum(bot_10p_ntouched))
+#Merge these totals back to the tb dataset
+tb <- merge(tb, bot_total, by = "user_pk", all.x = T)
+tb <- merge(tb, top_total, by = "user_pk", all.x = T)
+#Create top/bottom ratio for each user
+tb$top_ratio <- (tb$top_total/tb$active_months)*100
+tb$bot_ratio <- (tb$bot_total/tb$active_months)*100
+tb$top_user <- tb$user_pk %in% top_users
+tb$bot_user <- tb$user_pk %in% bot_users
+tb$user_rank[tb$top_user==T] <- "top_user"
+tb$user_rank[tb$bot_user==T] <- "bot_user"
+
+#Boxplot
+g <- ggplot(tb, aes(x="top_users", y=top_ratio)) +
+ geom_boxplot(width = 0.3) +
+ scale_y_continuous(limits=c(0,80))
+
+g <- ggplot(tb, aes(x="bottom_users", y=bot_ratio)) +
+ geom_boxplot(width = 0.3) +
+ scale_y_continuous(limits=c(0,80))
+
+#Test statistic for correlations
+cor.test(top$prev_ncases_touched, top$ncases_touched,
+ alternative = "two.sided", conf.level = 0.95)
+cor.test(bot$prev_ncases_touched, bot$ncases_touched,
+ alternative = "two.sided", conf.level = 0.95)
+
+test <- test2_data[!(test2_data$user_pk %in% top_users |
+ test2_data$user_pk %in% bot_users),]
+cor(test$prev_ncases_touched, test$ncases_touched, use = "complete.obs")
+cor.test(test$prev_ncases_touched, test$ncases_touched,
+ alternative = "two.sided", conf.level = 0.95)
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/training_set.R b/analysis_scripts/rdayalu/training_set.R
new file mode 100644
index 0000000..6654f2f
--- /dev/null
+++ b/analysis_scripts/rdayalu/training_set.R
@@ -0,0 +1,349 @@
+#Training set has 782 unique users
+#Exclude any users who logged > 100 visits in any month
+all_monthly$visits_ge_100 <- all_monthly$nvisits > 100
+user_ge_100 <- all_monthly %.%
+ group_by(user_id) %.%
+ summarise(ge_100 = sum(visits_ge_100))
+user_le_100 <- filter(user_ge_100, ge_100 == 0)
+#696 users have only <= 100 visits per month
+training_typical <-
+ all_monthly[all_monthly$user_id %in% user_le_100$user_id, ]
+
+#Exclude users with < 4 months on CC
+month_count <- training_typical %.%
+ group_by(domain, user_id) %.%
+ summarise(months_on_cc = length(unique(calendar_month)))
+month_count <- filter(month_count, months_on_cc >= 4)
+#390 users have >= 4 months on CC
+training_typical <-
+ training_typical[training_typical$user_id %in% month_count$user_id, ]
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+#Also want differences between indicators for each user from one month to the next
+#Differences in indicators will be used for tests 1a/b
+training_typical <- arrange(training_typical, user_pk, calendar_month)
+df <- data.table(training_typical)
+setkey(df,user_pk)
+df[,diff_days:=c(NA,diff(calendar_month)),by=user_pk]
+df[,diff_nvisits:=c(NA,diff(nvisits)),by=user_pk]
+df[,diff_active_day_percent:=c(NA,diff(active_day_percent)),by=user_pk]
+df[,diff_nforms:=c(NA,diff(nforms)),by=user_pk]
+df[,diff_median_visit_duration:=c(NA,diff(median_visit_duration)),by=user_pk]
+df[,diff_median_visits_per_day:=c(NA,diff(median_visits_per_day)),by=user_pk]
+df[,diff_time_using_cc:=c(NA,diff(time_using_cc)),by=user_pk]
+df[,diff_ninteractions:=c(NA,diff(ninteractions)),by=user_pk]
+df[,diff_ncases_registered:=c(NA,diff(ncases_registered)),by=user_pk]
+df[,diff_register_followup:=c(NA,diff(register_followup)),by=user_pk]
+df[,diff_case_register_followup_rate:=c(NA,diff(case_register_followup_rate)),by=user_pk]
+df[,diff_ncases_touched:=c(NA,diff(ncases_touched)),by=user_pk]
+df[,diff_nunique_followups:=c(NA,diff(nunique_followups)),by=user_pk]
+df[,diff_audio_plays:=c(NA,diff(audio_plays)),by=user_pk]
+df[,diff_network_warnings:=c(NA,diff(network_warnings)),by=user_pk]
+df[,diff_num_user_pk:=c(NA,diff(num_user_pk)),by=user_pk]
+df[,diff_domain_numeric:=c(NA,diff(domain_numeric)),by=user_pk]
+df[,diff_sample_undefined:=c(NA,diff(sample_undefined)),by=user_pk]
+df[,diff_sample_normal:=c(NA,diff(sample_normal)),by=user_pk]
+df[,diff_sample_percentile:=c(NA,diff(sample_percentile)),by=user_pk]
+df[,diff_sample_increase:=c(NA,diff(sample_increase)),by=user_pk]
+df[,diff_sample_decrease:=c(NA,diff(sample_decrease)),by=user_pk]
+training_typical <- as.data.frame(df)
+training_typical$previous_month_active <- training_typical$diff_days <= 31
+users <- unique(training_typical$user_pk)
+
+next_month_active <- c()
+for (i in users) {
+ single_user <- training_typical[training_typical$user_pk == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_month_active[-1], F)
+ next_month_active <- append(next_month_active, next_active)
+}
+training_typical$next_month_active <- next_month_active
+#If calendar_month = 10/1/14 then next_month_active = NA
+#because we don't know if the user will be active in the following month
+is.na(training_typical$next_month_active) <- training_typical$calendar_month == "2014-10-01"
+
+#Sample 43 users from largest domain (tulasalud)
+#Need to exclude 264 users
+nusers_domain <- training_typical %>% group_by(domain) %>% summarise(nusers = length(unique(user_pk)))
+exclude_users_tula <- sample(unique(training_typical$user_pk[training_typical$domain == "tulasalud"]), 264)
+training_typical <- training_typical[!(training_typical$user_pk %in% exclude_users_tula),]
+
+#------------------------------------------------------------------------#
+#General plots
+#------------------------------------------------------------------------#
+#Number of users by calendar month
+n_user <- training_typical %.%
+ group_by(calendar_month) %.%
+ summarise(n_users = length(unique(user_id)))
+
+g <- ggplot(n_user, aes(x=calendar_month, y=n_users)) +
+ geom_point(size = 3, shape = 19, alpha = 0.5, colour = "darkblue",
+ fill = "lightblue") +
+ geom_line(colour = "darkblue") +
+ scale_size_area() +
+ xlab("Calendar month") +
+ ylab("# unique users/month") +
+ theme(axis.text=element_text(size=12), axis.title=element_text(size=14,
+ face="bold")) +
+ ggtitle("Number of users by calendar month") +
+ theme(plot.title = element_text(size=14, face="bold"))
+
+pdf("plots.pdf")
+plot(g)
+dev.off()
+
+#------------------------------------------------------------------------#
+#Code for Test 1
+#------------------------------------------------------------------------#
+# % difference in indicators for each user for consectutive months
+# This isn't for truly consecutive months, so later on,
+# we will only use rows with previous_month_active == T
+#This will be used for test 1b
+source(file.path("analysis_scripts","rdayalu","test_1b.R", fsep = .Platform$file.sep))
+
+#Must only include rows with previous_month_active == T/NA.
+#We need to keep previous_month_active = NA because the indicator values for that
+#month's row needs to contribute to the domain median.
+training_consec <- filter(training_typical, previous_month_active == T |
+ is.na(previous_month_active))
+training_consec$concat <- paste(training_consec$user_pk, training_consec$calendar_month,
+ sep = "_")
+
+
+#Exclude domain calendar_months with nusers < 5 for that domain
+#Use this dataset only for test 1a/1b
+nusers <- training_consec %>%
+ group_by(domain, calendar_month) %>%
+ summarise(nusers = length(unique(user_pk)))
+nusers <- filter(nusers, nusers >= 5)
+nusers$concat <- paste(nusers$domain, nusers$calendar_month, sep = "_")
+training_consec <-
+ training_consec[paste(training_consec$domain, training_consec$calendar_month, sep = "_") %in%
+ nusers$concat, ]
+
+#Domain median ABSOLUTE change per user per calendar month,
+#excluding each user from the domain median for that user's row
+#This is used for test 1a
+source(file.path("analysis_scripts","rdayalu","test_1a.R", fsep = .Platform$file.sep))
+
+#Domain median PERCENTAGE change per user per calendar month,
+#excluding each user from the domain median for that user's row
+source(file.path("analysis_scripts","rdayalu","test_1b_2.R", fsep = .Platform$file.sep))
+
+names(training_consec)
+diff_indicator <- names(training_consec[41:61])
+per_diff_indicator <- names(training_consec[64:84])
+
+test_1a <-
+ c(cor(training_consec$med_nvisits_1a, training_consec$diff_nvisits, use = "complete.obs"),
+ cor(training_consec$med_active_day_percent_1a, training_consec$diff_active_day_percent, use = "complete.obs"),
+ cor(training_consec$med_nforms_1a, training_consec$diff_nforms, use = "complete.obs"),
+ cor(training_consec$med_median_visit_duration_1a, training_consec$diff_median_visit_duration, use = "complete.obs"),
+ cor(training_consec$med_median_visits_per_day_1a, training_consec$diff_median_visits_per_day, use = "complete.obs"),
+ cor(training_consec$med_time_using_cc_1a, training_consec$diff_time_using_cc, use = "complete.obs"),
+ cor(training_consec$med_ninteractions_1a, training_consec$diff_ninteractions, use = "complete.obs"),
+ cor(training_consec$med_ncases_registered_1a, training_consec$diff_ncases_registered, use = "complete.obs"),
+ cor(training_consec$med_register_followup_1a, training_consec$diff_register_followup, use = "complete.obs"),
+ cor(training_consec$med_case_register_followup_rate_1a, training_consec$diff_case_register_followup_rate, use = "complete.obs"),
+ cor(training_consec$med_ncases_touched_1a, training_consec$diff_ncases_touched, use = "complete.obs"),
+ cor(training_consec$med_nunique_followups_1a, training_consec$diff_nunique_followups, use = "complete.obs"),
+ cor(training_consec$med_audio_plays_1a, training_consec$diff_audio_plays, use = "complete.obs"),
+ cor(training_consec$med_network_warnings_1a, training_consec$diff_network_warnings, use = "complete.obs"),
+ cor(training_consec$med_num_user_pk_1a, training_consec$diff_num_user_pk, use = "complete.obs"),
+ cor(training_consec$med_domain_numeric_1a, training_consec$diff_domain_numeric, use = "complete.obs"),
+ cor(training_consec$med_sample_undefined_1a, training_consec$diff_sample_undefined, use = "complete.obs"),
+ cor(training_consec$med_sample_normal_1a, training_consec$diff_sample_normal, use = "complete.obs"),
+ cor(training_consec$med_sample_percentile_1a, training_consec$diff_sample_percentile, use = "complete.obs"),
+ cor(training_consec$med_sample_increase_1a, training_consec$diff_sample_increase, use = "complete.obs"),
+ cor(training_consec$med_sample_decrease_1a, training_consec$diff_sample_decrease, use = "complete.obs"))
+names(test_1a) <- indicators
+
+test_1b <-
+ c(cor(training_consec$med_nvisits_1b, training_consec$per_diff_nvisits, use = "complete.obs"),
+ cor(training_consec$med_active_day_percent_1b, training_consec$per_diff_active_day_percent, use = "complete.obs"),
+ cor(training_consec$med_nforms_1b, training_consec$per_diff_nforms, use = "complete.obs"),
+ cor(training_consec$med_median_visit_duration_1b, training_consec$per_diff_median_visit_duration, use = "complete.obs"),
+ cor(training_consec$med_median_visits_per_day_1b, training_consec$per_diff_median_visits_per_day, use = "complete.obs"),
+ cor(training_consec$med_time_using_cc_1b, training_consec$per_diff_time_using_cc, use = "complete.obs"),
+ cor(training_consec$med_ninteractions_1b, training_consec$per_diff_ninteractions, use = "complete.obs"),
+ cor(training_consec$med_ncases_registered_1b, training_consec$per_diff_ncases_registered, use = "complete.obs"),
+ cor(training_consec$med_register_followup_1b, training_consec$per_diff_register_followup, use = "complete.obs"),
+ cor(training_consec$med_case_register_followup_rate_1b, training_consec$per_diff_case_register_followup_rate, use = "complete.obs"),
+ cor(training_consec$med_ncases_touched_1b, training_consec$per_diff_ncases_touched, use = "complete.obs"),
+ cor(training_consec$med_nunique_followups_1b, training_consec$per_diff_nunique_followups, use = "complete.obs"),
+ cor(training_consec$med_audio_plays_1b, training_consec$per_diff_audio_plays, use = "complete.obs"),
+ cor(training_consec$med_network_warnings_1b, training_consec$per_diff_network_warnings, use = "complete.obs"),
+ cor(training_consec$med_num_user_pk_1b, training_consec$per_diff_num_user_pk, use = "complete.obs"),
+ cor(training_consec$med_domain_numeric_1b, training_consec$per_diff_domain_numeric, use = "complete.obs"),
+ cor(training_consec$med_sample_undefined_1b, training_consec$per_diff_sample_undefined, use = "complete.obs"),
+ cor(training_consec$med_sample_normal_1b, training_consec$per_diff_sample_normal, use = "complete.obs"),
+ cor(training_consec$med_sample_percentile_1b, training_consec$per_diff_sample_percentile, use = "complete.obs"),
+ cor(training_consec$med_sample_increase_1b, training_consec$per_diff_sample_increase, use = "complete.obs"),
+ cor(training_consec$med_sample_decrease_1b, training_consec$per_diff_sample_decrease, use = "complete.obs"))
+names(test_1b) <- indicators
+
+
+#Pairwise plots of absolute and % changes for individual FLWs by domain medians
+g <- ggplot(tula_consec, aes(x=med_domain_abs_change, y=diff_nvisits)) +
+ geom_point(shape=1) +
+ #scale_y_continuous(limits=c(-100,100)) +
+ geom_smooth(method=lm)
+
+
+#------------------------------------------------------------------------#
+#Code for Test 2
+#------------------------------------------------------------------------#
+
+#Previous month's indicator value
+training_typical$prev_nvisits <- training_typical$nvisits - training_typical$diff_nvisits
+training_typical$prev_active_day_percent <- training_typical$active_day_percent - training_typical$diff_active_day_percent
+training_typical$prev_nforms<- training_typical$nforms - training_typical$diff_nforms
+training_typical$prev_median_visit_duration <- training_typical$median_visit_duration - training_typical$diff_median_visit_duration
+training_typical$prev_median_visits_per_day <- training_typical$median_visits_per_day - training_typical$diff_median_visits_per_day
+training_typical$prev_time_using_cc <- training_typical$time_using_cc - training_typical$diff_time_using_cc
+training_typical$prev_ninteractions <- training_typical$ninteractions - training_typical$diff_ninteractions
+training_typical$prev_ncases_registered <- training_typical$ncases_registered - training_typical$diff_ncases_registered
+training_typical$prev_register_followup <- training_typical$register_followup - training_typical$diff_register_followup
+training_typical$prev_case_register_followup_rate <- training_typical$case_register_followup_rate - training_typical$diff_case_register_followup_rate
+training_typical$prev_ncases_touched <- training_typical$ncases_touched - training_typical$diff_ncases_touched
+training_typical$prev_nunique_followups <- training_typical$nunique_followups - training_typical$diff_nunique_followups
+training_typical$prev_audio_plays <- training_typical$audio_plays - training_typical$diff_audio_plays
+training_typical$prev_network_warnings <- training_typical$network_warnings- training_typical$diff_network_warnings
+training_typical$prev_num_user_pk <- training_typical$num_user_pk - training_typical$diff_num_user_pk
+training_typical$prev_domain_numeric <- training_typical$domain_numeric - training_typical$diff_domain_numeric
+training_typical$prev_sample_undefined <- training_typical$sample_undefined - training_typical$diff_sample_undefined
+training_typical$prev_sample_normal <- training_typical$sample_normal - training_typical$diff_sample_normal
+training_typical$prev_sample_percentile <- training_typical$sample_percentile - training_typical$diff_sample_percentile
+training_typical$prev_sample_increase <- training_typical$sample_increase - training_typical$diff_sample_increase
+training_typical$prev_sample_decrease <- training_typical$sample_decrease - training_typical$diff_sample_decrease
+
+test2_data <- filter(training_typical, previous_month_active == T)
+
+test_2a <-
+ c(cor(training_typical$prev_nvisits, training_typical$nvisits, use = "complete.obs"),
+ cor(training_typical$prev_active_day_percent, training_typical$active_day_percent, use = "complete.obs"),
+ cor(training_typical$prev_nforms, training_typical$nforms, use = "complete.obs"),
+ cor(training_typical$prev_median_visit_duration, training_typical$median_visit_duration, use = "complete.obs"),
+ cor(training_typical$prev_median_visits_per_day, training_typical$median_visits_per_day, use = "complete.obs"),
+ cor(training_typical$prev_time_using_cc, training_typical$time_using_cc, use = "complete.obs"),
+ cor(training_typical$prev_ninteractions, training_typical$ninteractions, use = "complete.obs"),
+ cor(training_typical$prev_ncases_registered, training_typical$ncases_registered, use = "complete.obs"),
+ cor(training_typical$prev_register_followup, training_typical$register_followup, use = "complete.obs"),
+ cor(training_typical$prev_case_register_followup_rate, training_typical$case_register_followup_rate, use = "complete.obs"),
+ cor(training_typical$prev_ncases_touched, training_typical$ncases_touched, use = "complete.obs"),
+ cor(training_typical$prev_nunique_followups, training_typical$nunique_followups, use = "complete.obs"),
+ cor(training_typical$prev_audio_plays, training_typical$audio_plays, use = "complete.obs"),
+ cor(training_typical$prev_network_warnings, training_typical$network_warnings, use = "complete.obs"),
+ cor(training_typical$prev_num_user_pk, training_typical$num_user_pk, use = "complete.obs"),
+ cor(training_typical$prev_domain_numeric, training_typical$domain_numeric, use = "complete.obs"),
+ cor(training_typical$prev_sample_undefined, training_typical$sample_undefined, use = "complete.obs"),
+ cor(training_typical$prev_sample_normal, training_typical$sample_normal, use = "complete.obs"),
+ cor(training_typical$prev_sample_percentile, training_typical$sample_percentile, use = "complete.obs"),
+ cor(training_typical$prev_sample_increase, training_typical$sample_increase, use = "complete.obs"),
+ cor(training_typical$prev_sample_decrease, training_typical$sample_decrease, use = "complete.obs"))
+names(test_2a) <- indicators
+
+g <- ggplot(raw_percentile, aes(x=sample_percentile, y=prev_percentile_sample)) +
+ geom_point(shape=1) +
+ geom_smooth(method=lm)
+
+#------------------------------------------------------------------------#
+#Code for Test 4
+#------------------------------------------------------------------------#
+
+#Create function to append attrition list
+lappend <- function (lst, ...){
+ lst <- c(lst, list(...))
+ return(lst)
+}
+
+#Extract users with at least one attrition event
+users <- unique((filter(training_typical, next_month_active == F))$user_pk)
+
+#This is the test_4b code
+source(file.path("analysis_scripts","rdayalu","test_4b_2.R", fsep = .Platform$file.sep))
+names(test_4b) <- indicators
+
+test <- data.frame(cbind(test_1a, test_1b, test_2a, test_4b))
+write.csv(test, file = "training_set_results.csv")
+#------------------------------------------------------------------------#
+#Other random code
+#------------------------------------------------------------------------#
+
+#Overall attrition_data: Median of each month column
+months_median <- apply(attrition_data[,1:23], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_data[,1:23], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black")
+
+#Relative attrition_data: Median of each relative month column
+months_median <- apply(attrition_subset[,25:29], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_subset[,25:29], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black") +
+ xlab("month_before_attrition") +
+ ylab("nvisits relative to month 5 (%)")
+
+#Subset of attrition_data: Median of each month column
+months_median <- apply(attrition_subset[,1:5], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_subset[,1:5], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black")
+
+
+#Test 4A: slope of line (lm) for absolute months 1-4 for each row
+#attrition_subset$slope_abs <- apply(attrition_subset[,1:4], 1, function(x)
+# lm(x~c(1:4))$coefficients[[2]])
+
+#Rehape all_monthly from long format to wide
+#This creates only one row per user with columns for each calendar month
+users_long <- select(tula_typical, domain, user_id, nvisits, calendar_month)
+users_wide <- reshape(users_long,
+ timevar = "calendar_month",
+ idvar = c("domain", "user_id"),
+ direction = "wide")
+
+g <- ggplot(data=tula_typical, aes(x=calendar_month, y=nvisits, group = user_id)) +
+ geom_line(colour="grey", size=1.0)
+
+users_long <- select(raw_percentile, domain, user_id, percentile, calendar_month)
+users_wide <- reshape(users_long,
+ timevar = "calendar_month",
+ idvar = c("domain", "user_id"),
+ direction = "wide")
+
+# Number of users by calendar_month
+users_month_tula <- tula_typical %.%
+ group_by(domain, calendar_month) %.%
+ summarise(nusers = length(unique(user_id)))
+
+g <- ggplot(data=users_month_tula, aes(x=calendar_month, y=nusers)) +
+ geom_line(colour="black", size=1.0) +
+ geom_point(colour="red", size=3, shape=21, fill="red")
+
+users_wide <- reshape(users_month_tula,
+ timevar = "calendar_month",
+ idvar = c("domain"),
+ direction = "wide")
+
+users_wide <- users_wide[,order(names(users_wide))]
+write.csv(users_wide, file = "tula_nusers_wide.csv")
diff --git a/analysis_scripts/rdayalu/tula_salud.R b/analysis_scripts/rdayalu/tula_salud.R
new file mode 100644
index 0000000..54c3def
--- /dev/null
+++ b/analysis_scripts/rdayalu/tula_salud.R
@@ -0,0 +1,550 @@
+#Develop indicator tests using tula salud dataset
+
+all_monthly_tula <- filter(all_monthly, domain == "tulasalud")
+
+#Tula has 637 unique users
+all_monthly_tula$calendar_month <- as.Date(all_monthly_tula$calendar_month)
+
+#Exclude any users who logged > 100 visits in any month
+all_monthly_tula$visits_ge_100 <- all_monthly_tula$nvisits > 100
+user_ge_100 <- all_monthly_tula %.%
+ group_by(user_id) %.%
+ summarise(ge_100 = sum(visits_ge_100))
+user_le_100 <- filter(user_ge_100, ge_100 == 0)
+#513 users have only <= 100 visits per month
+tula_typical <-
+ all_monthly_tula[all_monthly_tula$user_id %in% user_le_100$user_id, ]
+
+#Exclude users who started in June 2014 onwards
+user_start_month <- tula_typical %.%
+ group_by(user_id) %.%
+ summarise(first_month = min(calendar_month))
+user_start_month <- user_start_month[user_start_month$first_month <= "2014-05-01",]
+#118 users have start months before June 2014
+tula_typical <-
+ tula_typical[tula_typical$user_id %in% user_start_month$user_id, ]
+
+#Exclude users with < 4 months on CC
+month_count <- tula_typical %.%
+ group_by(domain, user_id) %.%
+ summarise(months_on_cc = length(unique(calendar_month)))
+month_count <- filter(month_count, months_on_cc >= 4)
+#91 users have >= 4 months on CC
+tula_typical <-
+ tula_typical[tula_typical$user_id %in% month_count$user_id, ]
+
+#Exclude visits prior to 2012-09-01
+#(because they have very few number of users)
+tula_typical <- filter(tula_typical, calendar_month >= "2012-09-01")
+
+#Add sample_percentile variable
+tula_typical$sample_percentile <-
+ sample(1:100, nrow(tula_typical), replace=T)
+
+#Add sample_per_diff variable
+tula_typical$sample_per_diff <-
+ sample(-100:100, nrow(tula_typical), replace=T)
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+tula_typical <- arrange(tula_typical, domain_numeric, user_id,
+ calendar_month)
+df <- data.table(tula_typical)
+#Can we setkey by domain and user_id since some user_ids might be the same?
+setkey(df,user_id)
+df[,diff_days:=c(NA,diff(calendar_month)),by=user_id]
+df[,diff_nvisits:=c(NA,diff(nvisits)),by=user_id]
+tula_data <- as.data.frame(df)
+tula_data$previous_month_active <- tula_data$diff_days <= 31
+
+users <- unique(tula_data$user_id)
+
+next_month_active <- c()
+for (i in users) {
+ single_user <- tula_data[tula_data$user_id == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_month_active[-1], F)
+ next_month_active <- append(next_month_active, next_active)
+}
+tula_data$next_month_active <- next_month_active
+
+#If calendar_month = 8/1/14 then next_month_active = NA
+#because we don't know if the user will be active in the following month
+is.na(tula_data$next_month_active) <- tula_data$calendar_month == "2014-08-01"
+
+# % difference in nvisits for each user for consectutive months
+# This isn't for truly consecutive months, so later on,
+# we will only use rows with previous_month_active == T
+per_diff_nvisits <- c()
+for (i in users) {
+ single_user <- tula_data[tula_data$user_id == i,]
+ prev_visits <- c()
+ prev_visits <- append(NA, single_user$nvisits)
+ prev_visits <- prev_visits[-length(prev_visits)]
+ per_diff <- c()
+ per_diff <- (single_user$diff_nvisits/prev_visits)*100
+ per_diff_nvisits <- append(per_diff_nvisits, per_diff)
+}
+tula_data$per_diff_nvisits <- per_diff_nvisits
+
+#General plots
+#Number of users by calendar month
+n_user <- tula_data %.%
+ group_by(calendar_month) %.%
+ summarise(n_users = length(unique(user_id)))
+
+g <- ggplot(n_user, aes(x=calendar_month, y=n_users)) +
+ geom_point(size = 6, shape = 19, alpha = 0.5, colour = "darkblue",
+ fill = "lightblue") +
+ geom_line(colour = "darkblue") +
+ scale_size_area() +
+ scale_y_continuous(limits=c(0,100)) +
+ xlab("Calendar month") +
+ ylab("# unique users/month") +
+ theme(axis.text=element_text(size=12), axis.title=element_text(size=14,
+ face="bold")) +
+ ggtitle("Number of users by calendar month") +
+ theme(plot.title = element_text(size=14, face="bold"))
+
+pdf("plots.pdf")
+plot(g)
+dev.off()
+
+#------------------------------------------------------------------------#
+#Code for Test 1
+#------------------------------------------------------------------------#
+
+#Calculate median absolute deviation (MAD) per calendar month
+#Use this to calculate rCV for the domain per calendar month
+test_1 <- function(data) {
+
+ test_1_compute <- data %.%
+ group_by(domain, calendar_month) %.%
+ summarise(median_indicator=median(nvisits, na.rm=TRUE),
+ mad_indicator=mad(nvisits, na.rm=TRUE))
+
+ test_1_compute$rcv = (test_1_compute$mad_indicator/test_1_compute$median_indicator)*100
+
+ #Compute CV of CVs by project
+ test_1_gp_cv = group_by(test_1_compute, domain)
+ test_1_compute_cv = summarise(test_1_gp_cv,
+ median_indicator = median(rcv, na.rm = T),
+ mad_indicator = mad(rcv, na.rm=T))
+ test_1_compute_cv$rcv = (test_1_compute_cv$mad_indicator/test_1_compute_cv$median_indicator)*100
+ test_1_score = median(test_1_compute_cv$rcv)
+ #return(test_1_score)
+ return(test_1_compute)
+}
+
+test_1_compute <- data.frame(test_1(tula_data))
+
+#Plot rCV
+g <- ggplot(test_1_compute, aes(x=calendar_month, y=median_indicator, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=median_indicator-mad_indicator, ymax=median_indicator+mad_indicator),
+ width=.3, colour = "black")
+
+g <- ggplot(test_1_compute, aes(x=calendar_month, y=rcv)) +
+ geom_point(size = 6, shape = 19, alpha = 0.5, colour = "red",
+ fill = "pink") +
+ geom_line(colour = "red") +
+ scale_size_area() +
+ scale_y_continuous(limits=c(0,100)) +
+ xlab("Calendar month") +
+ ylab("rCV(%)") +
+ theme(axis.text=element_text(size=12), axis.title=element_text(size=14,
+ face="bold")) +
+ ggtitle("rCV(%) by calendar month") +
+ theme(plot.title = element_text(size=14, face="bold"))
+
+#Domain median absolute change per user per calendar month, excluding each user from the median
+#Must only include rows with previous_month_active == T in any median calculations
+#So we should have 2093 domain median values (91 users * (24-1) calendar months)
+
+tula_consec <- tula_data[tula_data$previous_month_active == T,]
+
+#Initialize dataframe
+detach("package:data.table")
+tula_median_abs_change <- data.frame(matrix(ncol = 3, nrow = 1))
+names(tula_median_abs_change) <- c("calendar_month", "med_domain_abs_change", "user_id")
+tula_median_abs_change$calendar_month <- as.Date(tula_median_abs_change$calendar_month)
+tula_median_abs_change$user_id <- as.factor(tula_median_abs_change$user_id)
+
+for (i in users) {
+ exclude_user <- tula_consec[tula_consec$user_id != i,]
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_domain_abs_change = median(diff_nvisits, na.rm = T),
+ user_id = i)
+ exclude_user_median <- exclude_user_median[!(is.na(exclude_user_median$calendar_month)), ]
+ tula_median_abs_change <- rbind(tula_median_abs_change, exclude_user_median)
+}
+
+tula_median_abs_change <- tula_median_abs_change[!(is.na(tula_median_abs_change$calendar_month)), ]
+
+#Domain median % change per user per calendar month, excluding each user from the median
+#Must only include rows with previous_month_active == T in any median calculations
+#So we should have 2093 domain median values (91 users * (24-1) calendar months)
+
+#Initialize dataframe
+tula_median_per_change <- data.frame(matrix(ncol = 3, nrow = 1))
+names(tula_median_per_change) <- c("calendar_month", "med_domain_per_change", "user_id")
+tula_median_per_change$calendar_month <- as.Date(tula_median_per_change$calendar_month)
+tula_median_per_change$user_id <- as.factor(tula_median_per_change$user_id)
+
+for (i in users) {
+ exclude_user <- tula_consec[tula_consec$user_id != i,]
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_domain_per_change = median(per_diff_nvisits, na.rm = T),
+ user_id = i)
+ exclude_user_median <- exclude_user_median[!(is.na(exclude_user_median$calendar_month)), ]
+ tula_median_per_change <- rbind(tula_median_per_change, exclude_user_median)
+}
+
+tula_median_per_change <- tula_median_per_change[!(is.na(tula_median_per_change$calendar_month)), ]
+
+#Domain median sample_percentage change per user per calendar month, excluding each user from the median
+#Must only include rows with previous_month_active == T in any median calculations
+#So we should have 2093 domain median values (91 users * (24-1) calendar months)
+
+#Initialize dataframe
+tula_median_sample_change <- data.frame(matrix(ncol = 3, nrow = 1))
+names(tula_median_sample_change) <- c("calendar_month", "med_domain_sample_change", "user_id")
+tula_median_sample_change$calendar_month <- as.Date(tula_median_sample_change$calendar_month)
+tula_median_sample_change$user_id <- as.factor(tula_median_sample_change$user_id)
+
+for (i in users) {
+ exclude_user <- tula_consec[tula_consec$user_id != i,]
+ exclude_user_median <- exclude_user %.%
+ group_by(calendar_month) %.%
+ summarise(med_domain_sample_change = median(sample_per_diff, na.rm = T),
+ user_id = i)
+ exclude_user_median <- exclude_user_median[!(is.na(exclude_user_median$calendar_month)), ]
+ tula_median_sample_change <- rbind(tula_median_sample_change, exclude_user_median)
+}
+
+tula_median_sample_change <- tula_median_sample_change[!(is.na(tula_median_sample_change$calendar_month)), ]
+
+#Merge med_domain_abs_change and med_domain_per_change to tula_consec
+tula_consec <- merge(tula_consec, tula_median_abs_change, by = c("user_id", "calendar_month"),
+ all.x = T)
+tula_consec <- merge(tula_consec, tula_median_per_change, by = c("user_id", "calendar_month"),
+ all.x = T)
+tula_consec <- merge(tula_consec, tula_median_sample_change, by = c("user_id", "calendar_month"),
+ all.x = T)
+
+g <- ggplot(data=tula_median_abs_change, aes(x=calendar_month, y=med_domain_abs_change)) +
+ geom_line(colour="black", size=1.0) +
+ geom_point(colour="red", size=3, shape=21, fill="red")
+
+g <- ggplot(data=tula_median_per_change, aes(x=calendar_month, y=med_domain_per_change)) +
+ geom_line(colour="black", size=1.0) +
+ geom_point(colour="red", size=3, shape=21, fill="red")
+
+g <- ggplot(data=tula_median_sample_change, aes(x=calendar_month, y=med_domain_sample_change)) +
+ geom_line(colour="black", size=1.0) +
+ geom_point(colour="red", size=3, shape=21, fill="red")
+
+#Pairwise plots of absolute and % changes for individual FLWs by domain medians
+g <- ggplot(tula_consec, aes(x=med_domain_abs_change, y=diff_nvisits)) +
+ geom_point(shape=1) +
+ #scale_y_continuous(limits=c(-100,100)) +
+ geom_smooth(method=lm)
+
+cor(tula_consec$med_domain_abs_change,
+ tula_consec$diff_nvisits, use = "complete.obs")
+
+g <- ggplot(tula_consec, aes(x=med_domain_per_change, y=per_diff_nvisits)) +
+ geom_point(shape=1) +
+ scale_y_continuous(limits=c(-100,100)) +
+ geom_smooth(method=lm)
+
+cor(tula_consec$med_domain_per_change,
+ tula_consec$per_diff_nvisits, use = "complete.obs")
+
+g <- ggplot(tula_consec, aes(x=med_domain_sample_change, y=sample_per_diff)) +
+ geom_point(shape=1) +
+ #scale_y_continuous(limits=c(-100,100)) +
+ geom_smooth(method=lm)
+
+cor(tula_consec$med_domain_sample_change,
+ tula_consec$sample_per_diff, use = "complete.obs")
+
+
+#------------------------------------------------------------------------#
+#Code for Test 2
+#------------------------------------------------------------------------#
+
+percentile <- function(x) rank(x, ties.method = "average")/length(x)*100
+percentile_s <- paste0('percentile=percentile(', 'nvisits', ')')
+
+raw_percentile <- tula_data %.%
+ group_by(domain, calendar_month) %.%
+ s_mutate(percentile_s)
+
+test_2_compute <- raw_percentile %.%
+ group_by(domain, user_id) %.%
+ summarise(
+ median_indicator=median(percentile, na.rm=TRUE),
+ mad_indicator=mad(percentile, na.rm=TRUE)
+ )
+
+test_2_compute$rcv <- (test_2_compute$mad_indicator / test_2_compute$median_indicator) * 100
+
+
+g <- ggplot(data=tula_data, aes(x=calendar_month, y=nvisits, group = user_id)) +
+ geom_line(colour="grey", size=1.0)
+
+g <- ggplot(data=raw_percentile, aes(x=calendar_month, y=percentile, group = user_id)) +
+ geom_line(colour="grey", size=1.0)
+
+
+#Calculate differences between percentile values for each FLW
+raw_percentile <- arrange(raw_percentile, user_id, calendar_month)
+df <- data.table(raw_percentile)
+setkey(df,user_id)
+df[,diff:=c(NA,diff(percentile)),by=user_id]
+diff_percentile <- as.data.frame(df)
+diff_percentile <- diff_percentile[diff_percentile$previous_month_active == T,]
+
+myhist <- ggplot(diff_percentile, aes(x=diff)) +
+ geom_histogram(binwidth=5, colour="black", fill="lightblue")
+
+diff_percentile_sample <- df[,diff:=c(NA,diff(sample_percentile)),by=user_id]
+diff_percentile_sample <- as.data.frame(diff_percentile_sample)
+diff_percentile_sample <- diff_percentile_sample[diff_percentile_sample$previous_month_active == T,]
+
+myhist <- ggplot(diff_percentile_sample, aes(x=diff)) +
+ geom_histogram(binwidth=5, colour="black", fill="lightblue")
+
+myhist <- ggplot(diff_sample, aes(x=diff)) +
+ geom_histogram(binwidth=1, colour="black", fill="lightblue")
+
+#Pairwise plots of true consecutive months pairs
+prev_nvisits <- c()
+prev_percentile <- c()
+prev_percentile_sample <- c()
+for (i in users) {
+ single_user <- raw_percentile[raw_percentile$user_id == i,]
+
+ prev_vis <- c()
+ prev_vis <- append(NA, single_user$nvisits)
+ prev_vis <- prev_vis[-length(prev_vis)]
+ prev_nvisits <- append(prev_nvisits, prev_vis)
+
+ prev <- c()
+ prev <- append(NA, single_user$percentile)
+ prev <- prev[-length(prev)]
+ prev_percentile <- append(prev_percentile, prev)
+
+ prev_sample <- c()
+ prev_sample <- append(NA, single_user$sample_percentile)
+ prev_sample <- prev_sample[-length(prev_sample)]
+ prev_percentile_sample <- append(prev_percentile_sample, prev_sample)
+}
+raw_percentile$prev_percentile <- prev_percentile
+raw_percentile$prev_percentile_sample <- prev_percentile_sample
+raw_percentile$prev_nvisits <- prev_nvisits
+raw_percentile <- raw_percentile[raw_percentile$previous_month_active == T,]
+
+g <- ggplot(raw_percentile, aes(x=sample_percentile, y=prev_percentile_sample)) +
+ geom_point(shape=1) +
+ geom_smooth(method=lm)
+cor(raw_percentile$sample_percentile, raw_percentile$prev_percentile_sample, use = "complete.obs")
+
+#Older code not excluding skipped months
+v1 = NA
+v2 = NA
+user_id = NA
+grp <- unique(raw_percentile$user_id)
+raw_percentile <- arrange(raw_percentile, user_id, calendar_month)
+for(i in grp){
+ rownums = which(raw_percentile$user_id==i)
+ v1 = append(v1,raw_percentile$percentile[rownums[1:(length(rownums)-1)]])
+ v2 = append(v2,raw_percentile$percentile[rownums[2:(length(rownums))]])
+ user_id = append(user_id,rep(i,times=(length(rownums)-1)))
+}
+
+datset_pair_plot = data.frame(user_id,v1,v2)
+datset_pair_plot <- datset_pair_plot[-1,]
+
+tula_typical %.%
+ group_by(calendar_month) %.%
+ summarise(nusers = length(unique(user_id)))
+
+summary(tula_typical$user_id[tula_typical$calendar_month == "2012-09-01"] %in%
+ tula_typical$user_id[tula_typical$calendar_month == "2014-05-01"])
+
+#------------------------------------------------------------------------#
+#Code for Test 3
+#------------------------------------------------------------------------#
+
+overall <- tula_data %.%
+ group_by(month_abbr) %.%
+ summarise(visits_median = median(nvisits, na.rm = T),
+ visits_mad = mad(nvisits, na.rm = T))
+
+g <- ggplot(overall, aes(x=month_abbr, y=visits_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=visits_median-visits_mad, ymax=visits_median+visits_mad),
+ width=.3, colour = "black") +
+ scale_y_continuous(limits=c(-5,100))
+
+#------------------------------------------------------------------------#
+#Code for Test 4
+#------------------------------------------------------------------------#
+
+attrition <- tula_data %.%
+ group_by(user_id) %.%
+ summarise(attrition = sum(!next_month_active, na.rm = T))
+
+test <- data.frame(table(attrition$attrition))
+
+g <- ggplot(test, aes(x=Var1, y=Freq, group = 1)) +
+ geom_point(size = 6, shape = 19, alpha = 0.5, colour = "red",
+ fill = "pink") +
+ geom_line(colour = "red") +
+ scale_size_area() +
+ scale_y_continuous(limits=c(0,50)) +
+ xlab("# attrition events") +
+ ylab("# of unique FLWs") +
+ theme(axis.text=element_text(size=12), axis.title=element_text(size=14,
+ face="bold")) +
+ ggtitle("Number of unique FLWs by number of attrition events") +
+ theme(plot.title = element_text(size=14, face="bold"))
+
+#Isolate all months to each attrition event for all users
+#Create empty list
+attrition_list <- list()
+
+#Create function to append attrition list
+lappend <- function (lst, ...){
+ lst <- c(lst, list(...))
+ return(lst)
+}
+
+#Extract users with at least one attrition event
+users <- unique((filter(tula_data, next_month_active == F))$user_id)
+
+for (i in users) {
+ single_user <- tula_data[tula_data$user_id == i,]
+#Create vector of all attrition positions for this user
+attrition_positions <- which(single_user$next_month_active == F)
+#Append "months to first attrition event" to the attrition list
+attrition_list <- lappend(attrition_list, rev(single_user$nvisits[1:attrition_positions[1]]))
+
+#Append "months to subsequent attrition events" to the attrition list
+if(length(attrition_positions)>1) {
+for(j in 2:length(attrition_positions)) {
+ attrition_list <- lappend(attrition_list, rev(single_user$nvisits[(attrition_positions[j-1]+1):attrition_positions[j]]))
+ }
+ }
+}
+
+
+## Compute maximum length
+max_length <- max(sapply(attrition_list, length))
+## Add NA values to list elements
+attrition_list <- lapply(attrition_list, function(v) { c(v, rep(NA, max_length-length(v)))})
+## Create dataframe
+attrition_data <- data.frame(do.call(rbind, attrition_list))
+names(attrition_data) <- paste0("month_", 1:ncol(attrition_data))
+attrition_data$user_id <- filter(tula_data, next_month_active == F)$user_id
+
+#Overall attrition_data: Median of each month column
+months_median <- apply(attrition_data[,1:23], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_data[,1:23], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black")
+
+#Keep rows with at least "N" months before attrition
+#Then graph only those "N" months before attrition
+#Here, N = 5
+attrition_subset <- filter(attrition_data, !is.na(month_5))
+
+#Subset of attrition_data: Median of each month column
+months_median <- apply(attrition_subset[,1:5], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_subset[,1:5], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black")
+
+#Calculate indicators per month relative to N = 5
+attrition_subset$rel_1 <- (attrition_subset$month_1/attrition_subset$month_5)*100
+attrition_subset$rel_2 <- (attrition_subset$month_2/attrition_subset$month_5)*100
+attrition_subset$rel_3 <- (attrition_subset$month_3/attrition_subset$month_5)*100
+attrition_subset$rel_4 <- (attrition_subset$month_4/attrition_subset$month_5)*100
+attrition_subset$rel_5 <- (attrition_subset$month_5/attrition_subset$month_5)*100
+
+#Relative attrition_data: Median of each relative month column
+months_median <- apply(attrition_subset[,25:29], 2, function(x) median(x, na.rm = T))
+months_median <- data.frame(months_median)
+months_median$month_before_attrition <- c(1:nrow(months_median))
+months_median$months_mad <- apply(attrition_subset[,25:29], 2, function(x) mad(x, na.rm = T))
+
+#Plot
+g <- ggplot(months_median, aes(x=month_before_attrition, y=months_median, group = 1)) +
+ geom_line(colour="blue", size=1.0) +
+ geom_errorbar(aes(ymin=months_median-months_mad, ymax=months_median+months_mad),
+ width=.3, colour = "black") +
+ xlab("month_before_attrition") +
+ ylab("nvisits relative to month 5 (%)")
+
+#Test 4A: slope of line (lm) for absolute months 1-4 for each row
+attrition_subset$slope_abs <- apply(attrition_subset[,1:4], 1, function(x)
+ lm(x~c(1:4))$coefficients[[2]])
+
+#Test 4B: slope of line (lm) for relative months 1-4 for each row
+attrition_subset$slope_rel <- apply(attrition_subset[,25:28], 1, function(x)
+ lm(x~c(1:4))$coefficients[[2]])
+
+
+#------------------------------------------------------------------------#
+#Other random code
+#------------------------------------------------------------------------#
+
+#Rehape all_monthly from long format to wide
+#This creates only one row per user with columns for each calendar month
+users_long <- select(tula_typical, domain, user_id, nvisits, calendar_month)
+users_wide <- reshape(users_long,
+ timevar = "calendar_month",
+ idvar = c("domain", "user_id"),
+ direction = "wide")
+
+g <- ggplot(data=tula_typical, aes(x=calendar_month, y=nvisits, group = user_id)) +
+ geom_line(colour="grey", size=1.0)
+
+users_long <- select(raw_percentile, domain, user_id, percentile, calendar_month)
+users_wide <- reshape(users_long,
+ timevar = "calendar_month",
+ idvar = c("domain", "user_id"),
+ direction = "wide")
+
+# Number of users by calendar_month
+users_month_tula <- tula_typical %.%
+ group_by(domain, calendar_month) %.%
+ summarise(nusers = length(unique(user_id)))
+
+g <- ggplot(data=users_month_tula, aes(x=calendar_month, y=nusers)) +
+ geom_line(colour="black", size=1.0) +
+ geom_point(colour="red", size=3, shape=21, fill="red")
+
+users_wide <- reshape(users_month_tula,
+ timevar = "calendar_month",
+ idvar = c("domain"),
+ direction = "wide")
+
+users_wide <- users_wide[,order(names(users_wide))]
+write.csv(users_wide, file = "tula_nusers_wide.csv")
diff --git a/analysis_scripts/rdayalu/ur_study.R b/analysis_scripts/rdayalu/ur_study.R
new file mode 100644
index 0000000..e54b161
--- /dev/null
+++ b/analysis_scripts/rdayalu/ur_study.R
@@ -0,0 +1,126 @@
+#Final data for attrition study for Nisha and Reem
+#Dec 4, 2014
+#Utilization Ratio study (UR)
+#Using blog data with 60,047 observations
+ur_study <- read.csv(file = "blog_data.csv")
+ur_study$calendar_month <- as.Date(ur_study$calendar_month)
+
+#Prepare domain_table for merging
+#Bring in sector information
+sector <- tbl(db, "sector")
+sector <- collect(sector)
+names(sector)[names(sector) == "name"] = "sector_final"
+domain_sector <- tbl(db, "domain_sector")
+domain_sector <- collect(domain_sector)
+domain_sector <- select(domain_sector, domain_id, sector_id)
+domain_table <- merge(domain_table, domain_sector, by.x = "id", by.y = "domain_id", all.x = T)
+domain_table <- merge(domain_table, sector, by.x = "sector_id", by.y = "id", all.x = T)
+#Bring in subsector information
+subsector <- tbl(db, "subsector")
+subsector <- collect(subsector)
+subsector <- select(subsector, id, name)
+subsector <- filter(subsector, !is.na(name))
+subsector <- filter(subsector, name != "")
+names(subsector)[names(subsector) == "name"] = "subsector_final"
+domain_subsector <- tbl(db, "domain_subsector")
+domain_subsector <- collect(domain_subsector)
+domain_subsector <- select(domain_subsector, domain_id, subsector_id)
+domain_table <- merge(domain_table, domain_subsector, by.x = "id", by.y = "domain_id", all.x = T)
+domain_table <- merge(domain_table, subsector, by.x = "subsector_id", by.y = "id", all.x = T)
+#Consolidate country information
+is.na(domain_table$deployment.country) <- domain_table$deployment.country == ""
+is.na(domain_table$country) <- domain_table$country == ""
+domain_table$country_final <- domain_table$deployment.country
+keep_country <- which(is.na(domain_table$deployment.country) & !is.na(domain_table$country))
+domain_table$country_final[keep_country] <- domain_table$country[keep_country]
+#Consolidate Dimagi level of support
+is.na(domain_table$internal.services) <- domain_table$internal.services == ""
+is.na(domain_table$internal.self_started) <- domain_table$internal.self_started == ""
+domain_table$self_start[domain_table$internal.self_started == "True"] <- "self"
+domain_table$dimagi_services <- domain_table$internal.services
+keep_self <- which(is.na(domain_table$internal.services) & !is.na(domain_table$self_start))
+domain_table$dimagi_services[keep_self] <- domain_table$self_start[keep_self]
+
+#Keep only columns of interest
+names(domain_table)[names(domain_table) == "id"] = "domain_id"
+facets_to_merge <- select(domain_table, name, domain_id, country_final, sector_final,
+ subsector_final, dimagi_services, test)
+
+#Merge to ur_study
+ur_study <- merge(ur_study, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+ur_study <- filter(ur_study, sector_final == "Health")
+
+#Exclude domains for EULA
+exclude_domains <- read.csv(file = "can_use_true.csv")
+exclude_domains <- exclude_domains$x
+exclude_domains <- exclude_domains[-c(2,3,4,13)]
+exclude_domains <- append(as.character(exclude_domains), "ccdt")
+ur_study <- filter(ur_study, !(domain %in% exclude_domains))
+
+write.csv(ur_study, file = "ur_study_12_4_14.csv")
+
+
+
+
+#------------------------------------------------------------------------#
+#Older code
+#------------------------------------------------------------------------#
+
+ur_study <- monthly_table
+ur_study <- ur_study[!(ur_study$user_id =="demo_user"),]
+ur_study$keep_domain <- ur_study$domain %in% true_domains$name
+ur_study <- ur_study[ur_study$keep_domain == T,]
+ur_study <- merge(ur_study, facets_to_merge, by.x = "domain",
+ by.y = "name", all.x = T)
+ur_study$domain_numeric <- as.numeric(as.factor(ur_study$domain))
+#Need to convert user_id to user_numeric, but have a problem with duplicate
+#user_ids from different domains
+ur_study$date_first_visit <- as.Date(ur_study$date_first_visit)
+ur_study$date_last_visit <- as.Date(ur_study$date_last_visit)
+
+ur_study <- select(ur_study, domain_numeric, user_id, month.index, numeric_index,
+ date_first_visit, date_last_visit, active_days, summary_device_type,
+ country, Sector, Sub.Sector, Test.Project., active)
+
+names(ur_study)[names(ur_study) == "active"] = "active_domain"
+names(ur_study)[names(ur_study) == "month.index"] = "calendar_month"
+names(ur_study)[names(ur_study) == "Sector"] = "sector"
+names(ur_study)[names(ur_study) == "Sub.Sector"] = "subsector"
+names(ur_study)[names(ur_study) == "Test.Project."] = "test_project"
+
+#Convert calendar month to actual date
+ur_study$calendar_month <- parse_date_time(paste('01', ur_study$calendar_month), '%d %b %Y!')
+ur_study$calendar_month <- as.Date(ur_study$calendar_month)
+
+#Calculate differences between month_index to calculate next_month_active and
+#previous_month_active variables
+ur_study <- arrange(ur_study, domain_numeric, user_id, calendar_month)
+df <- data.table(ur_study)
+#Can we setkey by domain and user_id since some user_ids might be the same?
+setkey(df,user_id)
+df[,diff_days:=c(NA,diff(calendar_month)),by=user_id]
+ur_study <- as.data.frame(df)
+ur_study$previous_month_active <- ur_study$diff_days <= 31
+ur_study <- arrange(ur_study, domain_numeric, user_id, calendar_month)
+#This following code is a problem due to duplicate user_id
+users <- unique(ur_study$user_id)
+
+next_month_active <- c()
+for (i in users) {
+ single_user <- ur_study[ur_study$user_id == i,]
+ next_active <- c()
+ next_active <- append(single_user$previous_month_active[-1], F)
+ next_month_active <- append(next_month_active, next_active)
+}
+ur_study$next_month_active <- next_month_active
+#If calendar_month = 8/1/14 then next_month_active = NA
+#because we don't know if the user will be active in the following month
+is.na(ur_study$next_month_active) <- ur_study$calendar_month == "2014-08-01"
+
+ur_study <- select(ur_study, -(diff_days))
+
+write.csv(ur_study, file = "attrition_study.csv")
+
+#175 domains in total (only includes Test.Project. == F)
+#119 domains have active == T and 56 have active == F
\ No newline at end of file
diff --git a/analysis_scripts/rdayalu/vis1.r b/analysis_scripts/rdayalu/vis1.r
new file mode 100644
index 0000000..14b3ca7
--- /dev/null
+++ b/analysis_scripts/rdayalu/vis1.r
@@ -0,0 +1,131 @@
+#!/usr/bin/Rscript
+
+####################################################################################
+# code developed by Klurig Analytics (Dag Holmboe)
+# Oct 18, 2014
+# Please don't remove this section
+####################################################################################
+
+####################################################################################
+rm(list=ls())
+suppressMessages(library(dplyr))
+
+################################################################
+add.months <- function(date,n) seq(date, by = paste (n, "months"), length = 2)[2]
+
+################################################################
+# main
+if( length( commandArgs() ) == 7 ) {
+ input_file <- commandArgs()[6]
+ output_file <- commandArgs()[7]
+} else {
+ cat("\nError: not correct parameters\n")
+ print(getwd())
+ cat("Usage: program