This document details the construction of all features and scores used in RISSK. In the development branch, each feature has a corresponding Jupyter notebook stored in rissk/notebook
. For a deeper dive into the individual feature analyses, users familiar with Python can refer to the respective notebook.
Note
Use outline button in the top-right corner for swift navigation between features.
The terms below are frequently used in the ensuing descriptions. Here are their precise meanings:
-
Interviewing events refer to those events in the paradata that are recorded prior to any recorded action by a Supervisor or HQ role, i.e. the first event of type
['RejectedBySupervisor', 'OpenedBySupervisor', 'RejectedByHQ', 'OpenedByHQ']
. These events approximate the initial interview process, prior to any corrections or updates that may occur after the first intervention by Supervisor or HQ roles. -
Item refers to a specific question in Survey Solutions for a given roster row. For questions located on the main questionnaire level, the term "item" is synonymous with the question itself. However, for questions that are part of rosters, each roster row for a question is considered an individual item.
-
Unit denotes a Survey Solutions interview, uniquely identified by its
interview__id
. -
Score Type are groups of scores that are constructed in similar ways. Refer to chapter Process description for a classification.
The features listed below are included by default in the construction of the unit_risk_score
.
This feature identifies atypical number of alterations to question responses.
Feature
f__answer_changed
is derived from all active interviewing events of type AnswerSet
in the paradata. It quantifies the number of times the response to an item has been modified. Changes are computed as follows:
- For single-answer questions, the answer value is compared with prior answer values for the item (if it exists). The event is counted if there's a discrepancy.
- For questions of type
MultyOptionsQuestion
andTextListQuestion
, events are counted if the set of answers no longer contains any elements from the previous set of answers. This can occur if an answer option has been unselected or removed, or if the text of a list item has been removed. - For multi-answer questions with Yes/No mode, both sets of Yes and No answers are evaluated and events counted if an answer option has been removed or changed from Yes to No or vice versa.
Score
Type 1 Score. Anomalies within f__answer_changed
are detected using ECOD, an efficient, non-parametric algorithm that leverages cumulative distribution functions. The default contamination level is set to 0.1. If required, it can also be set manually or be determined automatically by the system. Anomalies are detected on the item-level by variable_name
.
To clarify, anomalies are unusual high (or low) frequencies of changes to an item's answer, relative to the question's norm. For instance, if altering a member's age is typically done 0-2 times, items with 1 or 2 changes wouldn't be flagged as anomalies. However, if the age of a member is adjusted 3 or more times, this item could be earmarked as anomalous.
s__answer_changed
represents the fraction of items in an interview that were determined to be anomalous based on their f__answer_changed
values.
This feature captures active interviewing events that occur during unconventional hours of the day.
Feature
f__answer_hour_set
represents the hour of the day extracted from the paradata, adjusted for timezone and rounded to half-hour intervals. It is constructed on the item level for all active interviewing actions performed by the interviewer in the paradata.
Score
Type 1 Score. Anomalies in f__answer_hour_set
are detected using ECOD, an efficient, non-parametric algorithm that leverages cumulative distribution functions. The default contamination level is set to 0.1. If required, it can also be set manually or be determined automatically by the system. Anomalies are detected on the item level across all observations.
s__answer_time_set
represents the proportion of anomalies within a given interview. Its values range from 0 (indicating the entire interview was conducted during typical hours) to 1 (indicating the entire interview took place during atypical hours). It's worth noting that this score treats every day identically, without differentiating based on the specific date or day of the week.
This feature identifies atypical frequencies of answer removals.
Feature
f__answer_removed
is constructed using interviewing events in the paradata. It counts the number of times an event of type AnswerRemoved
was logged for an item. These events can result from an interviewer actively deleting an answer or from Survey Solutions removing answers due to specific interviewer actions. Modifying the response of a question linked to a roster can trigger multiple such events.
Score
Type 1 Score. Anomalies in f__answer_removed
are detected using ECOD, an efficient, non-parametric algorithm that leverages cumulative distribution functions. The default contamination level is set to 0.1. If required, it can also be set manually or be determined automatically by the system. Anomalies are detected on the item-level by variable_name
.
s__answer_removed
represents the fraction of items in an interview that were determined to be anomalous based on their f__answer_removed
values.
This feature identifies unusual high or low number of answer options selected in multi select options.
Feature
f__answers_selected
is constructed on the item level for all answered questions of type MultyOptionsQuestion
in the microdata. It contains:
- For YesNo question, the number of YES options selected.
- For other multi-answer questions, the number of answers selected.
Score
Type 1 Score. Anomalies within f__answers_selected
are detected using ECOD, an efficient, non-parametric algorithm that leverages cumulative distribution functions. The default contamination level is set to 0.1. If required, it can also be set manually or be determined automatically by the system. Anomalies are detected on the item-level by variable_name
.
s__answers_selected
represents the fraction of items in an interview that were determined to be anomalous based on their f__answers_selected
values.
This feature highlights anomalous durations, either too short or too long, for individual questions.
Feature
f__answer_duration
is constructed at the item level using active interviewing events from the paradata. It represents the total duration, in seconds, for events of type AnswerSet
and AnswerRemoved
linked to an item. It is constructed by:
- Keeping only active interviewing events. This removes events generated by Survey Solutions and events that cannot be attributed to an item.
- For every event, computing the time interval to its preceding event (sorted by column
order
). - Assign a value of NaN (Not a Number) to negative time intervals, which are unrealistic durations often resulting from interviewers altering the tablet's time settings.
- Sum all time intervals for
AnswerSet
andAnswerRemoved
events related to the item.
It's important to understand that f__answer_duration
is an approximation of the actual time taken to answer an item. The open navigation design of Survey Solutions makes it challenging to pinpoint when a question is activated or deactivated. This is approximated using the timestamp of the preceding event and the time the answer is set. During these intervals, interviewers might undertake other tasks without logging any active event, like confirming the response to a prior question. Similarly, brief pauses recorded while administering a question (e.g., when accessing a calculator app) aren't allocated to a particular item.
Score
Type 1 Score. First, anomalies in f__answer_duration
are identified using ECOD, an efficient, non-parametric algorithm that leverages cumulative distribution functions. The default contamination level is set to 0.1. If required, it can also be set manually or be determined automatically by the system. Anomalies are detected on the item-level by variable_name
.
Subsequently, two scores, s__answer_duration_lower
and s__answer_duration_upper
are computed. They represent the share of items in an interview with anomalies at the lower or upper bounds respectively. Lower and upper end anomalies are treated separately as they indicate different potential undesired behaviours. Lower end anomalies point at interviewers rushing or fabricating data, upper end at interviewer struggling (or interview/respondent effects).
This feature identifies anomalies in the first two decimal digits in numeric questions.
Feature
f__first_decimal
contains the first two decimal digits of the response to all questions of type NumericQuestion
in the microdata that are not integer questions (IsInteger == False
).
Score
Type 1 Score. Anomalies within f__first_decimal
are detected using COF, an proximity-based algorithm that is able to handle outliers deviating from low density patterns. The default contamination level is set to 0.1. If required, it can also be set manually or be determined automatically by the system. Anomalies are detected on the item-level by variable_name
.
To clarify, anomalies are unusual first two decimal values, relative to the question's norm. For instance, if most answer values to a question contain decimals such as [0.25, 0.33, 0.5, 0.66, 0.75]
, a different and rare value such as 0.47
would be flagged as anomaly.
s__first_decimal
represents the fraction of items in an interview that were determined to be anomalous based on their f__first_decimal
values.
This feature detects atypical distribution of first digits in numeric questions.
Feature
f__first_digit
extracts the leading digit from the response of all questions classified as NumericQuestion
in the microdata.
Score
Type 3 Score. s__first_digit
quantifies the fraction of questions for which a specific interviewer exhibits a distinctly different first digit distribution as compared to their counterparts. The score is constructed by:
- Identifying those
variable_name
where answer values span over three orders of magnitude (a necessary condition for Benford's Law to apply). - For each interviewer and
variable_name
, compute the Jensen-Shannon (JS) divergence. This measures the deviation of an interviewer's first digit distribution from the collective distribution of the same variable across all other interviewers. - Tagging a
variable_name
as anomalous for an interviewer if its JS divergence value is less than half of the median JS divergence value for that variable.- For each interviewer, calculating the share of anomalous variables.
This feature identifies anomalies in the recorded locations for GPS-type questions.
Feature
Sub-features gps_latitude
, gps_longitude
and gps_accuracy
are constructed for all items of type GpsCoordinateQuestion in the microdata. It's important to note that there might be no or multiple GPS questions within an interview. Additionally, these GPS questions could refer to various places beyond the interview location, such as the location of household plots.
Score
Type 1 Score. Three distinct scores are derived:
s__gps_proximity_count
: Represents the total number of items located within a radius of 10 meters, plus the accuracies of both points, of any other points recorded within an interview. Note that high accuracy values could skew this score. The score essentially detects anomalies in location density. For instance, in stratified household surveys, numerous points at the same location could be suspicious. Conversely, in institutional surveys (like school or health center surveys), a high count is expected since interviews are usually conducted at the same location.
s__gps_extreme_outlier
contains the count of extreme outlier locations in an interview. It indicates locations that were recorded outside the survey area. The score is derived by:
- Computing the median longitude and latitude across all locations to establish a midpoint.
- Determining the distance of each point to this midpoint.
- Calculating the 50th and 75th percentile of the distances, and calculating the range p75-p50.
- Flagging points as anomalous if their distance to the midpoint exceeds 3.5 times the p75-p50 range.
- Counting these anomalous points within an interview. The score is treated in absolute terms, ensuring that interviews with any extreme outliers are deemed suspicious without being offset by other non-anomalous points within the same interview.
s__gps_outlier
counts outlier locations within an interview. While some outliers might indicate legitimately remote locations, many arise from interviewers recording the GPS locations while on the move or at their accommodation. The score is derived by:
- Excluding points flagged as extreme outliers.
- Detecting anomalies using COF for datasets containing fewer than 10k records, and LOF for larger datasets. Both methods are proximity-based, with COF generally performing better, albeit with higher memory demands.
- Counting these anomalous points within an interview. The score is treated in absolute terms, ensuring that interviews with any extreme outliers are deemed suspicious without being offset by other non-anomalous points within the same interview.
This feature identifies interviewers' tendency to repeatedly select the same answer options in multi-select questions.
Feature
f__multi_option_question
is constructed on the item level for all questions of type MultyOptionsQuestion
in the microdata with fixed answer options (question is not linked), 2 or more answer options, and which are not of the Combobox type. The feature indicates the relative position of the chosen answers: a value of 0 means the top answer was selected, and a value of 1 means the bottom-most answer was selected. The feature is scaled in intervals of 1/N, where N is the total number of answer options available.
Score
Type 3 Score. s__multi_option_question
represents the proportion of questions for a given interviewer where there's a noticeable pattern in selecting similar or identical answer options compared to other interviewers. The score is derived by:
- Identifying those
variable_name
for each interviewer with observations exceeding five times the number of available answer options. - Computing the entropy for these
variable_name
for each interviewer, capturing the diversity in their answers. - Designating a
variable_name
as anomalous for an interviewer if its entropy is less than half the median entropy value compared to other interviewers. - Calculating by interviewer the share of anomalous variables over all variables for which entropy was assessed.
This feature quantifies the total number of answers recorded in an interview.
Feature
s__number_answered
represents the count of answers recorded in the microdata that correspond to interviewing events in the paradata. Multi-select questions contribute as a single count. Answers with values [-999999999, '##N/A##'] are excluded since they signify questions that were enabled but remain unanswered. Survey Solution variables and system generated variables are also excluded.
Score
Type 2 Score. s__number_answered
is directly equated to f__number_answered
.
This feature highlights unusual frequencies of pauses during interviews.
Feature
f__pause_count
is the unit-level count of interviewing events of type Resumed
or Restarted
in the paradata. These are logged when the interview has been restarted or if the interviewer application has resumed to be active after the tablet screen was turned off or another application was accessed. It's important to note that this feature includes both short system-recorded pauses and longer deliberate pauses when the interview is intentionally interrupted.
Score
Type 2 Score. s__pause_count
is derived by taking the ratio of f__pause_count
to f_number_answered
. This quantifies the frequency of pauses in relation to the length of the interview, measure in question answers.
This feature identifies unusual duration of pauses during interviews.
Feature
f__pause_duration
represents the total duration of all pauses in an interview. It is derived by summing up the durations between paradata events of type Resumed
or Restarted
and their preceding events. It's important to note that this feature includes the duration of both short system-recorded pauses and longer deliberate pauses when the interview is intentionally interrupted.
Score
Type 2 Score. s__pause_duration
is computed as the proportion between f__pause_duration
and f__total_elapsed
. It measures the fraction of the total elapsed time that is attributed to pauses.
This feature identifies irregular patterns in the question answering sequence.
Feature
f__sequence_jump
quantifies the number of question sequences the interviewer skipped or jumped over when answering an item, relative to the previous item. It is derived as follows:
- For each item, retain only the last interviewing event of type
AnswerSet
from the paradata. If an item was answered multiple times (e.g., re-answered later during the interview), only the last event is considered. - For each item, calculate the difference
diff
between the answer sequence (sequential number in paradata, sorted by columnorder
) and the question sequence (the order of question in the questionnaire). - Determine the change in
diff
relative to the previous item. This step helps in identifying whether (groups of) questions were answered sequentially, even if there were previous sequence jumps.
A f__sequence_jump
value of 0 implies that a question was answered immediately after the preceding question, according to the questionnaire sequence. Negative values denote a jump back in the questionnaire sequence, while positive values indicate a forward jump. A positive jump can also arise when preceding questions are disabled and therefore unanswerable. It's worth noting that the initial question in roster instances, other than the first row, often has negative jumps, as interviewers "revert" a few questions in the questionnaire sequence to commence with the subsequent roster item.
Score
Type 1 Score. Anomalies within f__sequence_jump
are detected using iNNE, a fast, isolation-based algorithm adept at detecting local anomalies. The default contamination level is set to 0.1. If required, it can also be set manually or be determined automatically by the system. Anomalies are detected on the item-level by variable_name
.
In essence, anomalies are considered as unusual sequence jumps to an item. These can arise either from legitimate yet atypical enablement patterns in preceding items or from interviewers skipping or backtracking to answer the item.
s__sequence_jump
represents the fraction of items in an interview that were determined to be anomalous based on their f__sequence_jump
values.
This feature identifies interviewers' tendency to repeatedly select the same answer options in single-select questions.
Feature
f__single_question
is constructed on the item level for all questions of type SingleQuestion
in the microdata with fixed answer options (question is not linked), 2 or more answer options, and which are not of the Combobox type. The feature indicates the relative position of the chosen answer: a value of 0 means the top answer was selected, and a value of 1 means the bottom-most answer was selected. The feature is scaled in intervals of 1/N, where N is the total number of answer options available.
Score
Type 3 Score. s__single_question
represents the proportion of questions for a given interviewer where there's a noticeable pattern in selecting similar or identical answer options compared to other interviewers. The score is derived by:
- Identifying those
variable_name
for each interviewer with observations exceeding five times the number of available answer options. - Computing the entropy for these
variable_name
for each interviewer, capturing the diversity in their answers. - Designating a
variable_name
as anomalous for an interviewer if its entropy is less than half the median entropy value compared to other interviewers. - Calculating by interviewer the share of anomalous variables over all variables for which entropy was assessed.
This feature identifies the extent to which tablet time has been adjusted backward in an interview, usually done deliberately by interviewers.
Feature
f__time_changed
is calculated at the unit level by examining the time differences between consecutive active interviewing events recorded in the paradata. All negative time differences, excluding those within a 180-second range, are aggregated. This approach deliberately omits minor negative time intervals (up to 3 minutes) which can arise when questions are answered after initiating a GPS question but before its response has been logged. The negative time differences captured in f__time_changed generally indicate intentional adjustments to the tablet's clock by the interviewer. Such adjustments might be made to alter the timestamps shown in timestamp questions or to give the appearance of conducting interviews at different times or dates.
Score
Type 2 Score. s__time_changed
is computed by rounding f__time_changed
to the nearest 10 minutes interval. This is done to control sensibility of the algorithm used for score aggregation.
This feature quantifies the total interview duration.
Feature
s__total_duration
is constructed on the unit level and contains an approximation of the total duration of active interviewing events. It is built by:
- Keeping only active interviewing events, i.e of type
['AnswerSet', 'AnswerRemoved', 'CommentSet', 'Resumed', 'Restarted']
. Note, as of now, events of typePaused
are excluded and not counted towards the total. - For every event, computing the time interval to its preceding event (sorted by column
order
). - Assigning NaN (Not a Number) to negative time intervals, which are unrealistic durations often resulting from interviewers altering the tablet's time settings.
- Sum all time intervals for an interview that are shorter than 30 minutes. Longer time intervals occasionally occur, but it is save to assume that no active interviewing activities occurred if no event was recorded in the time interval.
It's important to understand that f__total_duration
is an approximation of the actual duration of all interviewing events. Pauses are excluded. In the future, it could be considered to include short pauses.
Score
Type 2 Score. s__total_duration
is computed by rounding f__total_duration
to the nearest 10 minutes interval. This is done to control sensibility of the algorithm used for score aggregation.
This feature quantifies the total time that has elapsed in an interview.
Feature
f__total_elapsed
is computed at the unit level by calculating the time difference between the first and last interviewing events recorded in the paradata. It measures the total time that has elapsed from the moment an interview was started to the last action taken by an interviewer before any Supervisor or Headquarters role first interacted with the interview. In scenarios where interviews were conducted in a single session without any interruptions, f__total_elapsed
closely mirrors f__total_duration
. However, for interviews that experienced pauses and were resumed later, the elapsed time captured by f__total_elapsed
could be substantially longer than that of f__total_duration
.
Score
Type 2 Score. First, anomalies in f__total_elapsed
are identified using ECOD, an efficient, non-parametric algorithm that leverages cumulative distribution functions.
Subsequently, two scores are computed: s__total_elapsed_lower
and s__elapse_duration_upper
. Both are boolean values representing anomalies detected at the lower or upper boundaries of elapsed time, respectively. By distinguishing between the lower and upper anomalies, we mitigate potential correlations with the s__total_duration
score.