browseMetadata

This R package was created to help a researcher browse the health datasets in the SAIL databank. It is intended to be useful in the earlier stages of a project, where datasets are being scoped out. When a research team has not yet got access to the data they can still browse the metadata, and start to address such questions as:

❓ what datasets are available?

❓ what datasets do I need for my research question?

❓ which variables within these datasets map onto my research domains of interest? (e.g. socioeconomic factors, childhood adverse events, medical diagnoses, culture and community)

What does the R package do?

This R package is a planning tool, designed to be used alongside other tools and sources of information about health datasets for research.

If a researcher wants to access datasets within SAIL databank, how do they know which variables will represent the concepts they care about for their research question? For many health datasets, including SAIL, the metadata is publicly available. This R package uses the Health Data Research Gateway and the connected Metadata Catalogue. This R package has a function which takes a metadata file as input and facilitates the process of browsing through each dataset and variable. The user is asked to categorise each variable into a domain related to their research question, and these categorisations get saved in a csv file for later reference. To speed up this process, the function automatically categorises some variables that regularly appear in health datasets (e.g. ID, Sex, Age).

🚧 ⚠️ This package is in early development, and has only been tested on a limited number of metadata files. In theory, this package should work for any dataset listed on the Health Data Research Gateway (not just SAIL) as long as a json metadata file can be downloaded. In practice, it has only been tested on a limited number of metadata files for SAIL databank.

Getting started with metadata

There are many existing tools that allow you to browse metadata for health datasets. These are listed in the RESOURCES.md file in this repository.

💡 These tools may be sufficient for you to address the example questions listed above.

Getting started with this `R` package `browseMetadata`

Install

Run in the R console:

install.packages("devtools")
devtools::install_github("aim-rsf/browseMetadata")

Demo

library(browseMetadata)

Read the documentation, then run the function in demo mode:

?domain_mapping

domain_mapping()

The R console will show:

ℹ Running domain_mapping in demo mode using package data files

Enter Initials: RS

Respond with your initials and press enter. It will ask you if you want to read the description of Data Assets and Data Classes (tables):

── Data Asset Name ────────────────────────────────────────────────────────────────────
Maternity Indicators Dataset (MIDS)

── Data Asset Last Updated ────────────────────────────────────────────────────────────
2023-12-04T14:13:49.131Z

── Data Asset File Exported By ────────────────────────────────────────────────────────
Rachael Stickland at 2024-01-05T13:22:09.774Z

ℹ Found 2 Data Classes (2 tables) in this Data Asset

Would you like to read a description of the Data Asset? (Y/N) Y

Enter Y to read these descriptions, for the purpose of the demo.

For this example, the Data Asset is called MIDS and the tables inside this Data Class are BIRTH and INITIAL_ASSESSMENT.

For each table, it will ask which variables to process:

Enter the range of variables (data elements) to process. Press Enter to process all: 1,10

If you press enter it will process all the variables, so use a smaller range like 1 to 10 the first time you run this demo.

For each data element (variable) you will be shown this structure:

DATA ELEMENT ----->  SERVICE_USER_HAS_MENTAL_HEALTH_CONDITION_CD 

DESCRIPTION ----->  Code indicating whether or not the woman has an existing mental health condition. 

DATA TYPE ----->  CHARACTER

By referencing the plots tab, and other info you may have, categorise this variable with a number(s). A variable can map to more than one domain so a comma separated list of numbers can be given (7,8).

There is an (optional) note field to explain your choice.

Categorise this variable: 8

Notes (write 'N' if no notes): N

If you make a mistake, the next prompt allows you to redo. Or press enter if you are happy to continue.

Press enter to continue or write 'redo' to correct previous answer:

When you get to the end of your requested number of variables it will show you variables that have been auto categorised.

If you want to change these auto categorisations, and do them manually, include the row number (1,9) in the list.

! Please check the auto categorised data elements are accurate:

 DataClass      DataElement        Domain_code
1      BIRTH    AVAIL_FROM_DT      1
9      BIRTH    CHILD_ALF_E        2
10     BIRTH    CHILD_ALF_STS_CD   2

Enter row numbers you'd like to edit or press enter to accept the auto categorisations:

Finally, it will ask you if you want to review the categorisations you previously made.

Would you like to review your categorisations? (Y/N)

If you say yes (with Y) it will take you through the same review process you just did for auto categorisations.

At the end of processing that Data Class (table) it will show:

ℹ Your final categorisations have been saved to LOG_MaternityIndicatorsDataset(MIDS)_BIRTH_2024-01-30_10-42-15.csv

The function will then repeat the same steps for the next Data Class (table).

Understanding the domain list

For this demo, a simple list of domains are provided, see data-raw/domain_list_demo.csv.

This list shows up in the R plot tab:

[0] NO MATCH / UNSURE
[1] METADATA
[2] ALF ID
[3] OTHER ID
[4] DEMOGRAPHICS
[5] Socioeconomic factors
[6] Location
[7] Education
[8] Health

There are 5 default domains always included [0-4], appended on to any domain list given.

For a research study, your domains are likely to be much more specific e.g. 'Prenatal, antenatal, neonatal and birth' or 'Health behaviours and diet'.

Output

The output of your decisions will be saved to a csv file. The csv file name includes the data asset, data class, and date stamp. This csv file, in addition to what is shown on the console, contains:

user initials (from user input)
metadata version (from json)
date time stamp the metadata was last updated (from json)
data asset (from json)

The intended use case for this log file is to be loaded up, compared across users, and used as an input in later analysis steps when working out which variables can be used to represent which research domains.

A subset of columns from the csv outputs are shown below, running with '1,10' data elements:

   DataClass         DataElement Domain_code                   Note
1      BIRTH       AVAIL_FROM_DT           1       AUTO CATEGORISED
2      BIRTH       BABY_BIRTH_DT           4                      N
3      BIRTH   BIRTH_APGAR_SCORE           8                      N
4      BIRTH       BIRTH_MODE_CD           8                      N
5      BIRTH         BIRTH_ORDER           8                      N
6      BIRTH    BIRTH_OUTCOME_CD           8                      N
7      BIRTH      BIRTH_TREAT_CD           0   No description given
8      BIRTH BIRTH_TREAT_SITE_CD           6                      N
9      BIRTH         CHILD_ALF_E           2       AUTO CATEGORISED
10     BIRTH    CHILD_ALF_STS_CD           2       AUTO CATEGORISED

            DataClass                                 DataElement Domain_code                           Note
1  INITIAL_ASSESSMENT                               AVAIL_FROM_DT           1               AUTO CATEGORISED
2  INITIAL_ASSESSMENT                                  GEST_WEEKS           8                              N
3  INITIAL_ASSESSMENT                              INITIAL_ASS_DT           8           Date of health visit
4  INITIAL_ASSESSMENT                              MAT_AGE_AT_ASS           4               AUTO CATEGORISED
5  INITIAL_ASSESSMENT                                MOTHER_ALF_E           2               AUTO CATEGORISED
6  INITIAL_ASSESSMENT                           MOTHER_ALF_STS_CD           2               AUTO CATEGORISED
7  INITIAL_ASSESSMENT                                     PROV_CD         6,8   Org code for health provider
8  INITIAL_ASSESSMENT                     SERVICE_USER_GRAVIDA_CD           8                              N
9  INITIAL_ASSESSMENT SERVICE_USER_HAS_MENTAL_HEALTH_CARE_PLAN_CD           8                              N
10 INITIAL_ASSESSMENT SERVICE_USER_HAS_MENTAL_HEALTH_CONDITION_CD           8                              N

Using your own input files

domain_mapping(json_file, domain_file)

This code is in early development. To see known bugs or sub-optimal features refer to the Issues.

Run the code the same as the demo, using your own input files.

The json file:

contains metadata about datasets of interest
downloaded from the metadata catalogue
see data-raw/maternity_indicators_dataset_(mids)_20240105T132210.json for an example download
the metadata catalogue refers to datasets as 'data assets' and tables within these as 'data classes'

The domain_file:

a csv file created by the user, with each domain listed on a separate line
see data-raw/domain_list_demo.csv for a template
the first 5 domains will be auto populated (see demo above)

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

The GNU General Public License is a free, copyleft license for software and other kinds of works. For more information, please refer to https://www.gnu.org/licenses/gpl-3.0.en.html.

Citation

To cite package ‘browseMetadata’ in publications use:

Stickland R (2024). browseMetadata: Browses available metadata, to catergorise/label each variable in a dataset. R package version 0.1.0.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {browseMetadata: Browses available metadata, to catergorise/label each variable in a dataset},
    author = {Rachael Stickland},
    year = {2024},
    note = {R package version 0.1.0},
    doi = {https://doi.org/10.5281/zenodo.10581500},
  }

Contributing

We warmly welcome contributions to the browseMetadata project. Whether it's fixing bugs, adding new features, or improving documentation, we welcome your involvement.

Report Issues: If you find a bug or have a feature request, please report it via GitHub Issues.
Submit Pull Requests: We welcome pull requests. Please read our Contribution Guidelines on how to make contributions.
Feedback and Suggestions: We're always looking to improve, and we value feedback and suggestions. Feel free to open an issue to share your thoughts.

For more information on how to contribute, please refer to our Contribution Guidelines.

Contributors ✨

This project follows the all-contributors specification, using the (emoji key). Contributions of any kind welcome!

_{Rachael Stickland}
🖋 📖 🚧 🤔

_{Batool Almarzouq}
📓 👀 🤔

_{Mahwish Mohammad}
📓

Acknowledgements ✨

Thank you to multiple members of the MELD-B research project and the SAIL Databank team for providing use-cases of meta data browsing, ideas and feedback. Thank you to the Health Data Research Innovation Gateway for hosting openly available metadata for health datasets, and for data providers that have included their datasets on this gateway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

browseMetadata

What does the R package do?

Getting started with metadata

Getting started with this `R` package `browseMetadata`

Install

Demo

Understanding the domain list

Output

Using your own input files

License

Citation

Contributing

Contributors ✨

Acknowledgements ✨

Files

README.md

Latest commit

History

README.md

File metadata and controls

browseMetadata

What does the R package do?

Getting started with metadata

Getting started with this R package browseMetadata

Install

Demo

Understanding the domain list

Output

Using your own input files

License

Citation

Contributing

Contributors ✨

Acknowledgements ✨

Getting started with this `R` package `browseMetadata`