Status
| Package | mighty.metadata |
| Status | Approved |
| Version | 0.1.0 |
| Description | ADR for base structure and initial release of mighty.metadata |
Success criteria
With the initial v0.1.0 release of mighty.metadata users will be able to specify ADaM data sets using our new YAML based format. Functions and classes in mighty.metadata will facilitate easy manipulation, validation, and downstream usage.
Context
mighty.metadata is a new R package defining a new way of working with metadata and specifications for programming CDISC ADaM data sets. It replaces legacy Excel metadata that are stored in tabular form generally known as CST files.
The specification of ADaM datasets in mighty.metadata has a dual purpose of equal importance:
- Full specification in a machine readable format that can be utilized for automation and limit existing manual coding workflows.
- Automatic generation of ADaM scripts using the complete mighty framework.
In order to achieve this we need agreement on the aspects below.
User interface:
- How will a Clinical Data Scientist specify their ADaM dataset?
- What will be the main functionalities exported by the package?
Scope of the package:
- What is the technological scope of the solutions?
- To what degree will additional business logic be implemented?
Integration:
- What is our main tech stack?
- How do we ensure compatibility with mighty and mighty.toolbox?
- Should mighty.metadata be developed for Open-Source?
Decisions
1. Use YAML
YAML is used to specify ADaM data sets since it is machine readable, can be version controlled, and are easy to edit for a CDS. One YAML per ADaM domain. Structure of YAML will be defined with a JSON schema. YAML is used since it integrates well with Git while still being easy to update for a human.
2. Scope of solutions (YAML, R)
The package will only read YAML specification files, and only write modified YAML specification files. All other functionalities will ingest and return R objects.
3. S7 based tech stack for robust validation and modern OOP
Main classes and functionalities are based on S7, and S7 generics and dispatch are used. YAML functionalities are based on S7schema and use the validator from there.
4. Main classes
The package will define two main classes:
mighty_study()andmighty_domain()1.mighty_domain()is a list of metadata regarding a single ADaM domain. A new instance will be created based on the YAML specification.mighty_study()is a named list ofmighty_domain()objects. New instance to be created based on a path to a folder of YAML specifications.
5. Business logic handling
The package will provide utility functions to reduce redundancies in ADaM specifications. Initially this will include inheritance of core variables, conditional filtering of specifications (e.g. subsetting a pooled specification to only include columns relevant for a single trial), and fully populate specifications (e.g. reusing label and format information from a predecessor). These utilities will ingest and return
mighty_study()and/ormighty_domain()objects as applicable. Often a method formighty_study()will call the generic on eachmighty_domain()entry.
6. Process integration
In general all downstream usage, e.g. code generation in mighty, should be based on
mighty_study(). For easier reuse of metadata specifications in scripts the package will also include utility functions to create metadata in tabular format. Metadata will concern domains, columns, and parameters specified inmighty_study()and will be returned asdata.frame()/tibble. Similarly, the package will also contain utility functions to do the reverse; creating amighty_study()andmighty_domain()objects from the same tabular format.
7. Open-Source
mighty.metadata will be developed sufficiently generic to be suitable as a publicly available R package. To be hosted on NovoNordisk-OpenSource GitHub organization, and will be submitted to CRAN. 8. CDISC Standards support
The aim is for mighty.metadata and its schema to be compatible:
| Standard | Version |
|---|---|
| ADaM. | 2.1 |
| ADaM IG | 1.2. |
| OCCDS | 1.1. |
| Define-XML | 2.1 |
Consequences
Changes to current content
- Add
mighty_study()class. - Remove html viewer and formatting function.
- Move
build_adam_metadata(),write_adam_yaml(), andwrite_adam_domain_yaml()to {plz} (utility package for NN specific workflows). - Update
make_mdcol_from_yaml()to ingestmighty_study().
Alternatives Considered
1. Work only directly with YAML files
-
Pros: Simplifies technical implementation if we
just read/write YAML files directly instead of defining new
mighty_domain()andmighty_study()classes. - Cons: Lose the ability to work with the specifications in memory (suited for e.g. a later Shiny editor app).
- R-specific considerations: Validation without using parent S7schema class (or similar) would require significant development, that is out of scope for this package.
- Why not chosen: Validation needed; both when reading and writing specifications. In memory manipulation of specs needed for any further app development, but also makes it easier to manually work with the metadata from R.
2. Use S3 base R classes and methods
- Pros: Keep it simple, without using new “cutting-edge” libraries such as S7.
- Cons: Makes adopting S7schema validation harder.
- R-specific considerations: Lose the new more robust dispatch features of S7.
- Why not chosen: We need S7schema for validation, and S7schema need to be S7 due to how the javascript validator is implemented.
3. Use R6
- Pros: Use a more widely adopted OOP library.
- Cons: Lose the more pipeable user interface from S7.
- R-specific considerations: We do not need to modify objects in place, which is the standard in R6.
- Why not chosen: Implementing the pipeable user interface is simpler with S7.
4. Use jsonvalidate instead of S7schema
- Pros: Would have saved development time, since jsonvalidate was initially developed for use in mighty.metadata.
- Cons: jsonvalidate is less robust, and the reporting of validation errors are lacking. Also does not include functions to document a schema (planned in S7schema).
-
R-specific considerations: jsonvalidate is based on
R6, and we would therefore lose the
S7::validate()integration. The serializer in jsonvalidator looked really promising, but when tested it was not very robust. - Why not chosen: Reporting of validation errors are not good enough for us to be able to represent them to a user in a meaningful way.
Implementation Details
Key Functions
-
mighty_domain(): Metadata object for a single ADaM domain. Inherits fromS7schema::S7schema(). -
mighty_study(): Metadata object for a study. Named list ofmighty_domain()objects. - Utility functions to manipulate columns, rows, and parameters in a
mighty_domain()object. - Utility functions to create tabular representations of
mighty_study()metadata. - Function to add core variables (e.g. from ADSL) as predecessors in
another
mighty_domain()domain. - Function to conditionally filter specifications suitable for subsetting a pooled study specification.
- Function to fully specify a domain when sparse information was provided. E.g. defining the format of a predecessor column based on the format of the origin column.
- Utility function to create YAML files based on tabular
representations of
mighty_study()metadata. Consider if we want to add wrapper covering Pinnacle 21E Excel metadata format. ### Example workflow
Below is an example workflow. Note that function names are stand-ins unless they are explicitly specified above:
mighty_study("path/to/specs") |> # --> List `mighty_domain()` specs
filter_with_metadata() |> # --> Remove e.g. columns from a pooled spec that are not relevant for study A
populate_core() |> # --> adds core variables as predecessors
populate_sparse() |> # --> all info filled out and still returning a `mighty_study()` object
create_md_col() # --> data.frame with all column specsDocumentation Strategy
- roxygen2 documentation for all exported functions. Group in meaningful sections.
- README showing simple use case relevant for a new user
- Vignette documenting the schema
- Getting Started vignette showing general workflow (see example above)
- Article containing this ADR until we have a better place to put it.
- Include in overview of the mightyverse in the mighty package.
- Include as part of holistic ADaM programming section in r-docs2.
Testing Strategy
- All scripts (
/R/{scriptname}.R) should have a corresponding test file (/tests/testthat/test-{scriptname}.R). - Focus on unit testing individual functions, including helper functions.
- Always aim for 100% test coverage.
- Also treat testing of larger exported functions as integration tests, making sure realistic use is tested, but rely on helper function to throw errors on invalid input in order to keep tests modular.
- Use standard GitHub Actions workflows from r.workflows.
- Avoid repo-level lint configurations as much as possible to ensure adherence QA standards established by r.workflows
Compliance Considerations
- All development on GitHub using Pull Requests for merges to main branch, and standard ATMOS branch protection rules.
- R CMD Check is required to pass on all relevant platforms before a PR is approved.
References
- jsonvalidate
- mighty.metadata
- mighty.toolbox (internal package)
- plz (internal package)
- r.workflows
- roxygen2
- S7
- S7schema
