SDTM defines 'Domain' as a collection of logically related observations with a common topic.
SDTM domains are broadly classified into three General observation classes and special purpose domains
Findings class captures the observations from planned evaluations like laboratory, ECG testing, questionnaires, vital signs. The sub-type 'Findings about' captures the findings related to an event or intervention.
Events class captures 1) Planned protocol milestones such as randomization and study completion (disposition data) 2) Occurrences, conditions or incidents independent of planned study evaluations (MH, AE)
Interventions class captures information about 1) Investigational treatment (EC, EX) 2) Therapeutic treatments or procedures coincident with or prior to study (CM,PR) 3) Other substances self-administered by subject (SU)
Based on the 'core', variables are classified as Required, Expected and Permissible
Required: The set of variables which must be present in a dataset and with no null value are called required variables. Examples include identifier variables like DOMAIN, USUBJID, xxSEQ and topic varilables like xxTESTCD, xxTERM, xxTRT
Expected: The set of variables which must be present in a dataset and can have null values are called expected variables. Examples include Reference dates like RFSTDTC, RFENDTC, RFXSTDTC in demographics domain, AESTDTC,AEENDTC in adverse events domain, EXDOSE in exposure domain, LBCAT, LBORRES in laboratory domain etc
Permissible: A variable which can be kept in a dataset when data is collected, the values in these variables can be null. BRTHDTC, ETHNIC in demographics domain, Some identifier variables like xxREFID, xxSPID, Some timing variables like xxTPT, xxRTPT
Based on the 'role', variables are classified as Identifiers, Topic, Qualifiers, Timing and Rule variables
5 different types of qualifier variables in SDTM are Grouping qualifiers, Synonym qualifiers, Result qualifiers, Variable qualifiers and Record qualifiers
Variables which are used to identify the study, domain, subject and record sequence are the common identifier variables.
Some additional identifiers include xxGRPID, xxREFID, xxSPID, xxLNKID, xxLNKGRP
Date/time of collection in findings class xxDTC
Start date/time and End date/time of an event in events class
Start date/time and End date/time of a medication in interventions class
VISITNUM, VISIT, EPOCH, VISITDY associated with a date of collection or start date of an event/intervention
Study days associated with above date variables etc.
The topic variable in a findings domain is xxTESTCD, where xx stands for the domain of interest.
For example, LBTESTCD is the topic variable in LB domain, EGTESTCD is the topic variable in EG domain, VSTESTCD is the topic variable in VS domain, QSTESTCD is the topic variable in QS domain and so on.
The topic variable in a findings domain is xxTRT, where xx stands for the domain of interest.
For example, CMTRT is the topic variable in CM domain, EXTRT is the topic variable in EX domain
The topic variable in a findings domain is xxTERM, where xx stands for the domain of interest.
For example, AETERM is the topic variable in AE domain, MHTERM is the topic variable in MH domain, DVTERM is the topic variable in DV domain, DSTERM is the topic variable in DS dataset
Possible values of origin are: CRF, eDT, Assigned, Derived.
Majority of the data values in SDTM comes from CRF.
Some of the variable values like DOMAIN are directly assigned in programming.
When data is collected in other systems, like Electrocardiogram and Central laboratory the source becomes 'eDT'.
And, some variables like RFSTDTC, RFENDTC, Relative Study Day, xxSEQ are derived programatically.
Demographics(DM), Subject Visits (SV), Subject Elements (SE), Comments (CO) domains belong to special purpose domains in SDTM. SUBJID is the topic variable in DM, VISITNUM is the topic variable in SV, ETCD is the topic variable in SE, COVAL is the topic variable in CO
SDTM allows only a set of standard variables to be kept in a dataset. However,in most cases there can be additional information collected based on study needs which cannot be represented in standard variables. This information is represented as non-standard variables in a supplementary dataset to a parent domain and these non-standard variables are supplementary qualifiers.
The structure of a supplemental qualifier dataset is one record per IDVAR, IDVARVAL, and QNAM value per subject
Findings about domain is used to collect additional information about an event or intervention.
Example 1:
In a study evaluating a new treatment for anemia, anemia history is captured as part of medical history. Medical history domain structure restricts the collection of additional information about anemia apart from the start date, end date, severity etc. Additional information about symptoms of anemic history like shortness of breath, fatigue can be collected in Findings About domain.
Example 2:
In a study related to Chronic Kidney Disease (CKD), information related to start date of CKD is collected as part of medical history. However additional follow-up questions about this medical history (event) like the cause of CKD (diabetic nephropathy, hypertensive nephropathy etc), and stage of CKD do not fit in medical history, so this information is collected as part of FA domain.
Findings about domain is used to collect additional information about an event or intervention and it has a structure similar to a findings domain. The link between the TESTCDs collected in FA is related to parent event or intervention by using FAOBJ variable. In most cases, FAOBJ will contain the --TERM or --TRT or --DECOD from the event or intervention domain.
TPT and associated variables are required when a collection or assessment is made with reference to another timepoint. Example: A subject's vital signs are to be be collected at 30 min, 60 min and 90 min post treatment dosing. In such cases, we use VSTPT and associated variables for data collection. We populate VSTPT as 30 min, 60 min and 90 min for the respective collection and populate VSTPTREF as 'DOSE ADMINISTRATION' and VSRFTDTC wit h the date/time of dosing to indicate that the values 30 min, 60 min and 90 min are with reference to dose administration date/time. We can also populate the planned elapsed time using VSELTM with the values PT30M, PT1H and PT1H30M to indicate that the samples are meant to be collected at the specified timepoints after dosing.
SDTM defines 'EPOCH' as a planned period of time that serves a purpose in the trial as a whole (like screening, treatment, follow-up).
SDTM defines 'ELEMENT' as the basic building block in the trial design (like planned treatments-Investigational product, comparator; screening, follow-up,washout)
So, EPOCH is an upper level grouping of ELEMENT. For example, in treatment epoch a subject belongs to either investigational product or control group elements based on the arm to which the subject is randomized to.
Clinical Events domain is used to collect information about events of clinical interest as per protocol.
In a study evaluating a new treatment for chronic kidney disease, certain events like patient requiring dialysis, receiving or planning kidney transplantation are events of clinical interest. Such events are captured in CE domain.
Adverse event is defined as 'Any untoward medical occurrence in a patient or clinical investigation subject administered a pharmaceutical product and which does not necessarily have a causal relationship with this treatment'
SDTM defines clinical events as any event of interest that would not be classified as an adverse event. For example, episodes of symptoms of the disease under study - episode of migraine in a study treating migraine, episode of a seizure in an epilepsy study, episode of hypoglycemia in a study treating diabetes, occurrence of stroke in a study providing preventive treatment for stroke. Subject experiencing unusual physical activity or stressful event because of wearing a device in study evaluating a new wearable device.
Finding domains of SDTM are of normalized structure (long structure) in which results of different tests are presented as rows using same variables (TESTCD and ORRES). Sometimes we need to have more specificity about the data present on a given record (Eg: Origin, Derivation or presentation requirement of a TESTCD may differ from other TESTCDs). In such cases, we present the variable-level metadata for each value of the TESTCD. Such metadata is called value-level metadata.
Dataset name
Dataset Description
SDTM class
Record structure
Purpose of the dataset
Unique keys and Documentation
Type (data type)
Controlled terms or Format
In addition to the variable level metadata items, we will have the variable (for which value-level metadata-VLM is required) and a where clause (to identify the list of values for which the VLM is applicable)
Trial summary (TS): contains the summary (planned and actual characteristics) of trial in a structured format
Trial elements (TE): contains the start rule and end rule definitions of each element in the study
Trial arms (TA): contains the ordered sequence of elements of each trial arm in the study
Trial visits (TV): describes the planned visits in the trial including the planned study day of each visit
Trial Inclusion/Exclusion (TI): contains all the inclusion and exclusion criteria for the trial, including different amendments
Trial Disease Assessments (TD): contains the protocol-specified disease assessment schedule
First step is to create the required QNAMs as variables.
1) TRANSPOSE these QNAMs and remove the records with QVAL after transpose. 2) Use explicit output statements to restructure the dataset.
AETRTEM in SUPPAE and TRTEMFL in ADAE are used to indicate if an event is treatment-emergent. These two represent the same concept and will yeild same results unless there exists some partial dates with which treatment emergence cannot be clearly identified. If impuation of dates is required for analysis, TRTEMFL will be derived in ADAE dataset using imputed dates.
When the collected result (Original result) is not the same unit as the standard unit then we need to employ a conversion factor to convert the original result to standard result.
We have to fetch the normal ranges from a central repository of normal ranges by merging based on age and gender. Before merging, we may have to derive a variable to calculate AGE at collection.
Metadata items associated with an SDTM variable in a data definition file (define.xml) are :
1) Variable Name
2) Variable Label
3)Variable Type
4) Variable Length
5) Controlled terms, codelist or Format
Origin) Role
7) Derivation/Comment
The subjects who are not screenfailures but not randomized can be identified using the information present in demographics domain. For these subjects, DM.ARM value will be "NOT ASSIGNED", for screen failures DM.ARM value will be "SCREEN FAILURE" and for randomized subjects ARM value will contain the actual treatment to which the subject is randomized.
Planned and Actual arm variables (ARMCD/ARM vs ACTARMCD/ACTARM) variables of Demographics can be used to identify subjects who received incorrect treatment when compared to the planned treatment arm.
UNPLAN element of Subject Elements domain can also be also used to check if a subject received a wrong treatment.
The subjects who are randomized but not treated can be identified using the information present in demographics domain. For these subjects, DM.ACTARM will be "NOT TREATED", while for screen failures DM.ACTARM will "SCREEN FAILURE" and for non-randomized non-screenfailures DM.ACTARM will be "NOT ASSIGNED" and for treated subjects DM.ACTARM will contain a value corresponding to the actual treatment the subject is exposed to.
DM, Demographics domain belongs to Special Purpose Domains
SUBJID is the topic variable in Demographics domain
RFXSTDTC is used to capture the subject's treatment start date. RFSTDTC can also be used to capture the subject's treatment start date, however it may vary in certain cases.
For example, RFSTDTC can be populated with treatment start for treated subjects, randomization date for subjects who are randomized but not treated, informed consent or enrollment date for subjects who are not screenfailures but not randomized yet or discontinued before randomization.
In studies where treatment start date is considered as reference start date, it is populated as the earliest exposure date. In cases where alternate definitions exist for reference start date for different subjects then dates from disposition related datasets like randomization date or enrollment date or informed date can be used.
RFENDTC is generally populated from End of study disposition page of CRF.
RFXSTDTC and RFXENDTC is generally derived as the earliest and latest exposure dates respectively using exposure related raw datasets.
RFPENDTC is the date when subject ended participation or follow-up or the last known date of contact in a trial.
In a randomized multiple arm study, ARMCD corresponds to the planned treatment which a subject is supposed to take as assigned at the time of randomization. While ACTARMCD corresponds to the actual treatment the subject is exposed to.
Different algorithms can be used to determine a subject's actual arm. For example, the first exposed treatment can be considered as actual treatment for a subject or the most frequently exposed treatment can be considered as the actual treatment.
In studies where death information is not exolicitly collected, we can use adverse events data to get the death date and death flag. Different variables of adverse events indicate if death occurred. Presence of a record in AE with AEOUT=FATAL or AESDTH="Y" or AETOXGR=5 indicates the death of the subject.
RFPENDTC is derived by appending all the date variables from all applicable input raw datasets and picking the latest date for each subject.
Based on study team's decision, if the last known date is a partial date, we can keep partial date in RFPENDTC.
Most of the date variables in Demographics domain are derived variable. For example, RFSTDTC, RFENDTC are derived using disposition, enrollment and exposure related raw datasets. RFXSTDTC, RFXENDTC are derived using exposure related raw datasets. RFPENDTC is derived by majority of the raw datasets which have a date variable in it.
When CRF allows collection of multiple RACE values for a subject, we generally populate RACE variable value as 'MULTIPLE', and the individual RACE values are populated in supplementary domain using QNAMs like RACE1, RACE2, RACE3 etc.
Information related to death of a subject is generally captured on end of study page and adverse event related raw datasets. We fetch the death from these dataset. In disposition raw datasets, we find the reason for study discontinuation as 'Death', while in adverse event related raw datasets, we will have a variable indicating if an event lead to the death of the subject or the event outcome variable may be marked as 'Fatal. When toxicity grades of an event is collected as per CTCAE, a grade of 5 indicates that the event related in death of the subject.
Subject Visits domain is used for presenting the start and end dates of the visits that the subjects completed in the study. Whereas in a Trial Visits domain we will have all the protocol scheduled visits.
Trial visits dataset contains all the planned visits of the study and it is not a subject level dataset. Whereas, Subject visit dataset contains the start and end date of the actual visits completed by each subject. Some of the actual visits completed by a subject can be unplanned visits. TV dataset does not contain any information about unplanned visits as we will not know which subject will need an unscheduled visit and the timing of the unscheduled visits.
Unscheduled visits are remapped such that they are in a chronological sequence with planned visits, and are incremented by decimal places to the planned visit number value.
VISITDY represents the planned study of the visit. Whereas SVSTDY and SVENDY represent the actual study days of the start and end of the visits respectively.
If a subject's experience for a particular period of time cannot be represented with one of the planned elements (like receiving wrong treatement or a dose other than the planned dose), that period of time can be presented as an unplanned element. We populate ETCD as UNPLAN on such record in SE and use SEUPDES to populate the description of the unplanned element.
When compared to other parent SDTM domains, the structure of Comments domain is different and the structure resembles that of a supplementary domain. So, we will have RDOMAIN, IDVAR, IDVARVAL variables to identify the domain and record to which a comment is applicable. COVAL variable is used to present the actual comment collected.
Generally, most of the data in SDTM domains come directly from CRF. In very few instances of findings domains, there can be some derived records. We use --DRVFL to indicate that a record is derived record (when compared to other collected records).
Examples include:
1) A new TESTCD is created for capturing the total score of a questionnaire.
AETRTEM in SUPPAE is derived by comparing AESTDTC with RFXSTDTC and RFXENDTC. Based on the drug's half-life we may have to add 'n' number of days to RFXENDTC.
Disposition domain is used to store subjects' information about protocol milestones like informed consent, randomization and disposition events like treatment, follow-up, study completion status.
EX contains the data related to subject's exposure to protocol-specified study treatments in protocol specified units
Whereas, EC is used to capture subject's exposure to protocol-specified study treatment as collected on the CRF, that is, the collected units need not be same as the protocol specified unit. For example, the protocol specified unit of dosage is 'mg/kg' and CRF has the total volume of injection administered is collected. We present the volume administered information in EC domain and the derived dosage information in terms of mg/kg is presented in EX domain.
If the collected information is in terms of protocol-specified units then usage of EC domain is optional.