Cross-trial register analysis of safety data

Example workflow to show mangling and merging of EUCTR and CTGOV data on results of clinical trials
analyses
trials
Author

Ralf Herold

Published

2026-03-30

Modified

2026-06-07

This is an example workflow to show the use of trial data and package ctrdata.

Safety data analysis across registers

Registers of clinical trials include safety data that is collected during the trial.

The safety data collection is part of pharmacovigilance activities and employs a highly structured approach to the definition and detection of adverse events, to their assessment including causality and severity, and to the categorisation of the affected organs, functioning or other observations.

The overall approach is to obtain details of allocation or reporting groups, to merge this with details (counts) of all recorded types of serious adverse events SAEs per group, and finally to merge EUCTR and CTGOV data into a single data frame.

To start, necessary packages are loaded and a database collection is prepared:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(ctrdata)
dbc <- nodbi::src_sqlite(
  dbname = "database.sqlite", 
  collection = "saes")

Get trials of interest

Here, the research question is to tabulate safety data and events of special interest, for a class of medicinal products.

Safety data are part of the results-related data in registers such as EUCTR and CTGOV. Only EUCTR and CTGOV provide structured results-related information.

This use case looks into antibody drug conjugates, which can be associated with specific immunological inflammatory phenomena.

Trials for such medicinal products with result-related data in the therapeutic area oncology can be found with queries proposed by ctrdata, which are then loaded from these registers into the database collection.

queries <- ctrGenerateQueries(
  intervention = "antibody drug conjugate",
  onlyWithResults = TRUE,
  condition = "cancer"
)

result <- lapply(
  queries[c("EUCTR", "CTGOV2")],
  ctrLoadQueryIntoDb,
  con = dbc,
  euctrresults = TRUE
)

Several hundred records are found and stored.

Get data fields of interest

As a next step, from all retrieved trials, those of phase 2, 3 or 4 trials are selected, as they can be assumed to have a sufficiently long observation of the potential occurrences and larger participant numbers. Furthermore, the trials should have an internal control group and not be an extension or follow-up of a previous trial.

This selection is done by calculating trial concepts that are pre-defined in ctrdata and using them to filter relevant from all trials. In the same steps, fields are obtained from the database collection that hold the safety data of interest. The fields were identified beforehand (e.g., using https://regulatorysciencedata.eu/shiny/clinicaltrials/).

trialData <- dbGetFieldsIntoDf(
  fields = c(
    # This example only looks at SAEs
    # but can be extended to all AEs
    # by adding relevant fields and 
    # adapting subsequent code chunks
    #
    "adverseEvents.reportingGroups", # EUCTR
    "adverseEvents.seriousAdverseEvents.seriousAdverseEvent", # EUCTR
    #
    "resultsSection.adverseEventsModule.seriousEvents", # CTGOV2
    "resultsSection.adverseEventsModule.eventGroups" # CTGOV2
  ),
  calculate = c(
    "f.isUniqueTrial",
    "f.trialPhase",
    "f.trialTitle",
    "f.controlType"
  ),
  con = dbc
) %>% 
  filter(.isUniqueTrial) 

# Overview cross-tabulation
with(trialData, table(
  .trialPhase, .controlType))
               .controlType
.trialPhase     none no-treatment placebo active placebo+active other
  phase 1          0            0       0     19              0     0
  phase 1+2        1            0       0     27              0     0
  phase 2          4            0       0     64              0     2
  phase 2+3        0            0       0      1              0     0
  phase 2+4        0            0       0      0              0     0
  phase 3          0            0       4     24              0     1
  phase 3+4        0            0       0      0              0     0
  phase 1+2+3      0            0       0      0              0     0
  phase 4          0            0       0      5              0     0
  phase 1+2+3+4    0            0       0      0              0     0
# Filter relevant trials
trialData <- trialData %>% 
  filter(grepl("2|3|4", .trialPhase)) %>% 
  filter(!grepl("extension", .trialTitle)) %>% 
  filter(.controlType %in% c("placebo", "active"))

# Show example trial data
trialData %>% 
  slice_sample(n = 5L)
# A tibble: 5 × 9
  `_id`       adverseEvents.seriousAdverseEvents.seriou…¹ resultsSection.adver…²
  <chr>       <list>                                      <list>                
1 NCT02001623 <lgl [1]>                                   <df [87 × 5]>         
2 NCT02980341 <lgl [1]>                                   <df [52 × 5]>         
3 NCT01476410 <lgl [1]>                                   <df [40 × 5]>         
4 NCT02988817 <lgl [1]>                                   <df [103 × 5]>        
5 NCT04014075 <lgl [1]>                                   <df [40 × 5]>         
# ℹ abbreviated names: ¹​adverseEvents.seriousAdverseEvents.seriousAdverseEvent,
#   ²​resultsSection.adverseEventsModule.seriousEvents
# ℹ 6 more variables: resultsSection.adverseEventsModule.eventGroups <list>,
#   adverseEvents.reportingGroups.reportingGroup <list>, .isUniqueTrial <lgl>,
#   .trialPhase <ord>, .trialTitle <chr>, .controlType <fct>

The above shows that the safety data are represented in a nested data structure, that is, in a hierarchy of several levels (shown as lists or data frames within the trialData data frame).

To simplify handling nested data, ctrdata provides the function dfTrials2Long(), which generates for each scalar a data item (field) a single row in a long and narrow data frame.

A unique identifier is created for fields that belong together in a trial, e.g. reporting groups, endpoint definition and results.

trialDataLong <- trialData %>%
  dfTrials2Long()
# Total 139712 rows, 45 unique names of variables

trialDataLong %>% 
  slice_sample(n = 10L)
# A tibble: 10 × 4
   `_id`             identifier name                                 value      
   <chr>             <chr>      <chr>                                <chr>      
 1 NCT00947856       18.1       resultsSection.adverseEventsModule.… 3          
 2 NCT01777152       91.1       resultsSection.adverseEventsModule.… 223        
 3 NCT03907488       63.1       resultsSection.adverseEventsModule.… EG000      
 4 2015-001377-40-IT 11.1       adverseEvents.seriousAdverseEvents.… 0          
 5 NCT02573324       20.6       resultsSection.adverseEventsModule.… EG005      
 6 NCT02341625       63.9       resultsSection.adverseEventsModule.… 30         
 7 NCT02341625       42         resultsSection.adverseEventsModule.… Peritonitis
 8 NCT02631876       6.2        resultsSection.adverseEventsModule.… 109        
 9 NCT03729596       3.10       resultsSection.adverseEventsModule.… 1          
10 NCT03474107       184.1      resultsSection.adverseEventsModule.… 0          

Mangle EUCTR data

First, EUCTR data are used to obtain the reporting groups details, including id’s which are references in results data.

# Obtain reporting group data for adverse events,
# (including group identifier and title as well as 
# number of subjects with an AE, number with an SAE
# number of exposed, for each reporting group) 
euctrGroups <- dfName2Value(
  df = trialDataLong,
  valuename = paste0(
    "adverseEvents.reportingGroups.reportingGroup.", 
    "(id|title|subject)")
) %>%
  # Remove empty reporting groups
  na.omit() %>%
  # Expand into one row per trial and reporting group
  pivot_wider(id_cols = c(`_id`, identifier)) %>%
  select(-identifier) %>% 
  # Shorten variable names
  rename(
    title = 
      adverseEvents.reportingGroups.reportingGroup.title,
    reportingGroupId = 
      adverseEvents.reportingGroups.reportingGroup.id)
# Returning values for 24 out of 125 trials

euctrGroups %>% 
  slice_sample(n = 5L)
# A tibble: 5 × 6
  `_id`             reportingGroupId      adverseEvents.reportingGroups.report…¹
  <chr>             <chr>                 <chr>                                 
1 2009-010000-28-FR ReportingGroup-21307  40                                    
2 2017-002985-28-BE ReportingGroup-114825 6                                     
3 2020-002818-41-HU ReportingGroup-167229 46                                    
4 2012-004902-82-BE ReportingGroup-125107 902                                   
5 2015-001377-40-IT ReportingGroup-70662  11                                    
# ℹ abbreviated name:
#   ¹​adverseEvents.reportingGroups.reportingGroup.subjectsAffectedByNonSeriousAdverseEvents
# ℹ 3 more variables:
#   adverseEvents.reportingGroups.reportingGroup.subjectsAffectedBySeriousAdverseEvents <chr>,
#   adverseEvents.reportingGroups.reportingGroup.subjectsExposed <chr>,
#   title <chr>

Next, serious adverse events (SAEs) details are obtained. From the long data frame obtained with dfTrials2Long(), a range of fields is extracted and pivoted into a wide format that has one row for each SAE for each group for each trial:

# Obtain details for SAEs, e.g. organ system, 
# dictionary info, occurrences and fatalities
euctrSaesDetails <- dfName2Value(
  df = trialDataLong,
  valuename = paste0(
    "adverseEvents.seriousAdverseEvents.seriousAdverseEvent.", 
    "(.*)")
)
# Returning values for 24 out of 125 trials

# For each trial and each AR,
# - first create a set without the values (counts)
# - merge both into a table where one row corresponds 
#   to one reporting group for one AR for one trial
euctrSaesDetails <- left_join(
  euctrSaesDetails %>%
    pivot_wider(id_cols = c(`_id`, identifier)) %>%
    filter(!grepl("[.]", identifier)) %>%
    select(-contains("values")),
  # - then create a set of only the values (counts)
  euctrSaesDetails %>%
    filter(grepl("[.]", identifier)) %>%
    pivot_wider(id_cols = c(`_id`, identifier)) %>%
    mutate(identifier = sub("([0-9]+)[.]?.*", "\\1", identifier)),
  by = c("_id", "identifier")
) %>% 
  # Shorten reportingGroupId
  rename(
    reportingGroupId =
      adverseEvents.seriousAdverseEvents.seriousAdverseEvent.values.value.reportingGroupId
  ) %>% 
  # Remove empty reportingGroupId
  filter(!is.na(reportingGroupId))

euctrSaesDetails %>% 
  slice_sample(n = 5L)
# A tibble: 5 × 17
  `_id`              identifier adverseEvents.seriousAd…¹ adverseEvents.seriou…²
  <chr>              <chr>      <chr>                     <chr>                 
1 2012-000660-22-BE  26         ADV_EVT_ASSESS_TYPE.non_… false                 
2 2011-000033-36-BE  22         ADV_EVT_ASSESS_TYPE.syst… true                  
3 2018-001321-68-3RD 33         ADV_EVT_ASSESS_TYPE.non_… false                 
4 2018-001321-68-3RD 65         ADV_EVT_ASSESS_TYPE.non_… false                 
5 2012-004879-38-BE  4          ADV_EVT_ASSESS_TYPE.syst… false                 
# ℹ abbreviated names:
#   ¹​adverseEvents.seriousAdverseEvents.seriousAdverseEvent.assessmentMethod.value,
#   ²​adverseEvents.seriousAdverseEvents.seriousAdverseEvent.dictionaryOverridden
# ℹ 13 more variables:
#   adverseEvents.seriousAdverseEvents.seriousAdverseEvent.organSystem.eutctId <chr>,
#   adverseEvents.seriousAdverseEvents.seriousAdverseEvent.organSystem.version <chr>,
#   adverseEvents.seriousAdverseEvents.seriousAdverseEvent.term <chr>, …

To complete the EUCTR steps, the group details and SAE details are merged:

saesEuctr <- left_join(
  x = euctrSaesDetails,
  y = euctrGroups,
  by = c("_id", "reportingGroupId")
) %>%
  # Shorten variable names by removing leading parts
  rename_with(~ sub(
    "adverseEvents.seriousAdverseEvents.seriousAdverseEvent.(|values.value.)", "", .x),
    starts_with("adverse")
  ) %>% 
  # Select variables of interest
  select(
    `_id`, 
    title,
    term,
    subjectsExposed, 
    subjectsAffected
  ) %>% 
  # Convert to integer (because dfTrials2Long
  # extracts data always as character strings)
  mutate(
    subjectsExposed = as.integer(subjectsExposed), 
    subjectsAffected = as.integer(subjectsAffected)
  )

# Show example rows
saesEuctr %>%  
  slice_sample(n = 5L)
# A tibble: 5 × 5
  `_id`             title                               term                    
  <chr>             <chr>                               <chr>                   
1 2021-003160-27-BE Part A                              Haemolytic anaemia      
2 2012-000660-22-BE Trastuzumab Emtansine 3.6 mg        Haematemesis            
3 2018-002816-29-DE Part1:BelantamabMafodotin3.4mg/kg+… Platelet count decreased
4 2012-004879-38-BE T-DM1 + P                           Epistaxis               
5 2011-000033-36-BE Part A Arm 4 100mg/m2/20 MBq        Endometrial cancer      
# ℹ 2 more variables: subjectsExposed <int>, subjectsAffected <int>

Mangle CTGOV data

Second, CTGOV data are mangled, starting with details on reporting groups:

ctgovGroups <- trialData %>% 
  # Select variables of interest
  select(`_id`, resultsSection.adverseEventsModule.eventGroups) %>% 
  # Expand into columns
  unnest(resultsSection.adverseEventsModule.eventGroups) %>% 
  # Rename to indicate id of group
  rename(groupId = id) %>% 
  # Remove empty event groups
  filter(!is.na(groupId))

ctgovGroups %>% 
  slice_sample(n = 5L)
# A tibble: 5 × 10
  `_id`       groupId title                            description              
  <chr>       <chr>   <chr>                            <chr>                    
1 NCT04296890 EG000   Mirvetuximab Soravtansine        "Participants received s…
2 NCT04622774 EG006   Dose Escalation - Schedule A: I… "Participants received I…
3 NCT02988817 EG014   Expansion Part Cohort 5: Enapot… "Participants with advan…
4 NCT01703949 EG001   Arm B (Brentuximab Vedotin, Niv… "Patients receive brentu…
5 NCT02001623 EG010   Dose Expansion Part: Endometria… "Participants with endom…
# ℹ 6 more variables: seriousNumAffected <int>, seriousNumAtRisk <int>,
#   otherNumAffected <int>, otherNumAtRisk <int>, deathsNumAffected <int>,
#   deathsNumAtRisk <int>

The SAE details are obtained from a different field, and the stats field provides counts by allocation group, type of SAE and organ system.

ctgovSaesDetails <- trialData %>% 
  # Select variables of interest
  # and expand into columns
  unnest(resultsSection.adverseEventsModule.seriousEvents) %>%
  # Expand into further columns
  unnest(stats)

ctgovSaesDetails %>% 
  slice_sample(n = 5L)
# A tibble: 5 × 17
  `_id`       adverseEvents.seriou…¹ term              organSystem              
  <chr>       <list>                 <chr>             <chr>                    
1 NCT02573324 <lgl [1]>              ATELECTASIS       Respiratory, thoracic an…
2 NCT03677596 <lgl [1]>              Fungal sepsis     Infections and infestati…
3 NCT01925612 <lgl [1]>              Visual impairment Eye disorders            
4 NCT04014075 <lgl [1]>              Nausea            Gastrointestinal disorde…
5 NCT03729596 <lgl [1]>              Flank pain        Musculoskeletal and conn…
# ℹ abbreviated name: ¹​adverseEvents.seriousAdverseEvents.seriousAdverseEvent
# ℹ 13 more variables: sourceVocabulary <chr>, assessmentType <chr>,
#   groupId <chr>, numAffected <int>, numAtRisk <int>, numEvents <int>,
#   notes <chr>, resultsSection.adverseEventsModule.eventGroups <list>,
#   adverseEvents.reportingGroups.reportingGroup <list>, .isUniqueTrial <lgl>,
#   .trialPhase <ord>, .trialTitle <chr>, .controlType <fct>

To complete the CTGOV steps, the group and SAE details are merged:

saesCtgov <- left_join(
  x = ctgovSaesDetails,
  y = ctgovGroups,
  # Merge by trial id and event group id
  by = c("_id", "groupId")
) %>% 
  select(
    `_id`, 
    title,
    term,
    numAtRisk,
    numAffected
  ) 

saesCtgov %>%  
  slice_sample(n = 10L)
# A tibble: 10 × 5
   `_id`       title                         term                      numAtRisk
   <chr>       <chr>                         <chr>                         <int>
 1 NCT03729596 SCCHN Expansion               Sepsis pasteurella               13
 2 NCT02785900 Placebo + HMA                 Facial pain                     128
 3 NCT01461538 BV 1.8 mg/kg Q3Week           Abdominal pain                   46
 4 NCT02341625 BMS-986148 0.2MG/KG Q3W       Impaired gastric emptying         2
 5 NCT02572167 Part 3: Same-day Dose         Jugular vein thrombosis          30
 6 NCT02573324 Open-Label Depatuxizumab Maf… WHITE BLOOD CELL COUNT D…         5
 7 NCT03525678 PACT Phase: GSK2857916 2.5 m… Gastric fibrosis                  1
 8 NCT02001623 Dose Escalation Part: 1.8 mg… Stress fracture                   3
 9 NCT03149549 Part C1 6 mg/kg Q2W           Ear pain                          6
10 NCT01925612 Part 3: RCHOP                 Bronchiolitis                    12
# ℹ 1 more variable: numAffected <int>

Merge register data

Finally, data from EUCTR and CTGOV (after adjusting column names) can directly be merged, representing SAEs from both registers for the selected types of trials.

saesMerged <- bind_rows(
  saesEuctr,
  saesCtgov %>% 
    # Rename to names in saesEuctr
    rename(
      subjectsExposed = numAtRisk,
      subjectsAffected = numAffected
    )
)

# Dataset rows and columns
dim(saesMerged)
[1] 26830     5
# Show sample rows with higher
# number of affected subjects
saesMerged %>%
  arrange(desc(subjectsAffected)) %>% 
  slice_head(n = 50L) %>% 
  slice_tail(n = 10L)
# A tibble: 10 × 5
   `_id`              title                            term                     
   <chr>              <chr>                            <chr>                    
 1 NCT01979536        Arm CZ (Crizotinib, Combination… Mucositis oral           
 2 NCT03474107        Chemotherapy                     Febrile neutropenia      
 3 2018-001321-68-3RD Mylotarg: Monotherapy            Febrile neutropenia      
 4 NCT01712490        ABVD                             Pneumonia                
 5 NCT01780662        Dose 1.8mg/kg                    White blood cell decreas…
 6 NCT02363283        Treatment (Glembatumumab Vedoti… Peripheral Neuropathy    
 7 NCT02785900        Placebo + HMA                    Pneumonia                
 8 2018-001321-68-3RD Mylotarg with Standard Chemothe… Disease progression      
 9 NCT01631552        Overall Safety Population        Pneumonia                
10 NCT01712490        A+AVD                            Sepsis                   
# ℹ 2 more variables: subjectsExposed <int>, subjectsAffected <int>

The dataset saesMerged has 26830 rows of SAEs for 116 trials. It can be exported into a spreadsheet format as follows for further work or reference.

writexl::write_xlsx(
  x = saesMerged, 
  path = "SAEs.xlsx")