user@machine:~/localdir$ tree -d emamds
emamds
└── en
├── documents
│ ├── all-authorised-presentations
│ ├── assessment-report
│ ├── procedural-steps
│ ├── procedural-steps-after
│ ├── product-information
│ ├── referral
│ ├── scientific-conclusion
│ ├── scientific-discussion
│ ├── scientific-discussion-variation
│ ├── steps-after-cutoff
│ └── variation-report
└── medicines
└── human
└── EPAR1 Overview
Markdown corresponds to semi-structured data, here on medicines, their development and their regulatory assessment. Such data are organised yet not in tabular format and are suitable for applying analysis methods for text and numbers.
This offering contains Markdown files that are continually generated from currently the following types of regulatory documents, automatically obtained at least once a week from https://www.ema.europa.eu/:
- EPAR webpages of all medicinal products for human use
- PDF files of all medicinal products for human use
- Summary of product characteristics
- Initial assessment reports
- Variation assessment reports
- Referral assessment reports
- Scientific discussions or conclusions
- All authorised presentations
- Steps after authorisation
To overcome challenges with PDF files (visual markup, sections split over pages, bitmap images), files are converted to Markdown so as to reflect structural markup and to include text recognised in images (OCR). Markdown files are versatile input for various analyses, see examples.
1.1 Credits
The EMA is acknowledged as a source of information used here and should also be acknowledged by users.
1.2 Disclaimer
The correctness of information in the Markdown files is not guaranteed to be correct. Conversion of and extraction from PDF files and webpages can have errors. Users have to verify the information they use.
When using any Markdown file from this repository or offering, please include the source credits and the citation
Herold, R. (2026). Regulatory documents as Markdown files [Data set]. https://github.com/rfhb/emamds
Contact me for additional types of regulatory documents (e.g., nationally authorised medicinal products, paediatric investigation plans, medicines supply shortages, orphan designation), with the use case of interest.
2 Repository
The repository is located at https://github.com/rfhb/emamds. It contains Markdown files and (in the future) helper scripts. The files in the repository are versioned using git. Users can create a local copy of the repository (git clone ...) and efficiently keep it updated (git pull). The size of the repository is about 1.8 GB, compressed 350 MB, for the more than 15,000 included files. Note the following folder structure of the repository.
3 Use cases
Here is a first set of example use cases for Markdown files from regulatory documents. The use cases show how to extract sections of interest from Markdown files. These can then be used for aggregations and further analyses.
Such use cases and analyses can be part of regulatory science research based on regulatory documents (Barbier et al. 2025; Ladanie et al. 2018; Papathanasiou et al. 2016).
The examples also showcase command line tools which allow rapid processing of a large number of documents.
3.1 Effects tables
Get the Effects tables from assessment reports of medicinal products for human use; the result can be saved as a NDJSON file. This example uses https://github.com/yshavit/mdq/wiki/ and https://jqlang.org/ in shell functions and operations. The output can directly be converted into tables and be analysed (not shown).
mdqJq() {
mdq -o json '# Effects Table | :-: * :-: ' ${1} | \
jq --compact-output --arg FN ${1} '{fn: $FN, tbl: [.items[].table.rows[]]}'
}
export -f mdqJq
find . -name '*-assessment-report*_en.md' -exec sh -c 'mdqJq "$0"' {} \;{"fn":"./zynteglo-epar-public-assessment-report_en.md","tbl":[["Effect","Short Description","Unit","Treatment","Control","Uncertainties/ Strength of evidence","References"],["Favourable Effects","Favourable Effects","Favourable Effects","Favourable Effects","Favourable Effects","Favourable Effects","Favourable Effects"],["Transfusion Independence (TI) for evaluable non- β 0 /β 0 patients","Proportion of patients with a weighted average haemoglobin (Hb) ≥ 9 g/dL without any pRBC transfusions for a continuous period of ≥ 12 months at any time during the study after Zynteglo infusion","n/N (%)","15/19 (79%; 95% CI: 54% to 94%)","NA","longer authorised The single patient in Study HGB-207 who has not achieved TI exhibited a substantial decline from DP VCN to PB VCN, resulting in low HbA T87Q production Study HGB-205 was performed with the base manufacturing process, HGB-204 with the original manufacturing process, HGB-207 with the refined or commercial manufacturing process.","Studies HGB-204, HGB-205, HGB-207, LTF-303."],["Observed duration of TI for non- β 0 /β 0 patients","","min, max","no 12.0+ to 56.3+ months. * median duration of TI not reached","","All non- β 0 /β 0 patients who have achieved TI at any time have maintained their TI status through all Hb assessments. True duration of TI is unknown as follow up time is limited.","Studies HGB-204, HGB-205, HGB-207, LTF-303."],["Transfusion Reduction (TR) in non- β 0 /β 0 patients that did not achieve TI at any time (n=4)","Medicinal % change in annualized transfusion volume from the period from 6 months post- DPI through Month 24 as compared to annualized baseline pretreatment transfusion requirements","%","product -26.8%, -71.4% -86.9%, -100%","NA","Small number of patients","Studies HGB-204, HGB-205, HGB-207."],["Unfavourable Effects","Unfavourable Effects","Unfavourable Effects","Unfavourable Effects","Unfavourable Effects","Unfavourable Effects","Unfavourable Effects"],["AE for all patients with TDT (ITT)","Patients with at Least 1 AE","% (n)","98.2% (n=56)","NA","Limited safety database, contribution of Zynteglo to the overall safety profile cannot be distinguished from that of concomitant treatment/HSCT procedure",""],["Effect","Short Description","Unit","Treatment","Control","Uncertainties/ Strength of evidence","References"],["SAE for all patients with TDT (ITT)","Patients with at Least 1 SAE","% (n)","50.9% (n=29)","NA","Limited safety database, one AE was assessed as possibly related",""],["Common AE for all patients with TDT (ITT)","AEs occurring in at least 50% of patients","%","Thrombocytopenia (82.5%), Anaemia (71.9%), Stomatitis (64.9%), Vomiting (50.9.%), Neutropenia (57.9%), Alopecia (52.6%)","NA","authorised to drug product Limited safety database, contribution of Zynteglo to the overall safety profile cannot be distinguished from that of concomitant treatment/HSCT procedure",""],["Delayed platelet engraftment for all patients with TDT","Time to platelet values ≥ 20 ×10 9 /L on 3 consecutive days","Median, days","41.5 (min 19, max 191).","< 30 days","contribution of Zynteglo to the delayed platelet engraftment remains unexplained",""]]}
{"fn":"./kalydeco-h-c-002494-x-0114-g-epar-assessment-report-extension_en.md","tbl":[["Effect","Short Description","Unit","Treatment ELX/TEZ/IVA","Control -","Uncertainties/ Strength of evidence","References"],["Favourable Effects","Favourable Effects","Favourable Effects","Favourable Effects","Favourable Effects","Favourable Effects","Favourable Effects"],["SwCl","Change 0-24 wks LS mean (95% CI) from baseline","Mmol/l","-57.9 (-61.3, - 54.6)","-","Unc : open-label, single-arm trial","Study 111"],["LCI2.5","Change 0-24 wks LS mean (95% CI) from baseline","number","-0.83 (-1.02, - 0.66)","","Unc : open-label, single-arm trial","Study 111"],["PEx","Event rate/year","number","0.32","-","Unc : open-label, single-arm trial Unc : 24 week trial","Study 111"],["Unfavourable Effects","Unfavourable Effects","Unfavourable Effects","Unfavourable Effects","Unfavourable Effects","Unfavourable Effects","Unfavourable Effects"],["Elevated Transaminase Events","Events of increased transaminases (Part B)","%","10.7%","-","Unc : Open-label, single arm study,","Study 111"],["Rash","Events (Part B)","%","20.0%","-","Unc : Open-label, single arm study","Study 111"]]}
{"fn":"./abecma-epar-public-assessment-report_en.md","tbl":[["Effect","Short Description","Treatment","Result","Uncertainties/ Strength of evidence"],["Favourable effects","Favourable effects","Favourable effects","Favourable effects","Favourable effects"],["ORR (%)","Percentage of subjects who achieved PR or better as assessed by an IRC according to IMWG uniform response criteria for multiple myeloma.","150 x 10 6 CAR+T cells n=4 300 x 10 6 CAR+T cells n=70 450 x 10 6 CAR+T cells n=54 150-450 x 10 6 CAR+T cells n=128 Ide-cel treated pop. Enrolled pop. n= 140","2 (50.0%) 48 (68.6%) 44 (81.5%) 94 (73.4%) (95% CI: 65.8, 81.1) * 94 (67.1%) (95% CI: 59.4, 74.9)","No control arm. Patients were not randomised to the different dose cohorts. *p < 0.0001, 1- sample binomial test rejecting the null hypothesis of ≤ 50% for ORR [and ≤ 10% for CR rate]"],["CR (%)","Percentage of subjects who achieved CR or sCR as assessed by an IRC according to IMWG uniform response criteria for multiple myeloma.","150-450 x 106 CAR+T cells n=128 Ide-cel treated pop. Enrolled pop.","42 (32.8%)* (95% CI 24.7, 40.9) 42 (30.0%)",""],["DOR, (median, months) -EMA censoring","Time from first documentation of response of PR or better to first documentation of disease progression or death from any cause, whichever occurred first.","150-450 x 10 6 CAR+T cells n=128 Idecel treated pop. Enrolled pop. n= 140","10.6 (95% CI 8.0, 11.4) 10.6 (95% CI 8.0, 11.4)",""],["Effect","Short Description","Treatment","Result","Uncertainties/ Strength of evidence"],["Favourable effects","Favourable effects","Favourable effects","Favourable effects","Favourable effects"],["CRS","","150 - 450 x 10 6 CAR+T cells N=184","81.0% Grade ≥ 5.4 %","Few subjects with dose 150 x 10 6 CAR+T cells"],["Neurologic toxicity - 'focused'","","150 x 450 10 6 CAR+T cells N=184","41.8%","Few subjects with dose 150 x 10 6 CAR+T cells"],["Neurologic toxicity - 'broad'","","150-450 x 10 6 CAR+T cells N=184","73.4%","Two other ways of recording neurotoxicity have been used"],["Cytopenias","","150 - 450 x 10 6 CAR+T cells n=184","95.7% Grade ≥ 3: 95.1%","Few subjects with dose 150 x 10 6 CAR+T cells"],["Infections","","150 - 450 x 10 6 CAR+T cells n=184","71.2% Grade ≥ 3: 23.4%","Few subjects with dose 150 x 10 6 CAR+T cells"],["Secondary malignancy","","150 - 450 x 10 6 CAR+T cells n=184","8.7%",""]]}
3.2 Indications
Get the therapeutic indications in Summary of product characteristics; the result can be written as a NDJSON file. The example uses https://mqlang.org/ and https://jqlang.org/ in shell functions and operations. The output could be used for an RAG approach, for example (not shown).
mq --aggregate '
import "section"
| nodes | section::split(1) + section::split(2) + section::split(3)
| section::title_match("herapeutic indic")
| self[0]["children"]
| try: join(self, "\\n\\n") catch: ""
| gsub(self, "<div style=\\\\\"page-break-after: always\\\\\"></div>", "")
| let ti = self
| let pf = s"${__FILE__}"
| let pn = gsub(s"${pf}", "-epar-product-information_en[.]md", "")
| s"\{ \"productName\": \"${pn}\", \"section41\": \"${ti}\" \}"
' *-epar-product-information_en[.]md \
| jq --compact-output '.'{"productName":"abevmy","section41":"Abevmy in combination with fluoropyrimidine-based chemotherapy is indicated for treatment of adult patients with metastatic carcinoma of the colon or rectum.\n\nAbevmy in combination with paclitaxel is indicated for first-line treatment of adult patients with metastatic breast cancer. For further information as to human epidermal growth factor receptor 2 (HER2) status, please refer to section 5.1.\n\nAbevmy in combination with capecitabine is indicated for first-line treatment of adult patients with metastatic breast cancer in whom treatment with other chemotherapy options including taxanes or anthracyclines is not considered appropriate. Patients who have received taxane and anthracyclinecontaining regimens in the adjuvant setting within the last 12 months should be excluded from treatment with Abevmy in combination with capecitabine. For further information as to HER2 status, please refer to section 5.1.\n\nAbevmy, in addition to platinum-based chemotherapy, is indicated for first-line treatment of adult patients with unresectable advanced, metastatic or recurrent non-small cell lung cancer other than predominantly squamous cell histology.\n\n\n\nAbevmy, in combination with erlotinib, is indicated for first-line treatment of adult patients with unresectable advanced, metastatic or recurrent non-squamous non-small cell lung cancer with Epidermal Growth Factor Receptor (EGFR) activating mutations (see section 5.1).\n\nAbevmy in combination with interferon alfa-2a is indicated for first line treatment of adult patients with advanced and/or metastatic renal cell cancer.\n\nAbevmy, in combination with carboplatin and paclitaxel is indicated for the front-line treatment of adult patients with advanced (International Federation of Gynecology and Obstetrics [FIGO] stages III B, III C and IV) epithelial ovarian, fallopian tube, or primary peritoneal cancer (see section 5.1).\n\nAbevmy, in combination with carboplatin and gemcitabine or in combination with carboplatin and paclitaxel, is indicated for treatment of adult patients with first recurrence of platinum-sensitive epithelial ovarian, fallopian tube or primary peritoneal cancer who have not received prior therapy with bevacizumab or other VEGF inhibitors or VEGF receptor -targeted agents.\n\nAbevmy in combination with paclitaxel, topotecan, or pegylated liposomal doxorubicin is indicated for the treatment of adult patients with platinum-resistant recurrent epithelial ovarian, fallopian tube, or primary peritoneal cancer who received no more than two prior chemotherapy regimens and who have not received prior therapy with bevacizumab or other VEGF inhibitors or VEGF receptor -targeted agents (see section 5.1).\n\nAbevmy, in combination with paclitaxel and cisplatin or, alternatively, paclitaxel and topotecan in patients who cannot receive platinum therapy, is indicated for the treatment of adult patients with persistent, recurrent, or metastatic carcinoma of the cervix (see section 5.1)."}
{"productName":"zytiga","section41":"ZYTIGA is indicated with prednisone or prednisolone for:\n\n- the treatment of newly diagnosed high risk metastatic hormone sensitive prostate cancer (mHSPC) in adult men in combination with androgen deprivation therapy (ADT) (see section 5.1)\n\n- the treatment of metastatic castration resistant prostate cancer (mCRPC) in adult men who are asymptomatic or mildly symptomatic after failure of androgen deprivation therapy in whom chemotherapy is not yet clinically indicated (see section 5.1)\n\n- the treatment of mCRPC in adult men whose disease has progressed on or after a docetaxel-based chemotherapy regimen."}
3.3 Metadata analysis
Explore metadata about regulatory documents. The example uses https://mikefarah.gitbook.io/yq and https://www.r-project.org/, and it models the number of pages per second of time needed for their conversion.
find . -type f -name "*.md" -exec \
yq --front-matter="extract" --indent=0 --output-format="json" \
'{"time": .processing_time, "pages": .document_pages}' {} \; \
| Rscript -e \
"jsonlite::stream_in(file('stdin'), verbose = FALSE) |> lm()"
Call:
lm(formula = jsonlite::stream_in(file("stdin"), verbose = FALSE))
Coefficients:
(Intercept) pages
-113.51 1.87
4 Technical aspects
This offering is based on consecutive steps:
- obtaining medicines spreadsheet, select links to recent updates
- mirroring efficiently with
wget2PDF files and webpages - setting file dates and times of webpages from their metadata
- converting updates using
doclingto Markdown - adding YAML with metadata to Markdown files
- adding, committing and pushing into the repository
4.1 Mirror using wget2
File ‘emaepars.links’ contains one hyperlink per line, one for each EPAR webpage; the links can be obtained from EMA.
wget2 \
--accept-regex='^.+/en/documents/(all-auth|scientific-[dc]|steps|assessment|procedur|variation|referral|product).+_en[.]pdf$' \
--adjust-extension \
--filter-urls \
--header="Accept: text/html" \
--ignore-tags=img,link,script \
--input-file='emaepars.links' \
--max-threads=10 \
--mirror \
--no-follow-sitemaps \
--random-wait \
--retry-on-http-error=429 \
--robots \
--stats-site=csv:emaeparsstats.csv \
--tries=20 \
--wait=3 \
--waitretry=150This will create the folder “www.ema.europa.eu” with the following subfolder structure.
user@machine:~/localdir$ tree -d www*
www.ema.europa.eu
└── en
├── documents
│ ├── all-authorised-presentations
│ ├── assessment-report
│ ├── procedural-steps
│ ├── procedural-steps-after
│ ├── product-information
│ ├── referral
│ ├── scientific-conclusion
│ ├── scientific-discussion
│ ├── scientific-discussion-variation
│ ├── steps-after-cutoff
│ └── variation-report
└── medicines
└── human
└── EPAR4.2 Correcting timestamps
Above, wget2 saves HTML files from EPAR webpages, but timestamps these to the time of saving, which prevents using times for identifying recent updates for conversion. Thus, the following R pseudo-code shows how to get metadata and use it to set the file timestamp.
# find HTML files
dir(
path = mirrorPrefix,
pattern = "[.]html$",
recursive = TRUE,
full.names = TRUE
)
# get content modification time from HTML files
rvest::read_html() |>
rvest::html_elements(
xpath = '/html/head/meta[@property="article:modified_time"]'
# for some, not-yet-modified HTML files use
# '/html/head/meta[@property="article:published_time"]'
) |>
rvest::html_attr("content") |>
# e.g. "2017-09-20T13:54:00+0200"
lubridate::as_datetime()
# set date and time of files
Sys.setFileTime()4.3 Convert with docling
For docling, see https://docling-project.github.io/docling/; for the docling API server details, see https://github.com/docling-project/docling-serve/blob/main/docs/configuration.md. These settings were used on Apple Silicon.
DOCLING_SERVE_MAX_SYNC_WAIT=600 \
DOCLING_NUM_THREADS=16 \
DOCLING_PERF_PAGE_BATCH_SIZE=8 \
docling-serve runFor client use of the docling API, see https://github.com/docling-project/docling-serve/blob/main/docs/usage.md. Note the source and target file and folder specifications. Also note that “emamds” is the folder that contains the repository https://github.com/rfhb/emamds as created using git clone ... (see repository and the output of tree above).
curl -sS -X 'POST' \
-F files="@www.ema.europa.eu/en/documents/assessment-report/tepezza-epar-public-assessment-report_en.pdf" \
-F do_ocr="true" \
-F do_table_structure="true" \
-F force_ocr="false" \
-F image_export_mode="placeholder" \
-F include_images="true" \
-F md_page_break_placeholder='<div style="page-break-after: always"></div>' \
-F ocr_engine="rapidocr" \
-F ocr_lang="en" \
-F pdf_backend="dlparse_v4" \
-F pipeline="standard" \
-F table_mode="accurate" \
-F table_cell_matching="true" \
-F to_formats="md" \
'http://localhost:5001/v1/convert/file' \
> emamds/en/documents/assessment-report/tepezza-epar-public-assessment-report_en.md4.4 Add YAML header
Markdown files need to be started with a YAML header that includes the following information. The docling_version list is obtained from the API endpoint http://localhost:5001/version.
document_datetime: 2026-01-10 11:55:33.574005
document_pages: 33
document_pathfilename: www.ema.europa.eu/en/documents/procedural-steps-after/foclivia-epar-procedural-steps-taken-scientific-information-after-authorisation-archive_en.pdf
document_name: foclivia-epar-procedural-steps-taken-scientific-information-after-authorisation-archive_en.pdf
version: success
processing_time: 17.5662868
conversion_datetime: 2026-01-10 11:57:58.57619
docling_version:
docling-serve: 1.9.0
docling-jobkit: 1.8.1
docling: 2.67.0
docling-core: 2.58.0
docling-ibm-models: 3.10.3
docling-parse: 4.7.2
python: cpython-313 (3.13.11)
plaform: macOS-26.2-arm64-arm-64bit-Mach-O4.5 Update repository
Use conventional commands to update a user’s copy of the source repository. Preferably one commit corresponds to one Markdown file added or updated.
git add <single Markdown file>git commit -m "fileName: <...>\nfileDateTime: <...>git push
5 Contributing
In the future, to update the source repository, pull requests (PRs) may be accepted if users generate Markdown files following the guidance as per the technical aspects described above; scripts that users can use for these steps will be added to the repository shortly.