Creator/Principal investigator(s)
Gabriel Westman
- Uppsala University
Description
Line breaks and special characters (excluding punctuation characters) were removed, and punctuation was added to sentences where this was missing (such as headings) to avoid false aggregation. All paragraphs were tokenized on a sentence level using the Natural Language Toolkit (NLTK) version 3.7 tokenizer
Language
English
Research principal
Responsible department/unit
Department of Medical Sciences
Commissioning organisation
Swedish Medical Products Agency
Data contains personal data
No
Population
All centrally approved medicinal products within EU
Study design
Observational study
Description of study design
Health informatics study on information about approved medicinal products.
Bergman E, Sherwood K, Forslund M, Arlett P, Westman G (2022) A natural language processing approach towards harmonisation of European medicinal product information. PLoS ONE 17(10): e0275386. https://doi.org/10.1371/journal.pone.0275386
DOI:
https://doi.org/10.1371/journal.pone.0275386
If you have published anything based on these data, please notify us with a reference to your publication(s). If you are responsible for the catalogue entry, you can update the metadata/data description in DORIS.
Download data
Associated documentation
Description
This database contains sentence-level tokenized product infomation from all centrally approved medicinal products within the EU (May 3, 2022) including Summary of product characteristics (SmPC) and Package leaflet (PL) documents.A total of 1258 medicinal products were initially included, of which 5 were subsequently excluded due to document compatibility issues. From these, a total of 783 K sentences were extracted from PL and SmPC documents.
Version 1
https://doi.org/10.57804/ggrw-hr06
Citation
Download citation
Data format / data structure
Text
Creator/Principal investigator(s)
Gabriel Westman
- Uppsala University