Tokeniserad produktinformation för centralt godkända läkemedel inom EU (extraherad 2022-05-03)
Tokenized product information for centrally approved medicines within EU (extracted May 3, 2022)
2022-157-1-1
https://doi.org/10.57804/ggrw-hr06
Swedish National Data Service
Svensk nationell datatjänst
Landing page
Tokeniserad produktinformation för centralt godkända läkemedel inom EU (extraherad 2022-05-03)
Tokenized product information for centrally approved medicines within EU (extracted May 3, 2022)
2022-157-1-1
https://doi.org/10.57804/ggrw-hr06
Westman, Gabriel
Westman, Gabriel
Swedish National Data Service
Svensk nationell datatjänst
2022-09-29
Landing page
Information Science
Informationsvetenskap
linguistics
lingvistik
Artificial Intelligence
Artificiell intelligens
Pharmacy
Farmaci
Health Occupations
Vårdyrken
Algorithms
Algoritmer
Computing Methodologies
Dataanalys
Mathematical Concepts
Matematiska begrepp
Computer and Information Science
Data- och informationsvetenskap (Datateknik)
Basic Medicine
Medicinska och farmaceutiska grundvetenskaper
Natural Sciences
Naturvetenskap
Medical and Health Sciences
Medicin och hälsovetenskap
The text corpus was compiled on May 3, 2022, by scripted downloading of all available English language product information files for all centrally approved medicinal products within the EU, from the European Medicines Agency website. Package Leaflet (PL) and Summary of product characteristics (SmPC) documents for each medicinal product, excluding multiplicate documents for medicinal products with more than one strength or pharmaceutical preparation, were used. The PDF files were scraped using the pdfplumber version 0.6.1 package in Python 3.8.10 to extract all text except page numbering, headers, and footers.
Line breaks and special characters (excluding punctuation characters) were removed, and punctuation was added to sentences where this was missing (such as headings) to avoid false aggregation. All paragraphs were tokenized on a sentence level using the Natural Language Toolkit (NLTK) version 3.7 tokenizer
This database contains sentence-level tokenized product infomation from all centrally approved medicinal products within the EU (May 3, 2022) including Summary of product characteristics (SmPC) and Package leaflet (PL) documents.
A total of 1258 medicinal products were initially included, of which 5 were subsequently excluded due to document compatibility issues. From these, a total of 783 K sentences were extracted from PL and SmPC documents.
Tokeniserad produktinformation för centralt godkända läkemedel inom EU. Se engelsk beskrivning för detaljer om hur data kompilerats och bearbetats.
Databasen innehåller tokeniserad produktinformation på meningsnivå, extraherad från alla centralt godkända läkemedel inom EU (2022-05-03). Se engelskspråkig beskrivning för ytterligare detaljer.
Access to data through SND. Data are freely accessible.
Åtkomst till data via SND. Data är fritt tillgängliga.