ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology
SND-ID: 2022-190-1. Version: 1. DOI: https://doi.org/10.48723/w728-p041
Karolinska Institutet - Department of Medical Epidemiology and Biostatistics
The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data ta... Show more..
The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available.
WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.org/). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiff/) at 10X magnification and lower image levels.
The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zip (71.47 GB), train_part2.zip (70.59 GB), train_part3.zip (75.91 GB), train_part4.zip (71.63 GB) and train_part5.zip (69.09 GB)), one for the validation data (valid.zip 21.79 GB) and one for the test data (test.zip 68.11 GB).
File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading.
While it would be helpful to notify SND of any publications using this data set by sending an email to firstname.lastname@example.org, please note that this is not required to use the data. Show less..
Data contains personal data
Unit of analysis
Anonymised female primary breast cancer patients from the Stockholm region
Time period(s) investigated
2012 – 2018
Number of individuals/objects
Data format / data structure
- Description of the mode of collection: Archived routine clinical diagnostic tissue slides with tissue material were scanned using whole-slide-image scanners at Karolinska Institutet.
- Time period(s) for data collection: 2012 – 2018
- Data collector: Karolinska Institutet
- Instrument: NanoZoomer S360 (Technical instrument(s)) - Hamamatsu whole-slide-imaging scanner
- Instrument: NanoZoomer XR (Technical instrument(s)) - Hamamatsu whole-slide-imaging scanner.
Geographic location: Stockholm County
Department of Medical Epidemiology and Biostatistics
Aino Kuusela - University of Turku, Institute of Biomedicine... Show more..
Aino Kuusela - University of Turku, Institute of Biomedicine
Sonja Koivukoski - University of Eastern Finland, Institute of Biomedicine
Circe Carr - University of Turku, Institute of Biomedicine
Sandra Pouplier - Zealand University Hospital, Department of Surgical PathologyShow less..
- Funding agency: ERA PerMed
- Funding agency's reference number: ERAPERMED2019-224-ABCAP
- Project name on the application: Advancing Breast Cancer histopathology towards AI-based Personalised medicine
Stockholm - Ref. 2017/2106-31
Science and technology (CESSDA Topic Classification)
Information technology (CESSDA Topic Classification)
Medical image processing (Standard för svensk indelning av forskningsämnen 2011)
Medical and health sciences (Standard för svensk indelning av forskningsämnen 2011)
Cancer and oncology (Standard för svensk indelning av forskningsämnen 2011)
Weitz, P. et al., (2022). ACROBAT -- a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology. doi:10.48550/ARXIV.2211.13621
Weitz P, Valkonen M, Solorzano L, Carr C, Kartasalo K, Boissin C, Koivukoski S, Kuusela A, Rasic D, Feng Y, Sinius Pouplier S, Sharma A, Ledesma Eriksson K, Latonen L, Laenkholm AV, Hartman J, Ruusuvuori P, Rantalainen M. A Multi-Stain Breast Cancer Histological Whole-Slide-Image Data Set from Routine Diagnostics. Sci Data. 2023 Aug 24;10(1):562.
If you have published anything based on these data, please notify us with a reference to your publication(s). If you are responsible for the catalogue entry, you can update the metadata/data description in DORIS.