Downloading the data set¶
CAMELYON16 and CAMELYON17 data sets are open access and shared publicly on GigaScience, Google Drive and on Baidu Pan.
GigaScience Database:
AWS Registry of Open Data
Baidu Pan:
Meta files: These files are available in the shared folders. They are shared here too for convenience.
- CAMELYON16: checksums.md5, README.md
- CAMELYON17: checksums.md5, README.md
This work is made available under CC0.
Data¶
The data in this challenge contains whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections.
Depending on the particular data set (see below), ground truth is provided:
- On a lesion-level: with detailed annotations of metastases in WSI.
- On a patient-level: with a pN-stage label per patient.
All ground truth annotations were carefully prepared under supervision of expert pathologists. For the purpose of revising the slides, additional slides stained with cytokeratin immunohistochemistry were used. If however, you encounter problems with the data set, then please report your findings at the forum.
The data set for CAMELYON17 is collected from 5 medical centres in the Netherlands. WSI are provided as TIFF images. Lesion-level annotations are provided as XML files. For training, 100 patients will be provided and another 100 patients for testing. This means we will release 1000 slides with 5 slides per patient .
Training data set¶
The first training data set was released 18 November 2016.
Lesion-level training data:
- Data from CAMELYON16, which was collected from Radboud UMC and UMC Utrecht, will be re-used as lesion-level training data for CAMELYON17.
- Lesion-level annotations are also provided for 10 training slides from every medical centre within CAMELYON17 (50 annotated slides total).
- In this set the micro and macro metastases are annotated exhaustively. Note however, ITCs are not annotated exhaustively.
Patient-level training data:
- For each of the 5 data sources we provide slides that are organised by patient.
- Each patient is labelled with a pN-stage.
- Patients consist of 5 lymph nodes.
- Every slide holds sections of just 1 lymph node.
Test data set¶
The test data set was released in March 2017.
Just like the training data set, the test data set contains 500 slides, which are also organised by patient, with every patient consisting of 5 slides. These slides are not annotated and not labelled. A pN-stage per patient is also not given.
Visualising whole-slide images and annotations¶
Reading the multi-resolution images using standard image tools or libraries is a challenge because these tools are typically designed for images that can comfortably be uncompressed into RAM or a swap file. OpenSlide is a C library that provides a simple interface to read WSIs of different formats.
Automated Slide Analysis Platform (ASAP) is an open source platform developed by DIAG for visualising, annotating and automatically analysing whole-slide histopathology images. ASAP is built on top of several well-developed open source packages like OpenSlide, Qt and OpenCV. We strongly recommend the participants to use this platform for visualising the slides and viewing the annotations.
Accessing the data¶
The whole-slide images provided in this challenge are standard TIFF files. Standard libraries like OpenSlide can be used to open and read these files. We used ASAP to prepare the data. Its multiresolutionimageinterface C++ library and python package provides an easy to use interface for accessing pixel data in TIFF files efficiently. Only 3 simple steps are necessary to be able to use it in python:
- Download and install ASAP.
- Configure your PYTHONPATH environment variable to contain the /bin directory path.
- Import multiresolutionimageinterface to your python module.
A few things to know about the library and the TIFF image format:
- The TIFF contains multiple down-sampled versions of the original image. The highest resolution in on level 0.
- Pixel values can be read in patches from any available level.
- Pixel indexing is done in (column, row) manner.
- Regardless of which level a patch is read from the indexing is done on level 0.
The following python code snippet loads a TIFF file and reads a 300 pixel wide, 200 pixel high image patch starting at the (568, 732) XY coordinate on level 2:
import multiresolutionimageinterface as mir reader = mir.MultiResolutionImageReader() mr_image = reader.open('camelyon17/centre_0/patient_000_node_0.tif') level = 2 ds = mr_image.getLevelDownsample(level) image_patch = mr_image.getUCharPatch(int(568 * ds), int(732 * ds), 300, 200, level)
Annotations¶
The annotations were made in ASAP. The annotation ROIs are polygons that are stored as an ordered list of vertex (X, Y) pixel coordinates on level 0 of the multi-resolution images.
The following python code snippet converts an annotation into a mask file:
import multiresolutionimageinterface as mir reader = mir.MultiResolutionImageReader() mr_image = reader.open('camelyon17/centre_0/patient_010_node_4.tif') annotation_list = mir.AnnotationList() xml_repository = mir.XmlRepository(annotation_list) xml_repository.setSource('camelyon17/centre_0/patient_010_node_4.xml') xml_repository.load() annotation_mask = mir.AnnotationToMask() camelyon17_type_mask = True label_map = {'metastases': 1, 'normal': 2} if camelyon17_type_mask else {'_0': 1, '_1': 1, '_2': 0} conversion_order = ['metastases', 'normal'] if camelyon17_type_mask else ['_0', '_1', '_2'] annotation_mask.convert(annotation_list, output_path, mr_image.getDimensions(), mr_image.getSpacing(), label_map, conversion_order)