INTRODUCTION: Knee osteoarthritis (OA) is a leading cause of disability in older adults, with no cure currently available in the market and dissatisfaction rates of nearly 20% in patients with joint replacement surgery. Given the importance of radiographic staging in both the development of preventative treatments and decision making for joint replacement surgery, automated tools that assist clinicians and mitigate human bias offer the promise of significantly improving OA management and patient quality of life. Deep learning is a paradigm in machine learning that is transforming medical image analysis across medical fields (e.g., cancer detection, brain imaging) and has the potential to advance research and clinical decision making in OA as well. The aim of this study was to develop a fully automated tool for staging knee OA severity from radiographs and to compare its performance against a board- certified musculoskeletal radiologist.

METHODS: We used six longitudinal X-ray images taken at baseline (0 years) and 1, 2, 3, 4, and 6 years afterwards for each lower limb of 4,508 patients, yielding 40,280 total images (20,140 images that were each split into two single-limb images). All the images were bilateral fixed-flexion plain film X-rays acquired in a standardized manner by the Osteoarthritis Initiative. All radiographs were staged using the Kellgren Lawrence system by a committee of two trained musculoskeletal radiologists. Subjects were split into training (3606 subjects, 32116 images), validation (450 subjects, 4074 images), and test (452 subjects, 4090 images) sets. We used a convolutional neural network (CNN) to learn radiographic features that are predictive of OA severity from the dataset. The DenseNet architecture, an established CNN architecture shown to perform well on other orthopedic image analysis tasks, was used, but the final output was modified to produce five probabilities for each image corresponding to the likelihood that the image represented an example of each of the five possible KL grades (0 – 4). Before being used as input to the model, images were preprocessed in an automated fashion to standardize their size, resolution, and pixel variance. Our preprocessing procedure differs from past work in that it does not involve the manual cropping of the knee joint space within the larger image and therefore is fully automated. Data augmentation was applied to the training set to improve the model’s ability to generalize to new images. The model was trained using the augmented training set images and iteratively tuned using the validation set images. The model that performed best on the validation set was selected as the final model and was then evaluated using the images in the test set. We also assessed the model’s performance relative to a board-certified radiologist using a subset of 50 images from the test set. The radiologist was blinded to the OAI committee-assigned scores and asked to score each image as they would in their clinical practice. The 50 images were composed of 10 randomly selected images from each KL score. An F1 score was calculated separately for each KL score for both the radiologist and the model. A simple average was used to calculate the average F1 score. An overall accuracy was calculated both for the radiologist and the model using all 50 images. Saliency maps were used to obtain a qualitative understanding of how the trained model arrived at its predictions. They were produced by calculating the contribution of each pixel to the probability that the model assigns to the true KL class.

RESULTS: With the committee scores used as ground truth, the model’s predictions had an average F1 score of 0.70 across all KL classes and an overall accuracy of 0.71 when evaluating the model using the full test set (Table 1). The individual radiologist had an average F1 score of 0.53 and an overall accuracy of 0.54; the model had an average F1 score of 0.64 and an overall accuracy of 0.66 for the fifty-image test subset (Table 2). The model’s F1 scores for individual KL scores exceeded those of the radiologist for KL = 2, 3, and 4 while the radiologist had higher F1 scores for KL = 0 and 1.The saliency maps (Figure 1) frequently identified large contributions at the medial and lateral joint margins, and to a lesser degree at the intercondylar tubercles (i.e. tibial spines), all of which are primary sites of osteophyte formation in OA and indicators used in the determination of KL scores. They show that the model is using the portions of images that the KL scoring system emphasizes and suggest that the model is basing its predictions on clinically relevant radiographic information.

DISCUSSION: We developed a model that takes a full knee radiograph as input and predicts the KL score of the joint with state-of-the-art accuracy that exceeds individual radiologist-level agreement with a committee of radiologists. Applying a sequence of data augmentation methods to each training set image enhanced the generalizability of the model to new images and eliminates the need to manually annotate the joint space in each new image before applying the model to classify the image. This allows the model to use the full radiograph directly after it is produced by the X-ray machine without the need for any intermediate labor. The model was trained and evaluated on radiographs collected using a standardized protocol and must still be evaluated on unstandardized radiographs from clinical practice. While our saliency maps provide only a qualitative understanding of how the model arrives at its predictions, they demonstrate a valuable approach for interpreting and assessing deep learning models for knee X-ray classification. The use of deep learning shows great promise for the development of clinical decision-making tools that provide clinicians insights into joint pathology and may in the future be able to point out important features not apparent to the human eye and detect disease earlier than is currently possible.

SIGNIFICANCE/CLINICAL RELEVANCE: We have developed an accurate, fully automated tool for evaluating OA severity. This tool establishes the potential of a deep learning model to improve surgical outcomes and accelerate the development of preventive treatments for OA.

ORS 2019 Annual Meeting