Development and Validation of a Diagnostic 35-Gene Expression Profile Test for Ambiguous or Difficult-To-Diagnose Suspicious Pigmented Skin Lesions

Purpose: A clinical hurdle for dermatopathology is the accurate diagnosis of melanocytic neoplasms. While histopathologic assessment is frequently sufficient, high rates of diagnostic discordance are reported. The development and validation of a 35-gene expression profile (35-GEP) test that accurately differentiates benign and malignant pigmented lesions is described. Methods: Lesion samples were reviewed by at least three independent dermatopathologists and included in the study if 2/3 or 3/3 diagnoses were concordant. Diagnostic utility of 76 genes was assessed with quantitative RT-PCR; neural network modeling and cross-validation were utilized for diagnostic gene selection using 200 benign nevi and 216 melanomas for training. To reflect the complex biology of melanocytic neoplasia, the 35-GEP test was developed to include an intermediate-

Over 5 million skin biopsies are performed annually in the US, leading to the diagnosis of over 130,000 invasive melanomas. [1][2][3][4][5] Because melanoma is one of the most aggressive skin cancers, early detection and diagnosis are crucial. 1,6 Current methods used for definitive diagnosis of melanoma are sufficient for the majority of lesions; however, histopathologic assessment can be challenging, even for experienced dermatopathologists, and high rates of diagnostic discordance have been reported. [7][8][9][10][11] Even pigmented lesions with clear pathological features consistent with benign nevi or invasive melanoma have concordance rates of 92% and 72%, respectively 7 , indicating that a subset of lesions with typical histopathological presentation are subject to differential assessment.
Visual assessment of hematoxylin and eosin (H&E) stained lesions is inherently subjective and relies on expert interpretation and integration of a wide spectrum of architectural and cytologic features that are weighted differently based on the presumed subtype of melanocytic neoplasm and heavily influenced by the pathologists' personal experience and training. All melanoma subtypes, including desmoplastic, spitzoid, nevoid, lentigo maligna, and superficial spreading melanoma, can mimic benign nevus variants to varying degrees. Diagnoses require the integration of multiple factors including histopathologic and clinical features, variants of melanoma subtypes, patient age and anatomic location. 12 Difficult-todiagnose lesions are commonly sent for second opinions to expert dermatopathologists who have more experience with challenging cases; however the nature of many lesions remains ambiguous with discordant rates of lesions in this category of 25-43%. 7 Studies detailing the prevalence, outcome, and misdiagnosis of these lesions indicate that improved ancillary diagnostic technologies could be greatly beneficial to the dermatopathologist and dermatologist in determining the most appropriate treatment plan. [8][9][10][13][14][15][16][17][18] Efforts to improve melanoma diagnosis have traditionally focused on ancillary tests such as immunohistochemistry (IHC), fluorescence in situ hybridization (FISH), and comparative genomic hybridization (CGH), but each has limitations. [19][20][21][22] FISH and CGH may have some limitations including less than optimal specificity and less availability than IHC. IHC is the most commonly utilized diagnostic tool for melanocytic lesions, but IHC, including Ki-67, Melan-A/MART-1 and p16, is limited in its ability to distinguish benign from malignant melanocytic lesions. 23 Similarly, a recently developed PRAME IHC assay has exhibited staining patterns in approximately 14% of nevi, some of which are above the threshold established for a diagnosis of melanoma. 24 Definitive diagnosis is also complicated for a subset of lesions described as being borderline, indeterminant, of unknown malignant potential (UMP), atypical melanocytic proliferation (AMP) or in a 'grey' zone. [25][26][27][28][29][30][31][32][33] Clinical management of these cases usually results in conservative treatment for the 'most significant consideration in the differential diagnosis'. 27 GEP has been employed to improve the diagnosis of these suspicious pigmented lesions. While a 2-gene pigmented lesion array (2-gene) [34][35][36][37][38][39] and a 23-gene expression profile (GEP) test 40,41 have been previously developed, the 2-gene utility is focused on guiding biopsy decisions by dermatologists; whereas, the 23-GEP labels a substantial number of lesions (~15% across studies) as indeterminate rather than INTRODUCTION providing a result of benign or malignant. 40,41 Approximately 10% of unequivocal cases and 15% of ambiguous lesions may be labeled indeterminate by the 23-GEP test. 42 Although sensitivity and specificity is reported at 91.5% and 92.5% [40][41][42][43][44] , respectively, for the 23-GEP, there exists an opportunity to significantly increase the accuracy and thus optimize the management of the melanoma patient, particularly given the advances in melanoma prognosis and treatment over the past decade.
In this study we describe the development and validation of a 35-GEP test to differentiate between benign and malignant pigmented lesions with greater accuracy than previously developed tests. A training cohort of samples, including subtypes considered challenging to diagnose, was established and bioinformatic and machinelearning approaches were used to select and prioritize genes associated with benign or malignant biology. The test was validated using an independent cohort of cases and demonstrates sensitivity and specificity metrics exceeding those currently reported in the melanoma diagnostic literature while maintaining a minimal indeterminate-risk zone. The novel 35-GEP test could aid in the diagnosis of suspicious pigmented lesions and improve accuracy alone or when used in combination with currently applied diagnostic tools.

Sample and Clinical Data Collection
Archival benign samples and associated deidentified clinical data were collected from multiple independent dermatopathology laboratories as part of this Institutional Review Board (IRB)-approved study. Formalin-fixed, paraffin-embedded (FFPE)  pigmented lesion tissue was collected as 5   μm sections for subsequent diagnosis based  on H&E staining and for real-time  quantitative reverse transcription PCR (qRT-PCR) analysis. Additionally, archival melanoma samples and de-identified clinical data were obtained from specimens submitted to Castle Biosciences for clinical testing with the 31-GEP (DecisionDx-Melanoma). A total of 951 samples diagnosed between January 2013 and August 2020 were included in the training and validation cohorts, of which 498 were benign and 453 malignant. All laboratory personnel were blinded to clinical diagnoses for all 951 samples.
Samples were excluded from the study if there was less than 10% tumor volume (cellularity of all samples was determined by a single dermatopathologist), tissue originated from melanoma metastases, lesions were not primary to the skin, tissue was derived from re-excisions (including wide local excision), diagnosis was a nonmelanocytic neoplasm, or if patients had previous radiation or immunotherapy treatment. Melanoma subtypes of acral lentiginous, desmoplastic, lentiginous, lentigo maligna, nevoid, nodular, spitzoid, superficial spreading, and melanoma in situ were included. Benign subtypes of blue nevus, common nevus (compound, junctional and intradermal), deep penetrating nevus, dysplastic nevus (compound and junctional), and Spitz nevus were included.

Histopathologic Examination
Eight dermatopathologists participated in sample acquisition, and six dermatopathologists participated in sample review for diagnostic concordance. The majority of these dermatopathologists are affiliated with private practice and have an average of 12 years of experience reviewing skin lesions. All acquired samples were received with the original pathology report.

METHODS
For all benign diagnoses, the contributing dermatopathologists provided a description of the lesion in a free text field, and the information was entered into the clinical research form. All benign samples then underwent H&E diagnostic review by a second and third dermatopathologist who were blinded to the original diagnosis and provided with only patient age and anatomic location of the lesion. Reviewing dermatopathologists were asked to select a diagnosis (benign, malignant, or unknown (unknown malignant potential (UMP)) as well as a subtype classification from a predetermined list. If discordance was observed across three diagnoses, the case was reviewed by additional dermatopathologists in a blinded manner for adjudication. A total of 395 samples that were diagnosed as benign by 3 out of 3 dermatopathologists were included in the study; additionally, 78 cases diagnosed as UMP by no more than 1 dermatopathologist (i.e. 2 benign and 1 UMP) were added to the training and validation cohorts. As a result, the final training and validation cohorts consisted of benign samples with full diagnostic concordance (167/200 and 228/273 of samples, respectively) and samples with no more than one UMP classification (33/200 and 45/273, respectively).

Real-Time
Quantitative Reverse-Transcription PCR Pigmented lesions were processed for qRT-PCR expression analysis in a central CLIAcertified, CAP-accredited, and New York State Department of Health permitted laboratory.
Tumor sections were macrodissected from unstained FFPE tissue and total RNA was extracted per manufacturer's instructions using either the QIAsymphony SP Automated Nucleic Acid Extractor (Qiagen) or KingFisher Flex (ThermoFisher Scientific) platforms. Total RNA concentration was quantified using the NanoDrop 8000 (ThermoFisher Scientific). cDNA was obtained using the High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems). cDNA pre-amplification reaction was performed utilizing the TaqMan PreAmp Master Mix (Applied Biosystems) and a 14-cycle amplification. Pre-amplified samples were diluted 1:2.5X in TE Buffer pH 7.0 (ThermoFisher Scientific) and combined with an equal volume of 2X Open Array Real-Time PCR Master Mix (Applied Biosystems). The samples were loaded onto a custom Open Array gene card using the QuantStudio 12K Flex AccuFill system (Applied Biosystems) subsequently run on the QuantStudio 12K PCR system.

Expression Analysis and Diagnosis Assignment
The array data was analyzed to identify genes that were best able to segregate benign and malignant lesions based on levels of gene expression. [45][46][47] The resulting gene set was then reviewed to ensure that a wide variety of biological pathways were represented and to confirm the biological relevance of those genes. As a result, 76 candidate diagnostic genes were selected for model training. Three genes (FXR1, HNRNPL, and YKT6) were reliably and consistently expressed in the study cohort and chosen as control genes. Triplicate gene expression data were aggregated and normalized using control probes. Failure of three or more candidate genes (MGF, multiple gene failure) led to sample exclusion from training and validation cohorts, while control genes were evaluated independently, and failure of any control gene resulted in sample exclusion. Following quality control measures to assess amplification and stability of gene expression, 58 discriminant probes and 3 control probes were selected for further analysis. Deep learning techniques were applied to gene expression data for gene selection and model identification. [48][49][50] Gene expression data analyzed with neural network modeling resulted in two diagnostic algorithms. 51,52 Tumors with spitzoid or melanoma in situ features had poorer initial classification accuracy; therefore, the presence of those features in diagnosis was added to the input of the algorithm to improve accuracy. Algorithm improvement continued until the mean kappa value improved by less than 0.01 for the top 25% of the assay population. Hyperparameter selection and model evaluation was performed using 4x4-fold cross validation. 50,53 Kappa was determined from the average kappa value at each of the cross validation runs. The final model was trained against all training data using the optimal gene set. Two models were developed which together generated the locked algorithm for the 35-GEP test. Classification into benign (gene expression profile suggestive of benign neoplasm), intermediate-risk (gene expression profile cannot exclude malignancy) or malignant (gene expression profile suggestive of melanoma) zones was determined from the probability scores from both algorithms.
Analysis was performed with R v.3.3.3. Differences in age were assessed using the Wilcoxon F test. Differences in categorical variables including sex, ulceration status and location were assessed by Pearson Chisquare test. P values <0.05 were considered statistically significant.

Sample Cohorts
Quantitative RT-PCR was performed on 498 benign and 453 malignant lesions accrued under an IRB-approved protocol in a multicenter cohort (Figure 1). Thirty-two samples (~3.4%) of the study cohort were excluded from further analysis due to MGF with the remaining 919 samples randomized to training or validation cohorts while conserving benign or melanoma subtype representation in each cohort. Training (200 benign nevi and 216 melanomas) and validation (273 benign nevi and 230 melanomas) cohorts' demographic details are shown in Table 1. No statistically significant differences were observed in the training vs. validation cohorts. The median age of patients with benign lesions was 47 (range 7-85) years in the training cohort and 48   The majority of malignant lesions were biopsied from arms and legs (extremities, 40% of cases in training and validation, p=0.812), while benign lesions were mainly located on patients' backs (36.5% in training cohort and 41% in validation, p=0.863). The distribution of different subtypes of melanoma and nevi in the training and validation sets are provided in Table 2.

Development of 35-GEP Profile
Artificial neural networks were selected as the model type due to their ability to recognize multiple patterns, which is critical for successfully distinguishing different subtypes of benign nevi and melanomas. Therefore, to represent biological diversity and different growth patterns and features, lesions unanimously diagnosed as benign by 3/3 reviewers and lesions with less definitive histopathology resulting in 2/3 concordance were included in the training set to ensure the resulting algorithm is capable of classifying both typical and heterogenous lesions. A 35-GEP comprising 32 discriminant genes and 3 control genes was developed using neural networks for model fitting and genetic algorithms for feature selection on a diverse set of benign and malignant samples. The 35-GEP is primarily composed of genes in cytoskeletal and barrier functions, gene regulation and melanin biosynthesis (Table 3).  Multiple molecular pathways have been associated with melanoma progression and the 35-GEP signature includes several genes from key signaling networks to encompass the complexity of the disease. Biological processes such as epithelial cell differentiation, tissue and epidermis development, programmed cell death, and keratinocyte differentiation were identified as top functional enrichments for this gene set.  Table 6). One of the melanomas that was classified as benign was in situ and one was nodular melanoma with a Breslow thickness of 4.0 mm that had the low-risk prognostic Class 1B 31-GEP result. Among the 15 benign lesions that were classified as malignant by the 35-GEP, four were dysplastic (one compound with mild atypia and three junctional with mild/moderate atypia), one compound nevus, one combined blue and intradermal nevus, one blue nevus, one benign melanocytic nevus (not otherwise specified), and seven were Spitz nevi. Six of the seven misclassified Spitz nevi were in pediatric patients suggesting this may be a limitation of the 35-GEP (accuracy metrics for the validation cohort without lesions with spitzoid features is provided in Table 4). Spitzoid lesions are particularly difficult to diagnose as many have ambiguous histologic characteristics and may involve regional lymph nodes in the absence of increased mortality rates or malignant potential. 16 Dermatopathologists have a number of ancillary tools available to assist with the DISCUSSION diagnosis of pigmented lesions, yet there is a substantial amount of diagnostic discordance that may potentially lead to overtreatment of patients with benign lesions and undertreatment of patients with melanoma. 80 The 35-GEP test to distinguish benign from malignant pigmented lesions was developed to improve diagnostic accuracy and reduce diagnostic uncertainty for difficult-to-diagnose cases. Dysplastic nevi had different degrees of atypia: a -mild (n=19), moderate (n=4) and severe (n=3); b -mild (n=24), moderate (n=2), and severe (n=3); c -mild (n=20) and moderate (n=6); d -mild (n=22) and moderate (n=17) atypia. Tumorigenesis CXCL14* C-X-C motif chemokine 14 Tumorigenesis S100A8* Protein S100-A8 Tumorigenesis S100A9* Protein S100-A9 A cross-study analysis shows that in an independent validation cohort of 503 benign lesions and melanomas the 35-GEP test demonstrated improved accuracy compared to other diagnostic tools based on their primary validation studies (Table 7).

Validation of the 35-GEP
Unlike FISH, CGH or IHC, gene expression profiling captures transcriptomic events within the lesion and the surrounding tissue, allowing for a more comprehensive assessment of the biological changes that are associated with the transition to a malignant phenotype. 22,81,82 IHC generally allows for evaluation of changes in the expression of a single biomarker at the protein level, which can be limited by subjective quantification systems. 83 PRAME IHC has been reported as a reliable method to distinguish benign from malignant pigmented lesions; however, ~14% of nevi can have some staining for PRAME and the interpretation of positive staining (4+, ≥76% of immunoreactive tumor cells are PRAME positive) can be subjective. Thus, PRAME IHC requires further validation for widespread clinical use due to the potential for misdiagnosis of benign lesions as malignant. 24,84 In the current study, PRAME expression did not improve diagnostic accuracy above the results reported for the 35-GEP (data not shown).
In this study, the 35-GEP reliably diagnosed 96.4% of benign and malignant lesions. In cross-study comparison (Table 7), the 35test out-performed a 23-GEP diagnostic test with previously reported accuracy metrics for unequivocal samples ranging from 91.5-94% for sensitivity, 90.0-92.5% for specificity, and technical failures in 14.7%. Moreover, ~15% of diagnostically concordant (i.e. 3 out of 3) cases could not be classified as benign or malignant. 40,41 By comparison, the 35-GEP test demonstrated sensitivity (99.1%) and specificity (94.3%) in all ages and 99.1% sensitivity and 96.2% specificity in patients ≥18 years old, a low number of technical failures (3.4%), and no more than 3.8% of cases received an intermediate-risk result. The improved classification of lesions compared to that of the 23-GEP test is likely due to implementation of highly sophisticated modeling (neural networks) that resulted in two algorithms with 32 diagnostic and 3 control genes, the inclusion of samples with different growth patterns (total of nine melanoma subtypes and eight benign subtypes) in the training cohort as well as incorporation of lesions with 2/3 concordance.
Data supporting the utility of the 23-GEP test in ambiguous or diagnostically discordant lesions is limited. 44 Recently, sensitivity of 90.4% and specificity of 95.5% was reported for 125 'uncertain' cases, however, the definition of uncertainty was broad and included lesions as discordant if a differing diagnosis was received from just 1 of 7 dermatopathologists reviewing the cases. 85 In this study, we included cases with concordance for 2 of 3 reviewing dermatopathologists in this independent 35-GEP validation. The 35-GEP was developed and validated using fully concordant lesions and a small set of 'borderline' cases, where no more than 1 out of 3 dermatopathologists indicated 'unknown malignant potential' as a diagnosis. Since the 35-GEP will be most likely used in difficult-to-diagnose lesions, inclusion of 2/3 concordant cases to capture differentially expressed genes from those histopathologically challenging cases was factored into the neural network configuration during the test development. With the improved accuracy metrics and substantially reduced intermediate-risk zone, dermatopathologists can expect a definitive result from the 35-GEP test in ≥95% of lesions submitted for testing. It is our hope that improved test characteristics for the disambiguation of pigmented lesions will help refine guidelines for when to utilize GEP in the diagnosis of challenging pigmented lesions.
Although the vast majority of cases tested by the 35-GEP will have a definitive score of benign or malignant risk potential, 3 Unfortunately, up to 1/3 of nevi transition to melanoma, so there is a subset of lesions that may be clinically identified during this progression. 12,86 In addition, there are atypical melanocytic proliferations (AMPs) that never evolve to full malignancy despite metastasis to regional lymph nodes. The spectrum of outcomes for these lesions warrants special consideration in clinical management. 28 Clinical management of AMPs varies as there are no official guidelines governing their treatment, but common practice is definitive surgical treatment with removal of lesion with the margin of normal skin. 28 In addition, the use of the 35-GEP can provide the dermatologist and/or patient with treatment options to cover the most severe of diagnoses, including a diagnosis of melanoma. Studies are underway in a true AMP population with  Table  4. The Spitz subtype is particularly challenging and thus far all available ancillary tests have had limitations in sensitivity and specificity. 42,87,88 Of note, absence of spitzoid melanomas and classification of the Spitz lesions in pediatric patients was not optimal in this study and therefore further studies are being undertaken to confirm whether this is a limitation of the 35-GEP.
For the dermatologist, metastatic risk assessment is critical for guiding appropriate patient management following a melanoma diagnosis. A prognostic 31-GEP test has been validated to determine individualized 5year risk for recurrence, metastasis and melanoma-specific survival. [89][90][91] Based on accuracy metrics and multivariate models demonstrating that the test is an independent and significant risk-prediction tool, the value of the GEP testing as an adjunct to current staging factors has been recognized by the National Comprehensive Cancer Network. 92 Thus, patients diagnosed with malignant lesions have effective prognostic tools and contemporary therapies, with demonstrated improved outcomes, at their disposal.
Given the availability of the prognostic 31-GEP test for cutaneous melanoma, the 35-GEP test was developed to refine the diagnosis of benign nevi and melanomas by providing dermatopathologists with an objective ancillary tool to aid in their diagnosis of difficult-to-diagnose pigmented lesions. Clinically implemented GEP tests for diagnostically challenging melanocytic lesions have demonstrated high impact on utility for guiding decision-making. 93,94 Although not the focus of this study, assessment of 35-GEP clinical utility, as well as correlation of test results with outcomes, is underway. In the zone of significant uncertainty, the high accuracy metrics of the test might increase confidence level in diagnosis to dermatopathologists and dermatologists, while providing assurance to the patients. The test also provided a definitive result for 96.4% of the lesions in the validation study, offering an opportunity to reduce the uncertainty associated with pigmented lesions and promote more definitive management of patients by dermatologists. An ancillary test with the characteristics reported here could impact expenditure on over-diagnoses by decreasing unnecessary surgeries, imaging and follow-up while more appropriately allocating healthcare resources to those lesions where malignant risk is identified.