Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (2024)

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ROC (Receiver Operation Characteristics) – curve is a graph that shows how classifiers performs by plotting the true positive and false positive rates. It is used to evaluate the performance of binary classification models by illustrating the trade-off between True positive rate (TPR) and False positive rate (FPR) at various threshold settings.

Key concepts

True Positive Rate (TPR): Also known as Sensitivity or Recall, it measures the proportion of actual positives correctly identified by the model.

Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (1)

False Positive Rate (FPR): It measures the proportion of actual negatives that are incorrectly identified as positives.

Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (2)

Area Under the Curve (AUC): This value summarizes the performance of the model. A value of 1 indicates a perfect model, while a value of 0.5 suggests a model with no discrimination ability.

ROC (Receiver Operating Characteristics) Curve: Is a value that illustrates the performance of a binary classifier model at varying threshold values.

    How to read ROC

    The ROC curve is the plot of the true positive rate against the false positive rate at each threshold setting. The threshold values can be obtained from the statistics used in confusion matrix that will generate the TPR or FPR.

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (3)

    Before going into the threshold value, lets check the synthetic data.

    import numpy as npimport pandas as pdfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import roc_curve, auc, confusion_matriximport matplotlib.pyplot as pltimport seaborn as sns# Create a sample binary classification datasetX, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

    Let’s split the data and train the model (using logistic regression):

    # Split the data into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Train a logistic regression modelmodel = LogisticRegression()model.fit(X_train, y_train)# Get predicted probabilities and predicted classesy_pred_proba = model.predict_proba(X_test)[:, 1] # Probability for class 1y_pred = model.predict(X_test)

    And check the statistics:

    # Calculate the ROC curvefpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)roc_auc = auc(fpr, tpr)# Compute the confusion matrixconf_matrix = confusion_matrix(y_test, y_pred)

    And add the results in data frame:

    import pandas as pdconfusion_stats = []# Loop over each threshold and calculate TP, FN, FP, TNfor idx, threshold in enumerate(thresholds): tp, fn, fp, tn = get_confusion_matrix_stats(y_test, y_pred_proba, threshold) confusion_stats.append({ 'Threshold': threshold, 'True Positive (TP)': tp, 'False Negative (FN)': fn, 'False Positive (FP)': fp, 'True Negative (TN)': tn })# Convert the results into a pandas DataFrameconfusion_df = pd.DataFrame(confusion_stats)confusion_df

    And the top and bottom five results of the thresholds and all the statistics for TP, FN, FP and TN.

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (4)

    For the same ROC Curve let us set a threshold value to 0.08 and check the statistics:

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (5)

    Adding calculation formula to

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (6)

    So, for that value we can calculate the correctly and falsely classified data (based on the prediction model) as true positive rate (TPR) against the false positive rate (FPR).

    So with the values are:
    TP (True Positives) = 149
    FN (False Negative) = 6
    FP (False Positive) = 80 and
    TN (True Negative) = 65
    TPR = TP / (TP+FN) = 149/(149+6) = 0.9612903
    FPR = FP / (FP + TN) = 80/(80+65) = 0.5517241


    The threshold value is determined where X and Y are crossed on the graph, where FPR in plotted on X-axis (values going from 0 to 1) and TPR is plotted on Y-axis (values going from 0 to 1). All values are plotted on a scale from 0 to 1 since we are plotting the results of binary classifies of a logistic regression.

    Where and how to understand the curve

    To understand the curve and where to “cut” or how to segment results based on the curve, that both variables can be displayed.

    In binary classification, the class prediction for each instance is often made based on acontinuous random variable X, which is a “score” computed for the instance (e.g. the estimated probability in logistic regression). Plotting the probability density for both TPR and FPR. Therefore, the true positive rate is given byTPR(T)=∫T∞f1(x)dx

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (7)

    and the false positive rate is given byFPR(T)=∫T∞f0(x)dx

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (8)

    .

    Plotting the probabilities using KDE (Kernel Density Estimate) to in Python Seaborn package to understand how changing (or increasing) the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.

    # Now, let's mark the threshold on the distribution plot and display confusion matrix statsplt.figure(figsize=(10, 6))# Plot the distributions of predicted probabilities for both classessns.kdeplot(df[df['y_test'] == 0]['y_pred_proba'], label="Class 0 (actual)", shade=True, color="blue")sns.kdeplot(df[df['y_test'] == 1]['y_pred_proba'], label="Class 1 (actual)", shade=True, color="orange")# Highlight the chosen threshold on the x-axisplt.axvline(chosen_threshold, color='green', linestyle='--', label=f'Threshold = {chosen_threshold:.2f}')plt.text(chosen_threshold + 0.05, 0.5, f"TP = {tp}\nFN = {fn}\nFP = {fp}\nTN = {tn}", fontsize=12, bbox=dict(facecolor='white', alpha=0.5))plt.xlabel('Predicted Probability')plt.ylabel('Density')plt.title('Distribution of Predicted Probabilities with Highlighted Threshold')plt.legend()plt.grid(True)plt.show()

    And the distributions for each values: True Negative and True Positive and False Negative and False Positive (different shades on class predictions).

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (9)

    Based on the problem you are trying to solve, it is also crucial which predicted values you want to evaluate and reduce / increase. Based on selected problem, you can reduce the complexity of the model by classifying the results in additional sub-models, making the solution more resilient and robust. This reduction complexity can be done using SVD, t-SNE, PCA, Isomap and others.

    As always, complete Fabric notebook is available on the Github repository for Data science with Microsoft Fabric.

    Working with Fabric, you can always investigate further spark operations, runs and optimise the workload. And complete results can also be done in Microsoft Fabric using R language.

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (10)

    Stay healthy and keep exploring!

    Related

    To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores (2024)

    References

    Top Articles
    Latest Posts
    Recommended Articles
    Article information

    Author: Mr. See Jast

    Last Updated:

    Views: 6164

    Rating: 4.4 / 5 (75 voted)

    Reviews: 90% of readers found this page helpful

    Author information

    Name: Mr. See Jast

    Birthday: 1999-07-30

    Address: 8409 Megan Mountain, New Mathew, MT 44997-8193

    Phone: +5023589614038

    Job: Chief Executive

    Hobby: Leather crafting, Flag Football, Candle making, Flying, Poi, Gunsmithing, Swimming

    Introduction: My name is Mr. See Jast, I am a open, jolly, gorgeous, courageous, inexpensive, friendly, homely person who loves writing and wants to share my knowledge and understanding with you.