Quantile Normalization in R with the {TidyDensity} Package (2024)

Table of Contents
Conclusion Related References

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In data analysis, especially when dealing with multiple samples or distributions, ensuring comparability and removing biases is crucial. One powerful technique for achieving this is quantile normalization. This method aligns the distributions of values across different samples, making them more similar in terms of their statistical properties.

Quantile normalization is a statistical method used to adjust the distributions of values in different datasets so that they have similar quantiles. This technique is particularly valuable when working with high-dimensional data, such as gene expression data or other omics datasets, where ensuring comparability across samples is essential.

The quantile_normalize() function is a new addition to the TidyDensity package, designed to simplify the process of quantile normalization within R. Let’s delve into how this function works and how you can integrate it into your data analysis pipeline.

The quantile_normalize() function takes a numeric matrix as input, where each column represents a sample. Here’s a breakdown of its usage:

quantile_normalize(.data, .return_tibble = FALSE)
  • .data: A numeric matrix where each column corresponds to a sample that requires quantile normalization.
  • .return_tibble: A logical value (default: FALSE) indicating whether the output should be returned as a tibble.

When you apply quantile_normalize() to your data, you receive a list object containing the following components:

  1. Quantile-Normalized Matrix: A numeric matrix where each column has been quantile-normalized.
  2. Row Means: The means of each row across the quantile-normalized matrix.
  3. Sorted Data: The sorted values used during the quantile normalization process.
  4. Ranked Indices: The indices of the sorted values.

The quantile_normalize() function performs quantile normalization through the following steps:

  1. Sorting: Each column of the input matrix is sorted.
  2. Row Mean Calculation: The mean of each row across the sorted columns is computed.
  3. Normalization: Each column’s sorted values are replaced with the corresponding row means.
  4. Unsorting: The columns are restored to their original order, ensuring that the quantile-normalized matrix maintains the same structure as the input.

Let’s demonstrate the usage of quantile_normalize() with a simple example:

# Load TidyDensitylibrary(TidyDensity)# Create a sample matrixset.seed(123)data <- matrix(rnorm(50), ncol = 4)head(data, 5)
 [,1] [,2] [,3] [,4][1,] -0.56047565 0.1106827 0.8377870 -0.3804710[2,] -0.23017749 -0.5558411 0.1533731 -0.6947070[3,] 1.55870831 1.7869131 -1.1381369 -0.2079173[4,] 0.07050839 0.4978505 1.2538149 -1.2653964[5,] 0.12928774 -1.9666172 0.4264642 2.1689560
# Apply quantile normalizationresult <- quantile_normalize(data)# Access the quantile-normalized matrixnormalized_matrix <- result[["normalized_data"]]# View the normalized matrixhead(normalized_matrix, 5)
 [,1] [,2] [,3] [,4][1,] -0.65451945 -0.3180877 0.84500772 -0.6545195[2,] -0.06327669 0.8450077 1.09078797 -0.9506544[3,] -1.40880292 -0.5235134 0.33150422 0.0863713[4,] 0.84500772 1.0907880 0.08637130 0.1991151[5,] -0.31808774 -0.6545195 -0.06327669 0.3315042

Let’s now look at the rest of the output components:

head(result[["row_means"]], 5)
[1] -1.4088029 -0.9506544 -0.6545195 -0.5235134 -0.3180877
head(result[["duplicated_ranks"]], 5)
 [,1] [,2] [,3] [,4][1,] 9 13 13 7[2,] 10 10 12 12[3,] 2 11 2 9[4,] 13 9 9 3[5,] 7 1 1 11
head(result[["duplicated_rank_row_indicies"]], 5)
NULL
head(result[["duplicated_rank_data"]], 5)
 [,1] [,2] [,3] [,4][1,] -0.23017749 -0.5558411 0.1533731 -0.6947070[2,] 0.07050839 0.4978505 1.2538149 -1.2653964[3,] 0.12928774 -1.9666172 0.4264642 2.1689560[4,] -0.68685285 -0.2179749 0.8215811 -0.4666554[5,] -0.44566197 -1.0260044 0.6886403 0.7799651

Now, lets take a look at the before and after quantile normalization summary:

as.data.frame(data) |> sapply(function(x) quantile(x, probs = seq(0, 1, 1/4)))
 V1 V2 V3 V40% -1.2650612 -1.9666172 -1.13813694 -1.2653963525% -0.4456620 -1.0260044 -0.06191171 -0.5604756550% 0.1292877 -0.5558411 0.55391765 -0.3804710075% 0.4609162 0.1106827 0.83778704 -0.08336907100% 1.7150650 1.7869131 1.25381492 2.16895597
as.data.frame(normalized_matrix) |> sapply(function(x) quantile(x, probs = seq(0, 1, 1/4)))
 V1 V2 V3 V40% -1.40880292 -1.40880292 -1.40880292 -1.4088029225% -0.52351344 -0.52351344 -0.52351344 -0.5235134450% -0.06327669 -0.06327669 -0.06327669 -0.0632766975% 0.33150422 0.33150422 0.33150422 0.33150422100% 1.73118725 1.73118725 1.73118725 1.73118725

Now let’s use the .return_tibble argument to return the output as a tibble:

quantile_normalize(data, .return_tibble = TRUE)
$normalized_data# A tibble: 13 × 4 V1 V2 V3 V4 <dbl> <dbl> <dbl> <dbl> 1 -0.655 -0.318 0.845 -0.655 2 -0.0633 0.845 1.09 -0.951 3 -1.41 -0.524 0.332 0.0864 4 0.845 1.09 0.0864 0.199 5 -0.318 -0.655 -0.0633 0.332 6 1.73 -0.0633 -0.133 -0.133 7 -0.524 -0.133 -0.524 -0.524 8 -0.133 1.73 1.73 1.73 9 0.332 0.0864 0.199 1.09 10 1.09 -0.951 -0.655 -0.318 11 -0.951 -1.41 -0.318 -1.41 12 0.199 0.199 -1.41 0.845 13 0.0864 0.332 -0.951 -0.0633$row_means# A tibble: 13 × 1 value <dbl> 1 -1.41 2 -0.951 3 -0.655 4 -0.524 5 -0.318 6 -0.133 7 -0.0633 8 0.0864 9 0.199 10 0.332 11 0.845 12 1.09 13 1.73 $duplicated_ranks# A tibble: 6 × 4 V1 V2 V3 V4 <int> <int> <int> <int>1 9 13 13 72 10 10 12 123 2 11 2 94 13 9 9 35 7 1 1 116 3 6 7 6$duplicated_rank_row_indices# A tibble: 6 × 1 row_index <int>1 22 43 54 95 106 12$duplicated_rank_data# A tibble: 6 × 4 V1 V2 V3 V4 <dbl> <dbl> <dbl> <dbl>1 -0.230 -0.556 0.153 -0.6952 0.0705 0.498 1.25 -1.27 3 0.129 -1.97 0.426 2.17 4 -0.687 -0.218 0.822 -0.4675 -0.446 -1.03 0.689 0.7806 0.360 -0.625 -0.0619 -0.560

Conclusion

In summary, the quantile_normalize() function from the TidyDensity package offers a convenient and efficient way to perform quantile normalization on numeric matrices in R. By leveraging this function, you can enhance the comparability and statistical integrity of your data across multiple samples or distributions. Incorporate quantile_normalize() into your data preprocessing workflow to unlock deeper insights and more robust analyses.

To explore more functionalities of TidyDensity and leverage its capabilities for advanced data analysis tasks, check out the package documentation and experiment with different parameters and options provided by the quantile_normalize() function.

Related

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Quantile Normalization in R with the {TidyDensity} Package (2024)

References

Top Articles
Latest Posts
Article information

Author: Aron Pacocha

Last Updated:

Views: 5595

Rating: 4.8 / 5 (48 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Aron Pacocha

Birthday: 1999-08-12

Address: 3808 Moen Corner, Gorczanyport, FL 67364-2074

Phone: +393457723392

Job: Retail Consultant

Hobby: Jewelry making, Cooking, Gaming, Reading, Juggling, Cabaret, Origami

Introduction: My name is Aron Pacocha, I am a happy, tasty, innocent, proud, talented, courageous, magnificent person who loves writing and wants to share my knowledge and understanding with you.