Normal Data

A number of statistical tools require that the underlying data be normally distributed.  Keep in mind that no real-world data-set is perfectly normal, but data should be checked to ensure that it is reasonably normal, when a given statistical tool requires it.  Note: The 3.4 DPM level associated with Six Sigma processes assumes normal data.

First Option – Plot a Histogram

In the DMAIC world, plotting a histogram and looking at its shape is usually sufficient for checking normality.  The only exception is when sample sizes are very small, in which case a normal probability plot (below) may be the best approach.  Normally distributed data will form a bell-shaped histogram, with the highest bars in the middle, and progressively smaller bars toward the edges, as shown in the following data (randomly generated using MINITAB) –

histogram-normal

Second Option – Normal Probability Plot


Normal probability plots can take different forms, but all have one thing in common:  the closer the data points are to the theoretical-normal line, the more likely it is that the data is normal.

The normal probability plots below show data values along the x-axis, versus the cumulative percentage of data points collected, on the y-axis.  The blue line on the chart reflects a perfectly normal distribution:

normaldata0001

Here are some examples of normal and non-normal data (made into histograms), and their corresponding probability plots (generated with MINITAB software).

normaldata0002normaldata0003normaldata0004normaldata0005normaldata0006normaldata0007normaldata0008normaldata0009

Note that the histograms are as indicative of normality (or non-normality) as the probability plots in these cases.

Defect-Rate Predictions and Non-Normal Data

Statistical techniques are available for dealing with non-normal data, but we’d like to bring some “real-world” perspective into the discussion from a Six Sigma practitioner’s viewpoint – Six Sigma practitioners get paid to reduce variation, not to model variation.  It is far better for a team to put its energy into learning the underlying causes of variation than to get wrapped up in finding the correct distribution or transformation method to make defect-rate predictions.

Once the underlying causes are understood, process redesign and process control are much greater assurances of zero defects over the long run than the fact that a sample taken from the population happened to be normal and capable at one point in time.

So the message here is, there are very few cases where non-normal data should stop a project from moving forward.

Thanks for recommending this site!