Standardization.md (1412B)
1 # Standardization 2 3 ML CH2 4 5 **Definition:** Standardization is the process of scaling values such that the value is equivalent to itself subtracing the mean and dividing by the standard deviation. 6 7 This is optimal in some cases as [MinMaxScaling](MinMaxScaling.md) has issues with outliers. If there is one outlier that is much bigger than all other values the max will be very large thus squishing the range of most values to be low numbers which can effect the accuracy of models. 8 9 See [FeatureScaling](FeatureScaling.md) for more. 10 11 Sample implementation: 12 13 ```python 14 15 # Get number columns 16 df = df.select_dtypes(include=['number']) 17 18 for i in df: 19 mean = df[i].mean() 20 std = df[i].std() 21 df[i] = (df[i] - mean) / std 22 23 print(df) 24 25 ``` 26 27 ## Probabilistic Interpretation 28 29 Standardization is the process of mapping some arbitrary [NormalDistribution](NormalDistribution.md) onto the normal distribution centered at 0 with a standard deviation of 1. This can be done simply by subtracting the mean of the normal distribution from each element and then dividing the subsequent values by the average standard deviation. 30 31 We do this because there is not a closed form solution to find the percentiles of a normal/gaussian distribution thus we use a lookup table which assumes the distribution is centered about 0 with a std. deviation of 1. This is all that is needed to fully describe a gaussian distribution.