notes

Personal notes
git clone git://git.laack.co/notes.git
Log | Files | Refs

Standardization.md (1412B)


      1 # Standardization 
      2 
      3 ML CH2
      4 
      5 **Definition:** Standardization is the process of scaling values such that the value is equivalent to itself subtracing the mean and dividing by the standard deviation. 
      6 
      7 This is optimal in some cases as [MinMaxScaling](MinMaxScaling.md) has issues with outliers. If there is one outlier that is much bigger than all other values the max will be very large thus squishing the range of most values to be low numbers which can effect the accuracy of models.
      8 
      9 See [FeatureScaling](FeatureScaling.md) for more.
     10 
     11 Sample implementation:
     12 
     13 ```python
     14 
     15 # Get number columns
     16 df = df.select_dtypes(include=['number'])
     17 
     18 for i in df:
     19     mean = df[i].mean()
     20     std = df[i].std()
     21     df[i] = (df[i] - mean) / std
     22 
     23 print(df)
     24 
     25 ```
     26 
     27 ## Probabilistic Interpretation
     28 
     29 Standardization is the process of mapping some arbitrary [NormalDistribution](NormalDistribution.md) onto the normal distribution centered at 0 with a standard deviation of 1. This can be done simply by subtracting the mean of the normal distribution from each element and then dividing the subsequent values by the average standard deviation.
     30 
     31 We do this because there is not a closed form solution to find the percentiles of a normal/gaussian distribution thus we use a lookup table which assumes the distribution is centered about 0 with a std. deviation of 1. This is all that is needed to fully describe a gaussian distribution.