Data Normalization with Spark
Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly.
In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization
The Spark based implementation is available in my open source project chombo. There is also a Hadoop based implementation in the same project.
Some Machine Learning algorithms are sensitive to the relative magnitudes of the feature attributes. Normalization alleviates this problem
The K Nearest Neighbor Algorithm (KNN) is based on distance between records. Unless data is normalized distance will be incorrectly calculated, because different attributes will not contribute to the distance in an uniform way. Attributes having a larger value range will have unduly larger influence on the distance, because they will make greater contribution to the distance.
In Artificial Neural Network (ANN) linear algebra operations are performed between input vector and weight vectors. With ANN, normalization is not strictly necessary, as the weights can accommodate varying range of input feature attributes. However, training can be more efficient and convergence can be reached faster when the data is normalized.
In Support Vector Machine (SVM), the algorithm finds the hyper plane separating the data points belonging to the different classed by optimization techniques and distance calculation enters the picture. Hence, normalization becomes a necessity. However, if kernel functions are used instead of calculating distance directly, the function may be able to handle difference in scales between attributes directly and normalization may be skipped.
As a counter example, let’s consider Decision Tree and Random Forest. In Decision Tree, the feature space is subdivided into different regions, keeping data homogeneity in each region as the the criteria. The algorithm operates on each attribute independently and relative values of different attributes is irrelevant. So, normalization is not necessary.
There are are various normalization techniques. The appropriate technique to be used depends on the machine learning algorithm to be used on the normalized data. The most popular techniques are minmax and zscore.
The minmax technique is based on the min and max values of the attribute as follows. Normalize values will be between 0 and 1.
vn = (v – vmin) / (vmax – vmin) where
vn = normalized value
v = original value
vmin = minimum value
vmax = maximum value
The max technique only uses the max value for normalization. The normalized values will between -1 and 1.
vn = v / vamax where
vamax = max(abs(vmax), abs(vmin))
The zscore technique is based on mean and standard deviation. Most of the normalized data will be between -1 and 1. Since the normalized data will follow a standard distribution, this technique is also known as standardization. Standard distribution N(0,1) is a normal distribution with a mean of 0 and standard deviation 1.
vn = (v – vmean) / s where
vmean = mean value
s = standard deviation
The center technique is based on mean only as below. The normalized data is not constrained by any range limit.
vn = v – vmean
The decimal technique, the value is scaled by a quantity which is a power of 10 and greater than the max value. Normalized values will be within the limits -1 and 1
vn = v / 10m where
vamax = max(abs(vmax), abs(vmin))
m = smallest integer such that 10m is greater than vamax
The unitSum technique is based on the sum of the values as below. The normalized data is not constrained by any range limit.
vn = v / sum where
sum = ∑vi
All the techniques described are prone to outliers. The zscore technique provides the option of purging outlier data while normalizing. Since outliers have high zscore, we could remove any record with zscore above some threshold.
We will use house price data as the use case. Consider a scenario, where you want to build a regression based predictive model for house price based on various input feature attributes.
You have also decided to use the KNN regression algorithm. As alluded to earlier, nearest neighbor based algorithms perform distance calculation which require normalized data. Here are the attributes of the house price data set.
- transaction ID
- zip code
- floor area
- number of bedrooms
- number of bathrooms
Here is some sample input
Normalization Spark Job
Normalization implementation is the scala object Normalizer. We are doing zscorenormalization and also purging outlier records. Threshold for outliers has been set at 2 x std deviation. Here is some sample output
The first field which is ID and the last field which is the output field for regression analysis have been skipped from normalization.
We have used house price data an example and gone through the data normalization process using a Spark based implementation. Execution steps for this use case are detailed in the tutorial document.
Originally posted here.