Data Normalization with Spark

Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly.

In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization

The Spark based implementation is available in my open source project chomboThere is also a Hadoop based implementation in the same project.

Why Normalize

Some Machine Learning algorithms are sensitive to the relative magnitudes of the feature attributes. Normalization alleviates this problem

The K Nearest Neighbor Algorithm (KNN) is based on distance between records. Unless data is normalized distance will be incorrectly calculated, because  different attributes will not contribute to the distance in an uniform way.  Attributes having a larger value range will  have unduly larger influence on the distance, because  they will make greater contribution to the distance.

In Artificial Neural Network (ANN) linear algebra operations are performed between input vector and weight vectors. With ANN, normalization is not strictly necessary, as the weights can accommodate varying range of input feature attributes. However, training can be more efficient and convergence can be reached faster when the data is normalized.

In Support Vector Machine (SVM), the algorithm finds the hyper plane separating the data points belonging to the different classed by optimization techniques and distance calculation enters the picture. Hence, normalization becomes a necessity. However, if  kernel functions are used instead of calculating distance directly, the function may be able to handle difference in scales between attributes directly and normalization may be skipped.

As a counter example, let’s consider Decision Tree and Random Forest. In Decision Tree, the feature space is subdivided into different regions, keeping data homogeneity in each region as the the criteria. The algorithm operates on each attribute independently and relative values of different attributes is irrelevant. So, normalization is not necessary.

Normalization Techniques

There are are various normalization techniques. The appropriate technique to be used depends on the machine learning algorithm to be used on the normalized data. The most popular techniques are minmax and zscore.

The minmax technique is based on the min and max values of the attribute as follows. Normalize values will be between 0 and 1.

vn = (v – vmin) / (vmax – vmin)       where
vn = normalized value
v = original value
vmin = minimum value 
vmax = maximum value

The max technique only uses the max value for normalization. The normalized values will between -1 and 1.

vn = v / vamax    where 
vamax = max(abs(vmax), abs(vmin))

The zscore technique is based on mean and standard deviation. Most of the normalized data will be between -1 and 1. Since the normalized data will follow a standard distribution, this technique is also known as standardization. Standard distribution N(0,1) is a normal distribution with a mean of 0 and standard deviation 1.

vn = (v – vmean) / s    where 
vmean = mean value 
s = standard deviation

The center technique is based on mean only as below. The normalized data is not constrained by any range limit.

vn = v – vmean

The decimal technique, the value is scaled by a quantity which is a power of 10 and greater than the max value. Normalized values will be within the limits -1 and 1

vn = v / 10m      where
vamax = max(abs(vmax), abs(vmin))
m = smallest integer such that 10m is greater than vamax

The unitSum technique is based on the sum of the values as below. The normalized data is not constrained by any range limit.

vn = v / sum     where 
sum = ∑vi

All the techniques described are prone to outliers. The zscore technique provides the option of purging outlier data while normalizing. Since outliers have high zscore, we could remove any record with zscore above some threshold.

House Price

We will use house price data as the use case. Consider a scenario, where you want to build a regression based predictive model for house price based on various input feature attributes.

You have also decided to use the KNN regression algorithm. As alluded to earlier, nearest neighbor based algorithms perform distance calculation which require normalized data. Here are the attributes of the house price data set.

  1. transaction ID
  2. zip code
  3. floor area
  4. number of bedrooms
  5. number of bathrooms
  6. price

Here is some sample input


Normalization Spark  Job

Normalization implementation is the scala object Normalizer. We are doing zscorenormalization and also purging outlier records. Threshold for outliers has been set at 2 x std deviation. Here is some sample output


The first field which is ID and the last field which is the output field for regression analysis have been skipped from normalization.

Wrapping Up

We have used house price data  an example and gone through the data normalization process using a Spark based implementation. Execution steps for this use case are detailed in the tutorial document.

Originally posted here.

Transforming Enterprises with
Data & AI Services & Solutions.

ThirdEye delivers Data and AI services & solutions for enterprises worldwide by
leveraging state-of-the-art Data & AI technologies.

Talk to ThirdEye