# Data Normalization with Spark

**Data normalization** is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly.

In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization

The *Spark* based implementation is available in my open source project *chombo*. There is also a *Hadoop* based implementation in the same project.

## Why Normalize

Some Machine Learning algorithms are sensitive to the relative magnitudes of the feature attributes. Normalization alleviates this problem

The ** K Nearest Neighbor Algorithm (KNN) **is based on distance between records. Unless data is normalized distance will be incorrectly calculated, because different attributes will not contribute to the distance in an uniform way. Attributes having a larger value range will have unduly larger influence on the distance, because they will make greater contribution to the distance.

In ** Artificial Neural Network (ANN)** linear algebra operations are performed between input vector and weight vectors. With ANN, normalization is not strictly necessary, as the weights can accommodate varying range of input feature attributes. However, training can be more efficient and convergence can be reached faster when the data is normalized.

In ** Support Vector Machine (SVM)**, the algorithm finds the hyper plane separating the data points belonging to the different classed by optimization techniques and distance calculation enters the picture. Hence, normalization becomes a necessity. However, if kernel functions are used instead of calculating distance directly, the function may be able to handle difference in scales between attributes directly and normalization may be skipped.

As a counter example, let’s consider ** Decision Tree and Random Forest**. In Decision Tree, the feature space is subdivided into different regions, keeping data homogeneity in each region as the the criteria. The algorithm operates on each attribute independently and relative values of different attributes is irrelevant. So, normalization is not necessary.

## Normalization Techniques

There are are various normalization techniques. The appropriate technique to be used depends on the machine learning algorithm to be used on the normalized data. The most popular techniques are *minmax* and *zscore*.

The ** minmax** technique is based on the min and max values of the attribute as follows. Normalize values will be between 0 and 1.

v_{n}= (v – v_{min}) / (v_{max}– v_{min) }where

v_{n}= normalized value

v = original value

v_{min}= minimum value

v_{max}= maximum value

The ** max** technique only uses the max value for normalization. The normalized values will between -1 and 1.

v_{n}= v / v_{amax}where

v_{amax}= max(abs(v_{max}), abs(v_{min}))

The ** zscore** technique is based on mean and standard deviation. Most of the normalized data will be between -1 and 1. Since the normalized data will follow a standard distribution, this technique is also known as standardization. Standard distribution N(0,1) is a normal distribution with a mean of 0 and standard deviation 1.

v_{n}= (v – v_{mean}) / s where

v_{mean}= mean value

s = standard deviation

The ** center** technique is based on mean only as below. The normalized data is not constrained by any range limit.

v

_{n}= v – v_{mean}

The ** decimal** technique, the value is scaled by a quantity which is a power of 10 and greater than the max value. Normalized values will be within the limits -1 and 1

v_{n}= v / 10^{m}where

v_{amax}= max(abs(v_{max}), abs(v_{min}))

m = smallest integer such that 10^{m}is greater than v_{amax}

The ** unitSum** technique is based on the sum of the values as below. The normalized data is not constrained by any range limit.

v_{n}= v / sum where

sum = ∑v_{i}

All the techniques described are prone to outliers. The *zscore* technique provides the option of purging outlier data while normalizing. Since outliers have high *zscore*, we could remove any record with *zscore* above some threshold.

## House Price

We will use house price data as the use case. Consider a scenario, where you want to build a regression based predictive model for house price based on various input feature attributes.

You have also decided to use the KNN regression algorithm. As alluded to earlier, nearest neighbor based algorithms perform distance calculation which require normalized data. Here are the attributes of the house price data set.

*transaction ID**zip code**floor area**number of bedrooms**number of bathrooms**price*

Here is some sample input

8544WY7325,94602,1987,5,2,1394000

VSK634510N,94702,1473,3,2,1178000

07C64O7OK0,94540,1680,4,2,1191000

6KR117M8EA,94538,1779,5,2,1186000

2A0T80P51T,95129,1365,3,2,894000

JMM83NVNM6,94540,1406,3,2,930000

PD7ES0I5G1,94602,1368,3,2,950000

## Normalization Spark Job

Normalization implementation is the scala object *Normalizer*. We are doing *zscore*normalization and also purging outlier records. Threshold for outliers has been set at 2 x std deviation. Here is some sample output

7OARSM21CR,94501,0.238,0.327,0.395,928000

R0JW4A171T,94538,0.213,0.327,0.395,867000

VSK634510N,94702,0.427,0.327,0.395,1178000

07C64O7OK0,94540,0.685,0.641,0.395,1191000

6KR117M8EA,94538,0.809,0.954,0.395,1186000

2A0T80P51T,95129,0.293,0.327,0.395,894000

JMM83NVNM6,94540,0.344,0.327,0.395,930000

The first field which is ID and the last field which is the output field for regression analysis have been skipped from normalization.

## Wrapping Up

We have used house price data an example and gone through the data normalization process using a Spark based implementation. Execution steps for this use case are detailed in the tutorial document.

Originally posted here.