Data Normalization with Spark

Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly.

In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization

The Spark based implementation is available in my open source project chombo. There is also a Hadoop based implementation in the same project.

Why Normalize

Some Machine Learning algorithms are sensitive to the relative magnitudes of the feature attributes. Normalization alleviates this problem

The K Nearest Neighbor Algorithm (KNN) is based on distance between records. Unless data is normalized distance will be incorrectly calculated, because different attributes will not contribute to the distance in an uniform way. Attributes having a larger value range will have unduly larger influence on the distance, because they will make greater contribution to the distance.

In Artificial Neural Network (ANN) linear algebra operations are performed between input vector and weight vectors. With ANN, normalization is not strictly necessary, as the weights can accommodate varying range of input feature attributes. However, training can be more efficient and convergence can be reached faster when the data is normalized.

In Support Vector Machine (SVM), the algorithm finds the hyper plane separating the data points belonging to the different classed by optimization techniques and distance calculation enters the picture. Hence, normalization becomes a necessity. However, if kernel functions are used instead of calculating distance directly, the function may be able to handle difference in scales between attributes directly and normalization may be skipped.

As a counter example, let’s consider Decision Tree and Random Forest. In Decision Tree, the feature space is subdivided into different regions, keeping data homogeneity in each region as the the criteria. The algorithm operates on each attribute independently and relative values of different attributes is irrelevant. So, normalization is not necessary.

Normalization Techniques

There are are various normalization techniques. The appropriate technique to be used depends on the machine learning algorithm to be used on the normalized data. The most popular techniques are minmax and zscore.

The minmax technique is based on the min and max values of the attribute as follows. Normalize values will be between 0 and 1.

v_n = (v – v_min) / (v_max – v_min) where
v_n = normalized value
v = original value
v_min = minimum value
v_max = maximum value

The max technique only uses the max value for normalization. The normalized values will between -1 and 1.

v_n = v / v_amax where
v_amax = max(abs(v_max), abs(v_min))

The zscore technique is based on mean and standard deviation. Most of the normalized data will be between -1 and 1. Since the normalized data will follow a standard distribution, this technique is also known as standardization. Standard distribution N(0,1) is a normal distribution with a mean of 0 and standard deviation 1.

v_n = (v – v_mean) / s where
v_mean = mean value
s = standard deviation

The center technique is based on mean only as below. The normalized data is not constrained by any range limit.

v_n = v – v_mean

The decimal technique, the value is scaled by a quantity which is a power of 10 and greater than the max value. Normalized values will be within the limits -1 and 1

v_n = v / 10^m where
v_amax = max(abs(v_max), abs(v_min))
m = smallest integer such that 10^m is greater than v_amax

The unitSum technique is based on the sum of the values as below. The normalized data is not constrained by any range limit.

v_n = v / sum where
sum = ∑v_i

All the techniques described are prone to outliers. The zscore technique provides the option of purging outlier data while normalizing. Since outliers have high zscore, we could remove any record with zscore above some threshold.

House Price

We will use house price data as the use case. Consider a scenario, where you want to build a regression based predictive model for house price based on various input feature attributes.

You have also decided to use the KNN regression algorithm. As alluded to earlier, nearest neighbor based algorithms perform distance calculation which require normalized data. Here are the attributes of the house price data set.

transaction ID
zip code
floor area
number of bedrooms
number of bathrooms
price

Here is some sample input

8544WY7325,94602,1987,5,2,1394000
VSK634510N,94702,1473,3,2,1178000
07C64O7OK0,94540,1680,4,2,1191000
6KR117M8EA,94538,1779,5,2,1186000
2A0T80P51T,95129,1365,3,2,894000
JMM83NVNM6,94540,1406,3,2,930000
PD7ES0I5G1,94602,1368,3,2,950000

Normalization Spark Job

Normalization implementation is the scala object Normalizer. We are doing zscorenormalization and also purging outlier records. Threshold for outliers has been set at 2 x std deviation. Here is some sample output

7OARSM21CR,94501,0.238,0.327,0.395,928000
R0JW4A171T,94538,0.213,0.327,0.395,867000
VSK634510N,94702,0.427,0.327,0.395,1178000
07C64O7OK0,94540,0.685,0.641,0.395,1191000
6KR117M8EA,94538,0.809,0.954,0.395,1186000
2A0T80P51T,95129,0.293,0.327,0.395,894000
JMM83NVNM6,94540,0.344,0.327,0.395,930000

The first field which is ID and the last field which is the output field for regression analysis have been skipped from normalization.

Wrapping Up

We have used house price data an example and gone through the data normalization process using a Spark based implementation. Execution steps for this use case are detailed in the tutorial document.

Originally posted here.

Related Blogs:

Transforming Enterprises with
Data & AI Services & Solutions.

ThirdEye delivers Data and AI services & solutions for enterprises worldwide by
leveraging state-of-the-art Data & AI technologies.

Talk to ThirdEye

Data Normalization with Spark

Why Normalize

Normalization Techniques

House Price

Normalization Spark Job

Wrapping Up

Transforming Enterprises with
Data & AI Services & Solutions.

Primary Services

Pre-Built Applications

Data & AI Solutions

Get Exclusive Insights

Insights

Talk To Us

Data Normalization with Spark

Why Normalize

Normalization Techniques

House Price

Normalization Spark Job

Wrapping Up

Transforming Enterprises with Data & AI Services & Solutions.

Share This Article

Related Posts

AI – Past, Present and Future

Tabular Data Column Semantic Type Identification with Contrastive Deep Learning

Semantic Search with Pre Trained Neural Transformer Model using Document, Sentence and Token Level Embedding

Predicting Covid-19 Viral Infections using Contact Data with LSTM Neural Network

Primary Services

Pre-Built Applications

Data & AI Solutions

Get Exclusive Insights

Insights

Talk To Us

Transforming Enterprises with
Data & AI Services & Solutions.