Scaling Features

Rajesh R.
4 min readSep 24, 2020
Photo by Ivan Diaz on Unsplash

Scaling of features is an essential technique in machine learning. Without scaling, there is a likelihood of the ML model becoming weaker or even ineffective. A machine learning algorithm works on numbers, so there is a tendency for the model to lean more towards a higher range of numbers in the feature set. The numbers in the entire feature set are scaled to remain within the same degree possible to prevent this from occurring.

Let’s discuss this using a simple example in Python.

Import panda libraries:

We begin coding by first importing the panda, numpy, matplotlib, and seaborn libraries. Now the Pandas and Numpy libraries are for constructing the dataframe and perform mathematical operations respectively, while the matplotlib and seaborn libraries are for visualization of the data.

Import Python Libraries

Create a DataFrame

We then construct a simple DataFrame using Pandas to store medical test records of individuals.

Medical Feature DataFrame

Calculate the Euclidean distance

We can see large variations if we compute the Euclidean distance between any two series in the Dataframe columns.

Euclidean distance between two different columns

The large variations in the Euclidean distances between the Dataframe columns will lead to inefficiencies in the ML model. To correct this, we need to scale the column features to be within the same range.

Scale the Features in the DataFrame

We will now scale the values in the columns of the Dataframe. Before then, we need to select the features that are essential for the ML model. i.e., choose those columns that represent the features of the model.

Features selected for scaling

SKLEARN’s Standard Scalar

There are many libraries or even custom coding to achieve scaling. However, in the example here, I am using the sklearn’s standard scalar to scale the features in the DataFrame. To do this, we will instantiate a ‘standard’ scalar object. Then perform a fit transform on the selected components of the Dataset. The scaled features returned is a ‘numpy.ndarray’ object.

Scale: Fit and transform the selected features

From the returned scaled features, we then create a new feature dataframe as below.

Scaled Features

Visualizing The Effect of Scaling

The effect of scaling can be understood by doing a simple visualization of the features through a distribution plot. The graphs are plotted using the seaborn library.

Prepare two distribution plots for the features: With and Without Scaling

It is easy to notice from the ‘distplot’ visualization below that the data is scaled to values comparable with other columns and is also now more packed closer than before.

Let’s also look at the scaled features information and statistics below.

Scaled Feature statistics

Scaling the data produced a rationalized feature dataset hence cannot be associated with the original units, such as platelet count, body temperature, or blood oxygen levels. The standardization and transformation operations to scale ensure that the data is closer to a zero mean with minimal variance.

Closing Thoughts

Feature scaling is an essential step in ML data processing before ML modeling. I have used the standard scalar in the above code sample. Several other scalars, such as Min-Max, Robust, Max Abs, can also be applied. While scaling serves to improve the results we seek from the ML model. The choice of scalars used to scale features tends towards being artistic and based on several trial and error to get the most desirable outcome.

The Github link to the source code is as below.

Rajesh Ramachander: linkedin.com/in/rramachander/

--

--

Rajesh R.

Engineer, Ph.D. Scholar, and writer. I blend technical expertise and storytelling to explore science and creativity. Happy reading!