When basic addition leads to exponential gains: Feature Engineering for Machine Learning.

Photo by Sajad Nori on Unsplash

Typically, once you join your first Kaggle Competition, you may hear the term “Feature Engineering” tossed around a lot in the discussions, and you may be confused as to what it means. According to Wikipedia:

Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. Feature engineering has been employed in Kaggle competitions and machine learning projects. ~ Wikipedia

If that didn’t make sense, don’t worry. In this article, I’m going to explain it, and give an example of some of the feature engineering which I did to achieve one of the best metadata-only CV scores.

What is feature engineering?

All models have 3 basic components:

  • Input (what you feed the model for training)
  • The model itself
  • Output (the result)

While we can train the model to optimize the second and third components, we can’t exactly train the input. Most of the times, we take our input from a pre-made dataset, clean it up, and feed it into our model. However, this underutilizes the input!

Enter feature engineering: Feature engineering is simply mixing and matching some of the data you get to form new data. This can be helpful for the model, as it can bring to light new “features” which it wouldn’t have access to before.

These new features can mean the difference for your model’s performace. They essentially allow the model to “skip” a few layers and get features earlier than what it could have trained by itself. In other words, you’re giving the model more time to work with this feature than it would have otherwise had (and sometimes you give it features it wouldn’t even get on its own).

How exactly do I engineer features?

One of the ways that I like to “engineer” features is to play around with the data (Exploratory Data Analysis — EDA), and figure out what has a major impact on the results. For me, EDA truly shines when given lots of quantitative data (numbers, booleans, etc), and less qualitative data (text, images, etc).

Feature engineering works best in large amounts of data. The more data, the more likely you are to get significant results out of it.

For example, maybe you choose to add two channels together to create a new channel. Or, you multiply two channels. Either way, you should check to see what impact the new channel has on the results. The heavier the impact, the better.

In general, there are 4 main ways to “engineer” features:

  • 1. Binning
  • 2. Transforming
  • 3. Splitting
  • 4. Combining

You can read more about the ways to “engineer” features here.

Can I have an example?

Sure!

Lets take a look at some of the feature engineering I did for the “PetFinder.my” kaggle competition.

Here (above), I add 2 channels and subtract a channel in order to get an “Overall Focus Channel.” In comparison to the original channels, this channel has a significant impact on the result.

Here (above) is another example of a feature which I “engineered.” I combined multiple differend factors which contributed to the overall neatness of the picture. After doing this, I was able to come up with a feature called “good looks” which had a significant impact on the pawpularity score.

The results of this simple feature engineering were pretty staggering. From a CV score of 23.35 they were able to take it all the way down to 19.81 (in this case lower is better). My public lb score ended up being 19.15 This score was one of the best scores for metadata-only submissions. I’ll update this with more information once the contest ends and private leaderboard scores are available!

In conclusion, feature engineering can be a very powerful strategy for increasing your model’s accuracy. Hopefully, you were able to learn a thing or two from this blog. Make sure to follow and clap!

--

--

--

Sophomore Student at Redmond High School. Python and Java programmer, Mathelete. https://www.youtube.com/channel/UCm2rP74T

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Part 3: Implementation in Java

Role of Deep Learning in Self Driving Cars

Deep Learning for Diagnosis of Skin Images with fastai

A complete Guide to Using TensorBoard with PyTorch

Solving ‘barren plateaus’ is the key to quantum machine learning

Solving 'barren plateaus' is the key to quantum machine learning

Super-resolution microscopy and machine learning shed new light on fossil pollen grains

To ML or not to ML

AutoML vs HPO vs CASH: what is the difference?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aadit Ambadkar

Aadit Ambadkar

Sophomore Student at Redmond High School. Python and Java programmer, Mathelete. https://www.youtube.com/channel/UCm2rP74T

More from Medium

Loan defaulter prediction — XGBoost (BoostARoota + Optuna)+ Perceptron + H2O.ai + other ML models

Loan defaulter prediction using ML models

How to Make Systematic Choices of Machine Learning Models

Interpret ML Models Every data scientist and aspirant must need to know

Machine learning