news

How to streamline feature engineering for machine learning

Spread the love

Prepping for feature engineering

When it comes to data preparation, especially in feature engineering for machine learning, there are several major steps.

The first step is data collection, which consists of gathering raw data from various sources, such as web services, mobile apps, desktop apps and back-end systems, and bringing it all into one place. Tools like Kafka and Amazon Kinesis are often used to collect raw event data and stream it to data lakes, such as Amazon S3 or Azure Data Lake, or data warehouses, such as Snowflake or Amazon Redshift.

The second step involves validating, cleaning and merging data together to create a single source of truth for all data analysis. On top of this single source of truth, new data sets are usually created to support specific use cases in a convenient, high-performing and cost-effective way.

Feature engineering is essentially the third step in the machine learning lifecycle, said Pavel Dmitriev, vice president of data science at Outreach, a sales engagement company. “The feature engineering step transforms the data from the single source of truth dataset into a set of features that can be directly used in a machine learning model,” Dmitriev said.

Additionally, typical transformations include scaling, truncating outliers, binning, handling missing values and transforming categorical values into numeric values. Dmitriev said the importance of manual feature engineering has been declining in recent years due to improvements in deep learning algorithms that require less feature engineering and the development of automated feature engineering techniques.