Walk me through your feature engineering process for a new tabular ML problem. What do you look at first, and what transformations do you commonly apply?
formulate your answer, then —
tldr
Feature engineering pipeline: (1) understand data types and distributions, (2) handle missing values — impute + add indicator, (3) transform numerics — log for skew, clip outliers, create ratios, (4) scale for non-tree models using training stats only, (5) encode categoricals by cardinality, (6) decompose datetimes into cyclical features. Always fit on training data; apply to val/test. Tree models skip scaling; linear models and NNs need it.
follow-up
- How do you prevent data leakage when using target encoding in cross-validation?
- Your dataset has a feature with 10% missing values. The missingness correlates with the target. How do you handle it?
- When would you use automated feature engineering (e.g., Featuretools) vs. manual domain-driven engineering?