2025年信贷违约风险预测(三)简单的特征工程

信贷违约风险预测(三)简单的特征工程在数据探索和特征工程阶段 仅仅使用了主表的数据 主要包含一些客户的详细信息 application train csv application test csv 在数据探索阶段 对数据进行过一些重编码和对齐之后 新数据的特征列有之前的 12 增加至 241 包含 TARGET

大家好,我是讯享网,很高兴认识大家。

在数据探索和特征工程阶段,仅仅使用了主表的数据:
主要包含一些客户的详细信息.

  • application_train.csv
  • application_test.csv
    在数据探索阶段,对数据进行过一些重编码和对齐之后,新数据的特征列有之前的121,增加至241,包含TARGET.

Feature Engineering

Andrew Ng老师曾说过:“applied machine learning is basically feature engineering.” ,不管选择什么样的模型,好的特征工程总能使模型表现的更出色.

特征工程

  • 特征构造:在原有的数据基础上构造新的特征
  • 选择特征:选择重要的特征或数据降维

下面会简单使用两种特征构造的方法:

  • Ploynomial features
  • Domain knowledge features

Polynomial features

多项式特征:这是一种非常简单的特征构造方法,基于数据原有的特征,构造出新的特征,比如:EXT_SOURCE_1^2EXT_SOURCE_2^2EXT_SOURCE_1 x EXT_SOURCE_2, EXT_SOURCE_1 x EXT_SOURCE_2^2, EXT_SOURCE_1^2 x EXT_SOURCE_2^2,等等.由多个单独变量组合构造新的特征.
为什么这么做呢?
因为单个变量可能对TARGET的影响很小,但是多个变量的组合特征可能会增加对TARGET的影响,捕捉到变量之间的相互作用.关于多项式特征polynomial features in his excellent book Python for Data Science的一些方法总结.下面会用EXT_SOURCE和DAYS_BIRTH来构造一些多项式特征,Scikit-learn提供一个PolynomialFeatures的类,可以很方便的构造多项式特征,在创建PloynomialFeatures实例时需要传入一个参数degree,构造特征的数量会随着degree呈指数级的增长,为了避免模型出现过拟合,一定要把握好这个degree度.

import pandas as pd from sklearn.preprocessing import Imputer from sklearn.preprocessing import PolynomialFeatures import warnings warnings.filterwarnings('ignore') import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline 

讯享网
讯享网# load data, 上次数据探索后处理过的数据 train_data = pd.read_csv('data/recode_train_data.csv') test_data = pd.read_csv('data/recode_test_data.csv') 
train_data.shape, test_data.shape 
讯享网((, 241), (48744, 240)) 
poly_features =train_data[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH']] poly_features_test =test_data[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH']] target = train_data['TARGET'] # 中位数填充缺失值 imputer = Imputer(strategy = 'median') poly_features = imputer.fit_transform(poly_features) poly_features_test = imputer.transform(poly_features_test) # 实例 poly_transformer = PolynomialFeatures(degree = 3) poly_transformer.fit(poly_features) #  poly_features = poly_transformer.transform(poly_features) poly_features_test = poly_transformer.transform(poly_features_test) 
讯享网# 构造多项式特征后的特征数量 print(poly_features.shape[1]) 
35 

新特征的名子,get_feature_names()

讯享网poly_features_names = poly_transformer.get_feature_names(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']) poly_features_names 
['1', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'EXT_SOURCE_1^2', 'EXT_SOURCE_1 EXT_SOURCE_2', 'EXT_SOURCE_1 EXT_SOURCE_3', 'EXT_SOURCE_1 DAYS_BIRTH', 'EXT_SOURCE_2^2', 'EXT_SOURCE_2 EXT_SOURCE_3', 'EXT_SOURCE_2 DAYS_BIRTH', 'EXT_SOURCE_3^2', 'EXT_SOURCE_3 DAYS_BIRTH', 'DAYS_BIRTH^2', 'EXT_SOURCE_1^3', 'EXT_SOURCE_1^2 EXT_SOURCE_2', 'EXT_SOURCE_1^2 EXT_SOURCE_3', 'EXT_SOURCE_1^2 DAYS_BIRTH', 'EXT_SOURCE_1 EXT_SOURCE_2^2', 'EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3', 'EXT_SOURCE_1 EXT_SOURCE_2 DAYS_BIRTH', 'EXT_SOURCE_1 EXT_SOURCE_3^2', 'EXT_SOURCE_1 EXT_SOURCE_3 DAYS_BIRTH', 'EXT_SOURCE_1 DAYS_BIRTH^2', 'EXT_SOURCE_2^3', 'EXT_SOURCE_2^2 EXT_SOURCE_3', 'EXT_SOURCE_2^2 DAYS_BIRTH', 'EXT_SOURCE_2 EXT_SOURCE_3^2', 'EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH', 'EXT_SOURCE_2 DAYS_BIRTH^2', 'EXT_SOURCE_3^3', 'EXT_SOURCE_3^2 DAYS_BIRTH', 'EXT_SOURCE_3 DAYS_BIRTH^2', 'DAYS_BIRTH^3'] 

之前的相关性分析当中,'EXT_SOURCE’这些外部数据源跟TARGET是呈负相关的,下面计算一下这些弱相关特征的组合特征跟TARGET的相关性系数.

讯享网poly_features = pd.DataFrame(poly_features, columns = poly_features_names) # 加上TARGET poly_features['TARGET'] = target # 计算相关系数与TARGET poly_corrs = poly_features.corr()['TARGET'].sort_values() 

相关性从小到大排序

poly_corrs = poly_corrs.drop(['TARGET']) 
讯享网plt.figure(figsize = (10, 10)) poly_corrs.plot(kind='barh') 
<matplotlib.axes._subplots.AxesSubplot at 0x7f9e7c757cf8> 

这里写图片描述
讯享网

讯享网poly_corrs.tail(10) 
EXT_SOURCE_1 -0.098887 EXT_SOURCE_1^2 DAYS_BIRTH -0.097507 EXT_SOURCE_1 DAYS_BIRTH^2 -0.094913 EXT_SOURCE_1^2 -0.091034 EXT_SOURCE_1^3 -0.083005 DAYS_BIRTH -0.078239 DAYS_BIRTH^2 -0.076672 DAYS_BIRTH^3 -0.074273 TARGET 1.000000 1 NaN Name: TARGET, dtype: float64 

有些新构造的特征比原始的特征相关性更强一点,但是在开始建模时,新特征并不会全加到训练数据当中,我们可以适当的选择一些特征丢掉一些特征.实际在机器学习项目中,往往不容易选择,唯一的选择就是多做些尝试,有时候很难确定这个特征就比另一个好.

讯享网poly_features['SK_ID_CURR'] = train_data['SK_ID_CURR'] # 合并到train_data poly_train_data = train_data.merge(poly_features, on = 'SK_ID_CURR', how = 'left') # 合并poly_features_test 到 test_data poly_features_test = pd.DataFrame(poly_features_test, columns = poly_features_names) poly_features_test['SK_ID_CURR'] = test_data['SK_ID_CURR'] poly_test_data = test_data.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left') # 数据列对齐 poly_train_data, poly_test_data = poly_train_data.align(poly_test_data, join = 'inner',axis =1) print('Train data shape:',poly_train_data.shape) print('Test data shape:',poly_test_data.shape) 
Train data shape: (, 274) Test data shape: (48744, 274) 
讯享网poly_train_data['TARGET'] = target poly_train_data.shape 
(, 275) 

Save data

讯享网poly_train_data.to_csv('data/poly_train_data.csv',index = False) poly_test_data.to_csv('data/poly_test_data.csv', index = False) 

Domain Knowledge Features

  • CREDIT_INCOME_PERCENT: 信用额度占客户收入的百分比
  • ANNUITY_INCOME_PERCENT: 贷款年金占客户收入的百分比
  • ANNUITY_CREDIT_PERCENT: 年金占信用额度百分比
  • DAYS_EMPLOYED_PERCENT: 客户工作天数占年龄的百分比
domain_train_data = train_data.copy() domain_test_data = test_data.copy() domain_train_data['CREDIT_INCOME_PERCENT'] = domain_train_data['AMT_CREDIT']/domain_train_data['AMT_INCOME_TOTAL'] domain_train_data['ANNUITY_INCOME_PERCENT'] = domain_train_data['AMT_ANNUITY']/domain_train_data['AMT_INCOME_TOTAL'] domain_train_data['ANNUITY_CREDIT_PERCENT'] = domain_train_data['AMT_ANNUITY']/domain_train_data['AMT_CREDIT'] domain_train_data['DAYS_EMPLOYED_PERCENT'] = domain_train_data['DAYS_EMPLOYED']/domain_train_data['DAYS_BIRTH'] 
讯享网domain_test_data['CREDIT_INCOME_PERCENT'] = domain_test_data['AMT_CREDIT'] /domain_test_data['AMT_INCOME_TOTAL'] domain_test_data['ANNUITY_INCOME_PERCENT'] = domain_test_data['AMT_ANNUITY'] /domain_test_data['AMT_INCOME_TOTAL'] domain_test_data['ANNUITY_CREDIT_PERCENT'] = domain_test_data['AMT_ANNUITY'] /domain_test_data['AMT_CREDIT'] domain_test_data['DAYS_EMPLOYED_PERCENT'] = domain_test_data['DAYS_EMPLOYED'] /domain_test_data['DAYS_BIRTH'] 

相关性分析

domain_features = domain_train_data[['TARGET','AMT_CREDIT','AMT_ANNUITY','DAYS_EMPLOYED', 'DAYS_BIRTH','AMT_INCOME_TOTAL','CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT','ANNUITY_CREDIT_PERCENT', 'DAYS_EMPLOYED_PERCENT']] domain_corrs = domain_features.corr()['TARGET'].sort_values() 
讯享网domain_corrs = domain_corrs.drop(['TARGET']) plt.figure(figsize = (10, 6)) plt.xticks(rotation = 90) plt.ylim(-0.1,0.2) domain_corrs.plot.bar() 

这里写图片描述
从相关性系数来看新构造的特征比原来的特征更强(除了DAYS_EMPLOYED),新特征还是对目标有一定的影响力的.

KDE核密度估计

从客户的不同人群中,观察这些特征

plt.figure(figsize = (12, 20)) for i, f_name in enumerate(['CREDIT_INCOME_PERCENT','ANNUITY_INCOME_PERCENT', 'ANNUITY_CREDIT_PERCENT','DAYS_EMPLOYED_PERCENT']): plt.subplot(4, 1, i+1) # TARGET = 0 sns.kdeplot(domain_features.loc[domain_features['TARGET'] == 0, f_name], label = 'target=0') # TARGET = 1 sns.kdeplot(domain_features.loc[domain_features['TARGET'] == 1, f_name], label = 'target=1') plt.title('Distribution of %s by TARGET'%f_name) plt.xlabel('%s'%f_name) plt.ylabel('Density') # 调整子图之间的上下间距 plt.tight_layout(h_pad = 2.5) 

这里写图片描述

前两个特征在不同人群中,密度估计曲线的拟合程度是很高的,对两类客户的区分度基本是没有,后面两个有明显的差异但是整体的趋势是类似的.仅从相关性和密度估计,也很难确定,这些特征对机器学习建模有多大的帮助,只有试试才知道.
补充一下Featuretool

Featuretools

feature LAB推出了一款特征生成工具,自动构造特征,

讯享网import featuretools as ft 
auto_train_data = train_data.copy() auto_test_data = test_data.copy() 
讯享网target = train_data['TARGET'] 
auto_train_data.shape, auto_test_data.shape 
讯享网((, 241), (48744, 240)) 
# initialize an EntitySet and give it an id es = ft.EntitySet(id = 'train_data') #load dataframe as an entity es = es.entity_from_dataframe(entity_id = 'train_data', dataframe = auto_train_data, index = 'SK_ID_CURR') 
讯享网es 
Entityset: train_data Entities: train_data [Rows: , Columns: 241] Relationships: No relationships 
讯享网auto_train_data, features = ft.dfs(entityset = es,target_entity='train_data',verbose=True) 
Built 240 features Elapsed: 00:08 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks 
讯享网es2 = ft.EntitySet(id = 'test_data') es2 = es2.entity_from_dataframe(entity_id = 'test_data', dataframe = auto_test_data, index = 'SK_ID_CURR') auto_test_data, features = ft.dfs(entityset = es2,target_entity='test_data',verbose=True) 
Built 239 features Elapsed: 00:03 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks 
讯享网# 数据对齐 auto_train_data, auto_test_data = auto_train_data.align(auto_test_data, join = 'inner', axis =1) 
auto_train_data.shape 
讯享网(, 238) 
auto_test_data.shape 
讯享网(48744, 238) 
auto_train_data['TARGET'] = target 
讯享网auto_test_data.to_csv('data/auto_test_data.csv') auto_train_data.to_csv('data/auto_train_data.csv') 
小讯
上一篇 2025-04-02 11:56
下一篇 2025-01-08 11:05

相关推荐

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/31060.html