局部放电是电力设备监测中的一种很重要的方法。本文将以局部放电为例,具体阐述feature selection。具体包括L1-regularization、sequential feature selection和random forest这三种特征选择的方法。
局部放电
当外加电压在电气设备中产生的场强,足以使绝缘部分区域发生放电,但在放电区域内未形成固定放电通道的这种放电现象,称为局部放电。
局部放电在电力设备的检测中应用很广泛,通过局部放电的情况,可以判断设备故障情况与老化状态。
本文将选择部分局部放电数据,通过各种不同的特征选择方法,选择出来合适的局放特征信号。
特征选择概述
在机器学习中,有一句名言:
rubbish in, rubbish out
这句话强调的是,无论多么强大的模型,都需要好的数据,好的数据的一个重要的特征就是这个数据跟需要解决的问题关联性大。
特征工程方面主要有两种方法:
- feature selection
- feature extraction
本文将主要探讨feature selection,也就是从已有的特征中进一步挑选出特征,而不是构建新的特征,这么做的优点有以下几个:
- 如果
feature太多的话,模型的复杂度过大,可能会发生过拟合 - 如果
feature太多的话,数据变得稀疏,会有curse of dimension - 如果
feature太多的话,训练以及预测需要的计算量更大
数据预处理
#放电数据主要包含如下几个特征: pd_location signal_width rise_time fall_time peak_voltage polarity mean_voltage rms sd skewness kurtosis crest form_factor MainFreq phase_angle T W pC #载入数据 import pandas as pd d = pd.read_csv("2C4KPdNoise.csv") #观察数据 d.head() d.groupby("pd_class").agg({
'T': len}) d.isnull().any() #训练集和测试集划分 from sklearn.cross_validation import train_test_split n1 = d.shape[1]-1 X, y = d.iloc[:, 0:n1].values, d.iloc[:, n1].values X_train, X_test, y_train, y_test = \ train_test_split(X, y, test_size=0.3, random_state=0) print ("X_train shape:",X_train.shape) print ("X_test shape:",X_test.shape) print ("Y_train shape:",y_train.shape) print ("Y_test shape:",y_test.shape) #数据归一化 from sklearn.preprocessing import StandardScaler stdsc = StandardScaler() X_train_std = stdsc.fit_transform(X_train) X_test_std = stdsc.transform(X_test)
讯享网
L1-regularization
![]() |
![]() |
L1-regularization的原理以及与L2的区别如上图,其使用了coordinate descent以及subgradient的原理,会使某一个feature迅速地变成0。
通过不断地调整权重,获得以下特征信息:
讯享网['W', 'pC'] ['T', 'W', 'pC'] ['pd_location', 'kurtosis', 'phase_angle', 'T', 'W', 'pC'] ['pd_location', 'skewness', 'kurtosis', 'phase_angle', 'T', 'W', 'pC']
l1 path如下图:

代码如下:
COLUMNS = d.columns[:-1] #第一次选权重 weights, params = [], [] for c in np.arange(-4, 6): lr = LogisticRegression(penalty='l1', C=10c, random_state=0) lr.fit(X_train_std, y_train) weights.append(lr.coef_[1]) params.append(10c) weights = np.array(weights) pd.DataFrame({
"c":np.arange(-4, 6),"num":[sum(i!=0) for i in weights]}) #第二次选权重 weights, params = [], [] for c in np.linspace(-2, -1,5): lr = LogisticRegression(penalty='l1', C=10c, random_state=0) lr.fit(X_train_std, y_train) weights.append(lr.coef_[1]) params.append(10c) weights = np.array(weights) pd.DataFrame({
"c":np.linspace(-2, -1,5),"num":[sum(i!=0) for i in weights]}) #第三次选权重 weights, params = [], [] for c in np.linspace(-1.25, -1,10): lr = LogisticRegression(penalty='l1', C=10c, random_state=0) lr.fit(X_train_std, y_train) weights.append(lr.coef_[1]) params.append(10c) weights = np.array(weights) pd.DataFrame({
"c":np.linspace(-1.25, -1,10),"num":[sum(i!=0) for i in weights]}) #查看结果 for i in range (len(weights)): print (COLUMNS[weights[i] !=0]) #可视化 import matplotlib.pyplot as plt 颜色的选择用`iWantHue` colors = ["#6FC9CA", "#CD5430", "#CE4DC4", "#88D54D", "#3D455E", "#5A752D", "#CC436D", "#", "#D2B43F", "#CD88BC", "#C8C28D", "#71D195", "#6A89C8", "#C18170", "#7F68D1", "#4A6553", "#BABAC7", "#"] 计算权重 weights, params = [], [] for c in np.arange(-2, 10): lr = LogisticRegression(penalty='l1', C=10c, random_state=0) lr.fit(X_train_std, y_train) weights.append(lr.coef_[1]) params.append(10c) weights = np.array(weights) 开始画图 import matplotlib.pyplot as plt %matplotlib inline fig = plt.figure() ax = plt.subplot(111) for column, color in zip(range(weights.shape[1]), colors): plt.plot(params, weights[:, column], label=pd.columns[column], color=color) plt.axhline(0, color='black', linestyle='--', linewidth=5) plt.xlim([10(-5), 105]) plt.ylabel('weight coefficient') plt.xlabel('C') plt.xscale('log') plt.legend(loc='upper left',markerscale=5) ax.legend(loc='upper center', bbox_to_anchor=(1.38, 1.03), ncol=1, fancybox=True) plt.savefig('l1_path.png', dpi=600) plt.show()
sequential feature selection
这是属于贪婪算法的一种方法,具体有:
- forward selection
- backward selection
- forward and backward selection
可以结合不同的模型,也可以有不同的评价指标,常见的有以下几个
- validation集上的精度
- 特征的p值
下面是用logistic regression模型,通过测试集上的分类精度,选择出的特征信息:
讯享网['pd_location' 'signal_width' 'peak_voltage' 'form_factor' 'pC'] ['signal_width' 'peak_voltage' 'form_factor' 'pC'] ['signal_width' 'peak_voltage' 'pC'] ['peak_voltage' 'pC'] ['peak_voltage']
训练的过程图如下:

代码如下:
#定义sequential feature selection函数 from sklearn.base import clone from itertools import combinations import numpy as np from sklearn.cross_validation import train_test_split from sklearn.metrics import accuracy_score class SBS(): def __init__(self, estimator, k_features, scoring=accuracy_score, test_size=0.25, random_state=1): self.scoring = scoring self.estimator = clone(estimator) self.k_features = k_features self.test_size = test_size self.random_state = random_state def fit(self, X, y): X_train, X_test, y_train, y_test = \ train_test_split(X, y, test_size=self.test_size, random_state=self.random_state) dim = X_train.shape[1] self.indices_ = tuple(range(dim)) self.subsets_ = [self.indices_] score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_) self.scores_ = [score] while dim > self.k_features: scores = [] subsets = [] for p in combinations(self.indices_, r=dim-1): score = self._calc_score(X_train, y_train, X_test, y_test, p) scores.append(score) subsets.append(p) best = np.argmax(scores) self.indices_ = subsets[best] self.subsets_.append(self.indices_) dim -= 1 self.scores_.append(scores[best]) self.k_score_ = self.scores_[-1] return self def transform(self, X): return X[:, self.indices_] def _calc_score(self, X_train, y_train, X_test, y_test, indices): self.estimator.fit(X_train[:, indices], y_train) y_pred = self.estimator.predict(X_test[:, indices]) score = self.scoring(y_test, y_pred) return score #训练 %matplotlib inline from sklearn.neighbors import KNeighborsClassifier import matplotlib.pyplot as plt lr = LogisticRegression() # selecting features sbs = SBS(lr, k_features=1) sbs.fit(X_train_std, y_train) # plotting performance of feature subsets k_feat = [len(k) for k in sbs.subsets_] plt.plot(k_feat, sbs.scores_, marker='o') plt.ylim([0, 1]) plt.ylabel('Accuracy') plt.xlabel('Number of features') plt.grid() plt.tight_layout() # plt.savefig('./sbs.png', dpi=300) plt.show() #看参数 for i in sbs.subsets_: print (np.array(COLUMNS[[i]]))
random forest
随机森林的特征选择是根据特征的Importance。其核心是用OOB的资料,通过特征的置换来计算重要程度,可参考下面两张PPT。

结果如下:
讯享网 1) pd_location 0. 2) signal_width 0. 3) rise_time 0. 4) fall_time 0. 5) peak_voltage 0.084795 6) polarity 0.071462 7) mean_voltage 0.056699 8) rms 0.056144 9) sd 0.048083 10) skewness 0.044781 11) kurtosis 0.041479 12) crest 0.036688 13) form_factor 0.033465 14) MainFreq 0.029730 15) phase_angle 0.029136 16) T 0.011835 17) W 0.010076 18) pC 0.000000

需要注意的是,如果有两个特征相关度很大,这种方法会造成其中一种重要性很高,另一种重要性很低。对于预测并不影响,但是会影响解释程度。
代码如下:
from sklearn.ensemble import RandomForestClassifier feat_labels = d.columns[0:-1] forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1) forest.fit(X_train, y_train) importances = forest.feature_importances_ indices = np.argsort(importances)[::-1] for f in range(X_train.shape[1]): print("%2d) %-*s %f" % (f + 1, 30, feat_labels[f], importances[indices[f]])) plt.title('Feature Importances') plt.bar(range(X_train.shape[1]), importances[indices], color='lightblue', align='center') plt.xticks(range(X_train.shape[1]), feat_labels, rotation=90) plt.xlim([-1, X_train.shape[1]]) plt.tight_layout() #plt.savefig('./figures/random_forest.png', dpi=300) plt.show() #指定的threshold进行选择 X_selected = forest.transform(X_train, threshold=0.15) X_selected.shape
参考资料
- PythonMachineLeaning



版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/21781.html