2025年机器学习之线性回归_通过线性回归开始机器学习之旅

大家好，我是讯享网，很高兴认识大家。

机器学习之线性回归

线性回归 (Linear Regression)

Linear regression is a part of Statistics that defines the relationship between two numerical variables. It is a linear model that believes and justifies that there exists a linear relationship between two variables.

线性回归是统计的一部分，它定义了两个数值变量之间的关系。它是一个线性模型，可以相信并证明两个变量之间存在线性关系。

It takes into account the input variable and the output variable. It implies that one can calculate from a linear combination of input variables (x).

它考虑了输入变量和输出变量。这意味着可以根据输入变量(x)的线性组合进行计算。

线性回归模型表示 (Linear Regression Model Representation)

Linear regression can be expressed in terms of an equation as:

线性回归可以用等式表示为：

y=B0+B1*x

y = B0 + B1 * x

Where x is an input variable. ‘B’ is greek alphabet representing coefficients here which are a scalar factor assigned to each input variable. An additional coefficient has been added to incorporate the intercept or bias.

其中x是输入变量。 “ B”是希腊字母，代表此处的系数，是分配给每个输入变量的标量因子。添加了附加系数以合并截距或偏差。

线性回归的类型 (Types of Linear Regression)

Simple Linear Regression: It takes into account a single x variable and helps in predicting output(y) variables.

简单线性回归：它考虑了单个x变量，并有助于预测输出(y)变量。

Example: When we are trying to predict the price of a house based on the square footage of the area covered by it. Here, Square footage of the house is the input variable and the price of the home is the output variable.

示例：当我们尝试根据房屋所覆盖区域的平方英尺来预测房屋价格时。在这里，房屋的平方英尺是输入变量，房屋的价格是输出变量。

Multiple Regression: There are more than 1 input variables involved to predict output(y) variables.

多元回归：涉及多个输入变量来预测输出(y)变量。

Example: When we take an area of a house, the number of rooms, HouseStyle to predict the house price. Here, multiple input variables like the area of the house, number of rooms, HouseStyle are used to predict house price which is the output variable.

示例：当我们以房屋的面积为单位时，房间数，HouseStyle可以预测房价。在这里，多个输入变量(如房屋面积，房间数量，HouseStyle)用于预测房屋价格，这是输出变量。

正则化 (Regularization)

It is the technique where we add information to the regression equation or reduce coefficients to zero to avoid overfitting or the complex nature of the problem. It is used when there is collinearity in input values

在这种技术中，我们将信息添加到回归方程中或将系数减小为零，以避免过度拟合或问题的复杂性。当输入值存在共线性时使用

基于正则化的回归类型 (Types Of Regularization Based Regression)

Lasso Regression: It is also known as L1 Regularization. It is a procedure where Ordinary Least Squares is modified to reduce the absolute sum of the coefficients.

套索回归：也称为L1正则化。这是修改普通最小二乘以减少系数的绝对和的过程。

Example: There are 10,000 features to predict variables, the Lasso model selects only a few coefficients and converts the reset to zero.

示例：有10,000个可预测变量的特征，套索模型仅选择一些系数并将重置值转换为零。

Ridge Regression: It is also known as L2 Regularization. It is a procedure where Ordinary Least Squares squared the absolute sum of the coefficients. When coefficients used in the regression are unbalanced, we introduce alpha value to improve the model. Example: When we are trying to predict the sales of outlets, the type of outlet has higher weight compared to the weight of items sold there then we introduce alpha which reduces the sum of coefficients.

岭回归：也称为L2正则化。这是一个用普通最小二乘法对系数的绝对和求平方的过程。当回归中使用的系数不平衡时，我们引入alpha值来改进模型。示例：当我们试图预测网点的销售时，网点的类型比那里售出的商品的权重更高，因此我们引入alpha来减少系数的总和。

梯度下降 (Gradient Descent)

It is a process of optimizing coefficients by repeatedly minimizing the error of the model on your training data. The process involves adding learning rates and coefficients are updated for minimizing the error. It is iterated until a minimum sum square error is achieved or change is not possible.

讯享网

这是通过反复最小化模型对训练数据的误差来优化系数的过程。该过程涉及增加学习率，并且为了最小化误差而更新系数。迭代直到达到最小和平方误差或无法更改。

Learning Rate () is the size of the improvement step for each iteration of the procedure and should be chosen decisively.

学习率()是该过程每次迭代的改进步骤的大小，应果断选择。

梯度下降的类型 (Types of Gradient Descent)

Stochastic Gradient Descent: This method looks at every example in the entire training set on every step.

随机梯度下降：此方法在每个步骤的整个训练集中查看每个示例。

Example: The training data has 200 samples then the parameters are updated for the same number of samples. It means once every individual sample is used in the model.

示例：训练数据有200个样本，然后针对相同数目的样本更新参数。这意味着在模型中使用了每个单独的样本。

Batch Gradient Descent: This method iterates through a training set, whenever you come across a training example, you update the parameters according to the error gradient based on a single training example only.

批梯度下降：此方法遍历训练集，每当您遇到训练示例时，仅根据单个训练示例根据误差梯度更新参数。

Example: The training set has 100 samples, then the parameters of the model are updated only once based on all examples.

示例：训练集有100个样本，然后基于所有示例仅更新一次模型参数。

回归线属性 (Regression Line Properties)

Considering regression coefficients as B0 and B1, the line has the following properties:

考虑回归系数为B0和B1，该线具有以下属性：

The line minimizes the sum of squared differences between the actual values and predicted values.
该线使实际值和预测值之间的平方差之和最小。
The regression line graphically passes through the mean of X and Y values.
回归线以图形方式穿过X和Y值的平均值。
B0 means the y-intercept of the regression line.
B0表示回归线的y截距。
B1 is the average change in Y for 1-unit change in X. It is also known as the slope of the regression line.
B1是X的1个单位变化的Y的平均变化。也称为回归线的斜率。

The least-squares regression line is the only straight line that has all of these properties.

最小二乘回归线是具有所有这些属性的唯一直线。

定义输入和输出变量之间的关系 (Defining The Relationship Between Input And Output Variable)

When B1>0, x and y variables have positive relationships. It implies that x will increase y.

当B1> 0时，x和y变量具有正关系。这意味着x将增加y。

When B1<0, x and y variables have negative relationships. It implies that x and y are inversely related, if x increases, y will decrease.

当B1 <0时，x和y变量具有负关系。这意味着x和y成反比，如果x增加，y将减少。

For example , When we are trying to predict house price, house type, and several rooms used to define the model is known as input variables and house price is an output variable.

例如，当我们尝试预测房屋价格，房屋类型以及用于定义模型的多个房间时，将其称为输入变量，而房屋价格则是输出变量。

如何检查模型性能？ (How To Check Model Performance?)

We plot the actual values and predicted values on a graph. The main idea is to find a line that best fits the data. The best line would be where the total prediction error is the smallest. Error is the distance between the point of the regression line.

我们在图表上绘制实际值和预测值。主要思想是找到最适合数据的线。最好的线是总预测误差最小的位置。误差是回归线的点之间的距离。