当前位置:网站首页>Machine learning notes linear regression of time series
Machine learning notes linear regression of time series
2022-06-25 07:52:00 【Sit and watch the clouds rise】
One 、 Time series prediction
Prediction is probably the most common application of machine learning in the real world . Enterprises predict product demand , The government forecasts economic and population growth , Meteorologists predict the weather . The understanding of the future is science 、 Government and Industry ( Not to mention our personal lives !) The urgent need for , Practitioners in these fields are increasingly using machine learning to address this need .
Time series prediction is a broad field with a long history . This course focuses on applying modern machine learning methods to time series data , To produce the most accurate predictions . The lessons in this course are influenced by the past Kaggle Inspiration for predicting winning solutions in competitions , But as long as accurate forecasting becomes a priority , Applicable .
After completing this course , You will know how to :
Design functions for major time series components ( trend 、 Season and cycle ) Modeling ,
Visualization of time series with multiple time series diagrams ,
Create a predictive mix that combines the strengths of complementary models , as well as
Make the machine learning method adapt to various prediction tasks .
As part of the exercise , You will have the opportunity to participate in our store sales - Introduction to time series prediction . In this competition , Your task is to forecast Corporación Favorita( Large Ecuadorian grocery retailer ) near 1800 Sales of product categories .
Store Sales - Time Series Forecasting | KaggleUse machine learning to predict grocery saleshttps://www.kaggle.com/competitions/store-sales-time-series-forecasting/data The basic object of prediction is time series , It is a set of observations recorded over time . In the forecasting application , Observations are usually recorded at a fixed frequency , For example, every day or every month .
date | Hardcover book sales |
---|---|
2000-04-01 | 139 |
2000-04-02 | 128 |
2000-04-03 | 172 |
2000-04-04 | 139 |
2000-04-05 | 191 |
The above table records a retail store 30 Number of hardcover books sold in a day . Please note that , We have a list of hardcover observations with time indexed dates .
Two 、 Linear regression with time series
In the first part of this course , We will use the linear regression algorithm to build the prediction model . Linear regression is widely used in practice , And naturally adapt to even complex prediction tasks .
Linear regression algorithms learn how to derive weighted sums from their input characteristics . For both functions , We will have :
target = weight_1 * feature_1 + weight_2 * feature_2 + bias
During training , Regression algorithm learning parameters weight_1、weight_2 And the value of the deviation that best suits the target . ( This algorithm is usually called the ordinary least square method , Because it selects a value that minimizes the square error between the target and the forecast .) Weights are also called regression coefficients , Deviation is also called intercept , Because it says that the position function goes through y Axis .
1、 Time step characteristics
Time series have two unique characteristics : Time step characteristics and lag characteristics .
The time step feature is a feature that we can get directly from the time index . The most basic time step feature is the time dummy variable , It calculates the time steps in the sequence from beginning to end .
import numpy as np
df['Time'] = np.arange(len(df.index))
df.head()
Date | Hardcover | Time |
---|---|---|
2000-04-01 | 139 | 0 |
2000-04-02 | 128 | 1 |
2000-04-03 | 172 | 2 |
2000-04-04 | 139 | 3 |
2000-04-05 | 191 | 4 |
Linear regression generation model with time dummy variables :
target = weight * time + bias
then , The time dummy variable lets us fit the curve to the time series in the time graph , In which time forms x Axis .
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn-whitegrid")
plt.rc(
"figure",
autolayout=True,
figsize=(11, 4),
titlesize=18,
titleweight='bold',
)
plt.rc(
"axes",
labelweight="bold",
labelsize="large",
titleweight="bold",
titlesize=16,
titlepad=10,
)
%config InlineBackend.figure_format = 'retina'
fig, ax = plt.subplots()
ax.plot('Time', 'Hardcover', data=df, color='0.75')
ax = sns.regplot(x='Time', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_title('Time Plot of Hardcover Sales');
The time step feature allows you to model time dependencies . If a series of values can be predicted from the time they occur , So it's time-dependent . In the hardcover sales series , We can predict that the sales volume in the later part of the month is usually higher than that in the earlier part of the month .
2、 Hysteresis characteristics
In order to make hysteresis characteristics , We changed the observations of the target series , Make them appear to happen at a later time . ad locum , We created one 1 Step lag function , Although multi-step movement is possible .
df['Lag_1'] = df['Hardcover'].shift(1)
df = df.reindex(columns=['Hardcover', 'Lag_1'])
df.head()
Date | Hardcover | Lag_1 |
---|---|---|
2000-04-01 | 139 | NaN |
2000-04-02 | 128 | 139.0 |
2000-04-03 | 172 | 128.0 |
2000-04-04 | 139 | 172.0 |
2000-04-05 | 191 | 139.0 |
Linear regression generation model with lag characteristics :
target = weight * lag + bias
therefore , The hysteresis feature allows us to fit the curve to the hysteresis graph , Each observation in the series corresponds to the previous observation .
fig, ax = plt.subplots()
ax = sns.regplot(x='Lag_1', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_aspect('equal')
ax.set_title('Lag Plot of Hardcover Sales');
You can see from the lag diagram , Sales for a day ( Hardcover ) Compared with the previous day's sales (Lag_1) relevant . When you see such a relationship , You will know that the delay function can be useful .
More generally , Hysteresis allows you to model serial dependencies . When observations can be predicted from previous observations , Time series have sequence dependence . In hardcover sales , We can predict that one day's high sales usually means the next day's high sales .
The problem of adapting machine learning algorithm to time series is mainly about characteristic engineering with time index and lag . For most courses , We use linear regression for simplicity , But no matter which algorithm you choose for the prediction task , These functions will be very useful .
3、 ... and 、 Example - Tunnel flow
Tunnel Traffic It's a time series , Described from 2003 year 11 Month to 2005 year 11 The number of vehicles passing through the Swiss bareg tunnel per day during the month of . In this case , We will do some exercises , Linear regression is applied to time step feature and lag feature .
Hidden cells set everything .
from pathlib import Path
from warnings import simplefilter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
simplefilter("ignore") # ignore warnings to clean up output cells
# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 4))
plt.rc(
"axes",
labelweight="bold",
labelsize="large",
titleweight="bold",
titlesize=14,
titlepad=10,
)
plot_params = dict(
color="0.75",
style=".-",
markeredgecolor="0.25",
markerfacecolor="0.25",
legend=False,
)
%config InlineBackend.figure_format = 'retina'
# Load Tunnel Traffic dataset
data_dir = Path("../input/ts-course-data")
tunnel = pd.read_csv(data_dir / "tunnel.csv", parse_dates=["Day"])
# Create a time series in Pandas by setting the index to a date
# column. We parsed "Day" as a date type by using `parse_dates` when
# loading the data.
tunnel = tunnel.set_index("Day")
# By default, Pandas creates a `DatetimeIndex` with dtype `Timestamp`
# (equivalent to `np.datetime64`, representing a time series as a
# sequence of measurements taken at single moments. A `PeriodIndex`,
# on the other hand, represents a time series as a sequence of
# quantities accumulated over periods of time. Periods are often
# easier to work with, so that's what we'll use in this course.
tunnel = tunnel.to_period()
tunnel.head()
Day | NumVehicles |
---|---|
2003-11-01 | 103536 |
2003-11-02 | 92051 |
2003-11-03 | 100795 |
2003-11-04 | 102352 |
2003-11-05 | 106569 |
1、 Time step characteristics
If there are no missing dates in the time series , We can create a time dummy by calculating the length of the sequence .
df = tunnel.copy()
df['Time'] = np.arange(len(tunnel.index))
df.head()
Day | NumVehicles | Time |
---|---|---|
2003-11-01 | 103536 | 0 |
2003-11-02 | 92051 | 1 |
2003-11-03 | 100795 | 2 |
2003-11-04 | 102352 | 3 |
2003-11-05 | 106569 | 4 |
The process of fitting the linear regression model follows scikit-learn Standard procedure for .
from sklearn.linear_model import LinearRegression
# Training data
X = df.loc[:, ['Time']] # features
y = df.loc[:, 'NumVehicles'] # target
# Train the model
model = LinearRegression()
model.fit(X, y)
# Store the fitted values as a time series with the same time index as
# the training data
y_pred = pd.Series(model.predict(X), index=X.index)
The model actually created ( about ) by :Vehicles = 22.5 * Time + 98176. Plotting fitting values over time shows us how to fit a linear regression to a time dummy variable to create a trend line defined by the equation .
ax = y.plot(**plot_params)
ax = y_pred.plot(ax=ax, linewidth=3)
ax.set_title('Time Plot of Tunnel Traffic');
2、 Hysteresis characteristics
Pandas It provides us with a simple method of lag sequence , namely shift Method .
df['Lag_1'] = df['NumVehicles'].shift(1)
df.head()
Day | NumVehicles | Time | Lag_1 |
---|---|---|---|
2003-11-01 | 103536 | 0 | NaN |
2003-11-02 | 92051 | 1 | 103536.0 |
2003-11-03 | 100795 | 2 | 92051.0 |
2003-11-04 | 102352 | 3 | 100795.0 |
2003-11-05 | 106569 | 4 | 102352.0 |
When creating hysteresis characteristics , We need to decide how to deal with the resulting missing values . Filling them is an option , May use 0.0 Or use the first known value “ backfill ”. contrary , We will delete only the missing values , And make sure to delete the value in the target from the corresponding date .
from sklearn.linear_model import LinearRegression
X = df.loc[:, ['Lag_1']]
X.dropna(inplace=True) # drop missing values in the feature set
y = df.loc[:, 'NumVehicles'] # create the target
y, X = y.align(X, join='inner') # drop corresponding values in target
model = LinearRegression()
model.fit(X, y)
y_pred = pd.Series(model.predict(X), index=X.index)
The lag diagram shows us that we can well fit the relationship between the number of vehicles in one day and the number of vehicles in the previous day .
fig, ax = plt.subplots()
ax.plot(X['Lag_1'], y, '.', color='0.25')
ax.plot(X['Lag_1'], y_pred)
ax.set_aspect('equal')
ax.set_ylabel('NumVehicles')
ax.set_xlabel('Lag_1')
ax.set_title('Lag Plot of Tunnel Traffic');
This prediction from the lag feature means to what extent we can predict the cross time series ? The following time chart shows us how our predictions now respond to a series of recent behaviors .
ax = y.plot(**plot_params)
ax = y_pred.plot()
The best time series model usually contains some combination of time step characteristics and lag characteristics . In the next few classes , We will learn how to use the features in this lesson as a starting point , Modeling the most common patterns in time series .
Continue practicing , You will begin to use the techniques you learned in this tutorial to predict store sales .
边栏推荐
- “空间转换”显著提升陡崖点云的地面点提取质量
- Modular programming of wireless transmission module nRF905 controlled by single chip microcomputer
- "Spatial transformation" significantly improves the quality of ground point extraction of cliff point cloud
- Atlassian confluence漏洞分析合集
- OAuth 2.0一键登录那些事
- Runtime——methods成员变量,cache成员变量
- Requirements for Power PCB circuit board design 2021-11-09
- 基于Anaconda的模块安装与注意事项
- Estimation of dense forest volume based on LIDAR point cloud with few ground points
- Force deduction 76 questions, minimum covering string
猜你喜欢
随机推荐
CAN总线工作状况和信号质量“体检”
OpenMP入门
Pit encountered by pytorch: why can't l1loss decrease during model training?
基于Anaconda的模块安装与注意事项
One "stone" and two "birds", PCA can effectively improve the dilemma of missing some ground points under the airborne lidar forest
Analysis and utilization of Microsoft Office Word remote command execution vulnerability (cve-2022-30190)
NSIS 静默安装vs2013运行时
NSIS silent installation vs2013 runtime
【视频】ffplay 使用mjpeg格式播放usb摄像头
【QT】qtcreator便捷快捷键以及QML介绍
Collection of common terms and meanings in forestry investigation based on lidar
JDBC-DAO层实现
Force deduction 76 questions, minimum covering string
1464. maximum product of two elements in an array
Function template_ Class template
神经网络与深度学习-3- 机器学习简单示例-PyTorch
Audio (V) audio feature extraction
57. 插入区间
C reads XML on the web
OAuth 2.0一键登录那些事