当前位置：网站首页>Machine learning notes linear regression of time series

Machine learning notes linear regression of time series

2022-06-25 07:52:00 【Sit and watch the clouds rise】

One 、 Time series prediction

Prediction is probably the most common application of machine learning in the real world . Enterprises predict product demand , The government forecasts economic and population growth , Meteorologists predict the weather . The understanding of the future is science 、 Government and Industry （ Not to mention our personal lives ！） The urgent need for , Practitioners in these fields are increasingly using machine learning to address this need .

Time series prediction is a broad field with a long history . This course focuses on applying modern machine learning methods to time series data , To produce the most accurate predictions . The lessons in this course are influenced by the past Kaggle Inspiration for predicting winning solutions in competitions , But as long as accurate forecasting becomes a priority , Applicable .

After completing this course , You will know how to ：

Design functions for major time series components （ trend 、 Season and cycle ） Modeling ,

Visualization of time series with multiple time series diagrams ,

Create a predictive mix that combines the strengths of complementary models , as well as

Make the machine learning method adapt to various prediction tasks .

As part of the exercise , You will have the opportunity to participate in our store sales - Introduction to time series prediction . In this competition , Your task is to forecast Corporación Favorita（ Large Ecuadorian grocery retailer ） near 1800 Sales of product categories .

Store Sales - Time Series Forecasting | KaggleUse machine learning to predict grocery saleshttps://www.kaggle.com/competitions/store-sales-time-series-forecasting/data The basic object of prediction is time series , It is a set of observations recorded over time . In the forecasting application , Observations are usually recorded at a fixed frequency , For example, every day or every month .

date	Hardcover book sales
2000-04-01	139
2000-04-02	128
2000-04-03	172
2000-04-04	139
2000-04-05	191

The above table records a retail store 30 Number of hardcover books sold in a day . Please note that , We have a list of hardcover observations with time indexed dates .

Two 、 Linear regression with time series

In the first part of this course , We will use the linear regression algorithm to build the prediction model . Linear regression is widely used in practice , And naturally adapt to even complex prediction tasks .

Linear regression algorithms learn how to derive weighted sums from their input characteristics . For both functions , We will have ：

target = weight_1 * feature_1 + weight_2 * feature_2 + bias

During training , Regression algorithm learning parameters weight_1、weight_2 And the value of the deviation that best suits the target . （ This algorithm is usually called the ordinary least square method , Because it selects a value that minimizes the square error between the target and the forecast .） Weights are also called regression coefficients , Deviation is also called intercept , Because it says that the position function goes through y Axis .

1、 Time step characteristics

Time series have two unique characteristics ： Time step characteristics and lag characteristics .

The time step feature is a feature that we can get directly from the time index . The most basic time step feature is the time dummy variable , It calculates the time steps in the sequence from beginning to end .

import numpy as np
df['Time'] = np.arange(len(df.index))
df.head()

Date	Hardcover	Time
2000-04-01	139	0
2000-04-02	128	1
2000-04-03	172	2
2000-04-04	139	3
2000-04-05	191	4

Linear regression generation model with time dummy variables ：

target = weight * time + bias

then , The time dummy variable lets us fit the curve to the time series in the time graph , In which time forms x Axis .

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn-whitegrid")
plt.rc(
    "figure",
    autolayout=True,
    figsize=(11, 4),
    titlesize=18,
    titleweight='bold',
)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
%config InlineBackend.figure_format = 'retina'

fig, ax = plt.subplots()
ax.plot('Time', 'Hardcover', data=df, color='0.75')
ax = sns.regplot(x='Time', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_title('Time Plot of Hardcover Sales');

The time step feature allows you to model time dependencies . If a series of values can be predicted from the time they occur , So it's time-dependent . In the hardcover sales series , We can predict that the sales volume in the later part of the month is usually higher than that in the earlier part of the month .

2、 Hysteresis characteristics

In order to make hysteresis characteristics , We changed the observations of the target series , Make them appear to happen at a later time . ad locum , We created one 1 Step lag function , Although multi-step movement is possible .

df['Lag_1'] = df['Hardcover'].shift(1)
df = df.reindex(columns=['Hardcover', 'Lag_1'])
df.head()

Date	Hardcover	Lag_1
2000-04-01	139	NaN
2000-04-02	128	139.0
2000-04-03	172	128.0
2000-04-04	139	172.0
2000-04-05	191	139.0

Linear regression generation model with lag characteristics ：

target = weight * lag + bias

therefore , The hysteresis feature allows us to fit the curve to the hysteresis graph , Each observation in the series corresponds to the previous observation .

fig, ax = plt.subplots()
ax = sns.regplot(x='Lag_1', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_aspect('equal')
ax.set_title('Lag Plot of Hardcover Sales');

You can see from the lag diagram , Sales for a day （ Hardcover ） Compared with the previous day's sales （Lag_1） relevant . When you see such a relationship , You will know that the delay function can be useful .

More generally , Hysteresis allows you to model serial dependencies . When observations can be predicted from previous observations , Time series have sequence dependence . In hardcover sales , We can predict that one day's high sales usually means the next day's high sales .

The problem of adapting machine learning algorithm to time series is mainly about characteristic engineering with time index and lag . For most courses , We use linear regression for simplicity , But no matter which algorithm you choose for the prediction task , These functions will be very useful .

3、 ... and 、 Example - Tunnel flow

Tunnel Traffic It's a time series , Described from 2003 year 11 Month to 2005 year 11 The number of vehicles passing through the Swiss bareg tunnel per day during the month of . In this case , We will do some exercises , Linear regression is applied to time step feature and lag feature .

Hidden cells set everything .

from pathlib import Path
from warnings import simplefilter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

simplefilter("ignore")  # ignore warnings to clean up output cells

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 4))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)
%config InlineBackend.figure_format = 'retina'


# Load Tunnel Traffic dataset
data_dir = Path("../input/ts-course-data")
tunnel = pd.read_csv(data_dir / "tunnel.csv", parse_dates=["Day"])

# Create a time series in Pandas by setting the index to a date
# column. We parsed "Day" as a date type by using `parse_dates` when
# loading the data.
tunnel = tunnel.set_index("Day")

# By default, Pandas creates a `DatetimeIndex` with dtype `Timestamp`
# (equivalent to `np.datetime64`, representing a time series as a
# sequence of measurements taken at single moments. A `PeriodIndex`,
# on the other hand, represents a time series as a sequence of
# quantities accumulated over periods of time. Periods are often
# easier to work with, so that's what we'll use in this course.
tunnel = tunnel.to_period()

tunnel.head()

Day	NumVehicles
2003-11-01	103536
2003-11-02	92051
2003-11-03	100795
2003-11-04	102352
2003-11-05	106569

1、 Time step characteristics

If there are no missing dates in the time series , We can create a time dummy by calculating the length of the sequence .

df = tunnel.copy()
df['Time'] = np.arange(len(tunnel.index))
df.head()

Day	NumVehicles	Time
2003-11-01	103536	0
2003-11-02	92051	1
2003-11-03	100795	2
2003-11-04	102352	3
2003-11-05	106569	4

The process of fitting the linear regression model follows scikit-learn Standard procedure for .

from sklearn.linear_model import LinearRegression

# Training data
X = df.loc[:, ['Time']]  # features
y = df.loc[:, 'NumVehicles']  # target

# Train the model
model = LinearRegression()
model.fit(X, y)

# Store the fitted values as a time series with the same time index as
# the training data
y_pred = pd.Series(model.predict(X), index=X.index)

The model actually created （ about ） by ：Vehicles = 22.5 * Time + 98176. Plotting fitting values over time shows us how to fit a linear regression to a time dummy variable to create a trend line defined by the equation .

ax = y.plot(**plot_params)
ax = y_pred.plot(ax=ax, linewidth=3)
ax.set_title('Time Plot of Tunnel Traffic');

2、 Hysteresis characteristics

Pandas It provides us with a simple method of lag sequence , namely shift Method .

df['Lag_1'] = df['NumVehicles'].shift(1)
df.head()

Day	NumVehicles	Time	Lag_1
2003-11-01	103536	0	NaN
2003-11-02	92051	1	103536.0
2003-11-03	100795	2	92051.0
2003-11-04	102352	3	100795.0
2003-11-05	106569	4	102352.0

When creating hysteresis characteristics , We need to decide how to deal with the resulting missing values . Filling them is an option , May use 0.0 Or use the first known value “ backfill ”. contrary , We will delete only the missing values , And make sure to delete the value in the target from the corresponding date .

from sklearn.linear_model import LinearRegression

X = df.loc[:, ['Lag_1']]
X.dropna(inplace=True)  # drop missing values in the feature set
y = df.loc[:, 'NumVehicles']  # create the target
y, X = y.align(X, join='inner')  # drop corresponding values in target

model = LinearRegression()
model.fit(X, y)

y_pred = pd.Series(model.predict(X), index=X.index)

The lag diagram shows us that we can well fit the relationship between the number of vehicles in one day and the number of vehicles in the previous day .

fig, ax = plt.subplots()
ax.plot(X['Lag_1'], y, '.', color='0.25')
ax.plot(X['Lag_1'], y_pred)
ax.set_aspect('equal')
ax.set_ylabel('NumVehicles')
ax.set_xlabel('Lag_1')
ax.set_title('Lag Plot of Tunnel Traffic');

This prediction from the lag feature means to what extent we can predict the cross time series ？ The following time chart shows us how our predictions now respond to a series of recent behaviors .

ax = y.plot(**plot_params)
ax = y_pred.plot()

The best time series model usually contains some combination of time step characteristics and lag characteristics . In the next few classes , We will learn how to use the features in this lesson as a starting point , Modeling the most common patterns in time series .

Continue practicing , You will begin to use the techniques you learned in this tutorial to predict store sales .

原网站

版权声明
本文为[Sit and watch the clouds rise]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/176/202206250549055051.html