当前位置:网站首页>Machine learning notes linear regression of time series

Machine learning notes linear regression of time series

2022-06-25 07:52:00 Sit and watch the clouds rise

One 、 Time series prediction

         Prediction is probably the most common application of machine learning in the real world . Enterprises predict product demand , The government forecasts economic and population growth , Meteorologists predict the weather . The understanding of the future is science 、 Government and Industry ( Not to mention our personal lives !) The urgent need for , Practitioners in these fields are increasingly using machine learning to address this need .

         Time series prediction is a broad field with a long history . This course focuses on applying modern machine learning methods to time series data , To produce the most accurate predictions . The lessons in this course are influenced by the past Kaggle Inspiration for predicting winning solutions in competitions , But as long as accurate forecasting becomes a priority , Applicable .

         After completing this course , You will know how to :

         Design functions for major time series components ( trend 、 Season and cycle ) Modeling ,

         Visualization of time series with multiple time series diagrams ,

         Create a predictive mix that combines the strengths of complementary models , as well as

         Make the machine learning method adapt to various prediction tasks .

         As part of the exercise , You will have the opportunity to participate in our store sales - Introduction to time series prediction . In this competition , Your task is to forecast Corporación Favorita( Large Ecuadorian grocery retailer ) near 1800 Sales of product categories .

Store Sales - Time Series Forecasting | KaggleUse machine learning to predict grocery saleshttps://www.kaggle.com/competitions/store-sales-time-series-forecasting/data         The basic object of prediction is time series , It is a set of observations recorded over time . In the forecasting application , Observations are usually recorded at a fixed frequency , For example, every day or every month .

date Hardcover book sales
2000-04-01139
2000-04-02128
2000-04-03172
2000-04-04139
2000-04-05191

        The above table records a retail store 30 Number of hardcover books sold in a day . Please note that , We have a list of hardcover observations with time indexed dates .

Two 、 Linear regression with time series

         In the first part of this course , We will use the linear regression algorithm to build the prediction model . Linear regression is widely used in practice , And naturally adapt to even complex prediction tasks .

         Linear regression algorithms learn how to derive weighted sums from their input characteristics . For both functions , We will have :

target = weight_1 * feature_1 + weight_2 * feature_2 + bias

         During training , Regression algorithm learning parameters weight_1、weight_2 And the value of the deviation that best suits the target . ( This algorithm is usually called the ordinary least square method , Because it selects a value that minimizes the square error between the target and the forecast .) Weights are also called regression coefficients , Deviation is also called intercept , Because it says that the position function goes through y Axis .

1、 Time step characteristics

         Time series have two unique characteristics : Time step characteristics and lag characteristics .

         The time step feature is a feature that we can get directly from the time index . The most basic time step feature is the time dummy variable , It calculates the time steps in the sequence from beginning to end .

import numpy as np
df['Time'] = np.arange(len(df.index))
df.head()
DateHardcoverTime
2000-04-011390
2000-04-021281
2000-04-031722
2000-04-041393
2000-04-051914

         Linear regression generation model with time dummy variables :

target = weight * time + bias

         then , The time dummy variable lets us fit the curve to the time series in the time graph , In which time forms x Axis .

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn-whitegrid")
plt.rc(
    "figure",
    autolayout=True,
    figsize=(11, 4),
    titlesize=18,
    titleweight='bold',
)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
%config InlineBackend.figure_format = 'retina'

fig, ax = plt.subplots()
ax.plot('Time', 'Hardcover', data=df, color='0.75')
ax = sns.regplot(x='Time', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_title('Time Plot of Hardcover Sales');

          The time step feature allows you to model time dependencies . If a series of values can be predicted from the time they occur , So it's time-dependent . In the hardcover sales series , We can predict that the sales volume in the later part of the month is usually higher than that in the earlier part of the month .

2、 Hysteresis characteristics

         In order to make hysteresis characteristics , We changed the observations of the target series , Make them appear to happen at a later time . ad locum , We created one 1 Step lag function , Although multi-step movement is possible .

df['Lag_1'] = df['Hardcover'].shift(1)
df = df.reindex(columns=['Hardcover', 'Lag_1'])
df.head()
DateHardcoverLag_1
2000-04-01139NaN
2000-04-02128139.0
2000-04-03172128.0
2000-04-04139172.0
2000-04-05191139.0

         Linear regression generation model with lag characteristics :

target = weight * lag + bias

         therefore , The hysteresis feature allows us to fit the curve to the hysteresis graph , Each observation in the series corresponds to the previous observation .

fig, ax = plt.subplots()
ax = sns.regplot(x='Lag_1', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_aspect('equal')
ax.set_title('Lag Plot of Hardcover Sales');

          You can see from the lag diagram , Sales for a day ( Hardcover ) Compared with the previous day's sales (Lag_1) relevant . When you see such a relationship , You will know that the delay function can be useful .

         More generally , Hysteresis allows you to model serial dependencies . When observations can be predicted from previous observations , Time series have sequence dependence . In hardcover sales , We can predict that one day's high sales usually means the next day's high sales .

         The problem of adapting machine learning algorithm to time series is mainly about characteristic engineering with time index and lag . For most courses , We use linear regression for simplicity , But no matter which algorithm you choose for the prediction task , These functions will be very useful .

3、 ... and 、 Example - Tunnel flow

        Tunnel Traffic It's a time series , Described from 2003 year 11 Month to 2005 year 11 The number of vehicles passing through the Swiss bareg tunnel per day during the month of . In this case , We will do some exercises , Linear regression is applied to time step feature and lag feature .

         Hidden cells set everything .

from pathlib import Path
from warnings import simplefilter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

simplefilter("ignore")  # ignore warnings to clean up output cells

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 4))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)
%config InlineBackend.figure_format = 'retina'


# Load Tunnel Traffic dataset
data_dir = Path("../input/ts-course-data")
tunnel = pd.read_csv(data_dir / "tunnel.csv", parse_dates=["Day"])

# Create a time series in Pandas by setting the index to a date
# column. We parsed "Day" as a date type by using `parse_dates` when
# loading the data.
tunnel = tunnel.set_index("Day")

# By default, Pandas creates a `DatetimeIndex` with dtype `Timestamp`
# (equivalent to `np.datetime64`, representing a time series as a
# sequence of measurements taken at single moments. A `PeriodIndex`,
# on the other hand, represents a time series as a sequence of
# quantities accumulated over periods of time. Periods are often
# easier to work with, so that's what we'll use in this course.
tunnel = tunnel.to_period()

tunnel.head()
DayNumVehicles
2003-11-01103536
2003-11-0292051
2003-11-03100795
2003-11-04102352
2003-11-05106569

1、 Time step characteristics

         If there are no missing dates in the time series , We can create a time dummy by calculating the length of the sequence .

df = tunnel.copy()
df['Time'] = np.arange(len(tunnel.index))
df.head()
DayNumVehiclesTime
2003-11-011035360
2003-11-02920511
2003-11-031007952
2003-11-041023523
2003-11-051065694

         The process of fitting the linear regression model follows scikit-learn Standard procedure for .

from sklearn.linear_model import LinearRegression

# Training data
X = df.loc[:, ['Time']]  # features
y = df.loc[:, 'NumVehicles']  # target

# Train the model
model = LinearRegression()
model.fit(X, y)

# Store the fitted values as a time series with the same time index as
# the training data
y_pred = pd.Series(model.predict(X), index=X.index)

         The model actually created ( about ) by :Vehicles = 22.5 * Time + 98176. Plotting fitting values over time shows us how to fit a linear regression to a time dummy variable to create a trend line defined by the equation .

ax = y.plot(**plot_params)
ax = y_pred.plot(ax=ax, linewidth=3)
ax.set_title('Time Plot of Tunnel Traffic');

2、 Hysteresis characteristics  

        Pandas It provides us with a simple method of lag sequence , namely shift Method .

df['Lag_1'] = df['NumVehicles'].shift(1)
df.head()
DayNumVehiclesTimeLag_1
2003-11-011035360NaN
2003-11-02920511103536.0
2003-11-03100795292051.0
2003-11-041023523100795.0
2003-11-051065694102352.0

         When creating hysteresis characteristics , We need to decide how to deal with the resulting missing values . Filling them is an option , May use 0.0 Or use the first known value “ backfill ”. contrary , We will delete only the missing values , And make sure to delete the value in the target from the corresponding date .

from sklearn.linear_model import LinearRegression

X = df.loc[:, ['Lag_1']]
X.dropna(inplace=True)  # drop missing values in the feature set
y = df.loc[:, 'NumVehicles']  # create the target
y, X = y.align(X, join='inner')  # drop corresponding values in target

model = LinearRegression()
model.fit(X, y)

y_pred = pd.Series(model.predict(X), index=X.index)

         The lag diagram shows us that we can well fit the relationship between the number of vehicles in one day and the number of vehicles in the previous day .

fig, ax = plt.subplots()
ax.plot(X['Lag_1'], y, '.', color='0.25')
ax.plot(X['Lag_1'], y_pred)
ax.set_aspect('equal')
ax.set_ylabel('NumVehicles')
ax.set_xlabel('Lag_1')
ax.set_title('Lag Plot of Tunnel Traffic');

          This prediction from the lag feature means to what extent we can predict the cross time series ? The following time chart shows us how our predictions now respond to a series of recent behaviors .

ax = y.plot(**plot_params)
ax = y_pred.plot()

          The best time series model usually contains some combination of time step characteristics and lag characteristics . In the next few classes , We will learn how to use the features in this lesson as a starting point , Modeling the most common patterns in time series .

         Continue practicing , You will begin to use the techniques you learned in this tutorial to predict store sales .

原网站

版权声明
本文为[Sit and watch the clouds rise]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206250549055051.html