当前位置：网站首页>Machine learning artifact scikit learn minimalist tutorial

Machine learning artifact scikit learn minimalist tutorial

2022-06-23 06:36:00 【PIDA】

author ：Peter edit ：Peter

Hello everyone , I am a Peter~

Scikit-learn Is a very well-known Python Machine learning library , It is widely used in data science fields such as statistical analysis and machine learning modeling .

Modeling is invincible ： User pass scikit-learn It can realize various supervised and unsupervised learning models
Various functions ： Use at the same time sklearn It can also preprocess data 、 Feature Engineering 、 Data set segmentation 、 Model evaluation, etc
Rich data ： Built in rich data sets , such as ： Titanic 、 Iris, etc , No more worries about data

This article introduces... In a concise way scikit-learn Use , Please refer to the official website for more details ：

The built-in dataset uses
Data set segmentation
Data normalization and Standardization
Type code
modeling 6 The part

Scikit-learn Use the divine map

The following picture is provided on the official website , Start with the size of the sample size , Divided into regression 、 classification 、 clustering 、 Data dimensionality reduction involves 4 Three aspects summarize scikit-learn Use ：

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

install

About installation scikit-learn, It is recommended to use anaconda To install , Don't worry about configuration and environmental issues . Of course, you can also directly pip To install ：

pip install scikit-learn

Dataset generation

sklearn Built in some excellent data sets , such as ：Iris data 、 House price data 、 Titanic data, etc .

import pandas as pd
import numpy as np

import sklearn 
from sklearn import datasets  #  Import dataset

Classified data -iris data

# iris data 
iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch

iris What the data looks like ？ Every built-in data has a lot of information

The above data can be generated into what we want to see DataFrame, You can also add dependent variables ：

Regression data - Boston prices

The attributes we focus on ：

data
target、target_names
feature_names
filename

Can also generate DataFrame：

There are three ways to generate data

The way 1

# Call module 
from sklearn.datasets import load_iris
data = load_iris()

# Import data and labels 
data_X = data.data
data_y = data.target

The way 2

from sklearn import datasets
loaded_data = datasets.load_iris()  #  Properties of the imported dataset 

# Import sample data 
data_X = loaded_data.data
#  Import label 
data_y = loaded_data.target

The way 3

#  Go straight back to 
data_X, data_y = load_iris(return_X_y=True)

Data sets use summary

from sklearn import datasets  #  Import library 

boston = datasets.load_boston()  #  Import Boston house price data 
print(boston.keys())  #  View key ( attribute )     ['data','target','feature_names','DESCR', 'filename'] 
print(boston.data.shape,boston.target.shape)  #  Look at the shape of the data  
print(boston.feature_names)  #  See what features  
print(boston.DESCR)  # described  Data set description information  
print(boston.filename)  #  File path

Data segmentation

#  The import module 
from sklearn.model_selection import train_test_split
#  It is divided into training set and test set data 
X_train, X_test, y_train, y_test = train_test_split(
  data_X, 
  data_y, 
  test_size=0.2,
  random_state=111
)

# 150*0.8=120
len(X_train)

Data standardization and normalization

from sklearn.preprocessing import StandardScaler  #  Standardization 
from sklearn.preprocessing import MinMaxScaler  #  normalization 

#  Standardization 
ss = StandardScaler()
X_scaled = ss.fit_transform(X_train)  #  Incoming data to be standardized 

#  normalization 
mm = MinMaxScaler()
X_scaled = mm.fit_transform(X_train)

Type code

From the official website ：https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Encode numbers

Encoding strings

Modeling cases

The import module

from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis  #  Model 
from sklearn.datasets import load_iris  #  Import data 
from sklearn.model_selection import train_test_split  #  Segmentation data 
from sklearn.model_selection import GridSearchCV  #  The grid search 
from sklearn.pipeline import Pipeline  #  Pipeline operation 

from sklearn.metrics import accuracy_score  #  Score verification

Model instantiation

#  Model instantiation 
knn = KNeighborsClassifier(n_neighbors=5)

Training models

knn.fit(X_train, y_train)

KNeighborsClassifier()

Test set prediction

y_pred = knn.predict(X_test)
y_pred  #  Model based predictions

array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2, 0, 2, 1, 0, 2, 1, 2,
       1, 1, 2, 0, 0, 2, 0, 2])

Score verification

Two methods of model score verification ：

knn.score(X_test,y_test)

0.9333333333333333

accuracy_score(y_pred,y_test)

0.9333333333333333

The grid search

How to search for parameters

from sklearn.model_selection import GridSearchCV

#  Search parameters 
knn_paras = {"n_neighbors":[1,3,5,7]}
#  Default model 
knn_grid = KNeighborsClassifier()

#  Instanced objects for grid search 
grid_search = GridSearchCV(
	knn_grid, 
	knn_paras, 
	cv=10  # 10 Crossover verification 
)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 3, 5, 7]})

#  The best parameter value found by searching 
grid_search.best_estimator_

KNeighborsClassifier(n_neighbors=7)

grid_search.best_params_

Out42:

{'n_neighbors': 7}

grid_search.best_score_

0.975

Modeling based on search results

knn1 = KNeighborsClassifier(n_neighbors=7)

knn1.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=7)

As can be seen from the following results ： The modeling effect after grid search is better than that without grid search

y_pred_1 = knn1.predict(X_test)

knn1.score(X_test,y_test)

1.0

accuracy_score(y_pred_1,y_test)

1.0

原网站

版权声明
本文为[PIDA]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/01/202201130839164127.html