当前位置:网站首页>Machine learning artifact scikit learn minimalist tutorial
Machine learning artifact scikit learn minimalist tutorial
2022-06-23 06:36:00 【PIDA】
author :Peter edit :Peter
Hello everyone , I am a Peter~
Scikit-learn Is a very well-known Python Machine learning library , It is widely used in data science fields such as statistical analysis and machine learning modeling .
- Modeling is invincible : User pass scikit-learn It can realize various supervised and unsupervised learning models
- Various functions : Use at the same time sklearn It can also preprocess data 、 Feature Engineering 、 Data set segmentation 、 Model evaluation, etc
- Rich data : Built in rich data sets , such as : Titanic 、 Iris, etc , No more worries about data
This article introduces... In a concise way scikit-learn Use , Please refer to the official website for more details :
- The built-in dataset uses
- Data set segmentation
- Data normalization and Standardization
- Type code
- modeling 6 The part
<!--MORE-->
Scikit-learn Use the divine map
The following picture is provided on the official website , Start with the size of the sample size , Divided into regression 、 classification 、 clustering 、 Data dimensionality reduction involves 4 Three aspects summarize scikit-learn Use :
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
install
About installation scikit-learn, It is recommended to use anaconda To install , Don't worry about configuration and environmental issues . Of course, you can also directly pip To install :
pip install scikit-learn
Dataset generation
sklearn Built in some excellent data sets , such as :Iris data 、 House price data 、 Titanic data, etc .
import pandas as pd import numpy as np import sklearn from sklearn import datasets # Import dataset
Classified data -iris data
# iris data iris = datasets.load_iris() type(iris) sklearn.utils.Bunch
iris What the data looks like ? Every built-in data has a lot of information
The above data can be generated into what we want to see DataFrame, You can also add dependent variables :
Regression data - Boston prices
The attributes we focus on :
- data
- target、target_names
- feature_names
- filename
Can also generate DataFrame:
There are three ways to generate data
The way 1
# Call module from sklearn.datasets import load_iris data = load_iris() # Import data and labels data_X = data.data data_y = data.target
The way 2
from sklearn import datasets loaded_data = datasets.load_iris() # Properties of the imported dataset # Import sample data data_X = loaded_data.data # Import label data_y = loaded_data.target
The way 3
# Go straight back to data_X, data_y = load_iris(return_X_y=True)
Data sets use summary
from sklearn import datasets # Import library boston = datasets.load_boston() # Import Boston house price data print(boston.keys()) # View key ( attribute ) ['data','target','feature_names','DESCR', 'filename'] print(boston.data.shape,boston.target.shape) # Look at the shape of the data print(boston.feature_names) # See what features print(boston.DESCR) # described Data set description information print(boston.filename) # File path
Data segmentation
# The import module from sklearn.model_selection import train_test_split # It is divided into training set and test set data X_train, X_test, y_train, y_test = train_test_split( data_X, data_y, test_size=0.2, random_state=111 ) # 150*0.8=120 len(X_train)
Data standardization and normalization
from sklearn.preprocessing import StandardScaler # Standardization from sklearn.preprocessing import MinMaxScaler # normalization # Standardization ss = StandardScaler() X_scaled = ss.fit_transform(X_train) # Incoming data to be standardized # normalization mm = MinMaxScaler() X_scaled = mm.fit_transform(X_train)
Type code
From the official website :https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Encode numbers
Encoding strings
Modeling cases
The import module
from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis # Model from sklearn.datasets import load_iris # Import data from sklearn.model_selection import train_test_split # Segmentation data from sklearn.model_selection import GridSearchCV # The grid search from sklearn.pipeline import Pipeline # Pipeline operation from sklearn.metrics import accuracy_score # Score verification
Model instantiation
# Model instantiation knn = KNeighborsClassifier(n_neighbors=5)
Training models
knn.fit(X_train, y_train)
KNeighborsClassifier()
Test set prediction
y_pred = knn.predict(X_test) y_pred # Model based predictions
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2, 0, 2, 1, 0, 2, 1, 2,
1, 1, 2, 0, 0, 2, 0, 2])Score verification
Two methods of model score verification :
knn.score(X_test,y_test)
0.9333333333333333
accuracy_score(y_pred,y_test)
0.9333333333333333
The grid search
How to search for parameters
from sklearn.model_selection import GridSearchCV
# Search parameters
knn_paras = {"n_neighbors":[1,3,5,7]}
# Default model
knn_grid = KNeighborsClassifier()
# Instanced objects for grid search
grid_search = GridSearchCV(
knn_grid,
knn_paras,
cv=10 # 10 Crossover verification
)
grid_search.fit(X_train, y_train)GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': [1, 3, 5, 7]})# The best parameter value found by searching grid_search.best_estimator_
KNeighborsClassifier(n_neighbors=7)
grid_search.best_params_
Out42:
{'n_neighbors': 7}grid_search.best_score_
0.975
Modeling based on search results
knn1 = KNeighborsClassifier(n_neighbors=7) knn1.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=7)
As can be seen from the following results : The modeling effect after grid search is better than that without grid search
y_pred_1 = knn1.predict(X_test) knn1.score(X_test,y_test)
1.0
accuracy_score(y_pred_1,y_test)
1.0
边栏推荐
- Leetcode topic resolution single number II
- The softing datafeed OPC suite stores Siemens PLC data in an Oracle Database
- A review: neural oscillation and brain stimulation in Alzheimer's disease
- 又到半年总结时,IT人只想躺平
- Day_13 传智健康项目-第13章
- 原址 交换
- 基于T5L1的小型PLC设计方案
- Day_01 传智健康项目-项目概述和环境搭建
- Docker实战 -- 部署Redis集群与部署微服务项目
- Pyinstaller package exe setting icon is not displayed
猜你喜欢
随机推荐
Simple about fastdfs
Illustration Google V8 18: asynchronous programming (I): how does V8 implement micro tasks?
华为软件测试笔试真题之变态逻辑推理题
MySQL ON DUPLICATE KEY 和 PgSQL ON CONFLICT(主键) 处理主键冲突
Day_05 传智健康项目-预约管理-预约设置
RF content learning
The softing datafeed OPC suite stores Siemens PLC data in an Oracle Database
开源生态|超实用开源License基础知识扫盲帖(下)
C语言 获取秒、毫秒、微妙、纳秒时间戳
11、 Realization of textile fabric off shelf function
Sorting out common problems after crawler deployment
How to maintain secure encryption of email communication with FDA?
There are so many code comments! I laughed
Jour 13 Projet de santé mentale - chapitre 13
Global attribute lang attribute
业务逻辑安全思路总结
如何为 Arduino IDE 安装添加库
【Leetcode】431. Encode n-ary tree to binary tree (difficult)
bootstrap如何清除浮动的样式
Pyinstaller packaging pyttsx3 error









