当前位置:网站首页>Machine learning practice notes
Machine learning practice notes
2022-07-24 14:20:00 【Strong fight】
Chapter one — End to end machine learning projects
Common operations of data preprocessing :
Set coordinate axis labels and scales in the subgraph .set_xlabel .set_xticks
– Data mapping —>data[col_name == Original value ,col_name]= Mapping values
– Get the list of column names —>col_names = data.columns.tolist()
– Remove some irrelevant columns —>todrop=[’’,’’] data.drop(todrop,axis=1)
– There is a large difference between the two equally important columns —> Standardization
– Select some qualified rows , First, traverse and save the qualified indexes in the list , then data.loc[list,:]
Download data
Write a function to get data
def fetch_housing_data(housing_url=url,housing_path=path):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path,"housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
Write a function to load data
import pandas
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path,"housing.csv")
return pd.read_csv(csv_path)
Watch the big picture
Quick view of data structure
info() You can quickly get a simple description of the dataset , Especially the total number , The type of each property and the number of non null values
value_counts() Check how many categories exist , Each category can display a summary of numeric attributes
describe() Displays a summary of numeric properties
Called on the entire dataset hist() Method , Draw a histogram of each attribute
example :
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize(20,15))
plt.show()
Create test set
( Pure random sampling method )
from sklearn.model_selection import train_test_split
predictors = ["col1", "col2"]
x_train, x_test, y_train, y_test = train_test_split(data[predictors], data["target"], test_size=0.4, random_state=0)
( Stratified sampling )
col Columns are hierarchical based columns
from sklearn.model_selection import StratifiedShuffleSplit
split = StratiffiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index,test_index in split.split(data,data["col"]):
start_train_set = data.loc[train_index]
start_test_set = data.loc[test_index]
To delete the same attribute of training set and test set :
for set in (start_train_set, start_test_set):
set.drop(["clo"], axis=1, inplace=True)
Data exploration and Visualization
① Geographic Data Visualization :
housing.plot(kind="scatter", x="longtitude", y="latitude", alpha=xxx
s=housing["population"]/100,label="population"
c="median_house_value",cmap=plt.get_cmap("jet"),colorbar=True)
# It can also depict X And Y The correlation between attributes
import matplotlib.pyplot as plt
plt.show()
② Looking for relevance :
(col Column correlation with other columns )
corr_matrix = data.corr()
corr_matrix[“col”].sort_values(ascendsing=False)
Pearson correlation coefficient : The closer the 1, Indicates that there is a stronger positive correlation , The closer the -1, Indicates that there is a stronger negative correlation
Plot the correlation of each numeric attribute with respect to other numeric attributes .
from pandas.tools.plotting import scatter_matrix
attributes = ["col1","col2","col3","col4"]
scatter_matrix(housing[attribute],figsize=(12,8))
Experiment with combinations of different attributes
example :housing[“rooms_per_household”] = housing[“total_rooms”]/housing[“households”]
③ Data preparation :
data = train_set.drop(“col”, axis=1)
label = train_set[“col”].copy
drop() A copy of the data will be created , But it doesn't affect data
Data cleaning :
col Some values of column attributes are missing , terms of settlement :
- Discard the corresponding value :data.dropna(subset=[“col”])
- Discard this attribute :data.drop(“col”,axis=1)
- Set the missing value to a value :data[“col”].fillna(median)
Handling missing values :
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy=“median”)
Use fit() Methods will imputer Instance adaptation to training set :
imputer.fit(housing_num)
here imputer Only the median value of each attribute is calculated , And store the result in its instance variable statistics_
X=imputer.transform(housing_num)
Handle text and classification properties
skikit-learn It provides a converter for such tasks LabelEncoder:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
The resulting problems : Machine learning algorithms assume that two numbers that are close to each other are more similar than two numbers that are far away
To solve this problem , A common solution is to create a binary attribute for each category , Hot coding alone
Scikit-learn Provides a OneHotEncoder Encoder , The integer classification value can be converted into a single heat vector
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
The output here is a Scipy sparse matrix
Use LabelBinarizer Class can perform two transformations at once ( Convert from text category to integer category , Then it is converted from integer category to independent heat vector )
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
By sending sparse_output=True to LabelBinarizer Constructors , We can get the sparse matrix
Calculation RMSE
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
Use cross validation to better evaluate
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
rmse_scores = np.sqrt(-scores)
Random forests
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared,housing_labels)
Fine tune the model
The grid search
It can be used scikit-learn Of GridSearchCV To explore
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators':[3,10,30],'max_features':[2,4,6,8]},
{'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared,housing_labels)
param_grid tell scikit-learn First evaluate the first dict in n_estimator and max_features All of the 34=12 A super parameter combination
next Try the second dict Zhongchao parameter values all 23=6 Combinations of , But this time the hyperparameter bootstrap I need to set to False instead of True
Optimal parameter combination :grid_search.best_params_
Random search
When the search range of the super parameter is large , Usually... Is preferred RandomizedSearchCV
If you run a random search 1000 An iterative , We will explore the... Of each super parameter 1000 Different values .
Analyze the best model and its mistakes
randomforestregressor You can point out the relative importance of each attribute
feature_importances = grid_search.best_estimator_.feature_importances_
Display these importance scores next to the corresponding attribute name :
sorted( zip(feature_importances,attributes),reverse=True )
Evaluate the system through the test set
final_model = grid_search.best_estimator_
x_test = start_test_set.drop("median_house_value",axis=1)
y_test = start_test_set["median_house_value"].copy()
x_test_prepared = full_pipeline.transform(x_test)
final_predictions = final_model.predict(x_test_prepared)
final_mse = mean_squared_error(y_test,final_predictions)
final_rmse = np.sqrt(final_mse)
Two . classification
MNIST Data sets , Each picture is marked with the number it represents .
x,y = minst[“data”],minst[“target”]
x.shape
y.shape
Each picture has 784 Features . Because the picture is 28*28 Pixels , Each feature represents the intensity of a pixel , from 0( white ) To 255( black )
display picture
import matplotlib
import matplotlib.pyplot as plt
some_digit = X[36000]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,interpolation="nearest")
plt.axis("off")
plt.show()
Training set data shuffle
x_train,x_test,y_train,y_test = X[:60000],X[60000:],Y[:60000],Y[60000:]
import numpy as np
shuffle_index = np.random.permutation(60000)
x_train,y_train =x_train[shuffle_index],y_train[shuffle_index]
Train a binary classifier
First create an objective vector for this classification task :
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
Training :
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(x_train,y_train_5)
sgd_clf.predict([some_digit])
Implement cross validation :
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_split=3, random_state=42)
for train_index,test_index in skfolds.split(x_train,y_train_5):
clone_clf = clone(sgd_clf)
x_train_folds = x_train[train_index]
y_train_folds = (y_train_5[train_index])
x_test_fold = x_train[test_index]
y_test_fold = (y_train_5[test_index])
clone_clf.fit(x_train_folds,y_train_folds)
y_pred = clone_clf.predict(x_test_fold)
n_correct = sum(y_pred == y_test_fold)
print(n_correct / len(y_pred)
Each fold is StratifiedKFold Perform stratified sampling to generate , The proportion symbol of each class it contains is the overall proportion .
Confusion matrix
A better way to evaluate the performance of classifiers is the confusion matrix
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf,x_train,y_train_5,cv=3)
And cross_val_score() The function is the same ,cross_val_predict() The function also performs K-fold Cross validation , But it doesn't return the evaluation score , But each
Folded prediction .
from sklearn.metrics import confusion_matrix # Use this function to get the confusion matrix
confusion_matrix( Target categories , Forecast category )
The rows in the confusion matrix represent the actual categories , List forecast categories .
A perfect classifier has only true classes and true negative classes , So its confusion matrix will only have non-zero values on its diagonal
precision :TP/(TP+FP) How many predictions are accurate
Recall rate =TP/(TP+FN)
scikit-learn It provides functions for calculating various classifier indicators , Accuracy and recall rate are also one of them
from sklearn.metrics import precision_score,recall_score
precison_score(y_train,y_pred)
recall_score(y_train,y_train_pred)
Combine precision and recall into a single index , be called F1 fraction ,F1 Is the harmonic average of accuracy and recall
To calculate F1 fraction , Just call f1_score()
from sklearn.metrics import f1_score
In some cases , More concerned about accuracy , In other cases , More concerned about recall rate
Multi label classification
Produce multiple categories for each instance , such as : Is a big number , It's odd again
from sklearn.neighbors import KNeightborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train %2 ==1)
y_multilabel = np.c_[y_train_large,y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train,y_multilabel)
Multi output classification
Use Numpy Of randint() Function is MNIST The pixel intensity of the picture increases noise . The goal is to restore the image to the original image :
noise = rnd.randint( 0,100,(len(x_train),784) )
noise = rnd.randint( 0,100,(len(x_test),784) )
x_train_mod = x_train + noise
x_test_mod = x_test +noise
knn_clf.fit(x_train_mod, y_train_mod)
clean_digit = knn_clf.predict( [x_test_mod[some_index]] )
plot_digit(clean_dight)
边栏推荐
- Introduction to the separation of front and rear platforms of predecessors
- Source code analysis of ArrayList
- sql server语法—创建数据库
- Typo in static class property declarationeslint
- String - Sword finger offer 58 - ii Rotate string left
- CAS atomic type
- Csp2021 T3 palindrome
- IEEE Transaction期刊模板使用注意事项
- Atcoder beginer contest 261e / / bitwise thinking + DP
- Noip2021 T2 series
猜你喜欢

Beijing all in one card listed and sold 68.45% of its equity at 352.888529 million yuan, with a premium rate of 84%

Centos7 installs Damon stand-alone database

Class loading mechanism and parental delegation mechanism
![Rasa 3.x learning series -rasa [3.2.3] - new version released on July 18, 2022](/img/fd/c7bff1ce199e8b600761d77828c674.png)
Rasa 3.x learning series -rasa [3.2.3] - new version released on July 18, 2022

After five years of contact with nearly 100 bosses, as a headhunter, I found that the secret of promotion was only four words

ISPRS2018/云检测:Cloud/shadow detection based on spectral indices for multi/hyp基于光谱指数的多/高光谱光学遥感成像仪云/影检测

Csp2021 T3 palindrome

Uni app background audio will not be played after the screen is turned off or returned to the desktop

北京一卡通以35288.8529万元挂牌出让68.45%股权,溢价率为84%

Detailed explanation of IO model (easy to understand)
随机推荐
电赛设计报告模板及
Not configured in app.json (uni releases wechat applet)
Unity pedestrians walk randomly without collision
Summary of Baimian machine learning
AtCoder Beginner Contest 261 F // 树状数组
Rasa 3.x learning series -rasa fallbackclassifier source code learning notes
看完这篇文章,才发现我的测试用例写的就是垃圾
Csp2021 T3 palindrome
Must use destructuring props assignmenteslint
Usage differences of drop, truncate and delete
Was installer startup error
自动化渗透扫描工具
Ansible installation and deployment of automated operation and maintenance
Is it safe for Huatai Securities to open an account? Can it be handled on the mobile phone?
C operator priority memory formula
Learn science minimize
【机器学习】之 主成分分析PCA
Similarities and differences between nor flash and NAND flash
Remove the treasure box app with the green logo that cannot be deleted from iPhone
Wechat applet todo case