当前位置:网站首页>Love number experiment | Issue 7 - Financial Crisis Analysis Based on random forest
Love number experiment | Issue 7 - Financial Crisis Analysis Based on random forest
2022-06-27 20:59:00 【Data science artificial intelligence】
Love number class :idatacourse.cn
field : other
brief introduction : In the last century 60 After the s , Africa has set off a wave of independence from colonialism . Because of hundreds of years of history , Most countries in the African continent are relatively backward in economic development , The economic system is fragile , Various crises often occur . This case is of great significance to Africa in the past century 13 An exploratory analysis of the financial crisis in countries , And a random forest model is built to predict .
data :
./dataset/african_crises.csv
./dataset/SimHei.ttf
Catalog
Total data 1059 strip , The meaning of each data field is shown in the following table :
Field | meaning |
|---|---|
case | Country number , A number representing a particular country |
cc3 | Country code , Three letter country / Region code |
country | Country name |
year | Observation Year |
systemic_crisis | Systemic crisis ,“ 0” It means that there was no systemic crisis in that year ,“ 1” It means that there was a systemic crisis in that year |
exch_usd | The exchange rate of the country's currency against the US dollar |
domestic_debt_in_default | Domestic debt default ,“0” It means that there is no domestic debt default in the current year ,“1” It indicates that there was a domestic debt default in that year |
sovereign_external_debt_default | Sovereign debt default ,“0” It means that there was no sovereign debt default in that year ,“1” It means that there was a sovereign debt default in that year |
gdp_weighted_default | Total amount of defaulted debt and GDP The ratio of the |
inflation_annual_cpi | year CPI The rate of inflation |
independence | independence ,“ 0” Express “ No independence ”,“ 1” Express “ independence ” |
currency_crises | Currency crisis ,“ 0” It means that it did not happen in the current year “ Currency crisis ”,“ 1” It means that something happened that year “ Currency crisis ” |
inflation_crises | Inflation crisis ,“ 0” It means that it did not happen in the current year “ Inflation crisis ”,“ 1” It means that something happened that year “ Inflation crisis ” |
banking_crisis | Banking crisis ,“ no_crisis” It means that there was no banking crisis in that year , and “ crisis” It means that there was a banking crisis that year |
1. Data reading and preprocessing
1.1 Reading data
# Import corresponding modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
# Set the font
font = FontProperties(fname = "./dataset/SimHei.ttf", size=14)
import seaborn as sns
import random
# Set the drawing style
%matplotlib inline
sns.set(style='whitegrid')
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
# Reading data
data = pd.read_csv('./dataset/african_crises.csv')
data.sample(5)
1.3 View the basic information of the data
First, let's look at the total number of countries .
unique_countries = data.country.unique()
unique_countries
You can see that the data contains 13 African countries , Algeria in order , Angola , Central African Republic , Ivory Coast , Egypt , Kenya , mauritius , Morocco , Nigeria , South Africa , Tunisia , Zambia and Zimbabwe . Next, we use the descriptive statistical function to check the basic situation of the data and whether there are any missing or abnormal data .
# Basic information about data sets
data.______()
You can see in addition to the country code cc3、 Country name country And the banking crisis banking_crisis These three fields are outside the character type , The rest are numeric types , And there is no missing value in the data .
# View statistical indicators of data
data.____________(include = 'all')
By observing statistical indicators , We see the year year The maximum and minimum values of are respectively 2014 Years and 1860 year , Egypt has the most statistical records , Yes 155 Data . We also found an anomaly , Currency crisis currency_crises The value range of is 0、1, However, values appear in the data 2, We need to deal with it separately , And other indicators have no obvious abnormality .
1.3 Data preprocessing
# Check out the currency crisis currency_crises The values for 2 The data of
data[data['currency_crises'] == 2]
You can see that the only data with exceptions is 4 strip , Let's delete it directly .
data = data[data['currency_crises'] != 2]# Get generated delete currency crisis currency_crises The values for 2 Data set of
data.______ # View the size of the newly generated dataset
2. Exploratory analysis of economic indicators
After World War II , The pattern of the world today has taken shape , stay 60 After the s , Africa has set off a wave of independence from colonialism . Because of hundreds of years of history , The African continent is the most backward region on the earth , The economic and political development of most countries is relatively backward , The quality of the population is low , The economic system is fragile , Various crises often occur . Next, we use this data set to analyze the economic development and economic crisis of various countries before and after independence . First we draw 13 A line chart showing the exchange rate changes of the currencies of countries against the US dollar .
2.1 Changes in the exchange rate of the currency against the US dollar
plt.figure(figsize=(12,20))
for i in range(13):
plt.subplot(7,2,i+1)
country = unique_countries[i]
# Randomly generate a color random.choice(): Randomly extract an element from a sequence , extract 6 Secondary composition 6 Bits represent random colors
col="#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
# Draw line chart
sns.____________(data[data.country == country]['year'],data[data.country == country]['exch_usd'],label = country,color = col)
# np.logical_and() Logic and Returns when both conditions are true True
plt.plot([np.min(data[np.logical_and(data.country == country,data.independence == 1)]['year']),
np.min(data[np.logical_and(data.country == country,data.independence == 1)]['year'])],
[0,np.max(data[data.country == country]['exch_usd'])],color = 'black',linestyle = 'dotted',alpha = 0.8)
plt.______(country) # Add image title
plt.tight_layout() # Automatically adjust the subgraph parameters to provide the specified fill
plt.show() # Output 13 A line chart showing the exchange rate changes of the currencies of countries against the US dollar
You can see , Most countries did not have their own monetary system in the short period before and after independence , Still use the currency of colonial countries , Such as francs or pounds . Angola (Angola)、 zimbabwe (Zimbabwe)、 Zambia (Zambia)、 Nigeria (Nigeria) The exchange rates of other countries have been maintained at 0, stay 21 Around the th century, there began to be a national currency . Tunisia (Tunisia) The exchange rate fell sharply after independence , Long term stability 1:1 about . In addition, it can be found that most African countries have developed over time , On a dollar basis , The currency is in a state of gradual depreciation .
2.2 Changes in inflation rate
The rate of inflation is also called the rate of price change , It is the ratio of the excess amount of money issued to the amount of money actually needed , To reflect inflation 、 The extent of currency depreciation . Calculate the inflation rate by the growth rate of the price index , The consumer price index is used in this data (CPI) To express .
Distinguish according to the rate of price rise :
- Moderate inflation ( The annual rate of price increase is 1%~6% within )
- Severe inflation ( The annual rate of price increase is 6%~9%)
- Galloping inflation ( The annual rate of price increase is 10%~50% following )
- Hyperinflation ( The annual rate of price increase is 50% above )
Next, let's analyze the changes of inflation rates in various countries .
plt.figure(figsize=(12,20))
for i in range(13):
plt.subplot(7,2,i+1)
country = unique_countries[i]
# Randomly generate a color
col="#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
# Draw line chart
sns.lineplot(data[data.country == country]['year'],data[data.country == country]['inflation_annual_cpi'],label = country,color = col)
# Add scatter plot
plt.______(data[data.country == country]['year'],data[data.country == country]['inflation_annual_cpi'],color = col,s = 28) # s Refers to the area of the scatter
plt.plot([np.min(data[np.logical_and(data.country == country,data.independence==1)]['year']),
np.min(data[np.logical_and(data.country == country,data.independence==1)]['year'])],
[np.min(data[data.country == country]['inflation_annual_cpi']),np.max(data[data.country == country]['inflation_annual_cpi'])],
color = 'black',linestyle = 'dotted',alpha = 0.8) # alpha Refers to color transparency
plt.title(country)
plt.tight_layout() # Automatically adjust the subgraph parameters to provide the specified fill
plt.show() # Output 13 Changes in inflation rates in countries
It can be seen that most of these African countries have experienced inflation of varying degrees , Such as South Africa (South Africa) stay 1970-1990 As the apartheid policy has been continuously sanctioned by Western economies, the economy has been greatly affected ; Angola (Angola) stay 20 century 90 There were many civil wars in the s , The war caused inflation to soar , Reach... In the highest year 4000 above . There are also some countries with relatively stable economies , Such as Tunisia (Tunisa) After independence , After a brief rise, the inflation rate gradually dropped to a lower level and remained stable .
2.3 Distribution of other crises
Next, let's analyze the other fields of the data : Systemic crises in different countries systemic_crisis、 Domestic debt default domestic_debt_in_default、 Sovereign debt default sovereign_external_debt_default、 Currency crisis currency_crises、 Inflation crisis inflation_crises、 Banking crisis banking_crisis And so on .
sns.set(style='darkgrid')
columns = ['systemic_crisis','domestic_debt_in_default','sovereign_external_debt_default','currency_crises','inflation_crises','banking_crisis']
# Draw the distribution pattern of other features
plt.figure(figsize=(16,16))
for i in range(6):
plt.subplot('32'+str(i+1))
sns.countplot(y = data.country,hue = data[columns[i]],palette = 'rocket') # palette For palette
plt.______(loc = 0) # Select the best legend location
plt.title(columns[i])
plt.tight_layout()
plt.show()
- Sovereign debt default
sovereign_external_debt_defaultIt refers to the situation that the government of a country is unable to repay the principal and interest of the debt it borrows from the external guarantee on time , Looking at the above picture, we can see that the Central African Republic (Central African Republic)、 zimbabwe (Zimbabwe) And Ivory Coast (Ivory Coast) A large number of sovereign debt defaults occurred , This leads to extremely low sovereign credit ratings . - At the same time, we can find , Except Angola (Angola)、 zimbabwe (Zimbabwe), Most of the remaining countries have not defaulted on their domestic debt
domestic_debt_in_default. This is because when a government has the right to issue money , The government can issue new money , Pay off local currency debt by putting in too much money ( Domestic debt ), At this time, the government will not have a sovereign default on domestic debt , This is also the reason why sovereign bonds marked in local currency enjoy the highest credit rating in China . But in reality, because the government issues too much money, it will bring inflation 、 Fluctuations in the value of the local currency , Therefore, its total debt to all creditors is capped . - The systemic financial crisis can be called “ Comprehensive financial crisis ”, It means that there is serious chaos in the major financial fields , Such as currency crisis 、 Banking crisis 、 Foreign debt crises occur simultaneously or successively . It often happens in the financial economy 、 Financial system 、 Market oriented countries and regions with relatively prosperous financial assets and countries with serious deficits and foreign debts , It has a great destructive effect on the development of the world economy .
There is a systemic crisis systemic_crisis The largest number of countries are the Central African Republic (Central African Republic), The second is Zimbabwe (Zimbabwe) And Kenya (Kenya). According to the definition of systemic crisis , There should be a link between systemic crisis and banking crisis . Let us examine whether these countries have a banking crisis at the same time as a systemic crisis banking_crisis.
2.4 The correlation between systemic crisis and banking crisis
# Create include year , Country , Systemic crisis , A data set of banking crises
systemic = data[['year','country', 'systemic_crisis', 'banking_crisis']]
# Draw and observe the overlap of systemic crisis and banking crisis
systemic = systemic[(systemic['country'] == 'Central African Republic') | (systemic['country']=='Kenya') | (systemic['country']=='Zimbabwe') ]
plt.figure(figsize=(12,12))
count = 1
for country in systemic.country.unique():
plt.subplot(len(systemic.country.unique()),1,count)
subset = systemic[(systemic['country'] == country)]
sns.lineplot(subset['year'],subset['systemic_crisis'],ci=None) # ci Parameter can be used to specify the size of segment interval
plt.scatter(subset['year'],subset["banking_crisis"], color='coral', label='Banking Crisis')
plt.subplots_adjust(hspace=0.6) # hspace Used to set the distance between the top and bottom of a subgraph
plt.______('Years') # to x Axis naming
plt.______('Systemic Crisis/Banking Crisis') # to y Axis naming
plt.title(country)
count+=1
The value of the blue line represents whether a systemic crisis has occurred , The red scatter represents whether the banking crisis has occurred , The chart above shows how crises overlap , This confirms our hypothesis that the systemic crisis has an impact on the banking crisis .
Calculate the correlation between all features
# The banking crisis banking_crisis Column for feature coding
# The banking crisis banking_crisis The data without crisis in is marked as 0, The data of the crisis is marked as 1
data['banking_crisis'] = data['banking_crisis'].map({"no_crisis":0,"crisis":1})
# Select all features
selected_features = ['systemic_crisis', 'exch_usd', 'domestic_debt_in_default','sovereign_external_debt_default', 'gdp_weighted_default',
'inflation_annual_cpi', 'independence', 'currency_crises','inflation_crises','banking_crisis']
corr = data[selected_features].______() # Get the correlation between the features and generate the correlation matrix
fig = plt.figure(figsize = (12,8))
cmap = sns.diverging_palette(220, 10, as_cmap=True) # Generate blue - white - Red color list
mask = np.zeros_like(corr, dtype=np.bool) # Returns an array of zeros with the same shape and type as the correlation matrix as the mask
mask[np.triu_indices_from(mask)] = True # Generate a mask for the upper triangular matrix of the correlation matrix
# Draw a heat map
sns.______(corr, mask=mask, cmap=cmap,vmin=-0.5,vmax=0.7, center=0,annot = True,
square=True, linewidths=.5,cbar_kws={"shrink": .5});
plt.title(" Correlation between features ",fontproperties = font)
plt.show()
In addition to seeing the high correlation between the systemic crisis and the banking crisis , We also learned that the exchange rate of the currency against the US dollar and domestic debt default are highly correlated with sovereign debt default .
Next, we try to build a random forest classification model , Predict the characteristics that will affect the occurrence of the banking crisis .
3. Build a banking crisis prediction model
- Feature code
- Data set partitioning and hierarchical sampling
- Establish a random forest prediction model
- Evaluation of model effect
- Use SMOTE Carry out oversampling optimization model
- Feature importance ranking
3.1 Feature code
data.drop(['case','cc3'],axis = 1,inplace = True) # Delete from the original data set case Column sum cc3 Column
data.head()
# To the country country Conduct labelencoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.______(data['country'].values) # take country The value of into an empty dictionary
data['country']=le.____________(data['country'].values) # Put... In the dictionary country The value of is converted to the index value
print(data['country']) # View the... After feature coding country name
# Draw no banking crisis no_crisis With the occurrence of a banking crisis crisis The histogram of
fig = plt.figure(figsize = (8,6))
data['banking_crisis'].value_counts().plot(kind='______',rot = 360,color = 'lightseagreen')
plt.xticks([0,1],["no_crisis","crisis"])
plt.show()
You can see , The bank crisis in the data set is far less than that without bank crisis , The proportion of about 1:10, For such unbalanced data , Try to make the proportion of samples in the training set consistent with that in the test set . It is necessary to use the hierarchical sampling method to divide the training set and the test set .
3.2 Data set partitioning and hierarchical sampling
Let's start to divide the training set and test set . stay Sklearn Medium model_selection modular , There is train_test_split() function , Used as the division of training set and test set , The function syntax is :train_test_split(x,y,test_size = None,random_state = None,stratify = y), among :
x,y: Are all the features required for prediction , And the characteristics that need to be predicted .test_size: Test set ratio , for exampletest_size=0.2Indicates division20%As a test set .random_state: Random seeds , Because the partition process is random , For repeatable training , Need to fix onerandom_state, The results reproduce .stratify: Use layered sampling , Ensure that the same proportion of training sets and test sets are extracted from the samples with and without bank crisis .
The function will eventually return four variables , Respectively x Training set and test set of , as well as y Training set and test set of .
# Division of training set and test set
from sklearn import model_selection
x = data.drop('banking_crisis',axis = 1) # Will delete banking_crisis The data set of the column is used as x
y = data['banking_crisis'] # banking_crisis List as y
x_train,x_test,y_train,y_test = model_selection.__________________(x, y,test_size=0.2,random_state = 33,stratify=y)
3.3 Establish a random forest prediction model
Random forest is an ensemble learning method , Samples and features are extracted from the data in a random way , Train multiple different decision trees , formation “ The forest ”. Each tree gives its own classification opinions , call “ vote ”. Under the classification problem , The forest chooses the category with the most votes ; In the case of regression, the mean value is used . stay Python Use in sklearn.ensemble Of RandomForestClassifier Building a classification model , Its main parameters include :
n_estimators: Number of training classifiers ( The default is 100);max_depth: The maximum depth of each tree ( The default is 3);max_features: The maximum characteristic number of partition ( The default is 'auto')random_state: Random seeds .
from sklearn.ensemble import RandomForestClassifier
# Training random forest classification model
rf = RandomForestClassifier(n_estimators = 100, max_depth = 20,max_features = 10, random_state = 20)
rf.fit(x_train, y_train)
y_pred = rf.______(x_test) # Yes y To make predictions
3.4 Model to evaluate
When evaluating the model , We use the functions respectively classification_report()、confusion_matrix() and accuracy_score(), Forecast report for output model 、 Confusion matrix and classification accuracy .
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test, y_pred)) # Output the forecast report of the model
confusion_matrix = __________________(y_test, y_pred)
print(confusion_matrix) # Output obfuscation matrix
# Draw the confusion matrix thermodynamic diagram
fig,ax = plt.subplots(figsize=(8,6))
sns._________(confusion_matrix,ax=ax,annot=True,annot_kws={'size':15}, fmt='d',cmap = 'YlGnBu_r')
ax.set_ylabel(' True value ',fontproperties = font)
ax.set_xlabel(' Predictive value ',fontproperties = font)
ax.set_title(' Confusion matrix thermal diagram ',fontproperties = font)
plt.show() # Output confusion matrix thermodynamic diagram
It can be seen from the model forecast report that , In the event of a banking crisis ( A few categories ) The recall rate reached 89%, Through the confusion matrix and the thermodynamic diagram of the confusion matrix, it can be seen that the correct classification accounts for a high proportion , It shows that the effect of random forest model is better , Next, we draw two categories ROC_AUC curve , For further evaluation .
ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc_auc = ____________(y_test, rf.predict(x_test)) # Calculation auc Value
fpr, tpr, thresholds = ____________(y_test, rf.predict_proba(x_test)[:,1]) # Calculate the... Under different thresholds TPR and FPR
# draw ROC curve
plt.figure(figsize = (8,6))
plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1],'r--')# Draw random guess lines
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc="lower right") # The legend is located at the bottom right
plt.show()
You can see the random forest ROC The curve performs well , Close to the upper left corner ,AUC The value of has reached 0.94.
Considering the small number of samples in the training set , Sample categories are unbalanced , We use for a few classes SMOTE Oversampling operation , Expand a few class samples , Optimize the model .
3.5 Use SMOTE Carry out oversampling optimization model
SMOTE The basic idea of the algorithm is to analyze the minority samples and add new samples to the dataset according to the minority samples . For a few samples a, Randomly select a nearest neighbor sample b, And then from a And b Randomly select a point on the line of c As a new minority sample .
In dividing the data set , Then the training set is oversampled , Extend a few classes . stay Python Use in imblearn.over_sampling Of SMOTE Class construction SMOTE Oversampling model .
# Yes x_train,y_train Conduct SMOTE Oversampling
from imblearn.over_sampling import SMOTE
x_train_resampled, y_train_resampled = SMOTE(random_state=4).fit_resample(x_train, y_train)
print(x_train_resampled.shape, y_train_resampled.shape) # View the data set size after sampling
# Select the optimal parameters through grid search
from sklearn.model_selection import GridSearchCV
param_grid = [{
'n_estimators':[10,20,30,40,50],
'max_depth':[5,8,10,15,20,25]
}]
grid_search = GridSearchCV(rf, param_grid, scoring = 'recall')
# Output the best combination of parameters and scores
grid_search.fit(x_train_resampled, y_train_resampled)
print("best params:", grid_search.best_params_)
print("best score:", grid_search.best_score_)
It can be found that SMOTE After oversampling and grid searching to find the optimal parameters , The optimal parameter selection is max_depth: 10, n_estimators: 10;recall The highest score can be achieved 0.97, And before the optimization model 0.89 It is also improved .
Last , We use random forest to screen out the important features that affect the occurrence of banking crisis , And draw the feature importance ranking diagram .
3.6 Feature importance ranking
fig = plt.figure(figsize=(16,12))
# Get the random forest feature importance score
rf_importance = rf.__________________
index = data.drop(['banking_crisis'], axis=1).columns # Delete the banking crisis banking_crisis Column characteristics
# Rank the obtained feature importance scores in descending order
rf_feature_importance = pd.DataFrame(rf_importance.T, index=index,columns=['score']).sort_values(by='score', ascending=True)
# Horizontal bar chart drawing
rf_feature_importance.plot(kind='______',legend=False,color = 'deepskyblue')
plt.title(' The importance of random forest characteristics ',fontproperties = font)
plt.show()
You can see , System crisis systemic_crisis Is the most important 、 year cpi The rate of inflation inflation_annual_cpi、 year year、 The exchange rate of the country's currency against the US dollar exch_used It is also of high importance , Explain that these characteristics are more important to influence whether the banking crisis occurs , This further verifies our previous conclusion of feature correlation analysis .
4. summary
Love number class (iDataCourse) It is a big data and artificial intelligence course and resource platform for colleges and universities . The platform provides authoritative course resources 、 Data resources 、 Case experiment resources , Help universities build big data and artificial intelligence majors , Curriculum construction and teacher capacity-building .
边栏推荐
- Cerebral Cortex:从任务态和静息态脑功能连接预测儿童数学技能
- SQL reported an unusual error, which confused the new interns
- 谈谈我写作生涯的画图技巧
- Database transactions
- 1029 Median
- Graduation design of police report convenience service platform based on wechat applet
- UOS prompts for password to unlock your login key ring solution
- Shuttle hides the return button of the AppBar
- Flexible IP network test tool -- x-launch
- SQL报了一个不常见的错误,让新来的实习生懵了
猜你喜欢

The meta universe virtual digital human is closer to us | Sinovel interaction

At 19:00 on Tuesday evening, the 8th live broadcast of battle code Pioneer - how to participate in openharmony's open source contribution in multiple directions

基于微信小程序的高校毕业论文管理系统#毕业设计

灵活的IP网络测试工具——— X-Launch

北汽制造全新皮卡曝光,安全、舒适一个不落

MongoDB简介及典型应用场景

智联招聘的基于 Nebula Graph 的推荐实践分享

Massive data attended the Lanzhou opengauss meetup (ECOLOGICAL NATIONAL trip) activity, enabling users to upgrade their applications with enterprise level databases

Show the comprehensive strength of strong products, and make the first show of 2022 Lincoln aviator in Southwest China

Redis cluster
随机推荐
UOS prompts for password to unlock your login key ring solution
Grasp the detailed procedure of function call stack from instruction reading
Database log
教程|fNIRS数据处理工具包Homer2下载与安装
Database index
难怪大家丢掉了postman而选择 Apifox
Cocoscreator plays audio and synchronizes progress
Postman 汉化教程(Postman中文版)
I haven't thought about the source for some time. After upgrading to the latest version 24, the data encryption problem is repeatedly displayed
Batch insert data using MySQL bulkloader
1029 Median
ABAP-CL_ OBJECT_ Collection tool class
UOS提示输入密码以解锁您的登陆密钥环解决办法
Leetcode 989. Integer addition in array form (simple)
云原生安全指南: 从零开始学 Kubernetes 攻防
“好声音“连唱10年,星空华文如何唱响港交所?
安全才是硬道理,沃尔沃XC40 RECHARGE
Model reasoning acceleration based on tensorrt
Leetcode 821. Minimum distance of characters (simple) - sequel
本周二晚19:00战码先锋第8期直播丨如何多方位参与OpenHarmony开源贡献