当前位置:网站首页>Love number experiment | Issue 7 - Financial Crisis Analysis Based on random forest
Love number experiment | Issue 7 - Financial Crisis Analysis Based on random forest
2022-06-27 20:59:00 【Data science artificial intelligence】
Love number class :idatacourse.cn
field : other
brief introduction : In the last century 60 After the s , Africa has set off a wave of independence from colonialism . Because of hundreds of years of history , Most countries in the African continent are relatively backward in economic development , The economic system is fragile , Various crises often occur . This case is of great significance to Africa in the past century 13 An exploratory analysis of the financial crisis in countries , And a random forest model is built to predict .
data :
./dataset/african_crises.csv
./dataset/SimHei.ttf
Catalog
Total data 1059 strip , The meaning of each data field is shown in the following table :
Field | meaning |
|---|---|
case | Country number , A number representing a particular country |
cc3 | Country code , Three letter country / Region code |
country | Country name |
year | Observation Year |
systemic_crisis | Systemic crisis ,“ 0” It means that there was no systemic crisis in that year ,“ 1” It means that there was a systemic crisis in that year |
exch_usd | The exchange rate of the country's currency against the US dollar |
domestic_debt_in_default | Domestic debt default ,“0” It means that there is no domestic debt default in the current year ,“1” It indicates that there was a domestic debt default in that year |
sovereign_external_debt_default | Sovereign debt default ,“0” It means that there was no sovereign debt default in that year ,“1” It means that there was a sovereign debt default in that year |
gdp_weighted_default | Total amount of defaulted debt and GDP The ratio of the |
inflation_annual_cpi | year CPI The rate of inflation |
independence | independence ,“ 0” Express “ No independence ”,“ 1” Express “ independence ” |
currency_crises | Currency crisis ,“ 0” It means that it did not happen in the current year “ Currency crisis ”,“ 1” It means that something happened that year “ Currency crisis ” |
inflation_crises | Inflation crisis ,“ 0” It means that it did not happen in the current year “ Inflation crisis ”,“ 1” It means that something happened that year “ Inflation crisis ” |
banking_crisis | Banking crisis ,“ no_crisis” It means that there was no banking crisis in that year , and “ crisis” It means that there was a banking crisis that year |
1. Data reading and preprocessing
1.1 Reading data
# Import corresponding modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
# Set the font
font = FontProperties(fname = "./dataset/SimHei.ttf", size=14)
import seaborn as sns
import random
# Set the drawing style
%matplotlib inline
sns.set(style='whitegrid')
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
# Reading data
data = pd.read_csv('./dataset/african_crises.csv')
data.sample(5)
1.3 View the basic information of the data
First, let's look at the total number of countries .
unique_countries = data.country.unique()
unique_countries
You can see that the data contains 13 African countries , Algeria in order , Angola , Central African Republic , Ivory Coast , Egypt , Kenya , mauritius , Morocco , Nigeria , South Africa , Tunisia , Zambia and Zimbabwe . Next, we use the descriptive statistical function to check the basic situation of the data and whether there are any missing or abnormal data .
# Basic information about data sets
data.______()
You can see in addition to the country code cc3、 Country name country And the banking crisis banking_crisis These three fields are outside the character type , The rest are numeric types , And there is no missing value in the data .
# View statistical indicators of data
data.____________(include = 'all')
By observing statistical indicators , We see the year year The maximum and minimum values of are respectively 2014 Years and 1860 year , Egypt has the most statistical records , Yes 155 Data . We also found an anomaly , Currency crisis currency_crises The value range of is 0、1, However, values appear in the data 2, We need to deal with it separately , And other indicators have no obvious abnormality .
1.3 Data preprocessing
# Check out the currency crisis currency_crises The values for 2 The data of
data[data['currency_crises'] == 2]
You can see that the only data with exceptions is 4 strip , Let's delete it directly .
data = data[data['currency_crises'] != 2]# Get generated delete currency crisis currency_crises The values for 2 Data set of
data.______ # View the size of the newly generated dataset
2. Exploratory analysis of economic indicators
After World War II , The pattern of the world today has taken shape , stay 60 After the s , Africa has set off a wave of independence from colonialism . Because of hundreds of years of history , The African continent is the most backward region on the earth , The economic and political development of most countries is relatively backward , The quality of the population is low , The economic system is fragile , Various crises often occur . Next, we use this data set to analyze the economic development and economic crisis of various countries before and after independence . First we draw 13 A line chart showing the exchange rate changes of the currencies of countries against the US dollar .
2.1 Changes in the exchange rate of the currency against the US dollar
plt.figure(figsize=(12,20))
for i in range(13):
plt.subplot(7,2,i+1)
country = unique_countries[i]
# Randomly generate a color random.choice(): Randomly extract an element from a sequence , extract 6 Secondary composition 6 Bits represent random colors
col="#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
# Draw line chart
sns.____________(data[data.country == country]['year'],data[data.country == country]['exch_usd'],label = country,color = col)
# np.logical_and() Logic and Returns when both conditions are true True
plt.plot([np.min(data[np.logical_and(data.country == country,data.independence == 1)]['year']),
np.min(data[np.logical_and(data.country == country,data.independence == 1)]['year'])],
[0,np.max(data[data.country == country]['exch_usd'])],color = 'black',linestyle = 'dotted',alpha = 0.8)
plt.______(country) # Add image title
plt.tight_layout() # Automatically adjust the subgraph parameters to provide the specified fill
plt.show() # Output 13 A line chart showing the exchange rate changes of the currencies of countries against the US dollar
You can see , Most countries did not have their own monetary system in the short period before and after independence , Still use the currency of colonial countries , Such as francs or pounds . Angola (Angola)、 zimbabwe (Zimbabwe)、 Zambia (Zambia)、 Nigeria (Nigeria) The exchange rates of other countries have been maintained at 0, stay 21 Around the th century, there began to be a national currency . Tunisia (Tunisia) The exchange rate fell sharply after independence , Long term stability 1:1 about . In addition, it can be found that most African countries have developed over time , On a dollar basis , The currency is in a state of gradual depreciation .
2.2 Changes in inflation rate
The rate of inflation is also called the rate of price change , It is the ratio of the excess amount of money issued to the amount of money actually needed , To reflect inflation 、 The extent of currency depreciation . Calculate the inflation rate by the growth rate of the price index , The consumer price index is used in this data (CPI) To express .
Distinguish according to the rate of price rise :
- Moderate inflation ( The annual rate of price increase is 1%~6% within )
- Severe inflation ( The annual rate of price increase is 6%~9%)
- Galloping inflation ( The annual rate of price increase is 10%~50% following )
- Hyperinflation ( The annual rate of price increase is 50% above )
Next, let's analyze the changes of inflation rates in various countries .
plt.figure(figsize=(12,20))
for i in range(13):
plt.subplot(7,2,i+1)
country = unique_countries[i]
# Randomly generate a color
col="#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
# Draw line chart
sns.lineplot(data[data.country == country]['year'],data[data.country == country]['inflation_annual_cpi'],label = country,color = col)
# Add scatter plot
plt.______(data[data.country == country]['year'],data[data.country == country]['inflation_annual_cpi'],color = col,s = 28) # s Refers to the area of the scatter
plt.plot([np.min(data[np.logical_and(data.country == country,data.independence==1)]['year']),
np.min(data[np.logical_and(data.country == country,data.independence==1)]['year'])],
[np.min(data[data.country == country]['inflation_annual_cpi']),np.max(data[data.country == country]['inflation_annual_cpi'])],
color = 'black',linestyle = 'dotted',alpha = 0.8) # alpha Refers to color transparency
plt.title(country)
plt.tight_layout() # Automatically adjust the subgraph parameters to provide the specified fill
plt.show() # Output 13 Changes in inflation rates in countries
It can be seen that most of these African countries have experienced inflation of varying degrees , Such as South Africa (South Africa) stay 1970-1990 As the apartheid policy has been continuously sanctioned by Western economies, the economy has been greatly affected ; Angola (Angola) stay 20 century 90 There were many civil wars in the s , The war caused inflation to soar , Reach... In the highest year 4000 above . There are also some countries with relatively stable economies , Such as Tunisia (Tunisa) After independence , After a brief rise, the inflation rate gradually dropped to a lower level and remained stable .
2.3 Distribution of other crises
Next, let's analyze the other fields of the data : Systemic crises in different countries systemic_crisis、 Domestic debt default domestic_debt_in_default、 Sovereign debt default sovereign_external_debt_default、 Currency crisis currency_crises、 Inflation crisis inflation_crises、 Banking crisis banking_crisis And so on .
sns.set(style='darkgrid')
columns = ['systemic_crisis','domestic_debt_in_default','sovereign_external_debt_default','currency_crises','inflation_crises','banking_crisis']
# Draw the distribution pattern of other features
plt.figure(figsize=(16,16))
for i in range(6):
plt.subplot('32'+str(i+1))
sns.countplot(y = data.country,hue = data[columns[i]],palette = 'rocket') # palette For palette
plt.______(loc = 0) # Select the best legend location
plt.title(columns[i])
plt.tight_layout()
plt.show()
- Sovereign debt default
sovereign_external_debt_defaultIt refers to the situation that the government of a country is unable to repay the principal and interest of the debt it borrows from the external guarantee on time , Looking at the above picture, we can see that the Central African Republic (Central African Republic)、 zimbabwe (Zimbabwe) And Ivory Coast (Ivory Coast) A large number of sovereign debt defaults occurred , This leads to extremely low sovereign credit ratings . - At the same time, we can find , Except Angola (Angola)、 zimbabwe (Zimbabwe), Most of the remaining countries have not defaulted on their domestic debt
domestic_debt_in_default. This is because when a government has the right to issue money , The government can issue new money , Pay off local currency debt by putting in too much money ( Domestic debt ), At this time, the government will not have a sovereign default on domestic debt , This is also the reason why sovereign bonds marked in local currency enjoy the highest credit rating in China . But in reality, because the government issues too much money, it will bring inflation 、 Fluctuations in the value of the local currency , Therefore, its total debt to all creditors is capped . - The systemic financial crisis can be called “ Comprehensive financial crisis ”, It means that there is serious chaos in the major financial fields , Such as currency crisis 、 Banking crisis 、 Foreign debt crises occur simultaneously or successively . It often happens in the financial economy 、 Financial system 、 Market oriented countries and regions with relatively prosperous financial assets and countries with serious deficits and foreign debts , It has a great destructive effect on the development of the world economy .
There is a systemic crisis systemic_crisis The largest number of countries are the Central African Republic (Central African Republic), The second is Zimbabwe (Zimbabwe) And Kenya (Kenya). According to the definition of systemic crisis , There should be a link between systemic crisis and banking crisis . Let us examine whether these countries have a banking crisis at the same time as a systemic crisis banking_crisis.
2.4 The correlation between systemic crisis and banking crisis
# Create include year , Country , Systemic crisis , A data set of banking crises
systemic = data[['year','country', 'systemic_crisis', 'banking_crisis']]
# Draw and observe the overlap of systemic crisis and banking crisis
systemic = systemic[(systemic['country'] == 'Central African Republic') | (systemic['country']=='Kenya') | (systemic['country']=='Zimbabwe') ]
plt.figure(figsize=(12,12))
count = 1
for country in systemic.country.unique():
plt.subplot(len(systemic.country.unique()),1,count)
subset = systemic[(systemic['country'] == country)]
sns.lineplot(subset['year'],subset['systemic_crisis'],ci=None) # ci Parameter can be used to specify the size of segment interval
plt.scatter(subset['year'],subset["banking_crisis"], color='coral', label='Banking Crisis')
plt.subplots_adjust(hspace=0.6) # hspace Used to set the distance between the top and bottom of a subgraph
plt.______('Years') # to x Axis naming
plt.______('Systemic Crisis/Banking Crisis') # to y Axis naming
plt.title(country)
count+=1
The value of the blue line represents whether a systemic crisis has occurred , The red scatter represents whether the banking crisis has occurred , The chart above shows how crises overlap , This confirms our hypothesis that the systemic crisis has an impact on the banking crisis .
Calculate the correlation between all features
# The banking crisis banking_crisis Column for feature coding
# The banking crisis banking_crisis The data without crisis in is marked as 0, The data of the crisis is marked as 1
data['banking_crisis'] = data['banking_crisis'].map({"no_crisis":0,"crisis":1})
# Select all features
selected_features = ['systemic_crisis', 'exch_usd', 'domestic_debt_in_default','sovereign_external_debt_default', 'gdp_weighted_default',
'inflation_annual_cpi', 'independence', 'currency_crises','inflation_crises','banking_crisis']
corr = data[selected_features].______() # Get the correlation between the features and generate the correlation matrix
fig = plt.figure(figsize = (12,8))
cmap = sns.diverging_palette(220, 10, as_cmap=True) # Generate blue - white - Red color list
mask = np.zeros_like(corr, dtype=np.bool) # Returns an array of zeros with the same shape and type as the correlation matrix as the mask
mask[np.triu_indices_from(mask)] = True # Generate a mask for the upper triangular matrix of the correlation matrix
# Draw a heat map
sns.______(corr, mask=mask, cmap=cmap,vmin=-0.5,vmax=0.7, center=0,annot = True,
square=True, linewidths=.5,cbar_kws={"shrink": .5});
plt.title(" Correlation between features ",fontproperties = font)
plt.show()
In addition to seeing the high correlation between the systemic crisis and the banking crisis , We also learned that the exchange rate of the currency against the US dollar and domestic debt default are highly correlated with sovereign debt default .
Next, we try to build a random forest classification model , Predict the characteristics that will affect the occurrence of the banking crisis .
3. Build a banking crisis prediction model
- Feature code
- Data set partitioning and hierarchical sampling
- Establish a random forest prediction model
- Evaluation of model effect
- Use SMOTE Carry out oversampling optimization model
- Feature importance ranking
3.1 Feature code
data.drop(['case','cc3'],axis = 1,inplace = True) # Delete from the original data set case Column sum cc3 Column
data.head()
# To the country country Conduct labelencoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.______(data['country'].values) # take country The value of into an empty dictionary
data['country']=le.____________(data['country'].values) # Put... In the dictionary country The value of is converted to the index value
print(data['country']) # View the... After feature coding country name
# Draw no banking crisis no_crisis With the occurrence of a banking crisis crisis The histogram of
fig = plt.figure(figsize = (8,6))
data['banking_crisis'].value_counts().plot(kind='______',rot = 360,color = 'lightseagreen')
plt.xticks([0,1],["no_crisis","crisis"])
plt.show()
You can see , The bank crisis in the data set is far less than that without bank crisis , The proportion of about 1:10, For such unbalanced data , Try to make the proportion of samples in the training set consistent with that in the test set . It is necessary to use the hierarchical sampling method to divide the training set and the test set .
3.2 Data set partitioning and hierarchical sampling
Let's start to divide the training set and test set . stay Sklearn Medium model_selection modular , There is train_test_split() function , Used as the division of training set and test set , The function syntax is :train_test_split(x,y,test_size = None,random_state = None,stratify = y), among :
x,y: Are all the features required for prediction , And the characteristics that need to be predicted .test_size: Test set ratio , for exampletest_size=0.2Indicates division20%As a test set .random_state: Random seeds , Because the partition process is random , For repeatable training , Need to fix onerandom_state, The results reproduce .stratify: Use layered sampling , Ensure that the same proportion of training sets and test sets are extracted from the samples with and without bank crisis .
The function will eventually return four variables , Respectively x Training set and test set of , as well as y Training set and test set of .
# Division of training set and test set
from sklearn import model_selection
x = data.drop('banking_crisis',axis = 1) # Will delete banking_crisis The data set of the column is used as x
y = data['banking_crisis'] # banking_crisis List as y
x_train,x_test,y_train,y_test = model_selection.__________________(x, y,test_size=0.2,random_state = 33,stratify=y)
3.3 Establish a random forest prediction model
Random forest is an ensemble learning method , Samples and features are extracted from the data in a random way , Train multiple different decision trees , formation “ The forest ”. Each tree gives its own classification opinions , call “ vote ”. Under the classification problem , The forest chooses the category with the most votes ; In the case of regression, the mean value is used . stay Python Use in sklearn.ensemble Of RandomForestClassifier Building a classification model , Its main parameters include :
n_estimators: Number of training classifiers ( The default is 100);max_depth: The maximum depth of each tree ( The default is 3);max_features: The maximum characteristic number of partition ( The default is 'auto')random_state: Random seeds .
from sklearn.ensemble import RandomForestClassifier
# Training random forest classification model
rf = RandomForestClassifier(n_estimators = 100, max_depth = 20,max_features = 10, random_state = 20)
rf.fit(x_train, y_train)
y_pred = rf.______(x_test) # Yes y To make predictions
3.4 Model to evaluate
When evaluating the model , We use the functions respectively classification_report()、confusion_matrix() and accuracy_score(), Forecast report for output model 、 Confusion matrix and classification accuracy .
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test, y_pred)) # Output the forecast report of the model
confusion_matrix = __________________(y_test, y_pred)
print(confusion_matrix) # Output obfuscation matrix
# Draw the confusion matrix thermodynamic diagram
fig,ax = plt.subplots(figsize=(8,6))
sns._________(confusion_matrix,ax=ax,annot=True,annot_kws={'size':15}, fmt='d',cmap = 'YlGnBu_r')
ax.set_ylabel(' True value ',fontproperties = font)
ax.set_xlabel(' Predictive value ',fontproperties = font)
ax.set_title(' Confusion matrix thermal diagram ',fontproperties = font)
plt.show() # Output confusion matrix thermodynamic diagram
It can be seen from the model forecast report that , In the event of a banking crisis ( A few categories ) The recall rate reached 89%, Through the confusion matrix and the thermodynamic diagram of the confusion matrix, it can be seen that the correct classification accounts for a high proportion , It shows that the effect of random forest model is better , Next, we draw two categories ROC_AUC curve , For further evaluation .
ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc_auc = ____________(y_test, rf.predict(x_test)) # Calculation auc Value
fpr, tpr, thresholds = ____________(y_test, rf.predict_proba(x_test)[:,1]) # Calculate the... Under different thresholds TPR and FPR
# draw ROC curve
plt.figure(figsize = (8,6))
plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1],'r--')# Draw random guess lines
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc="lower right") # The legend is located at the bottom right
plt.show()
You can see the random forest ROC The curve performs well , Close to the upper left corner ,AUC The value of has reached 0.94.
Considering the small number of samples in the training set , Sample categories are unbalanced , We use for a few classes SMOTE Oversampling operation , Expand a few class samples , Optimize the model .
3.5 Use SMOTE Carry out oversampling optimization model
SMOTE The basic idea of the algorithm is to analyze the minority samples and add new samples to the dataset according to the minority samples . For a few samples a, Randomly select a nearest neighbor sample b, And then from a And b Randomly select a point on the line of c As a new minority sample .
In dividing the data set , Then the training set is oversampled , Extend a few classes . stay Python Use in imblearn.over_sampling Of SMOTE Class construction SMOTE Oversampling model .
# Yes x_train,y_train Conduct SMOTE Oversampling
from imblearn.over_sampling import SMOTE
x_train_resampled, y_train_resampled = SMOTE(random_state=4).fit_resample(x_train, y_train)
print(x_train_resampled.shape, y_train_resampled.shape) # View the data set size after sampling
# Select the optimal parameters through grid search
from sklearn.model_selection import GridSearchCV
param_grid = [{
'n_estimators':[10,20,30,40,50],
'max_depth':[5,8,10,15,20,25]
}]
grid_search = GridSearchCV(rf, param_grid, scoring = 'recall')
# Output the best combination of parameters and scores
grid_search.fit(x_train_resampled, y_train_resampled)
print("best params:", grid_search.best_params_)
print("best score:", grid_search.best_score_)
It can be found that SMOTE After oversampling and grid searching to find the optimal parameters , The optimal parameter selection is max_depth: 10, n_estimators: 10;recall The highest score can be achieved 0.97, And before the optimization model 0.89 It is also improved .
Last , We use random forest to screen out the important features that affect the occurrence of banking crisis , And draw the feature importance ranking diagram .
3.6 Feature importance ranking
fig = plt.figure(figsize=(16,12))
# Get the random forest feature importance score
rf_importance = rf.__________________
index = data.drop(['banking_crisis'], axis=1).columns # Delete the banking crisis banking_crisis Column characteristics
# Rank the obtained feature importance scores in descending order
rf_feature_importance = pd.DataFrame(rf_importance.T, index=index,columns=['score']).sort_values(by='score', ascending=True)
# Horizontal bar chart drawing
rf_feature_importance.plot(kind='______',legend=False,color = 'deepskyblue')
plt.title(' The importance of random forest characteristics ',fontproperties = font)
plt.show()
You can see , System crisis systemic_crisis Is the most important 、 year cpi The rate of inflation inflation_annual_cpi、 year year、 The exchange rate of the country's currency against the US dollar exch_used It is also of high importance , Explain that these characteristics are more important to influence whether the banking crisis occurs , This further verifies our previous conclusion of feature correlation analysis .
4. summary
Love number class (iDataCourse) It is a big data and artificial intelligence course and resource platform for colleges and universities . The platform provides authoritative course resources 、 Data resources 、 Case experiment resources , Help universities build big data and artificial intelligence majors , Curriculum construction and teacher capacity-building .
边栏推荐
- 智联招聘的基于 Nebula Graph 的推荐实践分享
- [STL programming] [common competition] [Part 1]
- 分享一次自己定位 + 解决问题的经历
- 谈谈我写作生涯的画图技巧
- 使用MySqlBulkLoader批量插入数据
- This is the same as data collection. Can you define a parameter as last month or the previous day, and then use this parameter in SQL?
- Postman 汉化教程(Postman中文版)
- 什么是堆栈?
- 主键选择选择自增还是序列?
- OpenSSL client programming: SSL session failure caused by an obscure function
猜你喜欢

Safety is the last word, Volvo xc40 recharge

北汽制造全新皮卡曝光,安全、舒适一个不落

At 19:00 on Tuesday evening, the 8th live broadcast of battle code Pioneer - how to participate in openharmony's open source contribution in multiple directions

Installing services for NFS

UOS提示输入密码以解锁您的登陆密钥环解决办法

It took me 6 months to complete the excellent graduation project of undergraduate course. What have I done?

海量数据出席兰州openGauss Meetup(生态全国行)活动,以企业级数据库赋能用户应用升级

Oracle 架构汇总

Type the URL to the web page display. What happened during this period?

Csdn Skills Tree use Experience and Product Analysis (1)
随机推荐
Navicat premium connection problem --- host 'XXXXXXXX' is not allowed to connect to this MySQL server
爱数课实验 | 第七期-基于随机森林的金融危机分析
Cerebral Cortex:从任务态和静息态脑功能连接预测儿童数学技能
When developing digital collections, how should cultural and Museum institutions grasp the scale of public welfare and Commerce? How to ensure the security of cultural relics data?
Batch insert data using MySQL bulkloader
【STL编程】【竞赛常用】【part 3】
关于企业数字化的展望(38/100)
使用MySqlBulkLoader批量插入数据
After kotlin wechat payment callback, the interface is stuck and uipagefragmentactivity windowleft is thrown
Csdn Skills Tree use Experience and Product Analysis (1)
UOS prompts for password to unlock your login key ring solution
数据仓库体系之贴源层、历史层
Backtracking related issues
动物养殖生产虚拟仿真教学系统|华锐互动
Pycharm common functions - breakpoint debugging
主键选择选择自增还是序列?
What is a low code development platform? Why is it so hot now?
Oracle architecture summary
Recommended practice sharing of Zhilian recruitment based on Nebula graph
Shell command used in actual work - sed