当前位置：网站首页>Love number experiment | Issue 7 - Financial Crisis Analysis Based on random forest

Love number experiment | Issue 7 - Financial Crisis Analysis Based on random forest

2022-06-27 20:59:00 【Data science artificial intelligence】

Love number class ：idatacourse.cn

field ： other

brief introduction ： In the last century 60 After the s , Africa has set off a wave of independence from colonialism . Because of hundreds of years of history , Most countries in the African continent are relatively backward in economic development , The economic system is fragile , Various crises often occur . This case is of great significance to Africa in the past century 13 An exploratory analysis of the financial crisis in countries , And a random forest model is built to predict .

data ：

./dataset/african_crises.csv

./dataset/SimHei.ttf

Catalog

Total data 1059 strip , The meaning of each data field is shown in the following table ：

Field	meaning
case	Country number , A number representing a particular country
cc3	Country code , Three letter country / Region code
country	Country name
year	Observation Year
systemic_crisis	Systemic crisis ,“ 0” It means that there was no systemic crisis in that year ,“ 1” It means that there was a systemic crisis in that year
exch_usd	The exchange rate of the country's currency against the US dollar
domestic_debt_in_default	Domestic debt default ,“0” It means that there is no domestic debt default in the current year ,“1” It indicates that there was a domestic debt default in that year
sovereign_external_debt_default	Sovereign debt default ,“0” It means that there was no sovereign debt default in that year ,“1” It means that there was a sovereign debt default in that year
gdp_weighted_default	Total amount of defaulted debt and GDP The ratio of the
inflation_annual_cpi	year CPI The rate of inflation
independence	independence ,“ 0” Express “ No independence ”,“ 1” Express “ independence ”
currency_crises	Currency crisis ,“ 0” It means that it did not happen in the current year “ Currency crisis ”,“ 1” It means that something happened that year “ Currency crisis ”
inflation_crises	Inflation crisis ,“ 0” It means that it did not happen in the current year “ Inflation crisis ”,“ 1” It means that something happened that year “ Inflation crisis ”
banking_crisis	Banking crisis ,“ no_crisis” It means that there was no banking crisis in that year , and “ crisis” It means that there was a banking crisis that year

1. Data reading and preprocessing

1.1 Reading data

#  Import corresponding modules 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
#  Set the font 
font = FontProperties(fname = "./dataset/SimHei.ttf", size=14)

import seaborn as sns
import random

#  Set the drawing style 
%matplotlib inline
sns.set(style='whitegrid')

#  Ignore all warnings 
import warnings
warnings.filterwarnings('ignore')

#  Reading data 
data = pd.read_csv('./dataset/african_crises.csv')
data.sample(5)

1.3 View the basic information of the data

First, let's look at the total number of countries .

unique_countries = data.country.unique()
unique_countries

You can see that the data contains 13 African countries , Algeria in order , Angola , Central African Republic , Ivory Coast , Egypt , Kenya , mauritius , Morocco , Nigeria , South Africa , Tunisia , Zambia and Zimbabwe . Next, we use the descriptive statistical function to check the basic situation of the data and whether there are any missing or abnormal data .

#  Basic information about data sets 
data.______()

You can see in addition to the country code cc3、 Country name country And the banking crisis banking_crisis These three fields are outside the character type , The rest are numeric types , And there is no missing value in the data .

#  View statistical indicators of data 
data.____________(include = 'all')

By observing statistical indicators , We see the year year The maximum and minimum values of are respectively 2014 Years and 1860 year , Egypt has the most statistical records , Yes 155 Data . We also found an anomaly , Currency crisis currency_crises The value range of is 0、1, However, values appear in the data 2, We need to deal with it separately , And other indicators have no obvious abnormality .

1.3 Data preprocessing

# Check out the currency crisis currency_crises The values for 2 The data of 
data[data['currency_crises'] == 2]

You can see that the only data with exceptions is 4 strip , Let's delete it directly .

data = data[data['currency_crises'] != 2]#  Get generated delete currency crisis currency_crises The values for 2 Data set of 
data.______ #  View the size of the newly generated dataset

2. Exploratory analysis of economic indicators

After World War II , The pattern of the world today has taken shape , stay 60 After the s , Africa has set off a wave of independence from colonialism . Because of hundreds of years of history , The African continent is the most backward region on the earth , The economic and political development of most countries is relatively backward , The quality of the population is low , The economic system is fragile , Various crises often occur . Next, we use this data set to analyze the economic development and economic crisis of various countries before and after independence . First we draw 13 A line chart showing the exchange rate changes of the currencies of countries against the US dollar .

2.1 Changes in the exchange rate of the currency against the US dollar

plt.figure(figsize=(12,20))

for i in range(13):
    
    plt.subplot(7,2,i+1)
    country = unique_countries[i]
    
    #  Randomly generate a color  random.choice(): Randomly extract an element from a sequence , extract 6 Secondary composition 6 Bits represent random colors 
    col="#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
    
    #  Draw line chart 
    sns.____________(data[data.country == country]['year'],data[data.country == country]['exch_usd'],label = country,color = col)
    
    # np.logical_and() Logic and   Returns when both conditions are true True
    plt.plot([np.min(data[np.logical_and(data.country == country,data.independence == 1)]['year']),
              np.min(data[np.logical_and(data.country == country,data.independence == 1)]['year'])],
             [0,np.max(data[data.country == country]['exch_usd'])],color = 'black',linestyle = 'dotted',alpha = 0.8)
    

    plt.______(country) #  Add image title 
    
plt.tight_layout() #  Automatically adjust the subgraph parameters to provide the specified fill 
plt.show() #  Output 13 A line chart showing the exchange rate changes of the currencies of countries against the US dollar

You can see , Most countries did not have their own monetary system in the short period before and after independence , Still use the currency of colonial countries , Such as francs or pounds . Angola (Angola)、 zimbabwe (Zimbabwe)、 Zambia (Zambia)、 Nigeria (Nigeria) The exchange rates of other countries have been maintained at 0, stay 21 Around the th century, there began to be a national currency . Tunisia (Tunisia) The exchange rate fell sharply after independence , Long term stability 1:1 about . In addition, it can be found that most African countries have developed over time , On a dollar basis , The currency is in a state of gradual depreciation .

2.2 Changes in inflation rate

The rate of inflation is also called the rate of price change , It is the ratio of the excess amount of money issued to the amount of money actually needed , To reflect inflation 、 The extent of currency depreciation . Calculate the inflation rate by the growth rate of the price index , The consumer price index is used in this data （CPI） To express .

Distinguish according to the rate of price rise ：

Moderate inflation （ The annual rate of price increase is 1%~6% within ）
Severe inflation （ The annual rate of price increase is 6%~9%）
Galloping inflation （ The annual rate of price increase is 10%~50% following ）
Hyperinflation （ The annual rate of price increase is 50% above ）

Next, let's analyze the changes of inflation rates in various countries .

plt.figure(figsize=(12,20))

for i in range(13):
    
    plt.subplot(7,2,i+1)
    country = unique_countries[i]
    
    #  Randomly generate a color 
    col="#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
    
    #  Draw line chart 
    sns.lineplot(data[data.country == country]['year'],data[data.country == country]['inflation_annual_cpi'],label = country,color = col)
    
    #  Add scatter plot 
    plt.______(data[data.country == country]['year'],data[data.country == country]['inflation_annual_cpi'],color = col,s = 28) # s Refers to the area of the scatter 
    
    plt.plot([np.min(data[np.logical_and(data.country == country,data.independence==1)]['year']),
              np.min(data[np.logical_and(data.country == country,data.independence==1)]['year'])],
             [np.min(data[data.country == country]['inflation_annual_cpi']),np.max(data[data.country == country]['inflation_annual_cpi'])],
             color = 'black',linestyle = 'dotted',alpha = 0.8) # alpha Refers to color transparency 

    plt.title(country)
    
plt.tight_layout() #  Automatically adjust the subgraph parameters to provide the specified fill 
plt.show() #  Output 13 Changes in inflation rates in countries

It can be seen that most of these African countries have experienced inflation of varying degrees , Such as South Africa (South Africa) stay 1970-1990 As the apartheid policy has been continuously sanctioned by Western economies, the economy has been greatly affected ; Angola (Angola) stay 20 century 90 There were many civil wars in the s , The war caused inflation to soar , Reach... In the highest year 4000 above . There are also some countries with relatively stable economies , Such as Tunisia (Tunisa) After independence , After a brief rise, the inflation rate gradually dropped to a lower level and remained stable .

2.3 Distribution of other crises

Next, let's analyze the other fields of the data ： Systemic crises in different countries systemic_crisis、 Domestic debt default domestic_debt_in_default、 Sovereign debt default sovereign_external_debt_default、 Currency crisis currency_crises、 Inflation crisis inflation_crises、 Banking crisis banking_crisis And so on .

sns.set(style='darkgrid')
columns = ['systemic_crisis','domestic_debt_in_default','sovereign_external_debt_default','currency_crises','inflation_crises','banking_crisis']

#  Draw the distribution pattern of other features 
plt.figure(figsize=(16,16))

for i in range(6):
    plt.subplot('32'+str(i+1))
    sns.countplot(y = data.country,hue = data[columns[i]],palette = 'rocket') # palette For palette 
    plt.______(loc = 0) #  Select the best legend location 
    plt.title(columns[i])
    
plt.tight_layout()
plt.show()

Sovereign debt default sovereign_external_debt_default It refers to the situation that the government of a country is unable to repay the principal and interest of the debt it borrows from the external guarantee on time , Looking at the above picture, we can see that the Central African Republic (Central African Republic)、 zimbabwe (Zimbabwe) And Ivory Coast (Ivory Coast) A large number of sovereign debt defaults occurred , This leads to extremely low sovereign credit ratings .
At the same time, we can find , Except Angola (Angola)、 zimbabwe (Zimbabwe), Most of the remaining countries have not defaulted on their domestic debt domestic_debt_in_default. This is because when a government has the right to issue money , The government can issue new money , Pay off local currency debt by putting in too much money ( Domestic debt ), At this time, the government will not have a sovereign default on domestic debt , This is also the reason why sovereign bonds marked in local currency enjoy the highest credit rating in China . But in reality, because the government issues too much money, it will bring inflation 、 Fluctuations in the value of the local currency , Therefore, its total debt to all creditors is capped .
The systemic financial crisis can be called “ Comprehensive financial crisis ”, It means that there is serious chaos in the major financial fields , Such as currency crisis 、 Banking crisis 、 Foreign debt crises occur simultaneously or successively . It often happens in the financial economy 、 Financial system 、 Market oriented countries and regions with relatively prosperous financial assets and countries with serious deficits and foreign debts , It has a great destructive effect on the development of the world economy .

There is a systemic crisis systemic_crisis The largest number of countries are the Central African Republic (Central African Republic), The second is Zimbabwe (Zimbabwe) And Kenya (Kenya). According to the definition of systemic crisis , There should be a link between systemic crisis and banking crisis . Let us examine whether these countries have a banking crisis at the same time as a systemic crisis banking_crisis.

2.4 The correlation between systemic crisis and banking crisis

#  Create include year , Country , Systemic crisis , A data set of banking crises 
systemic = data[['year','country', 'systemic_crisis', 'banking_crisis']]

#  Draw and observe the overlap of systemic crisis and banking crisis 
systemic = systemic[(systemic['country'] == 'Central African Republic') | (systemic['country']=='Kenya') | (systemic['country']=='Zimbabwe') ]
plt.figure(figsize=(12,12))
count = 1

for country in systemic.country.unique():
    plt.subplot(len(systemic.country.unique()),1,count)
    subset = systemic[(systemic['country'] == country)]
    sns.lineplot(subset['year'],subset['systemic_crisis'],ci=None) # ci Parameter can be used to specify the size of segment interval 
    plt.scatter(subset['year'],subset["banking_crisis"], color='coral', label='Banking Crisis')
    plt.subplots_adjust(hspace=0.6) # hspace Used to set the distance between the top and bottom of a subgraph 
    plt.______('Years') #  to x Axis naming 
    plt.______('Systemic Crisis/Banking Crisis') #  to y Axis naming 
    plt.title(country)
    count+=1

The value of the blue line represents whether a systemic crisis has occurred , The red scatter represents whether the banking crisis has occurred , The chart above shows how crises overlap , This confirms our hypothesis that the systemic crisis has an impact on the banking crisis .

Calculate the correlation between all features

#  The banking crisis banking_crisis Column for feature coding 
#  The banking crisis banking_crisis The data without crisis in is marked as 0, The data of the crisis is marked as 1
data['banking_crisis'] = data['banking_crisis'].map({"no_crisis":0,"crisis":1})

#  Select all features 
selected_features = ['systemic_crisis', 'exch_usd', 'domestic_debt_in_default','sovereign_external_debt_default', 'gdp_weighted_default',
       'inflation_annual_cpi', 'independence', 'currency_crises','inflation_crises','banking_crisis']

corr = data[selected_features].______() #  Get the correlation between the features and generate the correlation matrix 

fig = plt.figure(figsize = (12,8))

cmap = sns.diverging_palette(220, 10, as_cmap=True) #  Generate blue - white - Red color list 
mask = np.zeros_like(corr, dtype=np.bool) #  Returns an array of zeros with the same shape and type as the correlation matrix as the mask 
mask[np.triu_indices_from(mask)] = True #  Generate a mask for the upper triangular matrix of the correlation matrix 

#  Draw a heat map 
sns.______(corr, mask=mask, cmap=cmap,vmin=-0.5,vmax=0.7, center=0,annot = True,
            square=True, linewidths=.5,cbar_kws={"shrink": .5});

plt.title(" Correlation between features ",fontproperties = font)
plt.show()

In addition to seeing the high correlation between the systemic crisis and the banking crisis , We also learned that the exchange rate of the currency against the US dollar and domestic debt default are highly correlated with sovereign debt default .

Next, we try to build a random forest classification model , Predict the characteristics that will affect the occurrence of the banking crisis .

3. Build a banking crisis prediction model

Feature code
Data set partitioning and hierarchical sampling
Establish a random forest prediction model
Evaluation of model effect
Use SMOTE Carry out oversampling optimization model
Feature importance ranking

3.1 Feature code

data.drop(['case','cc3'],axis = 1,inplace = True) #  Delete from the original data set case Column sum cc3 Column 
data.head()

#  To the country country Conduct labelencoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.______(data['country'].values) #  take country The value of into an empty dictionary 
data['country']=le.____________(data['country'].values) #  Put... In the dictionary country The value of is converted to the index value

print(data['country']) #  View the... After feature coding country name

#  Draw no banking crisis no_crisis With the occurrence of a banking crisis crisis The histogram of 
fig = plt.figure(figsize = (8,6))

data['banking_crisis'].value_counts().plot(kind='______',rot = 360,color = 'lightseagreen')

plt.xticks([0,1],["no_crisis","crisis"])
plt.show()

You can see , The bank crisis in the data set is far less than that without bank crisis , The proportion of about 1：10, For such unbalanced data , Try to make the proportion of samples in the training set consistent with that in the test set . It is necessary to use the hierarchical sampling method to divide the training set and the test set .

3.2 Data set partitioning and hierarchical sampling

Let's start to divide the training set and test set . stay Sklearn Medium model_selection modular , There is train_test_split() function , Used as the division of training set and test set , The function syntax is ：train_test_split(x,y,test_size = None,random_state = None,stratify = y), among ：

x,y: Are all the features required for prediction , And the characteristics that need to be predicted .
test_size: Test set ratio , for example test_size=0.2 Indicates division 20% As a test set .
random_state: Random seeds , Because the partition process is random , For repeatable training , Need to fix one random_state, The results reproduce .
stratify: Use layered sampling , Ensure that the same proportion of training sets and test sets are extracted from the samples with and without bank crisis .

The function will eventually return four variables , Respectively x Training set and test set of , as well as y Training set and test set of .

#  Division of training set and test set 
from sklearn import model_selection

x = data.drop('banking_crisis',axis = 1) #  Will delete banking_crisis The data set of the column is used as x
y = data['banking_crisis'] # banking_crisis List as y

x_train,x_test,y_train,y_test = model_selection.__________________(x, y,test_size=0.2,random_state = 33,stratify=y)

3.3 Establish a random forest prediction model

Random forest is an ensemble learning method , Samples and features are extracted from the data in a random way , Train multiple different decision trees , formation “ The forest ”. Each tree gives its own classification opinions , call “ vote ”. Under the classification problem , The forest chooses the category with the most votes ; In the case of regression, the mean value is used . stay Python Use in sklearn.ensemble Of RandomForestClassifier Building a classification model , Its main parameters include ：

n_estimators : Number of training classifiers ( The default is 100);
max_depth : The maximum depth of each tree ( The default is 3);
max_features: The maximum characteristic number of partition ( The default is 'auto')
random_state : Random seeds .

from sklearn.ensemble import RandomForestClassifier

#  Training random forest classification model 
rf = RandomForestClassifier(n_estimators = 100, max_depth = 20,max_features = 10, random_state = 20)
rf.fit(x_train, y_train) 
y_pred = rf.______(x_test) #  Yes y To make predictions

3.4 Model to evaluate

When evaluating the model , We use the functions respectively classification_report()、confusion_matrix() and accuracy_score(), Forecast report for output model 、 Confusion matrix and classification accuracy .

from sklearn.metrics import classification_report,confusion_matrix

print(classification_report(y_test, y_pred)) #  Output the forecast report of the model 
confusion_matrix = __________________(y_test, y_pred) 
print(confusion_matrix) #  Output obfuscation matrix 

#  Draw the confusion matrix thermodynamic diagram 
fig,ax = plt.subplots(figsize=(8,6)) 
sns._________(confusion_matrix,ax=ax,annot=True,annot_kws={'size':15}, fmt='d',cmap = 'YlGnBu_r')
ax.set_ylabel(' True value ',fontproperties = font)
ax.set_xlabel(' Predictive value ',fontproperties = font)
ax.set_title(' Confusion matrix thermal diagram ',fontproperties = font)
plt.show() #  Output confusion matrix thermodynamic diagram

It can be seen from the model forecast report that , In the event of a banking crisis ( A few categories ) The recall rate reached 89%, Through the confusion matrix and the thermodynamic diagram of the confusion matrix, it can be seen that the correct classification accounts for a high proportion , It shows that the effect of random forest model is better , Next, we draw two categories ROC_AUC curve , For further evaluation .

ROC curve

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

roc_auc = ____________(y_test, rf.predict(x_test)) # Calculation auc Value 
fpr, tpr, thresholds = ____________(y_test, rf.predict_proba(x_test)[:,1]) # Calculate the... Under different thresholds TPR and FPR

#  draw ROC curve 
plt.figure(figsize = (8,6))

plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1],'r--')#  Draw random guess lines 
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')

plt.legend(loc="lower right") #  The legend is located at the bottom right 
plt.show()

You can see the random forest ROC The curve performs well , Close to the upper left corner ,AUC The value of has reached 0.94.

Considering the small number of samples in the training set , Sample categories are unbalanced , We use for a few classes SMOTE Oversampling operation , Expand a few class samples , Optimize the model .

3.5 Use SMOTE Carry out oversampling optimization model

SMOTE The basic idea of the algorithm is to analyze the minority samples and add new samples to the dataset according to the minority samples . For a few samples a, Randomly select a nearest neighbor sample b, And then from a And b Randomly select a point on the line of c As a new minority sample .

In dividing the data set , Then the training set is oversampled , Extend a few classes . stay Python Use in imblearn.over_sampling Of SMOTE Class construction SMOTE Oversampling model .

#  Yes x_train,y_train Conduct SMOTE Oversampling 
from imblearn.over_sampling import SMOTE
x_train_resampled, y_train_resampled = SMOTE(random_state=4).fit_resample(x_train, y_train)

print(x_train_resampled.shape, y_train_resampled.shape) # View the data set size after sampling

#  Select the optimal parameters through grid search 
from sklearn.model_selection import GridSearchCV
param_grid = [{
    'n_estimators':[10,20,30,40,50],
    'max_depth':[5,8,10,15,20,25]
}]
grid_search = GridSearchCV(rf, param_grid, scoring = 'recall')

#  Output the best combination of parameters and scores 
grid_search.fit(x_train_resampled, y_train_resampled)

print("best params:", grid_search.best_params_)

print("best score:", grid_search.best_score_)

It can be found that SMOTE After oversampling and grid searching to find the optimal parameters , The optimal parameter selection is max_depth: 10, n_estimators: 10;recall The highest score can be achieved 0.97, And before the optimization model 0.89 It is also improved .

Last , We use random forest to screen out the important features that affect the occurrence of banking crisis , And draw the feature importance ranking diagram .

3.6 Feature importance ranking

fig = plt.figure(figsize=(16,12))

#  Get the random forest feature importance score 
rf_importance = rf.__________________
index = data.drop(['banking_crisis'], axis=1).columns #  Delete the banking crisis banking_crisis Column characteristics 

#  Rank the obtained feature importance scores in descending order 
rf_feature_importance = pd.DataFrame(rf_importance.T, index=index,columns=['score']).sort_values(by='score', ascending=True)

#  Horizontal bar chart drawing 
rf_feature_importance.plot(kind='______',legend=False,color = 'deepskyblue')

plt.title(' The importance of random forest characteristics ',fontproperties = font)

plt.show()

You can see , System crisis systemic_crisis Is the most important 、 year cpi The rate of inflation inflation_annual_cpi、 year year、 The exchange rate of the country's currency against the US dollar exch_used It is also of high importance , Explain that these characteristics are more important to influence whether the banking crisis occurs , This further verifies our previous conclusion of feature correlation analysis .

4. summary

Love number class （iDataCourse） It is a big data and artificial intelligence course and resource platform for colleges and universities . The platform provides authoritative course resources 、 Data resources 、 Case experiment resources , Help universities build big data and artificial intelligence majors , Curriculum construction and teacher capacity-building .

原网站

版权声明
本文为[Data science artificial intelligence]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206271840139710.html