当前位置:网站首页>How to compare two or more distributions: a summary of methods from visualization to statistical testing

How to compare two or more distributions: a summary of methods from visualization to statistical testing

2022-06-24 22:21:00 deephub

Comparing the distribution of a variable in different groups is a common problem in data science . When we want to evaluate a strategy ( User experience features 、 Advertising activities 、 Drugs, etc ) When the causal effect of , The gold standard of causal inference is randomized controlled trials , It's called A /B test . In practice , We selected a sample for the study , They were randomly divided into control group (control group) And the experimental group (treatment group) Compare the results between the two groups . Randomization ensures the only difference between the two groups , In this way, we can attribute the difference in results to the experimental results .

Because it is random, the two groups of individuals will not be exactly the same (identical). But sometimes , They are not even... In their overall performance “ be similar ” Of (similar). for example , We may have more men in a group , Or older people , wait .( We usually call these characteristics covariates or control variables ). When that happens , It can no longer be determined that the difference in results is only due to experiments . therefore , After randomization , Check that all observed variables are balanced between groups , Whether there are no systematic differences is very important .

In this article , We will see a comparison of the two ( Or more ) Different methods of distribution , And assess the magnitude and importance of their differences . We will consider two different approaches , Visualization and statistics . These two methods can provide us with intuition and rigorous numerical values to quickly evaluate and explore differences , But here's the thing , It is difficult for these methods to distinguish whether these differences are systematic or caused by noise .

Suppose we need to experiment with a group of people and have randomly divided them into experimental group and control group . We want them to be as comparable as possible , So that any difference between the two groups can only be attributed to the experimental effect . We also divided the experimental groups into different groups , To test the effect of different experimental methods ( for example , A slight change in the same drug ).

For this example , I simulated 1000 Personal data sets , We looked at a set of their characteristics . I started from src Imported the data generation process dgp_rnd_assignment().DGP and src.utils Some drawing functions and libraries in .

from src.utils import *
from src.dgp import dgp_rnd_assignment

df = dgp_rnd_assignment().generate_data()
df.head()

 Insert picture description here

We have 1000 Personal information , Observe their gender 、 Age and weekly income . Everyone is either assigned to 4 Two different experimental groups were either assigned to the control group .

2 Group data comparison - visualization

Let's start with the simplest : We want to compare the income distribution of the whole experimental group with that of the control group . We first explore the visualization method , Then there are statistical methods . The advantage of the first method is that we can use our intuition to judge , The advantage of the second method is that it is more rigorous to use numerical judgment .

For most visualizations , This will use Python Of seaborn library .

boxplot

The first visual method is the box diagram . Boxplot is a good compromise between summary statistics and data visualization . The center of the box represents the median , The borders represent the 1(Q1) And the 3 Four percentile (Q3). The extension line extends beyond the quadrant of the frame (Q3 - Q1) 1.5 Times the first data point . Points that fall outside the extension line are drawn separately , It is usually considered an outlier .

therefore , The boxplot provides summary statistics ( Boxes and extension lines ) And direct data visualization ( outliers ).

sns.boxplot(data=df, x='Group', y='Income');
plt.title("Boxplot");

The income distribution of the experimental group was more dispersed : The orange box is bigger , Its extension line covers a wider range . But the problem with the boxplot is that it hides the shape of the data , It tells us some aggregated Statistics , But it doesn't show the actual data distribution .

Histogram

The most intuitive way to draw a distribution map is a histogram . Histograms group data into containers of equal width (bin) in , And plot the number of observations in each container .

sns.histplot(data=df, x='Income', hue='Group', bins=50);
plt.title("Histogram");

 Insert picture description here

There are also some problems with histograms

  • Because the observation times of the two groups are different , Therefore, the two histograms are not comparable
  • bin The number of is arbitrary

We can use stat Option to plot density instead of counting to solve the first problem , And will common_norm Set to False Normalize each histogram separately .

sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False);
plt.title("Density Histogram");

 Insert picture description here

In this way, the two histograms are comparable !

But an important problem still exists :bin The size of is arbitrary . In extremely exceptional circumstances , If we group less data , You will end up with at most one observation bin, If we group the data more , We'll end up with just one bin. In both cases , We can't judge . This is a classic deviation - The problem of variance tradeoffs .

Nuclear density

One possible solution is to use the kernel density function , This function attempts to use kernel density estimation (KDE) Use continuous function to approximate histogram .

sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False);
plt.title("Kernel Density Function");

As you can see from the diagram , The income kernel density seems to have a higher variance in the experimental group , But the average value of each group is similar . The problem with nuclear density estimation is that it is a bit like a black box , May mask the relevant characteristics of the data .

Cumulative distribution

A more transparent representation of the two distributions is their cumulative distribution function (Cumulative Distribution Function). stay x Axis ( income ) Every point of , We plot the percentage of data points with equal or lower values . The main advantage of the cumulative distribution function is

  • There is no need to make any choice ( for example bin The number of )
  • No approximation is required ( For example, using KDE), The graph represents all data points
sns.histplot(x='Income', data=df, hue='Group', bins=len(df), stat="density",
             element="step", fill=False, cumulative=True, common_norm=False);
plt.title("Cumulative distribution function");

 Insert picture description here

How should we explain this picture ?

  • Because these two lines are 0.5 (y Axis ) More or less intersect , This means that their median values are similar
  • Because the orange line is above the blue line on the left , Below the blue line on the right , This means that the distribution of the experimental group is fatter tails( Fat tail )

QQ chart

A related approach is QQ chart , among q Represents the quantile .QQ Figure plots the quantiles of two distributions . If the distribution is the same, you should get a 45 Degree line .

Python There is no original QQ Figure function , and statsmodels The package provided. qqplot function , But it's quite troublesome . therefore , We will do it by hand .

First , We need to use percentile Function to calculate the quartile of two groups .

income = df['Income'].values
income_t = df.loc[df.Group=='treatment', 'Income'].values
income_c = df.loc[df.Group=='control', 'Income'].values

df_pct = pd.DataFrame()
df_pct['q_treatment'] = np.percentile(income_t, range(100))
df_pct['q_control'] = np.percentile(income_c, range(100))

Now we can plot two quantile distributions , add 45 Degree line , Represents the perfect fit of the datum .

plt.figure(figsize=(8, 8))
plt.scatter(x='q_control', y='q_treatment', data=df_pct, label='Actual fit');
sns.lineplot(x='q_control', y='q_control', data=df_pct, color='r', label='Line of perfect fit');
plt.xlabel('Quantile of income, control group')
plt.ylabel('Quantile of income, treatment group')
plt.legend()
plt.title("QQ plot");

QQ Graphs provide very similar information in terms of cumulative distributions : The experimental group had the same median income ( Lines cross at the center ) But the tail is wider ( The point is below the line on the left , Above the line on the right ).

2 Group data comparison - Statistical methods

up to now , We have seen different ways to visualize the differences between distributions . The main advantage of visualization is intuition : We can observe the differences and visually evaluate them .

However , We may want to be more strict , And try to assess the statistical significance of the differences between distributions , namely answer “ Is the observed difference systematic or due to sampling noise ?” The problem of .

We will now analyze different test methods to distinguish the two distributions .

T test

The first and most common is the student t test .T Tests are usually used to compare mean values . We want to test whether the average income distribution of the two groups is the same . The test statistics of the two mean comparison test are given by the following formula :

among x̅ That's the sample mean ,s Is the sample standard deviation . Under mild conditions , Test statistics as students t Distribution asymptotic distribution .

We use scipy Medium ttest_ind Function to execute t test . This function returns the test statistics and the implied p value .

from scipy.stats import ttest_ind

stat, p_value = ttest_ind(income_c, income_t)
print(f"t-test: statistic={stat:.4f}, p-value={p_value:.4f}")

t-test: statistic=-1.5549, p-value=0.1203

Tested p The value is 0.12, Therefore, we do not reject the zero hypothesis that there is no difference between the average values of the experimental group and the control group .

Normalized mean difference (SMD)

Generally speaking , When we do randomized controlled trials or A/B When testing , It is best to test the mean difference of all variables between the experimental group and the control group .

However , because t The denominator of the test statistic depends on the sample size , therefore t Tested p Values are difficult to compare across studies . So we may get significant results in an experiment with a very small difference but a large sample size , In experiments with large differences but small sample size, we may get insignificant results .

One solution to this problem is to standardize the mean difference (SMD). seeing the name of a thing one thinks of its function , This is not an appropriate statistic , It's just a standardized difference , It can be calculated as :

 Insert picture description here

Usually , lower than 0.1 The value of is considered to be a “ Small ” The difference of .

The average value of all variables in the experimental group and the control group and the distance between them are measured (t Test or SMD) Collected in a table called the balance table . have access to causalml In the library create_table_one Function to generate it . As the name of the function shows , In execution A/B When testing , The balance sheet should be the first one you want to see .

from causalml.match import create_table_one

df['treatment'] = df['Group']=='treatment'
create_table_one(df, 'treatment', ['Gender', 'Age', 'Income'])

 Insert picture description here

In the first two columns , We can see the average values of different variables in the experimental group and the control group , The standard error is in brackets . In the last column ,SMD The value of indicates that the standardized difference of all variables is greater than 0.1, This suggests that the two groups may be different .

Mann–Whitney U test

Another test is Mann-Whitney U test , It compares the median of two distributions . The original hypothesis of this test is that the two groups have the same distribution , The alternative hypothesis is that one group has a greater ( Or less ) Value .

Different from the other tests we saw above ,Mann-Whitney U Test for unknown outliers .

The inspection process is as follows .

  • Merge all data points and rank them ( In ascending or descending order )
  • Calculation U₁ = R₁ - n₁(n₁ + 1)/2, among R₁ Is the rank sum of the first set of data points ,n₁ Is the number of data points in the first group .
  • Similarly calculate the second set of U₂.
  • The test statistic consists of stat = min(U₁, U₂) give .

Under the original assumption that there is no system level difference between the two distributions ( The same median ), The test statistic is asymptotically normal , With known mean and variance .

Calculation R and U The theory behind it is as follows : If the values in the first sample are greater than those in the second sample , be R₁ = n₁(n₁ + 1)/2 And as a result ,U 1 Will be zero ( The minimum value attainable ). Otherwise, if the two samples are similar ,U1 and U2 Will be very close to n1 n2 / 2( The maximum attainable ).

We use scipy Of mannwhitneyu function .

from scipy.stats import mannwhitneyu

stat, p_value = mannwhitneyu(income_t, income_c)
print(f" Mann–Whitney U Test: statistic={stat:.4f}, p-value={p_value:.4f}")

Mann–Whitney U Test: statistic=106371.5000, p-value=0.6012

What we got p The value is 0.6, This means that we do not reject the null hypothesis that there is no difference in the median between the experimental group and the control group .

Replacement test

A nonparametric alternative is the substitution test . Under the original assumption , The two distributions should be the same , Therefore, disrupting group labels should not significantly change any statistics .

You can select any statistic and check how its value in the original sample compares with its distribution in the group label arrangement . For example, the difference of the sample mean between the experimental group and the control group is used as the test statistics .

sample_stat = np.mean(income_t) - np.mean(income_c)stats = np.zeros(1000)
for k in range(1000):
    labels = np.random.permutation((df['Group'] == 'treatment').values)
    stats[k] = np.mean(income[labels]) - np.mean(income[labels==False])
p_value = np.mean(stats > sample_stat)

print(f"Permutation test: p-value={p_value:.4f}")

Permutation test: p-value=0.0530

From the replacement test p The value is 0.053, This means that 5% Weak non rejection of the original hypothesis at the level of . So how to explain p value ? This means that the difference between the mean values in the data is greater than the difference between the mean values in the replacement samples 1–0.0560 = 94.4%.

We can visualize the distribution of test statistics in the permutation and the distribution of sample values .

plt.hist(stats, label='Permutation Statistics', bins=30);
plt.axvline(x=sample_stat, c='r', ls='--', label='Sample Statistic');
plt.legend();
plt.xlabel('Income difference between treatment and control group')
plt.title('Permutation Test');

 Insert picture description here

As we can see , The sample statistics are quite extreme relative to the values in the replacement samples , But not too much .

Chi square test

Chi square test is a very powerful test , It is mainly used to check the frequency difference .

One of the least known applications of chi square test is to test the similarity between two distributions . The idea is to categorize the two sets of observations . If the two distributions are the same , We would expect everyone bin The observation frequency is the same in . The important point here is that it needs to be in every bin Make enough observations in , To make the inspection effective .

Generate... Corresponding to the tenths of the income distribution in the control group bin, Then if the two distributions are the same , I counted every... In the experimental group bin Expected observations in .

# Init dataframe
df_bins = pd.DataFrame()

# Generate bins from control group
_, bins = pd.qcut(income_c, q=10, retbins=True)
df_bins['bin'] = pd.cut(income_c, bins=bins).value_counts().index

# Apply bins to both groups
df_bins['income_c_observed'] = pd.cut(income_c, bins=bins).value_counts().values
df_bins['income_t_observed'] = pd.cut(income_t, bins=bins).value_counts().values

# Compute expected frequency in the treatment group
df_bins['income_t_expected'] = df_bins['income_c_observed'] / np.sum(df_bins['income_c_observed']) * np.sum(df_bins['income_t_observed'])

df_bins

 Insert picture description here

Now you can compare and process groups across bin The expected (E) And observation (O) Observe the number to perform the test . The test statistics are given by the following formula

 Insert picture description here

among bin from i Indexes ,O yes bin i Number of data points observed in ,E yes bin i Expected number of data points in . Because we used the tenths of the income distribution in the control group to generate bin, So we expect to process each of the groups bin The number of observations in each bin They are the same . The asymptotic distribution of test statistics is chi square distribution .

In order to calculate test statistics and test p value , We use scipy Chi square function of .

from scipy.stats import chisquare

stat, p_value = chisquare(df_bins['income_t_observed'], df_bins['income_t_expected'])
print(f"Chi-squared Test: statistic={stat:.4f}, p-value={p_value:.4f}")

Chi-squared Test: statistic=32.1432, p-value=0.0002

Unlike all other tests described above , Chi square test strongly rejects the original hypothesis that two distributions are the same . Why is that ?

The reason is that the two distributions have similar centers but different tails , And chi square test tested the similarity of the whole distribution , Not just the center , As we did in the previous inspection .

This result tells a warning : In from p Before it's worth drawing a blind conclusion , It is very important to understand the contents of the actual inspection !

Kolmogorov-Smirnov test

Kolmogorov-Smirnov The idea of the test is to compare the cumulative distributions of the two groups . especially ,Kolmogorov-Smirnov The test statistic is the maximum absolute difference between two cumulative distributions .

 Insert picture description here

among F₁ and F₂ Is two cumulative distribution functions ,x Is the value of the underlying variable .Kolmogorov-Smirnov The asymptotic distribution of the test statistic is Kolmogorov Distribution .

Just to understand , Let's plot the cumulative distribution function and test statistics . First, calculate the cumulative distribution function .

df_ks = pd.DataFrame()
df_ks['Income'] = np.sort(df['Income'].unique())
df_ks['F_control'] = df_ks['Income'].apply(lambda x: np.mean(income_c<=x))
df_ks['F_treatment'] = df_ks['Income'].apply(lambda x: np.mean(income_t<=x))
df_ks.head()

Now we need to find the point where the absolute distance between the cumulative distribution functions is the largest .

k = np.argmax( np.abs(df_ks['F_control'] - df_ks['F_treatment']))
ks_stat = np.abs(df_ks['F_treatment'][k] - df_ks['F_control'][k])

The values of test statistics can be visualized by plotting the values of two cumulative distribution functions and test statistics .

y = (df_ks['F_treatment'][k] + df_ks['F_control'][k])/2
plt.plot('Income', 'F_control', data=df_ks, label='Control')
plt.plot('Income', 'F_treatment', data=df_ks, label='Treatment')
plt.errorbar(x=df_ks['Income'][k], y=y, yerr=ks_stat/2, color='k',
             capsize=5, mew=3, label=f"Test statistic: {ks_stat:.4f}")
plt.legend(loc='center right');
plt.title("Kolmogorov-Smirnov Test");

As we can see from the picture , The value of the test statistic corresponds to the income ~650 The distance between two cumulative distributions of . For this income value, there is the largest imbalance between the two groups .

We can use scipy Medium kstest Function performs a real test .

from scipy.stats import kstest

stat, p_value = kstest(income_t, income_c)
print(f" Kolmogorov-Smirnov Test: statistic={stat:.4f}, p-value={p_value:.4f}")

Kolmogorov-Smirnov Test: statistic=0.0974, p-value=0.0355

p The value is less than 5%: We use 95% We reject the original assumption that two distributions are the same .

Comparison of multiple groups of data - visualization

up to now , We only considered two groups of cases , But what if we have multiple groups ? Some of the methods we saw above can be well extended , Others cannot .

As an example , We will now look at whether the income distribution of different experimental groups is the same .

boxplot

When we have multiple groups , The box diagram can be well extended , Because we can put different boxes side by side .

sns.boxplot(x='Arm', y='Income', data=df.sort_values('Arm'));
plt.title("Boxplot, multiple groups");

 Insert picture description here

As you can see from the diagram , The income distribution of different experimental groups is different , The higher the number, the higher the average income .

Picture of violin

A very good extension of the box graph that combines aggregate statistics and kernel density estimation is the violin graph . Violin picture edge y The axis shows different densities , So they don't overlap . By default , It also adds a miniature box diagram .

sns.violinplot(x='Arm', y='Income', data=df.sort_values('Arm'));
plt.title("Violin Plot, multiple groups");

 Insert picture description here

The violin chart shows that the income of different experimental groups is different .

Ridge map

Ridge map along x The axis plots multiple nuclear density distributions , It is more intuitive than the violin picture . stay matplotlib and seaborn There is no default ridge map in . Always need joypy package .

from joypy import joyplot

joyplot(df, by='Arm', column='Income', colormap=sns.color_palette("crest", as_cmap=True));
plt.xlabel('Income');
plt.title("Ridgeline Plot, multiple groups");

 Insert picture description here

The ridge map shows , The higher the number, the higher the income of the experimental group . It is also easier to understand the different shapes of the distribution from this diagram .

Comparison of multiple groups of data - Statistical methods

Last , Let's consider a hypothesis test that compares multiple groups . For the sake of simplicity , We will focus on one of the most commonly used :f test .

F test

The most popular test for multiple groups is F test .F Test and compare the variance of variables between different groups . This analysis is also called analysis of variance .

F The test statistics are given by the following formula

 Insert picture description here

among G It's the number of groups ,N It's observation ,x̅ Is the overall average ,x̅g Is a group g The average within . Under the original assumption of group independence ,f The statistics are F The distribution of the .

from scipy.stats import f_oneway

income_groups = [df.loc[df['Arm']==arm, 'Income'].values for arm in df['Arm'].dropna().unique()]
stat, p_value = f_oneway(*income_groups)
print(f"F Test: statistic={stat:.4f}, p-value={p_value:.4f}")

F Test: statistic=9.0911, p-value=0.0000

test p The value is basically zero , This means that the original assumption that there is no difference in income distribution between the experimental groups is strongly rejected .

summary

In this article , We see many different ways to compare two or more distributions , Both visually and statistically . This is a major problem in many applications , Especially in causal inference , We need to randomize to make the experimental group and the control group as comparable as possible .

We have also seen how different approaches can be applied to different situations . The visual method is very intuitive , But statistical methods are crucial for decision making , Because we need to be able to assess the magnitude and statistical significance of the differences .

This article code :

https://avoid.overfit.cn/post/883ffe04b1e340c8a5fd2a7ea6364cbf

author :Matteo Courthoud

原网站

版权声明
本文为[deephub]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206241613496956.html