当前位置:网站首页>Statistical genetics: Chapter 2, the concept of statistical analysis
Statistical genetics: Chapter 2, the concept of statistical analysis
2022-06-26 11:18:00 【Analysis of breeding data】
Hello everyone , I'm brother Fei .
I recommended this book a few days ago , Can receive pdf And supporting data code . here , I will introduce each chapter , Summary is also a learning process .
The quotation part is the Google translation of the original book , The body part is my understanding .
Part I foundation , It is divided into six chapters , Namely :
- Chapter one : Basic concepts of genome ( This section introduces , Click to enter )
- Chapter two : The concept of statistical analysis
- The third chapter : Genotype data parameters
- Chapter four :GWAS analysis
- The fifth chapter : Polygenic effect
- Chapter six : Genes interact with the environment
today , Introduce the contents of Chapter 2 , The concept of statistical analysis , Take a look at the catalog :
primary coverage

This chapter includes :
- Basic statistical concepts , Including variance 、 Average , Standard deviation , covariance , And the variance covariance matrix
- The basic framework of statistical model , Include : Invalid Hypothesis 、 The alternative hypothesis , Significance threshold
- Correlation and causality
- Causal model
- Fixed effect model 、 Random effect model and mixed effect model
- I've understood fitting
Why study statistics
up to now , We have been focusing on mastering the basic concepts and foundations of the human genome . Before moving on to a more advanced topic , Especially before the applied statistics chapter later in this book , You must also master some core statistical concepts . As we pointed out at the beginning , This book is written at an introductory level , It aims to cater to various groups of researchers who first entered this field . Readers who already have some basic statistical knowledge will feel that this chapter is too introductory . It's been a long time for those statistics courses , Or just people who have studied statistics as part of a larger course , You may find this basic review section useful . Although many of the concepts introduced in this chapter are familiar in non genetic data analysis , But we also emphasized the special statistical concepts and problems of genetic data analysis .
The purpose of this chapter is to provide basic introductory knowledge for understanding the core concepts required for genetic data analysis . then , We will cover more advanced topics that you may wish to explore further . If you are interested in applying these techniques at a higher level of statistical complexity , We strongly encourage readers to use more advanced or professional textbooks for further reading , Some of them are mentioned in the further reading section at the end of this chapter . We first review some basic statistical concepts that are not unique to statistical genetic data analysis , Such as the central trend . then , Most of the content in this chapter is about the basics of statistical models , And provides updates to related concepts , Such as null hypothesis, substitution hypothesis and significance threshold . then , We distinguish between the concepts of correlation and causality , This is critical when evaluating these models . We also guide the reader through various types of causal models , Including direct and two-way causality 、 Additive and common cause models , And joint mediation and detente ( Or interaction ) Model . Next, the commonly used fixed effects are briefly introduced 、 Difference between random effects and mixed models . Last , We are interested in copying 、 Overfitting is briefly discussed , Then a brief summary is provided .
Flying brother's notes : Statistical genetics , It's not just genetics 、 Molecular biology , We should also learn statistics , Describe some characteristics of genetics through statistical parameters , Summarize some rules , It's very important .
Basic statistical concepts
Because we know that readers may come from different disciplines and backgrounds , So let's start by introducing the main concepts . As the first 1 Chapter , You often encounter the term phenotype ( In these statistical models are dependent variables ) And genotype ( It is often called a covariate 、 Predictors or independent variables ). As we will explain in the following chapters , These variables can take different forms according to their measured values . This measurement in turn affects the statistical model we choose . for example , If a phenotype is measured as a binary variable (1= disease ,0= No disease ), Then you will use logic or other similar models . However , If the phenotype is taken as a continuous or quantitative result ( Height, for example ) Take measurements , You need to use a model , The model can not only capture the gradual scale of the data , And it is usually possible to capture the distribution of this measurement .
Flying brother's notes : In statistical genetics , There is very little analysis of variance , Are regression analysis , The two taxonomic characters are logistic Model , Continuous traits are general linear models or mixed linear models . Heritability 、 Genetic correlation is calculated from variance components ,snp Significance is the significance test of regression , Polygenic scores are predictive regression models, and so on
Average 、 Standard deviation and variance
These parameters , It generally refers to the continuous character of normal distribution :

The formula of sample variance :

R Code display :
Simulate a data frame ,20 Data :
library(tidyverse)
dat = data.frame(ID = 1:20, y = rnorm(20)+100)
dat
Calculate average :
> mean1 = mean(dat$y);mean1
[1] 100.2073
Calculate variance :
> var1 = var(dat$y);var1
[1] 1.27476
Calculate the standard deviation :
> sd1 = sd(dat$y);sd1
[1] 1.129053
Another way to calculate variance :
> sum((dat$y - mean1)^2)/(20-1)
[1] 1.27476
Variance covariance matrix
X and Y The covariance of is expressed as COV(X,Y),X and X The covariance of is X The variance of .

If there are multiple variables , Used to represent their variance covariance matrix , It can be written like this , The diagonal is the variance , The non diagonal is the covariance between two pairs .

R Code demonstration :
y1 and y1 The covariance , Consistent with variance :
> cov(dat$y1,dat$y1)
[1] 0.9460778
> var1 = var(dat$y1);var1
[1] 0.9460778
y1,y2,y3 Variance covariance matrix of , The diagonal is the variance , The non diagonal is the covariance between two pairs .
> cov(dat[,2:4])
y1 y2 y3
y1 0.94607782 -0.07404345 -0.1165461
y2 -0.07404345 0.68879820 0.0776964
y3 -0.11654608 0.07769640 0.9165009
> cov(dat$y1,dat$y2)
[1] -0.07404345
> cov(dat$y1,dat$y3)
[1] -0.1165461
Statistical models
The regression model
Polygenic score :Polygenic score
If y = a*x + b,a Is the regression coefficient ,b It's intercept .
Calculation method of regression coefficient :

Be careful , The denominator here is x The variance of .
R Code , You can see two ways , The calculation results are consistent :
> lm(dat$y2 ~ dat$y1)
Call:
lm(formula = dat$y2 ~ dat$y1)
Coefficients:
(Intercept) dat$y1
207.78619 -0.07826
> cov(dat$y2,dat$y1)/var(dat$y1)
[1] -0.0782636
Null hypothesis and significance test
The goal of regression methods is usually to test the null hypothesis , This is a statistical test , Used to determine that there are no significant differences between specific groups . Think back to your previous introductory course in Statistics , This refers to your estimated parameters (β,β) Equal to zero . therefore , Another assumption is that when the parameter is not equal to zero . We use data for statistical tests , If the null hypothesis is true , The calculation p Value to determine statistical significance . In short , If p Small values , Then the data is inconsistent with the null hypothesis . If the parameter passes the significance threshold ( for example ,0.05,0.001), The null hypothesis will be rejected , To support alternative assumptions . In the area of statistical significance , There was considerable criticism and heated discussion , Mainly around the fact that , That is, the results are often only related to invalid assumptions .
relevant 、 Causal and multiple causal models
In this book , relevant (r) The terms cause and effect are frequently used , Therefore, it is necessary to distinguish the two terms . Correlation represents the statistical correlation between two variables . It is a scaled version of covariance , Its value is between -1 and 1 Between . therefore , A value close to zero means that the covariance between variables is very small . near 1 The value of represents a strong positive covariance , Close to negative 1 The value of represents a strong negative covariance . Please note that , The covariance we are discussing here 、 Correlation and regression models assume that there is a linear relationship between the two variables . In other words , We will draw a line through the scatter , And assume that one variable in its distribution is based on a constant unit change of another variable . When the correlation is close 1 when , Variables tend to move in the same direction , The change around the association line is relatively small ; When approaching -1 when , In the opposite direction .
The formula of correlation coefficient :

stay R In language , Yes cor function , The correlation coefficient can be calculated directly , The correlation coefficient can also be calculated by the above formula , Let's compare the two through code :
> cor(dat$y1,dat$y2)
[1] -0.09172278
>
> cov(dat$y1,dat$y2)/(sd(dat$y1)*sd(dat$y2))
[1] -0.09172278
Causality and correlation
Correlation , Not necessarily cause and effect , such as :
Correlation simply describes the size and direction of the relationship between two or more variables . This does not mean that a change in one variable causes a change in the value of another variable . contrary , Causality indicates that , A change in the value of one variable is the result of a change in another variable . This is called causality .
Take smoking for example . You will find that smoking may be highly correlated with other behaviors , For example, high alcohol consumption . However , You can't naively infer that smoking can lead to alcoholism . Causality is often difficult to reveal in almost all types of data analysis , Including statistical genetic data analysis .
How to determine the causal relationship ?
In medical research , Using randomized controlled trials is the most effective way to establish causality , In these types of research , Samples are usually divided into two groups , These two groups are in most WAV Chinese are similar . Then they receive different treatments , For example, a placebo with a specific treatment or drug . Then evaluate the results of each group . If the results are significantly different . Cause and effect can be established .
in fact , Many of the complex features we study in this field tend to avoid case-control or randomized control methods , We can observe environmental changes ( for example , Policy changes 、 Contamination exposure ), But determining causality remains a challenge .
In the later chapters of this book , We will show a variety of applied statistical methods , These methods attempt to establish causal relationships in statistical genetics , Including Mendelian randomization . We will also highlight the challenges and areas of ongoing debate in this study . Most applications of quantitative statistical genetics are aimed at estimating multivariate statistical models , This is the topic we are going to discuss now .
Fixed model 、 Random model and mixed linear model
Now we only discuss the fixed effect model , That is, the effect of covariates on phenotypic outcomes is modeled as fixed or the same increase per unit of covariates in the sample . therefore , Fixed effect model is different from random effect model or mixed model , Some or all of the model parameters are considered as random variables . The reader should pay attention to , These terms are used slightly differently in biostatistics and econometrics . Andrew · Gelman (AndrewGelman) Wrote an excellent blog , Describe these differences . In Econometrics , Fixed effect models are often used to determine a specific set of variables contained in hierarchical or panel data . In the text of Biostatistics and genetics ,“ Fixed effect ” It refers to the average population effect “ Random effects note the distribution of subitem specific effects . These random subject specific effects are usually considered as unknown potential variables .?2 We often use these models to control so-called unobserved heterogeneity . The assumption here is usually that heterogeneity is a time constant , Independent of other covariates . Random effect models are often very useful because we have a subset of individuals in the data . This includes changes in subsets of individuals or clusters , Such as family 、 School 、 Community 、 City 、 Country or hospital . When checking longitudinal data , Subsets can be repeated measurements of individuals . perhaps , If you check the recurrence event data , Subsets may be repeated disease episodes . therefore , We model random effects , To explain the subset of the data that may in turn affect the main effect .
The mixed linear model contains fixed and random effects . They are often used to examine repeated measurements of the same individual or a specific subset of measurements in longitudinal group studies . In the genetic research covered in this book , Hybrid models are useful for controlling population structure and estimating heritability .
When these models are applied to group structure , Random effect is the effect of genotype on the correlation between individuals - Contribution of phenotypic association . As mentioned earlier , Correlations between individuals are made using the genome relationship matrix (GRM) Calculated . therefore , The hybrid model can explain the genetic distance between individuals in the sample , So as to control the potential confusion caused by the correlation between genetic map differences and geographical differences .
As we discussed in the heritability section of Chapter 1 , Hybrid models are also commonly used to use genome-based restricted maximum likelihood (GREML) It is estimated that SNP Heritability . This is a statistical method of variance component estimation , Narrow additive contribution used to quantify phenotypic heritability . This is specific to a particular subset of genetic variation , It's usually limited to MAF Greater than 1% The locus of . For this reason , It's often called “ chip ” or “ Single nucleotide polymorphism ” Heritability ( or h2SNP, As the first 1 Chapter ). As mentioned earlier , With the advent of genome-wide data . Researchers can go beyond using twin models to test genetic similarity between unrelated individuals . The software used for this analysis is GCTA- Genome wide analysis of complex traits ( See at the end of this chapter “ Mixed model analysis software ”). These estimates produce a lower limit on the genetic contribution of a phenotype or trait , Without relying on the often limited assumptions in twins or family analysis . In short , If a particular phenotype is heritable , Then individuals with closer genetic relationships should have more similar phenotypic values . If the genetic correlation of individuals is not an indicator of similar phenotypic values , So we can come to a conclusion , Specific phenotypes may not be affected by genetics . At the end of this book 9 In the chapter , We provide a how to use GCTA And an example of such analysis .
Flying brother's notes : Mixed linear models are often used in animal and plant breeding , In human statistical genetics , To estimate heritability, use GREML Methods to estimate variance components and calculate heritability , It uses genotype data (SNP) Built G The matrix is put into the random factors in the mixed linear model , Similar to... In genome selection GBLUP Method , Commonly used in humans GCTA To estimate , The two are equivalent .
Software for evaluating heritability :
- GCTA:https://cnsgenomics.com/software/gcta/#Overview
- FastLMM:http://research.microsoft.com/en-us/um/redmond/projects/mscompbio/fastlmm/
- GEMMA:http://www.xzlab.org/software.html
- MMM:http://www.helsinki.fi/mjxpirin/download.html.
There are other soft estimates used in the summary of animals and plants :
- R package :asreml,sommer,BGLR
- ASReml,DMU,BLUPF90,HIBLUP etc.
The results were repeated and over fitted
If only one data set or sample is used for analysis ,“ You may encounter over fitting problems , This is related to the ability to replicate results in individual samples .
Over prediction refers to the problem that the prediction results of the predictors in the model in a specific data or sample are better than those in a new independent data set .
First , Over fitting may be the result of multiple tests . This is because our association between covariates and phenotypes violates such a basic premise , That is, this association is the result of real group effect and random chance . As we are in the 4 Described in Chapter , The genetic variation of these techniques is to test that sometimes there are millions of genetic variations with specific phenotypes . Organizations with the largest associations are more likely to make greater contributions than we expected , Not by chance . When researchers try to replicate the results in a smaller sample , They found that the result was often a small correlation . This is due to the fact that , That is, the top result of the initial over prediction model and its effect estimation ( For example, regression coefficient ) Greater or exaggerated the real effect . in fact , Any joint model with multiple covariates or predictors , If you build and test on a single sample , Will be over fitted .
This is because we estimate parameters to optimize the fit of the model to specific data . therefore , It is logical that the model does not perform well on new independent data .
In this introductory textbook , We can't describe all the ways to deal with over decoration , Only a few of them are outlined here . One is to use training and validation data sets , It is now more commonly used to solve this replication problem . One option is to retest the discovery in similar independent data sets , To see if the results repeat . Another option is to segment the data in the same sample into a training and verification set , This choice is due to the British biological bank ( Possession 50 10000 people ) Such as the release of large data sets . This can then be repeated with different data partitions , To improve robustness .
Other techniques for dealing with this problem are called regularization or contraction methods .
The shrink method performs variable selection , With valid shrink parameters , Keep the predicted value in the model , But it will shrink some parameter estimates . Lasso regression can be used to perform variable selection and ridge regression , To reduce the parameter estimation . Although they are far beyond the scope of this introductory article , But they effectively penalize the parameters in the formula for optimization . There is also an elastic net method combining ridge regression and lasso regression . In Bayesian contraction method , The penalty is expressed as a priori probability . Punishment can be set in many ways , For example, punish the big influence or reduce the small influence to zero or close to zero . The choice depends on the model and analysis . Because the basic facts are often unknown , Therefore, multiple analyses should be performed for testing , So as to make predictions in independent data and cross validation . These methods are becoming more and more common in genetics , Interested readers should refer to more advanced materials .
Flying brother's notes : Build the model , Over fitting often occurs , For example, you dig in a group GWAS site , And then use significant in this group SNP forecast , The accuracy of the discovery is very high , But it is very low or ineffective in other groups , This is the typical performance of over fitting . The solution is to enlarge the sample size , Modeling in large samples , The strong type will be better , The other is to choose other algorithms , There are punishments in it , Such as ridge regression and LASSO Regression, etc
边栏推荐
- Sqli-labs靶场1-5
- Laravel admin login add graphic verification code
- 深度理解STM32的串口實驗(寄存器)【保姆級教程】
- Docker中实现MySQL主从复制
- 最强swarm集群一键部署+氢弹级容器管理工具介绍
- Using baijiafan to automatically generate API calls: progress in JS (II)
- PC QQ大厅 上传更新 修改versionInfo
- 10 years' experience in programmer career - for you who are confused
- Is it safe to open an account in the top ten securities app rankings in China
- 2020.7.6 interview with fence network technology company
猜你喜欢

Redis的最佳实践?看完不心动,算我输!!

滑动窗口

机器学习聚类——实验报告

【Redis 系列】redis 学习十六,redis 字典(map) 及其核心编码结构

PC QQ hall upload update modify VersionInfo

机器学习线性回归——实验报告

9、 Beautify tables, forms, and hyperlinks

携程机票 App KMM 跨端 KV 存储库 MMKV-Kotlin | 开源

【深度学习理论】(6) 循环神经网络 RNN
![In depth understanding of STM32 serial port experiment (register) [nanny level tutorial]](/img/b2/f09e220918a85b14a1993aa85f7720.png)
In depth understanding of STM32 serial port experiment (register) [nanny level tutorial]
随机推荐
MySQL模糊查询详解
机器学习PCA——实验报告
Using baijiafan to automatically generate API calls: progress in JS (II)
PC qq Hall upload Update Modifying versioninfo
哈希表的前置知识---二叉搜索树
. Net, the usage of log components NLog, seriallog, log4net
你好!正向代理!
利用 Repository 中的方法解决实际问题
Re recognized! Know that Chuangyu has been selected as one of the first member units of the "business security promotion plan"
Build document editor based on slate
laravel中使用group by分组并查询数量
In depth understanding of STM32 serial port experiment (register) [nanny level tutorial]
Machine learning SVM - Experimental Report
【Redis 系列】redis 学习十六,redis 字典(map) 及其核心编码结构
Fabric. JS upper dash, middle dash (strikethrough), underline
Uncaught reflectionexception: class view does not exist
PC QQ大廳 上傳更新 修改versionInfo
机器学习SVM——实验报告
19: Chapter 3: develop pass service: 2: get through Alibaba cloud SMS service in the program; (it only connects with Alibaba cloud SMS server, and does not involve specific business development)
Group by is used in laravel to group and query the quantity