当前位置:网站首页>R language uses logistic regression, ANOVA, outlier analysis and visual classification iris iris data set
R language uses logistic regression, ANOVA, outlier analysis and visual classification iris iris data set
2022-07-25 02:01:00 【Extension Research Office】
Full text link :http://tecdat.cn/?p=27650
The source of the original text is : The official account of the tribal public
Abstract
This article will explore Fisher and Anderson The relationship between the three variables presented in the iris dataset , especially virginica and versicolor The dependent variable of the level species For predictive variables Petal length and Petal width The logical return of . One way ANOVA and data visualization both determine a factor level of the dependent variable , namely I. setosa, It is easy to be linearly separated from the other two factors , It has very obvious mean and variance , So it's not that we are interested in logistic regression .
Introduce
The preliminary examination of iris data raises direct questions about the nature of the dataset itself : Why collect such simple data , in fact , One of our first intuitions was to know , Given the information in the data set , Is it possible to conduct relevant analysis and diagnosis , Build a new view A model for classifying results .
We are surprised and pleased to learn that datasets are usually analyzed for this purpose . Its most common use is machine learning , In particular, classification and pattern recognition applications . We began to use the tools we have learned so far to check some data —— namely , We will use logistic regression and two kinds of iris ,Virginica and versicolor( Expressed as π =0 and π=1). The third species I. setosa To be excluded from , Because it is highly separated from the other two species in all dimensions .
Method
under these circumstances , Logistic regression ratio chi square or Fisher Accurate inspection is more appropriate , Because we have a binary dependent variable and multiple predictive variables , It also allows us to clearly quantify the intensity of various impacts while controlling other variables ( That is, the odds ratio of each parameter ).
plot(predicresiduals(logit.fylab="
rl=lm(resi.fit)~bs(predict(.fit),8))
#rl=loess(repredictit.fit))
y=pree=TRUE)
segments(predict(l
result
Created a logical model , The general model and parameter characteristics are as follows :


By looking at their odds ratio , It can effectively summarize the meaning of parameter estimation . obviously , The intercept term is not particularly interesting , Because data points (0,0) Theoretically, it is impossible , And it's far beyond the data we collect .β1 It's better than
and β2
More interesting ; They represent each increment of related variables , While the other remains the same , Certain plants belong to I. virginica The probability of species increases . under these circumstances , Obviously , Increasing the width of petals will classify specific plants as I. virginica The probability of having a huge impact —— This effect is about the length of petals 110 times . However , The odds ratio is 95% Confidence intervals do not include 1, So we can come to a conclusion , Both effects were statistically significant .
library(ggplot2)
# The drawing data
qplot(Petal.Width, Petal.Length, colour = Species, data = irises, main = "Iris classification")

Use the coefficient estimation in the model , We can set a standard —— A linear discriminant —— Through it, we can best separate data . The accuracy of linear discriminant is given in the following confusion matrix :
# Get the prediction results from the model
logit.predictions <- ifelse(predict(logit.fit) > 0,'virginica', 'versicolor')
# Confusion matrix
table(irises[,5],logit.predictions)

The diagnosis
By checking the effects of residuals and data , We identified several potentially abnormal observations :

In all observations that may be problematic , We notice that the second 57 Observation samples may be outliers . Check the diagnostic diagram , We see the trend characteristics of logistic regression , Including residual error and two different curves in the prediction diagram . The first 57 Observation samples appear in each diagnostic diagram , But fortunately, there is no distance beyond cook .

Conclusion and discussion
under these circumstances , The use of logical models is enlightening , Because it shows the powerful function of classifying data into binary dependent variables according to multiple predictive variables . The model predictably shows the greatest uncertainty , That is, in a given dimension ( That is, the boundary between the data of one species and the data of another species ) When the mesoscopic value is close to the average . Consider whether the model can be improved , Or whether different models are more suitable for data is very interesting ; Maybe for this classification problem ,k- The nearest neighbor method is necessary . in any case ,6% The error classification rate of is actually quite good ; More data will certainly increase this number .
Self test questions
Diagnosis of Depression in Primary Care
To study factors related to the diagnosis of depression in primary care, 400 patients were randomly selected and the following variables were recorded:
DAV: Diagnosis of depression in any visit during one year of care.
0 = Not diagnosed
1 = Diagnosed
PCS: Physical component of SF-36 measuring health status of the patient.
MCS: Mental component of SF-36 measuring health status of the patient
BECK: The Beck depression score.
PGEND: Patient gender
0 = Female
1 = Male
AGE: Patient’s age in years.
EDUCAT: Number of years of formal schooling.
The response variable is DAV (0 not diagnosed, 1 diagnosed), and it is recorded in the first column of the data. The data are stored in the file final.dat and is available from the course web site. Perform a multiple logistic regression analysis of this data using SAS or any other statistical packages. This includes
estimation, hypothesis testing, model selection, residual analysis and diagnostics. Explain your findings in a 3 to 4- page report. Your report may include the following sections:
• Introduction: Statement of the problem.
• Material and Methods: Description of the data and methods that you used for the analysis.
• Results: Explain the results of your analysis in detail. You may cut and paste some of your computer
outputs and refer to them in the explanation of your results.
• Conclusion and Discussion: Highlight the main findings and discuss.
Please cut and paste the computer outputs to your report and do not include any direct computer output as an attachment
Please note that you have also the option of using a similar data set in your own field of interest.

The most popular insights
1.R Language diversity Logistic Logical regression The application case
2. Panel smooth transfer regression (PSTR) Analyze the case and realize
3.matlab Partial least squares regression in (PLSR) And principal component regression (PCR)
4.R Language Poisson Poisson Regression model analysis case
5.R The return of language Hosmer-Lemeshow Goodness of fit test
6.r In language LASSO Return to ,Ridge Ridge return and Elastic Net Model implementation
7. stay R In language Logistic Logical regression
8.python Using linear regression to predict stock prices
9.R How to analyze the existence of language and Cox Calculate in regression IDI,NRI indicators
边栏推荐
- "Nowadays, more than 99.9% of the code is garbage!"
- Rightmost × Microframe, high quality heif image coding and compression technology
- Seven text editors that programmers should know are necessary for programming
- How can arm access the Internet through a laptop?
- 6-10 vulnerability exploitation SMTP experimental environment construction
- 10 commonly used data visualization tool software
- EasyX realizes button effect
- 1260. Two dimensional grid migration: simple construction simulation problem
- C#/VB. Net insert watermark in word
- Thinkphp5.0.24 deserialization chain analysis
猜你喜欢

VRRP virtual redundancy protocol configuration

Academicians said: researchers should also support their families. They can only do short-term and fast research if they are not promoted
![[hero planet July training leetcode problem solving daily] 20th BST](/img/25/2d2a05374b0cf85cf123f408c48fe2.png)
[hero planet July training leetcode problem solving daily] 20th BST

Standard transfer function

How to obtain workers' coats and helmets in stray? How to obtain workers' helmets

Take the first place in the International Olympic Games in mathematics, physics and chemistry, and win all the gold medals. Netizen: the Chinese team is too good
![[leetcode] 3. Longest substring without repeated characters - go language problem solution](/img/63/57d3557d77d44b51b7d0f71669568f.png)
[leetcode] 3. Longest substring without repeated characters - go language problem solution

Win10 configuring CUDA and cudnn

Dynamic memory development

Jsonp solves cross domain plug-ins (JS, TS)
随机推荐
JVM Foundation
Research and application of scientific data management strategy for high energy synchrotron radiation source
10 commonly used data visualization tool software
Redis learning notes (2) - power node of station B
Gerrit statistics script
Open sharing of scientific data in the context of open science: the practice of the national Qinghai Tibet Plateau scientific data center
[summer daily question] Luogu P7550 [coci2020-2021 6] bold
iptables :chains, target
"Nowadays, more than 99.9% of the code is garbage!"
The introduction of 23 Filipino doctors for 18million was a hot topic, and the school teacher responded: expedient
Leetcode - number of palindromes
How to obtain workers' coats and helmets in stray? How to obtain workers' helmets
Focus on improving women's and children's sense of gain, happiness and security! In the next ten years, Guangzhou Women's and children's undertakings will make such efforts
Redis tutorial
Kernel structure and design
Peripherals: interrupt system of keys and CPU
Visual studio code installation package download slow & Installation & environment configuration & new one-stop explanation
Several application scenarios of NAT
Seven text editors that programmers should know are necessary for programming
Freedom and self action Hegel