<< Chapter < Page Chapter >> Page >

In some data sets, there are values (observed data points) called outliers . Outliers are observed data points that are far from the least squares line. They have large "errors", where the "error" or residual is the vertical distance from the line to the point.

Outliers need to be examined closely. Sometimes, for some reason or another, they should not be included in the analysis of the data. It is possible that an outlier is a result of erroneous data. Other times, an outlier may hold valuable information about the population under study and should remain included in the data. The key is to examine carefully what causes a data point to be an outlier.

Besides outliers, a sample may contain one or a few points that are called influential points . Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line. To begin to identify an influential point, you can remove it from the data set and see if the slope of the regression line is changed significantly.

Computers and many calculators can be used to identify outliers from the data. Computer output for regression analysis will often identify both outliers and influential points so that you can examine them.

Identifying outliers

We could guess at outliers by looking at a graph of the scatterplot and best fit-line. However, we would like some guideline as to how far away a point needs to be in order to be considered an outlier. As a rough rule of thumb, we can flag any point that is located further than two standard deviations above or below the best-fit line as an outlier . The standard deviation used is the standard deviation of the residuals or errors.

We can do this visually in the scatter plot by drawing an extra pair of lines that are two standard deviations above and below the best-fit line. Any data points that are outside this extra pair of lines are flagged as potential outliers. Or we can do this numerically by calculating each residual and comparing it to twice the standard deviation. On the TI-83, 83+, or 84+, the graphical approach is easier. The graphical procedure is shown first, followed by the numerical calculations. You would generally need to use only one of these methods.

In the third exam/final exam example , you can determine if there is an outlier or not. If there is an outlier, as an exercise, delete it and fit the remaining data to a new line. For this example, the new line ought to fit the remaining data better. This means the SSE should be smaller and the correlation coefficient ought to be closer to 1 or –1.

Graphical identification of outliers

With the TI-83, 83+, 84+ graphing calculators, it is easy to identify the outliers graphically and visually. If we were to measure the vertical distance from any data point to the corresponding point on the line of best fit and that distance were equal to 2 s or more, then we would consider the data point to be "too far" from the line of best fit. We need to find and graph the lines that are two standard deviations below and above the regression line. Any points that are outside these two lines are outliers. We will call these lines Y2 and Y3:

As we did with the equation of the regression line and the correlation coefficient, we will use technology to calculate this standard deviation for us. Using the LinRegTTest with this data, scroll down through the output screens to find s = 16.412 .

Line Y2 = –173.5 + 4.83 x –2(16.4) and line Y3 = –173.5 + 4.83 x + 2(16.4)

where ŷ = –173.5 + 4.83 x is the line of best fit. Y2 and Y3 have the same slope as the line of best fit.

Graph the scatterplot with the best fit line in equation Y1, then enter the two extra lines as Y2 and Y3 in the "Y="equation editor and press ZOOM 9. You will find that the only data point that is not between lines Y2 and Y3 is the point x = 65, y = 175. On the calculator screen it is just barely outside these lines. The outlier is the student who had a grade of 65 on the third exam and 175 on the final exam; this point is further than two standard deviations away from the best-fit line.

Sometimes a point is so close to the lines used to flag outliers on the graph that it is difficult to tell if the point is between or outside the lines. On a computer, enlarging the graph may help; on a small calculator screen, zooming in may make the graph clearer. Note that when the graph does not give a clear enough picture, you can use the numerical comparisons to identify outliers.

The scatter plot of exam scores with a line of best fit.Two yellow dashed lines run parallel to the line of best fit. The dashed lines run above and below the best fit line at equal distances. One data point falls outside the boundary created by the dashed lines—it is an outlier.
Got questions? Get instant answers now!

Questions & Answers

I really appreciate that
umar Reply
I want to test linear regression data such as maintenance fees vs house size. Can I use R square, F test to test the relationship? Is the good condition of R square greater than 0.5
Mok Reply
yes of course must have use f test and also use t test individually multple coefficients
rishi
Alright
umar
hi frnd I'm akeem by name, I wanna study economics and statistics wat ar d thing I must do to b a great economist
akeem
what is random sampling what is sample error
Nistha Reply
@Nistha Kashyap Random sampling is the selection of random items (or random numbers) from the group. A sample error occurs when the selected samples do not truely represent the whole group. The can happen when most or all of the selected samples are taken from only one section of the group;
Ron
Thus the sample is not truely random.
Ron
What is zero sum game?
Hassan Reply
A game in which there is no profit & no loss to any of the both player.
Milan
Differences between sample mean & population mean
mohammed Reply
***keydifferences.com/difference-between-sample-mean-and-population-mean.html
Lucien
Not difference in the formula except the notation, sample mean is denoted by x bar and population mean is denoted by mu symbol. There is formula as well as notation between difference variance and standard deviations
Akash
Likely the difference would be in the result, unless the sample is an exact representation of the population (which is unlikely.)
Ron
what is data
Nii
Nii Avin - Data is just a simple way to refer to the numbers in the population, or in the sample used in your calculations.
Ron
what are the types of data
Nii
Data is the very pale android from the Star Trek Enterprise
Andrew
Am Emmanuel from Nigeria
Emmanuel
Am Qudus from Nigeria
Rasak
am Handson from Cameroon
Handson
what is a mode?
Handson
Nii - data is whatever you are sampling. Such as the number of students in each classroom.
Ron
Handson Ndintek - the mode is the number appearing most frequently. Example: 7 9 11 7 4 6 3 7 2. 7 is the mode. In a group such as 7 9 1 4 6 3, there is no mode because no number appears more often than any other.
Ron
give me the two types of data
Neddy Reply
qualitative and quantitative
phoenix
primary and secondary data
Peace
qualitative and quantitative
Prince
Using Cauchy Schwartz inequality,or prove that b2-b1-1=0
Md Reply
what is the ongoing probability that President Trump will remain in the position he has chosen as his viability of his cabinet as he runs for reelection in the primaries of 2020 election year
Terry
what is statistic?
Jhasaketan Reply
it's a science of collection, organization, analysis and summarizing data to get useful information to make several types of conclusions.which can be used in real life.
anshika
what is the statistical probability that president Trump will remain in the white house after the election of 2020?
Terry
i agree with anshika is right but let me add that such decisions are made in face of uncertainty
Maureen
yes
Stephen
classification of statistic
Jhasaketan Reply
statistic can classified into many types eassy to understand future values effect
Narendra
what is mean?
Jhasaketan
average value
Narendra
İ want to understand what is t test or neyma. Pearson test ans difference
Yasin
to test the hypotesis ho follws h1 l1/lo
Narendra
Hope this helps. There are three main types of averages. *mean -> average -> (X1+X2+X3+...+Xn) / n *mode -> the element within a set which occurs most. {3,4,5,8,12,3,4,3,3,56} mode = 3 *median -  {3,3,4,5,8,12,56} median = 5 OR {3,4,5,8,12,56} median = 6.5
Jack
conceptual approach to limits
lameck Reply
how are limits derived?
lameck
an entire section of calculus is devoted to that explanation.
Pitior
what is statistics?
Martin Reply
statistics :- can be defined as the branches of mathematics that deals with the summarizing, analysing,organization and interpretation of data.
Usman
well said
Venkat
can we find Z value on calculator with out using Z table
Maham Reply
no
Pitior
why
Maham
can another way is possible ?
Maham
Well you could make a table. And as the function you use the one used at the z table
Luca
The normal function is only one way, so you can only try using different numbers until you get the probability that you have. So that is easier if you have a table
Luca
me don't know nothing about z table and don't know how to see the z value on table can you tell me please how see the value on table
Maham
The z table is the table of the standard normal distribution
Luca
You can look it up on internet, its easier than writing down the normal distribution function (with an integral) and doing a table in the calculator
Luca
OK thanks luca
Maham
yes use pnorm in r
Venkat
pnorm(2.3,mean=0,sd=1)
Venkat
pnorm?
Maham
do u have r software
Venkat
no
Maham
its with tht u will get
Venkat
or type in google
Venkat
z mathportal calculator
Venkat
calculator
Venkat
OK venkat thanks
Maham
welcome
Venkat
have calculator but don't know how find z value
Maham
ti83
Venkat
hey guys I'm from computer background so what are the concepts I supposed to prepare for interview in statistics
Alwin
descriptive stats
Venkat
inferential stats
Venkat
outlier treatment
Venkat
boxplot
Venkat
ok
Alwin
assumption of linear regression
Venkat
logistic regression
Venkat
k means clustering
Venkat
exact syllabus?
Alwin
type. analytics vidya interview questions statistics
Venkat
listen. data also
Venkat
like this forum
Jameel
My question is "is it only stats?"
Jameel
wer is the problem
Venkat
how find straight line equation in regression
Maham Reply
u can find using excel
Venkat
or r studio
Venkat
for regression
Venkat
shall i help
Venkat
im an expert
Venkat
by giving a value to x,y
Ibrokhim
first provide data
Venkat
ill solve and guve
Venkat
ive
Venkat
yeah please
Maham
maham you posted data
Venkat
please post data
Venkat
ok
Maham
x:1,2,3,4,5 y:2,5,6,8,9
Maham
regredsion equation is
Venkat
y=0.9+1.7x
Venkat
reg eq is y=0.9+1.7x
Venkat
slope = 1.7
Venkat
yintercept = 0.9
Venkat
answered
Venkat
thanx venkat naveen😊
Maham
welcome
Venkat
the tenth percentile for land selling at jabi is 35,000 and the nineteenth percentile for the land price in the same area is 225,what is the 10_90 percentile range
sodiq Reply
what is statistics
Bhavani Reply
statistics is the beach of mathematics which deals with collection ,organisation, presentation, analysis and interpretation of numerical data
Saeed
oh but interpretation of data, like what and how? 🤔
Bhavani
interpretation: Think in a way that you have given a company year turnover and you have a record of 100years and data set is like (Year,Turnover). Now with that data you can interpret many thing how was the company growth, when were the losses and other things
Akash
interpretation: it is a process in which we make a decision about a population on the basis of sample data . example: if we want to interpret the average income of employees for upcoming year so we have to interpret the income of employees on the basis of previous year's income of those employees
Saeed
thank you saeed, Akash. I understood.
Bhavani
how to remember all this formulas easy ly
Madanapalle
no easy way
Pitior
best way is to do as many problems as possible
Pitior
Oh
Bhavani
is this the only one room? or separate room for separate users? 🤔
Bhavani
Threads*
Bhavani

Get the best Introductory statistics course in your pocket!





Source:  OpenStax, Introductory statistics. OpenStax CNX. May 06, 2016 Download for free at http://legacy.cnx.org/content/col11562/1.18
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Introductory statistics' conversation and receive update notifications?

Ask