Statistics 201 HW2: Introduction to SAS and Data Screening

Due Saturday,  September 19

 

This assignment will allow you some practice with SAS while doing some preliminary data screening and analysis.  Here's what to hand in:     

 

Email the following to the TA using this email:  kmaccorm@uvm.edu

 

A SAS program  (xxxx.sas) that is ready to run:  send the final version with titles, and no mistakes.

 

A word document with answers to questions in part 3.  

 

The data for this homework is located in a file called ‘hw2neighorhoodsf09.txt,’ which I will email to you.  (You can either paste it into your editor (“in-stream” data), or write an “infile” statement.)  The data comes from the yahoo neighborhoods website (http://realestate.yahoo.com/neighborhoods)   We selected a random sample of 42 zipcodes from across the United States, and gathered data on each of them.  The coding manual (below) describes the variables for each town.

 

 

1.       Write a SAS program to input the data based on the coding manual.  Some things you should do:

2.     Run Descriptive Statistics on SAS

 

a.  Run PROC FREQ on qualitative variables REGION and COST

 

 

b.  Obtain PROC MEANS on all quantitative variables (POPGROWTH -- GRADDEG).  Use Maxdec=2, and have Proc Means give you the mean, standard deviation, min and max (the default).

 

c.  Run PROC UNIVARIATE on all quantitative variables (POPGROWTH -- GRADDEG ).  Use the option PLOT.  Use TOWN as the ID.       (Don't print this -- too long!)

 

Stop!   Before continuing, examine the output from proc means, proc univariate and proc freq, and look for potential data errors.  There are (at least) 4 errors in the data.*   Fix them  by placing statements in your data block, (use if…..  then….. statements -- we’ll discuss in class -- i.e., don’t change the original data -- we want to leave a 'data trail')  When you’ve found them and fixed them, clear your output and log files, and rerun a, b, and c, and continue.  

 

*HINT:  Look for unexpected values.  Check missing values, unexpected 0's and (maybe) extreme outliers (see the boxplots in proc univariate for these).  To determine if they are actually wrong, go to the yahoo website:  http://realestate.yahoo.com/neighborhoods.     I

 

d.       Using only variables CostLiving, Females, MedHomeAge run PROC UNIVARIATE.  This time, do the 'full univariate':  (where .... is the list of variable names)    Use TOWN as the ID

proc univariate normal plot;

id town;

var .......;

histogram ...... / normal;

qqplot ...... / normal (mu=est sigma=est);

 

e.  Sort the data by region, then find means of CostLiving, Females, MedHomeAge by region.  

 

f.   Create boxplots of CostLiving, Females, MedHomeAge by region.  

 

g.  Find a scatterplot (using Gplot) and correlation coefficient of CostLiving * MedHomeCost,

 

3.       Describe output in words:

 

a.  Screening:  Explain what values you checked as potential data errors, whether they were errors or not,  and how you fixed them.

b.  Description:  Examine the PROC UNIVARIATE output on ..... from 2d.  Describe each of these four variables in words in terms of distribution shape, central tendency, and variability.  (Note:  You should choose different statistics for central tendency and variability, depending on the shape of the data!)  For each variable, is there evidence that it comes from a population that is not normally distributed?   Look for any “outliers.”  Any explanations for the outliers?

 

c.  Comparing groups:  Look at the means output and boxplots from 2e and 2f.  Do  you think there is a real regional difference?  Explain, suggesting a reason for the general pattern.

 

d.  Correlation:  Does there appear to be a relationship between CostLiving and MedHomeCost ?  If so, describe it, and offer an explanation for the relationship. 

 

Coding Manual for Neighborhood Data:

#     Variable Name        Description                                                     Location      Type

 1  NTOWN         Town Number                        1 - 2   Numeric

 2  ZIP           Zip Code                           3 - 7   Character

 3  TOWN          Town Name                          8 - 24  Character

 4  POPGROWTH     Population Growth                 25 - 29  Numeric

 5  POPDEN        Population Density                31 - 35  Numeric

 6  MEDAGE        Median Age                        36 - 37  Numeric

 7  MEDINC        Median Income (in $1000)          38 - 40  Numeric

 8  COSTLIVING    Cost of Living Index              41 - 43  Numeric

 9  FEMALES       Percent Females                   44 - 47  Numeric

10  SINGLE        Percent Single                    49 - 52  Numeric

11  MARRIED       Percent Married                   53 - 56  Numeric

12  FAMILIES      Percent Families with Kids        57 - 60  Numeric

13  DEM           Percent Democrat                  61 - 62  Numeric

14  MEDHOMEAGE    Median Home Age                   64 - 65  Numeric

15  MEDHOMECOST   Median Home Value (in $1000)      66 - 69  Numeric

16  SCHOOLEXP     School Expenditure per Pupil      70 - 74  Numeric

17  PTRATIO       School Pupil to Teacher Ratio     76 - 77  Numeric

18  HSGRAD        Percent HS Graduates              78 - 82  Numeric

19  SOMECOLL      Percent with Some College         83 - 86  Numeric

20  BACHELOR      Percent with Bachelor's Degree    88 - 91  Numeric

21  GRADDEG       Percent with Graduate Degree      93 - 96  Numeric

22  REGION        Region: 1=NE 2=SE 3=MW 4=WE       97       Numeric