Statistics 201 HW2: Introduction to SAS and Data Screening
Due Saturday, September 19
This assignment will allow you some practice with SAS while doing some preliminary data screening and analysis. Here's what to hand in:
Email the following to the TA using this email: kmaccorm@uvm.edu
A SAS program (xxxx.sas) that is ready to run: send the final version with titles, and no mistakes.
A word document with answers to questions in part 3.
The data for this homework is located in a file called ‘hw2neighorhoodsf09.txt,’ which I will email to you. (You can either paste it into your editor (“in-stream” data), or write an “infile” statement.) The data comes from the yahoo neighborhoods website (http://realestate.yahoo.com/neighborhoods) We selected a random sample of 42 zipcodes from across the United States, and gathered data on each of them. The coding manual (below) describes the variables for each town.
1. Write a SAS program to input the data based on the coding manual. Some things you should do:
Use title ‘ HW#2 – Neighborhood data set – Stat201 – Fall 2009’ and title2 ‘your name’ throughout the program. Use title3 to describe the particular analysis that is done in that part of the program, e.g., 'Description of Quantitative Variables'
Use the coding manual to write an Input statement with column numbers.
Create a numeric format for Region (see p. 77 in C&S for examples)
Create labels (p. 75 in C&S) for all variables, using the descriptions from the Coding Manual (Feel free to copy and paste from coding manual below)
Create a new character variable called COST. COST should equal 'More' if the cost of living is greater than or equal to 100; 'Less' otherwise.
In procedures below, use TOWN as the ID when useful (mainly in proc univariate).
2. Run Descriptive Statistics on SAS:
a. Run PROC FREQ on qualitative variables REGION and COST.
b. Obtain PROC MEANS on all quantitative variables (POPGROWTH -- GRADDEG). Use Maxdec=2, and have Proc Means give you the mean, standard deviation, min and max (the default).
c. Run PROC UNIVARIATE on all quantitative variables (POPGROWTH -- GRADDEG ). Use the option PLOT. Use TOWN as the ID. (Don't print this -- too long!)
Stop! Before continuing, examine the output from proc means, proc univariate and proc freq, and look for potential data errors. There are (at least) 4 errors in the data.* Fix them by placing statements in your data block, (use if….. then….. statements -- we’ll discuss in class -- i.e., don’t change the original data -- we want to leave a 'data trail') When you’ve found them and fixed them, clear your output and log files, and rerun a, b, and c, and continue.
*HINT: Look for unexpected values. Check missing values, unexpected 0's and (maybe) extreme outliers (see the boxplots in proc univariate for these). To determine if they are actually wrong, go to the yahoo website: http://realestate.yahoo.com/neighborhoods. I
d. Using only variables CostLiving, Females, MedHomeAge run PROC UNIVARIATE. This time, do the 'full univariate': (where .... is the list of variable names) Use TOWN as the ID
proc univariate normal plot;
id town;
var .......;
histogram ...... / normal;
qqplot ...... / normal (mu=est sigma=est);
e. Sort the data by region, then find means of CostLiving, Females, MedHomeAge by region.
f. Create boxplots of CostLiving, Females, MedHomeAge by region.
g. Find a scatterplot (using Gplot) and correlation coefficient of CostLiving * MedHomeCost,
3. Describe output in words:
a. Screening: Explain what values you checked as potential data errors, whether they were errors or not, and how you fixed them.
b. Description: Examine the PROC UNIVARIATE output on ..... from 2d. Describe each of these four variables in words in terms of distribution shape, central tendency, and variability. (Note: You should choose different statistics for central tendency and variability, depending on the shape of the data!) For each variable, is there evidence that it comes from a population that is not normally distributed? Look for any “outliers.” Any explanations for the outliers?
c. Comparing groups: Look at the means output and boxplots from 2e and 2f. Do you think there is a real regional difference? Explain, suggesting a reason for the general pattern.
d. Correlation: Does there appear to be a relationship between CostLiving and MedHomeCost ? If so, describe it, and offer an explanation for the relationship.
Coding Manual for Neighborhood Data:
# Variable Name Description Location Type
1 NTOWN Town Number 1 - 2 Numeric
2 ZIP Zip Code 3 - 7 Character
3 TOWN Town Name 8 - 24 Character
4 POPGROWTH Population Growth 25 - 29 Numeric
5 POPDEN Population Density 31 - 35 Numeric
6 MEDAGE Median Age 36 - 37 Numeric
7 MEDINC Median Income (in $1000) 38 - 40 Numeric
8 COSTLIVING Cost of Living Index 41 - 43 Numeric
9 FEMALES Percent Females 44 - 47 Numeric
10 SINGLE Percent Single 49 - 52 Numeric
11 MARRIED Percent Married 53 - 56 Numeric
12 FAMILIES Percent Families with Kids 57 - 60 Numeric
13 DEM Percent Democrat 61 - 62 Numeric
14 MEDHOMEAGE Median Home Age 64 - 65 Numeric
15 MEDHOMECOST Median Home Value (in $1000) 66 - 69 Numeric
16 SCHOOLEXP School Expenditure per Pupil 70 - 74 Numeric
17 PTRATIO School Pupil to Teacher Ratio 76 - 77 Numeric
18 HSGRAD Percent HS Graduates 78 - 82 Numeric
19 SOMECOLL Percent with Some College 83 - 86 Numeric
20 BACHELOR Percent with Bachelor's Degree 88 - 91 Numeric
21 GRADDEG Percent with Graduate Degree 93 - 96 Numeric
22 REGION Region: 1=NE 2=SE 3=MW 4=WE 97 Numeric