American Community Survey (ACS)
Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)
View Archive InfoField | Value | |
Title |
American Community Survey (ACS)
|
|
Identifier |
https://doi.org/10.7910/DVN/DKI9L4
|
|
Creator |
Damico, Anthony
|
|
Publisher |
Harvard Dataverse
|
|
Description |
analyze the american community survey (acs) with r and monetdb experimental. think of the american community survey (acs) as the united states' census for off-years - the ones that don't end in zero. every year, one percent of all americans respond, making it the largest complex sample administered by the u.s. government (the decennial census has a much broader reach, but since it attempts to contact 100% of the population, it's not a sur vey). the acs asks how people live and although the questionnaire only includes about three hundred questions on demography, income, insurance, it's often accurate at sub-state geographies and - depending how many years pooled - down to small counties. households are the sampling unit, and once a household gets selected for inclusion, all of its residents respond to the survey. this allows household-level data (like home ownership) to be collected more efficiently and lets researchers examine family structure. the census bureau runs and finances this behemoth, of course. the dow nloadable american community survey ships as two distinct household-level and person-level comma-separated value (.csv) files. merging the two just rectangulates the data, since each person in the person-file has exactly one matching record in the household-file. for analyses of small, smaller, and microscopic geographic areas, choose one-, three-, or fiv e-year pooled files. use as few pooled years as you can, unless you like sentences that start with, "over the period of 2006 - 2010, the average american ... [insert yer findings here]." rather than processing the acs public use microdata sample line-by-line, the r language brazenly reads everything into memory by default. to prevent overloading your computer, dr. thomas lumley wrote the sqlsurvey package principally to deal with t his ram-gobbling monster. if you're already familiar with syntax used for the survey package, be patient and read the sqlsurvey examples carefully when something doesn't behave as you expect it to - some sqlsurvey commands require a different structure (i.e. svyby gets called through svymean) and others might not exist anytime soon (like svyolr). gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests), so follow the monetdb installation instructions before running this acs code. monetdb imports, writes, recodes data slowly, but reads it hyper-fast . a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat. importation scripts (especially the ones i've already written for you) can be left running overnight sans hand-holding. the acs weights generalize to the whole united states population including individuals living in group quarters, but non-residential respondents get an abridged questionnaire, so most (not all) analysts exclude records with a relp variable of 16 or 17 right off the bat. this new github repository contains four scripts: 2005-2011 - download all microdata.R
fair warning: this full script takes a loooong time. run it friday afternoon, commune with nature for the weekend, and if you've got a fast processor and speedy internet connection, monday morning it should be ready for action. otherwise, either download only the years and sizes you need or - if you gotta have 'em all - run it, minimize it, and then don't disturb it for a week. 2011 single-year - analysis e xamples.R
de example.R
click here to view these four scripts for more detail about the american community survey (acs), visit: notes: if you're just looking for a couple data point s, you ought to give the census bureau's american factfinder a whirl. it's a table creator (click here to watch me blab about table creators), so it's easy-to-use but inflexible. here's a li'l tip: if you run a statistic using american factfinder and then the same statistic using these scripts, they will be close but won't match exa ctly. it's not a mistake, and both are methodologically correct. every now and then, grumpy lawmakers threaten to defund the acs because, well, it's expensive. use it or lose it. since data types in sql are not as plentiful as they are in the r language, the definition of a monet database-backed complex design object requires a cutoff be specified between the categorical variables and the linear ones. that cut point gets defined using the check.factors argument in the sqlsurvey() and sqlrepsurvey() function calls. check.factors defaults to ten, but can be raised or lowered as needed. here's how it works:
unless specified by the question's phrasing, most acs variables should be treated as point-in-time, as opposed to either annualized or ever during the year. this distinction is particularly important for health insurance coverage. think about these three statistics --
the number of americans who won't ever have health insurance during this year although the automated ftp download program for this data set only retrieves files back as far as 2005, a nationwide version of the american community survey has been conducted since 2000. i skipped those years for two reasons --
confidential to sas, spss, stata, sudaan users: the decennial census is enshrined in our constitution. your statistical software isn't. time to transition to r. :D |
|