Data manipulation for political stats: a tech note

Awaiting the happy tidings just a few hours away - but, in the meantime...

Vary rarely are datasets relating to legislative process straightforward to handle. (In my small, cheapskate experience; with enough moolah and savvy, I suspect things can be different!)

Take as an example the roll call data I've been working on in the last few days: what I was after were the votes of individual Dem reps on 29 particular votes in the 109th House. The data required was available on one of the Voteview pages - but the array was too big to fit on a standard spreadsheet.

The solution (highly circuitous, as all freebies tend to be) was to use the statistical program R - well worth having and getting to know for those needing the occasional statistical number-cruncher but unwilling to pay a cent for it!

R imported the Stata file containing the 109th data (Stata is a commercial program - muchos dólares!) without difficulty - in the end: it uses a command line interface (something of a culture shock for relative PC newsbies), but there are plenty of free manuals on the site to help.

Once the file was in R, it was left to this unofficial page to break the secret of how to select the particular roll calls I wanted, and output them as a file. The relevant example (halfway down the page) is this:

hsb3 <- hsb2.small[, c(1, 7, 8)])

The trick is to insert (in place of the (1,7,8) the numbers of the columns containing the names and districts of the reps and the numerical codes for each of the RCVs in question. The first nine columns in the main dataset are the names, districts, parties, etc for each of the reps - identifying the column numbers for these is no problem.

Then, starting with a spreadsheet (that I already had worked up) of the total votes given by party for and against for each of the RCVs, with the identifying numbers in the first column, I only needed to paste that column (transposed to be a row) to a fresh spreadsheet, save it as a CSV file, open the CSV file in Notepad and paste in the numbers between the brackets.

Having labeled the required subsetted information, all I needed to do was to output the subset as a CSV file and open that in a spreadsheet to work on.

A second set of problems arose in trying to put tables summarizing information garnered from such a spreadsheet into a piece on our very own MyDD.

Clearly, manually coding a table of any complexity is out of the question; and absurd, given that the table is already in machine-readable form!

I use Open Office, and, like every other spreadsheet under the sun - I suppose - it gives an option for output in HTML.

Simple, then: save the summary spreadsheet in HTML, open the HTML page, get the code (it's Ctrl + U in Firefox), top and tail (to leave only the bits between the TABLE tags) and paste into the MyDD diary form.

Not so simple: Scoop (or its current implementation) doesn't read half the tags in Open Office HTML. It needs something more primitive.

Workaround time: I open the OO HTML file in MS Word, and save it as Rich Text (RTF); I open the RTF file in Arachnophilia 4.0 (later versions aren't primitive enough!) to use its RTF-HTML converter.

I then take the HTML code thus produced, top and tail it as above - and paste that into the diary form.

As long as the form is operating in HTML Formatted and not in Auto Format mode, the table will reproduce more or less as it should.

(It may need tinkering with the TABLE attributes - a minor matter compared to the shlep it's taken to get it thus far!)

I should say I was stuck for ages on trying to find a way to deal with the too big for spreadsheets roll call dataset.

Bear in mind that, apart from the subsetting, I've done absolutely no work on the data within R - it's all been on the spreadsheet.

Crazy...

(This is more by way of an aide-memoire for the next time I need to do something like this. But if it's helpful to anyone else - bonus.)



Display:


Re: Data manipulation - nice explanation (none / 0)

Wow.  I don't use R, but can see why you went this direction, although I would think that a relational database would make your life a lot easier...?

Meanwhile, nothing beats a quick summary, and there's no doubt in my mind that you probably are grasping some relationships that others of us are missing b/c we're not playing with this dataset.  

Best of luck to you -- be interesting to see where you go with this ;-)


by readerOfTeaLeaves on Tue Nov 07, 2006 at 03:23:58 PM EST

Re: Data manipulation - nice explanation (none / 0)

I tend to think you're right that using database software would be easier.

Unfortunately, whilst I have just enough experience with spreadsheets to do relatively simple stuff with RCV info and the like, I have zero experience with databases. So far, at least!

I've got one or two ideas for further pieces: the real killer, I find, is to call a piece Part I: then, it's virtually guaranteed to be the first and last...


by skeptic06 on Wed Nov 08, 2006 at 06:33:07 PM EST
[ Parent ]

Re: Data manipulation for political stats: a tech (none / 0)

Yeah, I'd have to agree that using a database is a better solution.  I recommend mysql: it's also free, and has a limit of "only" 3398 columns.


end the occupation of Iraq
by aip on Tue Nov 07, 2006 at 03:31:55 PM EST

Re: Data manipulation (none / 0)

Ah? I've never counted the columns ;-)

But I agree with your recommendation ;-)


by readerOfTeaLeaves on Tue Nov 07, 2006 at 04:40:31 PM EST

Re: Data manipulation (none / 0)

I "cheated" and just looked it up.


end the occupation of Iraq
by aip on Wed Nov 08, 2006 at 12:13:22 AM EST
[ Parent ]


You are not logged in.

In order to post a comment, you must be logged in. If you have a member account, please log in to comment.

If not, you can make an account right here. It's quick and free.