My take on an R introduction talk

Author Scott Chamberlain

UPDATE: I put in an R tutorial as a Github gist below.

Here is a short intro R talk I gave today...for what it's worth...

R Introduction

View more presentations from schamber

Posted in openaccess Methods sciencetalks R

Author Scott Chamberlain

Comments

09 Sep 2011

A Data Visualization Book

Author Scott Chamberlain

Note: thanks to Scott for inviting me to contribute to the Recology blog despite being an ecology outsider; my work is primarily in atomic physics. -Pascal

A part of me has always liked thinking about how to effectively present information, but until the past year, I had not read much to support my (idle) interest in information visualization. That changed in the spring when I read Edward Tufte's The Visual Display of Quantitative Information, a book that stimulated me to think more deeply about presenting information. I originally started with a specific task in mind--a wonderful tool for focusing one's interests--but quickly found that Tufte's book was less a practical guide and more a list of general design principles. Then, a few months ago, I stumbled upon Nathan Yau's blog, FlowingData, and found out he was writing a practical guide to design and visualization. Conveniently enough for me, Yau's book, Visualize This, would be released within a month of my discovery of his blog; what follows are my impressions of Visualize This.

I have liked Visualize This a lot. Yau writes with much the same informal tone as on his blog, and the layout is visually pleasing (good thing, too, for a book about visualizing information!). The first few chapters are pretty basic if you have done much data manipulation before, but it is really nice to have something laid out so concisely. The examples are good, too, in that he is very explicit about every step: there is no intuiting what that missing step should be. The author even acknowledges in the introduction that the first part of the book is at an introductory level.

Early in the book, Yau discusses where to obtain data. This compilation of sources is potentially a useful reference for someone, like me, who almost always generates his own data in the lab. Unfortunately, Yau does not talk much about preparation of (or best practices for) your own data. Additionally, from the perspective of a practicing scientist, it would have been nice to hear about how to archive data to make sure it is readable far into the future, but that is probably outside the scope of the book.

Yau seems really big into using open source software for getting and analyzing data (e.g. Python, R, etc…), but he is surprisingly attached to the proprietary Adobe Illustrator for turning figures into presentation quality graphics. He says that he feels like the default options in most analysis programs do not make for very good quality graphics (and he is right), but he does not really acknowledge that you can generate nice output if you go beyond the default settings. For me, the primary advantage of generating output programmatically is that it is easy to regenerate when you need to change the data or the formatting on the plot. Using a graphical user interface, like in Adobe Illustrator, is nice if you are only doing something once (how often does that happen?), but when you have to regenerate the darn figure fifty times to satisfy your advisor, it gets tedious to move things around pixel by pixel.

By the time I reached the middle chapters, I started finding many of the details to be repetitive. Part of this repetition stems from the fact that Yau divides these chapters by the type of visualization. For example, "Visualizing Proportions" and "Visualizing Relationships" are two of the chapter titles. While I think these distinctions are important ones for telling the right story about one's data, creating figures for the different data types often boils down to choosing different functions in R or Python. People with less analysis and presentation experience should find the repetition helpful, but I increasingly skimmed these sections as I went along.

Working through Yau's examples for steps you do not already know would probably be the most useful way of getting something out of the book. So, for example, I started trying to use Python to scrape data from a webpage, something I had not previously done. I followed the book's example of this data-scraping just fine, but as with most things in programming, you find all sorts of minor hurdles to clear when you try your own thing. In my case, I am re-learning the Python I briefly learned about 10 years ago--partly in anticipation of not having access to Matlab licenses once I vacate the academy--since I have forgotten a lot of the syntax. A lot of this stuff would be faster if I were working in Matlab which I grew more familiar with in graduate school.

Overall, Visualize This is a really nice looking book and will continue to be useful to me as a reference. Yau concludes his book with a refreshing reminder to provide context for the data we present. This advice is particularly relevant when presenting to a wider or lay audience, but it is still important for us, as scientists, to clearly communicate our findings in the literature. Patterns in the data are not often self-evident, and therefore we should think carefully about which visualization tools will best convey the meaning of our results.

Edited to add a link to Visualize This here and in the introductory paragraph.

Posted in

Author Scott Chamberlain

Comments

08 Sep 2011

FigShare Talk

Author Scott Chamberlain

FigShare - I very much like this idea of a place to put your data online that is NOT published. Dryad is a nice place for datastes linked with published papers, but there isn't really a place for datasets that perhaps did not make the cut for a published paper, and if known to the scientific community, could potentially help resolve the "file-drawer" effect in meta-analyses. (wow, run on sentence)

"Figshare - Why don't you publish all your research?" Mark Hahnel Imperial College London from London Biogeeks on Vimeo.

Posted in openaccess vime sciencetalks

Author Scott Chamberlain

Comments

08 Sep 2011

Jonathan Eisen on the Fourth Domain and Open Science

Author Scott Chamberlain

Stalking the Fourth Domain with Jonathan Eisen, Ph D from mendelspod on Vimeo.

Posted in vimeo Evolution Phylogenetics sciencetalks

Author Scott Chamberlain

Comments

06 Sep 2011

rnpn: An R interface for the National Phenology Network

Author Scott Chamberlain

The team at rOpenSci and I have been working on a wrapper for the USA National Phenology Network API. The following is a demo of some of the current possibilities. We will have more functions down the road. Get the publicly available code, and contribute, at Github here. If you try this out look at the Description file for the required R packages to run rnpn. Let us know at Github (here) or at our website http://ropensci.org/, or in the comments below, or on twitter (@rOpenSci), what use cases you would like to see with the rnpn package.

Method and demo of each:
Get observations for species by day
From the documentation: "This function will return a list of species, containing all the dates which observations were made about the species, and a count of the number of such observations made on that date."

#### Note, the data below is truncated for blogging brevity...

> getobsspbyday(c(1, 2), '2008-01-01', '2011-12-31') # Searched for species 1 and 2 from Jan 1, 2008 to Dec 31, 2011
          date count   species
1   2009-03-08     2 species 1
2   2009-03-15     1 species 1
3   2009-03-22     1 species 1
4   2009-03-24     1 species 1
5   2009-03-26     1 species 1
6   2009-04-17     1 species 1
7   2009-04-24     1 species 1
8   2009-05-12     1 species 1
9   2009-05-20     1 species 1
10  2009-11-24     1 species 1
11  2009-12-07     1 species 1
12  2010-01-18     1 species 1
13  2010-01-23     1 species 1
62  2011-05-29     1 species 1
63  2011-06-27     1 species 1
64  2011-06-30     2 species 1
65  2009-03-17     1 species 2
66  2009-04-03     3 species 2
67  2009-04-05     3 species 2
68  2009-04-10     3 species 2
69  2009-04-17     3 species 2

Get individuals at specific stations
From the documentation: "This function returns all of the individuals at a series of stations."

> getindsatstations(c(507, 523)) # Searched for any individuals at stations 507 and 523
   individual_id individual_name species_id kingdom
1           1200         dogwood         12 Plantae
2           1197    purple lilac         36 Plantae
3           1193         white t         38 Plantae
4           3569     forsythia-1         73 Plantae
5           1206            jack        150 Plantae
6           1199      trout lily        161 Plantae
7           1198           dandy        189 Plantae
8           1192           red t        192 Plantae
9           1710    common lilac         36 Plantae
10          1711  common lilac 2         36 Plantae
11          1712       dandelion        189 Plantae

Get individuals of species at stations
From the documentation: "This function will return a list of all the individuals, which are members of a species, among any number of stations."

> getindspatstations(35, c(60, 259), 2009)  # Search for individuals of species 35 at stations 60 and 259 in year 2009
  individual_id individual_name number_observations
1          1715            west                   5
2          1716            east                   5

Get observation associated with particular observation
From the documentation: "This function will return the comment associated with a particular observation."

> getobscomm(1938) # The observation for observation number 1938

$observation_comment
[1] "some lower branches are bare"

My take on an R introduction talk

A Data Visualization Book

FigShare Talk

Jonathan Eisen on the Fourth Domain and Open Science

rnpn: An R interface for the National Phenology Network

About

Recent Posts

Blogroll

Categories

My take on an R introduction talk

A Data Visualization Book

FigShare Talk

Jonathan Eisen on the Fourth Domain and Open Science

rnpn: An R interface for the National Phenology Network

Subscribe to Recology

Just R posts

About

Recent Posts

Blogroll

Categories