My take on an R introduction talk

Author  Scott Chamberlain

UPDATE: I put in an R tutorial as a Github gist below.


Here is a short intro R talk I gave today...for what it's worth...



Posted in  openaccess Methods sciencetalks R

Author  Scott Chamberlain

A Data Visualization Book

Author  Scott Chamberlain

Note: thanks to Scott for inviting me to contribute to the Recology blog despite being an ecology outsider; my work is primarily in atomic physics. -Pascal

A part of me has always liked thinking about how to effectively present information, but until the past year, I had not read much to support my (idle) interest in information visualization. That changed in the spring when I read Edward Tufte's The Visual Display of Quantitative Information, a book that stimulated me to think more deeply about presenting information. I originally started with a specific task in mind--a wonderful tool for focusing one's interests--but quickly found that Tufte's book was less a practical guide and more a list of general design principles. Then, a few months ago, I stumbled upon Nathan Yau's blog, FlowingData, and found out he was writing a practical guide to design and visualization. Conveniently enough for me, Yau's book, Visualize This, would be released within a month of my discovery of his blog; what follows are my impressions of Visualize This.

I have liked Visualize This a lot.  Yau writes with much the same informal tone as on his blog, and the layout is visually pleasing (good thing, too, for a book about visualizing information!).  The first few chapters are pretty basic if you have done much data manipulation before, but it is really nice to have something laid out so concisely.  The examples are good, too, in that he is very explicit about every step: there is no intuiting what that missing step should be.  The author even acknowledges in the introduction that the first part of the book is at an introductory level.

Early in the book, Yau discusses where to obtain data. This compilation of sources is potentially a useful reference for someone, like me, who almost always generates his own data in the lab. Unfortunately, Yau does not talk much about preparation of (or best practices for) your own data.  Additionally, from the perspective of a practicing scientist, it would have been nice to hear about how to archive data to make sure it is readable far into the future, but that is probably outside the scope of the book.

Yau seems really big into using open source software for getting and analyzing data (e.g. Python, R, etc…), but he is surprisingly attached to the proprietary Adobe Illustrator for turning figures into presentation quality graphics.  He says that he feels like the default options in most analysis programs do not make for very good quality graphics (and he is right), but he does not really acknowledge that you can generate nice output if you go beyond the default settings.  For me, the primary advantage of generating output programmatically is that it is easy to regenerate when you need to change the data or the formatting on the plot.  Using a graphical user interface, like in Adobe Illustrator, is nice if you are only doing something once (how often does that happen?), but when you have to regenerate the darn figure fifty times to satisfy your advisor, it gets tedious to move things around pixel by pixel.

By the time I reached the middle chapters, I started finding many of the details to be repetitive. Part of this repetition stems from the fact that Yau divides these chapters by the type of visualization. For example, "Visualizing Proportions" and "Visualizing Relationships" are two of the chapter titles. While I think these distinctions are important ones for telling the right story about one's data, creating figures for the different data types often boils down to choosing different functions in R or Python. People with less analysis and presentation experience should find the repetition helpful, but I increasingly skimmed these sections as I went along.  

Working through Yau's examples for steps you do not already know would probably be the most useful way of getting something out of the book.  So, for example, I started trying to use Python to scrape data from a webpage, something I had not previously done.  I followed the book's example of this data-scraping just fine, but as with most things in programming, you find all sorts of minor hurdles to clear when you try your own thing. In my case, I am re-learning the Python I briefly learned about 10 years ago--partly in anticipation of not having access to Matlab licenses once I vacate the academy--since I have forgotten a lot of the syntax.  A lot of this stuff would be faster if I were working in Matlab which I grew more familiar with in graduate school.

Overall, Visualize This is a really nice looking book and will continue to be useful to me as a reference. Yau concludes his book with a refreshing reminder to provide context for the data we present. This advice is particularly relevant when presenting to a wider or lay audience, but it is still important for us, as scientists, to clearly communicate our findings in the literature. Patterns in the data are not often self-evident, and therefore we should think carefully about which visualization tools will best convey the meaning of our results.

Edited to add a link to Visualize This here and in the introductory paragraph.

Posted in 

Author  Scott Chamberlain

FigShare Talk

Author  Scott Chamberlain

FigShare - I very much like this idea of a place to put your data online that is NOT published. Dryad is a nice place for datastes linked with published papers, but there isn't really a place for datasets that perhaps did not make the cut for a published paper, and if known to the scientific community, could potentially help resolve the "file-drawer" effect in meta-analyses. (wow, run on sentence)

 
"Figshare - Why don't you publish all your research?" Mark Hahnel Imperial College London from London Biogeeks on Vimeo.

Posted in  openaccess vime sciencetalks

Author  Scott Chamberlain


rnpn: An R interface for the National Phenology Network

Author  Scott Chamberlain

The team at rOpenSci and I have been working on a wrapper for the USA National Phenology Network API. The following is a demo of some of the current possibilities. We will have more functions down the road. Get the publicly available code, and contribute, at Github here. If you try this out look at the Description file for the required R packages to run rnpn. Let us know at Github (here) or at our website  http://ropensci.org/, or in the comments below, or on twitter (@rOpenSci), what use cases you would like to see with the rnpn package.

Method and demo of each:
Get observations for species by day
From the documentation: "This function will return a list of species, containing all the dates which observations were made about the species, and a count of the number of such observations made on that date."

#### Note, the data below is truncated for blogging brevity...

> getobsspbyday(c(1, 2), '2008-01-01', '2011-12-31') # Searched for species 1 and 2 from Jan 1, 2008 to Dec 31, 2011
date count species
1 2009-03-08 2 species 1
2 2009-03-15 1 species 1
3 2009-03-22 1 species 1
4 2009-03-24 1 species 1
5 2009-03-26 1 species 1
6 2009-04-17 1 species 1
7 2009-04-24 1 species 1
8 2009-05-12 1 species 1
9 2009-05-20 1 species 1
10 2009-11-24 1 species 1
11 2009-12-07 1 species 1
12 2010-01-18 1 species 1
13 2010-01-23 1 species 1
62 2011-05-29 1 species 1
63 2011-06-27 1 species 1
64 2011-06-30 2 species 1
65 2009-03-17 1 species 2
66 2009-04-03 3 species 2
67 2009-04-05 3 species 2
68 2009-04-10 3 species 2
69 2009-04-17 3 species 2






Get individuals at specific stations
From the documentation: "This function returns all of the individuals at a series of stations."

> getindsatstations(c(507, 523)) # Searched for any individuals at stations 507 and 523
individual_id individual_name species_id kingdom
1 1200 dogwood 12 Plantae
2 1197 purple lilac 36 Plantae
3 1193 white t 38 Plantae
4 3569 forsythia-1 73 Plantae
5 1206 jack 150 Plantae
6 1199 trout lily 161 Plantae
7 1198 dandy 189 Plantae
8 1192 red t 192 Plantae
9 1710 common lilac 36 Plantae
10 1711 common lilac 2 36 Plantae
11 1712 dandelion 189 Plantae







Get individuals of species at stations
From the documentation: "This function will return a list of all the individuals, which are members of a species, among  any number of stations."

> getindspatstations(35, c(60, 259), 2009)  # Search for individuals of species 35 at stations 60 and 259 in year 2009
individual_id individual_name number_observations
1 1715 west 5
2 1716 east 5









Get observation associated with particular observation
From the documentation: "This function will return the comment associated with a particular observation."

> getobscomm(1938) # The observation for observation number 1938
$observation_comment
[1] "some lower branches are bare"

Posted in  ropensci API Ecology R Datasets

Author  Scott Chamberlain

Fork me on GitHub