Blogging/tweeting from #ESA11

We now have many options for archiving data sets online:

Dryad, KNB, Ecological Archives, Ecology Data Papers, Ecological Data, etc.

However, these portals largely do not communicate with one another as far as I know, and there is no way to search over all data set sources, again, as far as I know. So, I wonder if it would ease finding of all these different data sets to get these different sites to get their data sets cloned on a site like Infochimps, or have links from Infochimps. Infochimps already has APIs (and there's an R wrapper for the Infochimps API already set up here: http://cran.r-project.org/web/packages/infochimps/index.html by Drew Conway), and they have discussions set up there, etc.

Does it make sense to post data sets linked to published works on Infochimps? I think probably not know that I think about it. But perhaps it makes sense for other data sets, or subsets of data sets that are not linked with published works to be posted there as I know at least Dryad only accepts data sets linked with published papers.

One use case is there was a tweet from someone recently that his students were excited about getting their data sets on their resume/CV, but didn't think there was a way to put them any place where there wasn't a precondition that the data set was linked with a published work. Seems like this could be a good opportunity to place these datasets on Infcohimps, and at least they are available then where a lot of people are searching for data sets, etc.

What I think would be ideal is if Dryad, KNB, etc. could link their datasets to Infochimps, where they could be found, then users can either get them from Infochimps, or perhaps you would have to go to the Dryad site, e.g. But at least you could search over all ecological data sets then.

Posted in openaccess Datasets

Author Scott Chamberlain

Comments

15 Jul 2011

CRdata vs. Cloudnumbers

Author Scott Chamberlain

Cloudnumbers and CRdata are two new cloud computing services.

I tested the two services with a very simple script. The script simply creates a dataframe of 10000 numbers via rnorm, and assigns them to a factor of one of two levels (a or b). I then take the mean of the two factor levels with the aggregate function.

In CRdata you need to put in some extra code to format the output in a browser window. For example, the last line below needs to have '<crdata_object>' on both sides of the output object so it can be rendered in a browser. And etc. for other things that one would print to a console. Whereas you don't need this extra code for using Cloudnumbers.

 
dat <- data.frame(n = rnorm(10000), p = rep(c('a','b'), each=5000))
 
out <- aggregate(n ~ p, data = dat, mean)
 
#<crdata_object>out</crdata_object>

Here is a screenshot of the output from CRdata with the simple script above.

This simple script ran in about 20 seconds or so from starting the job to finishing. However, it seems like the only output option is html. Can this be right? This seems like a terrible only option.

In Cloudnumbers you have to start a workspace, upload your R code file.
Then, start a session...
choose your software platform...
choose packages (one at a time, very slow)...
then choose number of clusters, etc.
Then finally star the job.
Then it initializes, then finally you can open the console, and
Then from here it is like running R as you normally would, except on the web.

Who wins (at least for our very minimal example above)

Speed of entire process (not just running code): CRdata
Ease of use: CRdata
Cost: CRdata (free only)
Least annoying: Cloudnumbers (you don't have to add in extra code to run your own code)
Opensource: CRdata (you can use publicly available code on the site)
Long-term use: Cloudnumbers (more powerful, flexible, etc.)

I imagine Cloudnumbers could be faster for larger jobs, but you would have to pay for the speed of course.

What I really want to see is a cloud computing service that accepts code directly run from R or RStudio. Hmmm...that would be so tasty indeed. I think Cloudnumbers may be able to do this, but haven't tested it yet.

Perhaps using the server version of RStudio along with Amazon's EC2 is a better option than both of these. See Karthik Ram's post about using RStudio server along with Amazon's EC2. Even just running RStudio server on your Unbuntu machine or virtual machine is a pretty cool option, even without EC2 (works like a charm on my Parallels Ubuntu vm on my Mac).

Posted in cloudcomputing R

Author Scott Chamberlain

Comments

14 Jul 2011

rbold: An R Interface for Bold Systems barcode repository

Author Scott Chamberlain

Have you ever wanted to search and fetch barcode data from Bold Systems?

I am developing functions to interface with Bold from R. I just started, but hopefully folks will find it useful.

The code is at Github here. The two functions are still very buggy, so please bring up issues below, or in the Issues area on Github. For example, some searches work and other similar searches don't. Apologies in advance for the bugs.

Below is a screenshot of an example query using function getsampleids to get barcode identifiers for specimens. You can then use getseqs function to grab barcode data for specific specimens or many specimens.
Screen shot 2011-06-28 at 9.24.00 AM.png