Cleaning species names with R I: taxonstand

4/11/2014

Some biologist must collectively spend dozens of hours in their life starring at the screen, polishing up species names. It's important to get it right, you know. A simple spelling error in a species name and you have a mess: your data won't merge with another database. Or even worse: once published, the spelling error will be quickly sniffed out by many generations of taxonomically-savy peers.

For my first thesis, I spend hours comparing taxonomy and spelling with books and online databases. Luckily these days are over - thanks to some recent R packages. Take for example the taxonstand package in R.

Developed by Luis Cayuela and Jari Oksanen (see this publication), taxonstand was developed for those of use who work with vegetation databases. It performs taxonomic standardization and checks spelling for vascular plant and bryophyte names by connecting to The Plant List (TPL).

Installation: taxonstand is available on CRAN and can be installed directly from R:
> install.packages("taxonstand")

Data cleaning with taxonstand

Let's create a short taxon string.
To keep things simple, we omit any typos and author names:

 > spec1 <- c("Stipa capillata",
+            "Salvia tortuosa",
+            "Bryum capillare")

Load taxonstand and use the TPL() function:

 > require(Taxonstand)
>TPL(spec1)

 ## [1] "0 substitutions of standard annotations in specific epithet" ## [1] "0 substitutions of subsp. and var. in infraspecific epithet"

 ##    Genus   Species Abbrev Infraspecific Plant.Name.Index TPL_version ## 1  Stipa capillata   <NA>                           TRUE         1.1
## 2 Salvia  tortuosa   <NA>                           TRUE         1.1
## 3  Bryum capillare   <NA>                           TRUE         1.1
##   Taxonomic.status    Family    New.Genus New.Species New.Infraspecific
## 1         Accepted   Poaceae        Stipa   capillata                  
## 2         Accepted Lamiaceae       Salvia    tortuosa                  
## 3          Synonym  Bryaceae Ptychostomum   capillare                  
##                             Authority  Typo WFormat
## 1                                  L. FALSE   FALSE
## 2                               Kunth FALSE   FALSE
## 3 (Hedw.) D. T. Holyoak & N. Pedersen FALSE   FALSE

In my first string, taxonstand recognized all names and split them successfully into a genus and epithet name. It provided matching family names which is a nice feature. According to TPL (or more correctly: The Plant List), the accepted name for 'Bryum capillare' should be 'Ptychostomum capillare'.

Now, let's make things are little bit more complicated by introducing (unnecessary) author names and spelling mistakes, both in the genus name ('Saalvia' and 'Bryuum') and epithet ('tortuuosa).

 > spec2 <- c("Stipa capillata L.",
+            "Salvia tortuuosa",
+            "Saalvia tortuosa",
+            "Bryuum capillare",
+            corr=TRUE)

 require(Taxonstand)
TPL(spec2, corr=TRUE)

 ## [1] "0 substitutions of standard annotations in specific epithet" ## [1] "0 substitutions of subsp. and var. in infraspecific epithet"

 ##     Genus   Species Abbrev Infraspecific Plant.Name.Index TPL_version
## 1   Stipa capillata   <NA>            L.             TRUE         1.1
## 2  Salvia tortuuosa   <NA>                           TRUE         1.1
## 3 Saalvia  tortuosa   <NA>                          FALSE         1.1
## 4  Bryuum capillare   <NA>                          FALSE         1.1
## 5    TRUE        NA   <NA>                          FALSE         1.1
##   Taxonomic.status    Family New.Genus New.Species New.Infraspecific
## 1         Accepted   Poaceae     Stipa   capillata                  
## 2         Accepted Lamiaceae    Salvia    tortuosa                  
## 3                              Saalvia    tortuosa                NA
## 4                               Bryuum   capillare                NA
## 5                                 TRUE          NA                NA
##   Authority  Typo WFormat
## 1        L. FALSE   FALSE
## 2     Kunth  TRUE   FALSE
## 3           FALSE   FALSE
## 4           FALSE   FALSE
## 5           FALSE   FALSE

In my second string, taxonstand successfully removed an author name (in 'Stipa capillata L.'). Interestingly, it saved the author name as an 'Infraspecific' value.

However, it only recognized one of my three typos. My typo in the species epithet was corrected (in 'tortuuosa'). However, the two typos in the genus name remained unrecognized. In this case, taxonstand just recycled the unrecognized names in columns $New.Genus and $New.Species.

According to Louis Cayuela, this is not the fault of taxonstand but rather of The Plant List. Taxonstand accesses information from the front end of The Plant List. This means basically it extracts the same information as an online user of The Plant List does. For example, if you use the online search field in The Plant List and include a spelling mistake in your species epithet, The Plant List will offer you all species names in the respective genus. Enter a spelling mistake in the genus and you will not get any hits at all.

By the way, you can merge the new genus and species names by using paste:

 TPL_spec1 <- TPL(spec1)

 ## [1] "0 substitutions of standard annotations in specific epithet"
## [1] "0 substitutions of subsp. and var. in infraspecific epithet"

 new.names <- paste(TPL_spec1$New.Genus, TPL_spec1$New.Species)   new.names

 ## [1] "Stipa capillata"        "Salvia tortuosa"        ## [3] "Ptychostomum capillare"

There are some other issues that users have reported, including infraspecific epithet matches and homonyms. Check ?TPL to learn more.

In summary, taxonstand is a nice little package that might help you to clean and standardize your taxon names. But beware of its minor limitations.

0 Comments

Your comment will be posted after it is approved.

Cleaning species names with R I: taxonstand

Data cleaning with taxonstand

Leave a Reply.

Author

Archives

Categories