While it is easy to get lists of threatened and endangered plants by state, it is not so easy to do this by county. The Greenwich Land Trust has tasked me with providing them with a list of endangered plants on their lands (over 700 acres) in lower Fairfield county adjacent to southern Westchester county. In an attempt to come up with an list of endangered plants by county I hit on the idea of gathering the lists of endangered plants from New York state, Connecticut, Massachusetts, and Rhode Island and comparing them. I found these lists on the US government site Natural Resource Conservation Service. By scraping the state listings of this site using Beautiful Soup and pulling up shared plants, I discovered that Rhode Island and Massachusetts each share only one endangered plant with Connecticut and nothing with New York, so I discarded these two states. Connecticut and New York shared 31 endangered plants, I extrapolated from this fact, and decided that Westchester and Fairfield counties would be the likeliest place for these 31 endangered plants to be found (or not found).
Once I had a list of over thirty endangered plants, likely to be found in Westchester and Fairfield county, I then gathered information and details about these plants from several other websites. I created a dictionary in python of the name of each plant, its unique identifier from wikidata.org, its identifier from PlantList.org, its identifier from Tropicos.org and its url in Gobotany.org. My professor then provided code in python, to turn this into a json document, and to allow my code to loop through it like a dictionary in order to pull up specific information from each website. I used Beautiful Soup to do this. This searching code was attached to the looping code so that specific information from each website was pulled out for each plant in a long list. I broke up the list manually and created a page within a Kapsul website for each plant with important and descriptive information. I had to discard seven of these plants as the name they had from National Resource Conservation Service was not an official name, and what plant was being referred to was unclear.
The most important information for any plant is its accepted botanical name. “Accepted” is a technical term, indicating that the botanical academic community has agreed that this is the proper latin name for a plant. I chose to scrape websites, Tropicos.org and PlantList.org that are authoritative. One comes out of MIssouri Botanical Garden and the other out of Kew Botanical Garden. Both refer to each other and other authoritative websites, those of NYBG, the International Plant Name Index (IPNI) for example. (I know these to be authoritative, having interned at NYBG). Only certain organizations (and hence their websites) have the authority to call something “accepted” and then, not all authoritative websites give the same information. The botanical academic community continues to evolve its knowledge and as it does so, it changes the accepted name. An “accepted” name really means, what the academic community agrees, today, that the plant’s name is. An accepted name today, becomes, a “tentative” name tomorrow, or even a synonym. In the process of scraping botanical websites, I ran into this issue all the time. What a plant’s name is, is always a question. How to look for it, whether two websites are actually referring to the same plant, these were issues I struggled with when looking at websites. My questions were not always resolvable, because the academic community was not in agreement. There is no clear way to normalize or clean up a website by using a controlled vocabulary. Controlled vocabularies do not exist in the botanical world, or if they do the vocabulary keeps changing, ad hence is not so controlled.
The code I created to scrape botanical websites does scrape for “accepted” names and for synonyms. This code also allows for these names to change and so can provide updated lists of names. It also allows for there to be no “accepted” name, that space would just be left blank. It is not, however able to manage and express all the variations of name status that some of the websites I scraped are able to do. In the process of these websites managing the complexities of “tentative” name, “unresolved” name, synonym, and “accepted” name, the websites become quite idiosyncratic and increasingly hard to scrape.
I came to an increased understanding of some of the challenges that can arise in scraping and then managing information from botanical websites. This comes in great part from the fact that the websites I relied upon, while authoritative are constantly changing and not always in agreement. This means that the gathering of data on the naming of plants will continue to require a manual component on an ongoing basis.