Plum Tagline
Beta Stamp
Home
About
Lists
Browse
Details

The methods used to develop Plum


Data collection and organization

Lists of substances were obtained directly from authoritative sources on the web. Bibliographic information and methodological details are available directly in Plum: URLs, dates of access, and specific procedures for obtaining the data are provided together with the data for each list.

From each source, we extracted a list of substance names, CAS registry numbers (where available), and other relevant information, and organized these data in a spreadsheet. Different data sources contained different types of information in addition to chemical identifiers (e.g. reasons for inclusion, dates of listing, URLs of support documents, etc.). Controlled vocabularies were developed for naming all of these various data elements.

More About Plum:



Establishing unique identifiers for substances

Cross-referencing several lists of substances requires each unique substance to have a single unique identifier. We wished to use CAS registry numbers (CASRN) as unique identifiers to the greatest extent possible. When a substance was found to be listed without a CASRN, we took the following approach:

  1. Determined the correct CASRN for that substance, as accurately as possible. We used the chemical name, structure or other information provided by the authoritative source to identify the substance, cross-referencing with other authoritative lists and using tools such as OECD eChemPortal, USEPA ACToR, PubChem and ChemSpider.
  2. Determined that the substance has no CASRN and assigned a unique identifier of our own, based on other available information (see below).

Using CASRN as unique identifiers for substances has some limitations.


For chemicals entities that could not be identified using a CASRN, we created unique identifiers using the following schema.


Reorganizing data from authoritative sources

We organized the information from each authoritative source into a list, constructed as a spreadsheet, with the following characteristics.


Achieving this form of data organization required us to reformat or reorganize the information obtained from many of the authoritative sources. The following operations were commonly performed:


Data quality and sources of error

Some list data were edited in ways besides the above simple reorganizations, primarily due to errors or insufficiencies in chemical identifiers (name, CASRN). These changes were determined by our judgment of the meaning of original source data. All such edits were documented in the "list methodology" meta-data for each list. Common sources of error and actions taken are described below.