The methods used to develop Plum
Data collection and organization
Lists of substances were obtained directly from authoritative sources on the web. Bibliographic information and methodological details are available directly in Plum: URLs, dates of access, and specific procedures for obtaining the data are provided together with the data for each list.
From each source, we extracted a list of substance names, CAS registry numbers (where available), and other relevant information, and organized these data in a spreadsheet. Different data sources contained different types of information in addition to chemical identifiers (e.g. reasons for inclusion, dates of listing, URLs of support documents, etc.). Controlled vocabularies were developed for naming all of these various data elements.
Establishing unique identifiers for substances
Cross-referencing several lists of substances requires each unique substance to have a single unique identifier. We wished to use CAS registry numbers (CASRN) as unique identifiers to the greatest extent possible. When a substance was found to be listed without a CASRN, we took the following approach:
- Determined the correct CASRN for that substance, as accurately as possible. We used the chemical name, structure or other information provided by the authoritative source to identify the substance, cross-referencing with other authoritative lists and using tools such as OECD eChemPortal, USEPA ACToR, PubChem and ChemSpider.
- Determined that the substance has no CASRN and assigned a unique identifier of our own, based on other available information (see below).
Using CASRN as unique identifiers for substances has some limitations.
- Our data set contains substances that do not have a CASRN. These include some specific chemical compounds, but are mostly chemically undefined or biological materials.
- Our data sources list some specific preparations of substances. Some groups or families of substances are also listed, such that it is impossible or impractical to enumerate the members of the group (e.g. "Chromium compounds"). To ensure that Plum's representation of the data sources is accurate, we must regard these groups, mixtures or preparations as unique entities to the same extent that specific chemical compounds in the CAS registry are unique entities. Such entities usually do not have CAS registry numbers.
For chemicals entities that could not be identified using a CASRN, we created unique identifiers using the following schema.
- Chemical compounds having a European Commission (EC) number but no CASRN. Plum identifiers for such substances are as follows: EC-XXX-XXX-X where XXX-XXX-X is the EC number. Examples:
- EC-403-250-2 A mixture of: 4-[[bis-(4-fluorophenyl)methylsilyl]methyl]-4H-1,2,4-triazole; 1-[[bis-(4-fluorophenyl)methylsilyl]methyl]-1H-1,2,4-triazole
- EC-432-750-3 O-hexyl-N-ethoxycarbonylthiocarbamate
- Groups, mixtures or derivatives related to a common or representative substance that has a CASRN. This includes compounds of an element; salts of an ion/acid/base; groups or mixtures of congeners or isomers; polymers or copolymers; preparations; formulations; phases. Plum identifiers for such substances are as follows: G-XX-XX-X-yy where XX-XX-X is the CASRN of a representative substance, and yy is a sequentially-assigned number starting at 01. Where practical, CASRN of individual components of mixtures are included in the substance name. Examples:
- G-33419-42-0-01 Etoposide [33419-42-0] in combination with cisplatin [15663-27-1] and bleomycin [11056-06-7]
- G-92-87-5-01 Salts of benzidine
- G-92-87-5-02 Benzidine-based dyes
- G-7440-38-2-01 Arsenic compounds
- G-7440-38-2-02 Arsenic compounds, inorganic
- G-7440-38-2-03 Arsenic oxides, inorganic
- Biological materials, chemically undefined materials, and proprietary materials. Plum identifiers for such substances are as follows: O-xxxx where xxxx is a sequentially-assigned number starting at 0001. Examples:
Reorganizing data from authoritative sources
We organized the information from each authoritative source into a list, constructed as a spreadsheet, with the following characteristics.
- Unique identifier: Every substance on the list is identified by a unique identifier that is consistent across all lists.
- Minimum redundancy: Every unique substance on the list is identified in exactly one list entry, unless there is a meaningful reason for the same substance to be listed more than once (e.g. for different hazard traits).
- Synonyms: The names of the substances given in each list are those given by the source of that list. This allows for a natural multiplicity in chemical nomenclature for each unique substance.
Achieving this form of data organization required us to reformat or reorganize the information obtained from many of the authoritative sources. The following operations were commonly performed:
- List entries identifying multiple substances (i.e., combined entries with multiple CASRN and multiple names) were separated, retaining the correct CASRN/name pairs. Entries containing more than one CASRN, but only one name, were separated into multiple separate entries retaining the same name. Entries containing one CASRN and multiple names were separated only if the names clearly referred to different substances.
- Entries that specified both a single compound and a class of compounds at once, for example, "benzidine and its salts," were separated into two entries, for example benzidine [92-87-5] and "salts of benzidine" [G-92-87-5-01].
Data quality and sources of error
Some list data were edited in ways besides the above simple reorganizations, primarily due to errors or insufficiencies in chemical identifiers (name, CASRN). These changes were determined by our judgment of the meaning of original source data. All such edits were documented in the "list methodology" meta-data for each list. Common sources of error and actions taken are described below.
- Obvious typographical error in chemical name: Corrected.
- Source used deprecated CASRN for the substance: Replaced with current CASRN.
- Name did not match CASRN: Gathered more information from the original source and other sources to determine the most likely identity of the substance, if possible.
- Impossible to identify a substance from the source document: The substance was not included in the records.