Title: | Extract Tox Info from Various Databases |
---|---|
Description: | Extract toxicological and chemical information from databases maintained by scientific agencies and resources, including the Comparative Toxicogenomics Database <https://ctdbase.org/>, the Integrated Chemical Environment <https://ice.ntp.niehs.nih.gov/>, the Integrated Risk Information System <https://cfpub.epa.gov/ncea/iris/>, the CompTox Chemicals Dashboard Resource Hub <https://www.epa.gov/comptox-tools/comptox-chemicals-dashboard-resource-hub>, and PubChem <https://pubchem.ncbi.nlm.nih.gov/>. |
Authors: | Claudio Zanettini [aut, cre, cph] , Lucio Queiroz [aut] |
Maintainer: | Claudio Zanettini <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.9001 |
Built: | 2024-12-19 03:37:45 UTC |
Source: | https://github.com/c1au6i0/extractox |
This function retrieves the CASRN for a given set of PubChem Compound Identifiers (CID).
It queries PubChem through the webchem
package and extracts the CASRN from the depositor-supplied synonyms.
extr_casrn_from_cid(pubchem_id, verbose = TRUE)
extr_casrn_from_cid(pubchem_id, verbose = TRUE)
pubchem_id |
A numeric vector of PubChem CIDs. These are unique identifiers for chemical compounds in the PubChem database. |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
A data frame containing the CID, CASRN, and IUPAC name of the compound. The returned data frame includes three columns:
The PubChem Compound Identifier.
The corresponding CASRN of the compound.
The IUPAC name of the compound.
# Example with formaldehyde and aflatoxin cids <- c(712, 14434) # CID for formaldehyde and aflatoxin B1 extr_casrn_from_cid(cids)
# Example with formaldehyde and aflatoxin cids <- c(712, 14434) # CID for formaldehyde and aflatoxin B1 extr_casrn_from_cid(cids)
This function takes a vector of IUPAC names and queries the PubChem database
(using the webchem
package) to obtain the corresponding CASRN and CID for
each compound. It reshapes the resulting data, ensuring that each compound has
a unique row with the CID, CASRN, and additional chemical properties.
extr_chem_info(IUPAC_names, stop_on_warning = FALSE, verbose = TRUE)
extr_chem_info(IUPAC_names, stop_on_warning = FALSE, verbose = TRUE)
IUPAC_names |
A character vector of IUPAC names. These are standardized names of chemical compounds that will be used to search in the PubChem database. |
stop_on_warning |
Logical. If set to TRUE, the function will stop and throw an error if any substances are not found in PubChem. Defaults to FALSE, in which case a warning is issued. |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
A data frame with information on the queried compounds, including:
The IUPAC name of the compound.
The PubChem Compound Identifier (CID).
The SMILES string (Simplified Molecular Input Line Entry System).
# Example with formaldehyde and aflatoxin extr_chem_info(IUPAC_names = c("Formaldehyde", "Aflatoxin B1"))
# Example with formaldehyde and aflatoxin extr_chem_info(IUPAC_names = c("Formaldehyde", "Aflatoxin B1"))
This function interacts with the CompTox Chemistry Dashboard to download and extract a wide range of chemical data based on user-defined search criteria. It allows for flexible input types and supports downloading various chemical properties, identifiers, and predictive data.
It was inspired by the ECOTOXr::websearch_comptox
function.
extr_comptox( ids, download_items = c("CASRN", "INCHIKEY", "IUPAC_NAME", "SMILES", "INCHI_STRING", "MS_READY_SMILES", "QSAR_READY_SMILES", "MOLECULAR_FORMULA", "AVERAGE_MASS", "MONOISOTOPIC_MASS", "QC_LEVEL", "SAFETY_DATA", "EXPOCAST", "DATA_SOURCES", "TOXVAL_DATA", "NUMBER_OF_PUBMED_ARTICLES", "PUBCHEM_DATA_SOURCES", "CPDAT_COUNT", "IRIS_LINK", "PPRTV_LINK", "WIKIPEDIA_ARTICLE", "QC_NOTES", "ABSTRACT_SHIFTER", "TOXPRINT_FINGERPRINT", "ACTOR_REPORT", "SYNONYM_IDENTIFIER", "RELATED_RELATIONSHIP", "ASSOCIATED_TOXCAST_ASSAYS", "TOXVAL_DETAILS", "CHEMICAL_PROPERTIES_DETAILS", "BIOCONCENTRATION_FACTOR_TEST_PRED", "BOILING_POINT_DEGC_TEST_PRED", "48HR_DAPHNIA_LC50_MOL/L_TEST_PRED", "DENSITY_G/CM^3_TEST_PRED", "DEVTOX_TEST_PRED", "96HR_FATHEAD_MINNOW_MOL/L_TEST_PRED", "FLASH_POINT_DEGC_TEST_PRED", "MELTING_POINT_DEGC_TEST_PRED", "AMES_MUTAGENICITY_TEST_PRED", "ORAL_RAT_LD50_MOL/KG_TEST_PRED", "SURFACE_TENSION_DYN/CM_TEST_PRED", "THERMAL_CONDUCTIVITY_MW/(M*K)_TEST_PRED", "TETRAHYMENA_PYRIFORMIS_IGC50_MOL/L_TEST_PRED", "VISCOSITY_CP_CP_TEST_PRED", "VAPOR_PRESSURE_MMHG_TEST_PRED", "WATER_SOLUBILITY_MOL/L_TEST_PRED", "ATMOSPHERIC_HYDROXYLATION_RATE_(AOH)_CM3/MOLECULE*SEC_OPERA_PRED", "BIOCONCENTRATION_FACTOR_OPERA_PRED", "BIODEGRADATION_HALF_LIFE_DAYS_DAYS_OPERA_PRED", "BOILING_POINT_DEGC_OPERA_PRED", "HENRYS_LAW_ATM-M3/MOLE_OPERA_PRED", "OPERA_KM_DAYS_OPERA_PRED", "OCTANOL_AIR_PARTITION_COEFF_LOGKOA_OPERA_PRED", "SOIL_ADSORPTION_COEFFICIENT_KOC_L/KG_OPERA_PRED", "OCTANOL_WATER_PARTITION_LOGP_OPERA_PRED", "MELTING_POINT_DEGC_OPERA_PRED", "OPERA_PKAA_OPERA_PRED", "OPERA_PKAB_OPERA_PRED", "VAPOR_PRESSURE_MMHG_OPERA_PRED", "WATER_SOLUBILITY_MOL/L_OPERA_PRED", "EXPOCAST_MEDIAN_EXPOSURE_PREDICTION_MG/KG-BW/DAY", "NHANES", "TOXCAST_NUMBER_OF_ASSAYS/TOTAL", "TOXCAST_PERCENT_ACTIVE"), mass_error = 0, verify_ssl = FALSE, verbose = TRUE, ... )
extr_comptox( ids, download_items = c("CASRN", "INCHIKEY", "IUPAC_NAME", "SMILES", "INCHI_STRING", "MS_READY_SMILES", "QSAR_READY_SMILES", "MOLECULAR_FORMULA", "AVERAGE_MASS", "MONOISOTOPIC_MASS", "QC_LEVEL", "SAFETY_DATA", "EXPOCAST", "DATA_SOURCES", "TOXVAL_DATA", "NUMBER_OF_PUBMED_ARTICLES", "PUBCHEM_DATA_SOURCES", "CPDAT_COUNT", "IRIS_LINK", "PPRTV_LINK", "WIKIPEDIA_ARTICLE", "QC_NOTES", "ABSTRACT_SHIFTER", "TOXPRINT_FINGERPRINT", "ACTOR_REPORT", "SYNONYM_IDENTIFIER", "RELATED_RELATIONSHIP", "ASSOCIATED_TOXCAST_ASSAYS", "TOXVAL_DETAILS", "CHEMICAL_PROPERTIES_DETAILS", "BIOCONCENTRATION_FACTOR_TEST_PRED", "BOILING_POINT_DEGC_TEST_PRED", "48HR_DAPHNIA_LC50_MOL/L_TEST_PRED", "DENSITY_G/CM^3_TEST_PRED", "DEVTOX_TEST_PRED", "96HR_FATHEAD_MINNOW_MOL/L_TEST_PRED", "FLASH_POINT_DEGC_TEST_PRED", "MELTING_POINT_DEGC_TEST_PRED", "AMES_MUTAGENICITY_TEST_PRED", "ORAL_RAT_LD50_MOL/KG_TEST_PRED", "SURFACE_TENSION_DYN/CM_TEST_PRED", "THERMAL_CONDUCTIVITY_MW/(M*K)_TEST_PRED", "TETRAHYMENA_PYRIFORMIS_IGC50_MOL/L_TEST_PRED", "VISCOSITY_CP_CP_TEST_PRED", "VAPOR_PRESSURE_MMHG_TEST_PRED", "WATER_SOLUBILITY_MOL/L_TEST_PRED", "ATMOSPHERIC_HYDROXYLATION_RATE_(AOH)_CM3/MOLECULE*SEC_OPERA_PRED", "BIOCONCENTRATION_FACTOR_OPERA_PRED", "BIODEGRADATION_HALF_LIFE_DAYS_DAYS_OPERA_PRED", "BOILING_POINT_DEGC_OPERA_PRED", "HENRYS_LAW_ATM-M3/MOLE_OPERA_PRED", "OPERA_KM_DAYS_OPERA_PRED", "OCTANOL_AIR_PARTITION_COEFF_LOGKOA_OPERA_PRED", "SOIL_ADSORPTION_COEFFICIENT_KOC_L/KG_OPERA_PRED", "OCTANOL_WATER_PARTITION_LOGP_OPERA_PRED", "MELTING_POINT_DEGC_OPERA_PRED", "OPERA_PKAA_OPERA_PRED", "OPERA_PKAB_OPERA_PRED", "VAPOR_PRESSURE_MMHG_OPERA_PRED", "WATER_SOLUBILITY_MOL/L_OPERA_PRED", "EXPOCAST_MEDIAN_EXPOSURE_PREDICTION_MG/KG-BW/DAY", "NHANES", "TOXCAST_NUMBER_OF_ASSAYS/TOTAL", "TOXCAST_PERCENT_ACTIVE"), mass_error = 0, verify_ssl = FALSE, verbose = TRUE, ... )
ids |
A character vector containing the items to be searched within the CompTox Chemistry Dashboard. These can be chemical names, CAS Registry Numbers (CASRN), InChIKeys, or DSSTox substance identifiers (DTXSID). |
download_items |
A character vector of items to be downloaded. This includes a comprehensive set of chemical properties, identifiers, predictive data, and other relevant information. By Default, it download all the info
|
mass_error |
Numeric value indicating the mass error tolerance for searches involving mass data. Default is |
verify_ssl |
Logical value indicating whether SSL certificates should be verified. Default is |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
... |
Additional arguments passed to |
Please note that this function, which pulls data from EPA servers, may encounter issues on some Linux systems.
This is because those servers do not accept secure legacy renegotiation. On Linux systems, the current function depends
on curl
and OpenSSL
, which have known problems with unsafe legacy renegotiation in newer versions.
One workaround is to downgrade to curl v7.78.0
and OpenSSL v1.1.1
.
However, please be aware that using these older versions might introduce potential security vulnerabilities.
Refer to this gist for instructions on how to downgrade curl
and OpenSSL
on Ubuntu.
A cleaned data frame containing the requested data from CompTox.
CompTox Chemicals Dashboard Resource Hub
# Example usage of the function: extr_comptox(ids = c("Aspirin", "50-00-0"))
# Example usage of the function: extr_comptox(ids = c("Aspirin", "50-00-0"))
This function queries the Comparative Toxicogenomics Database API to retrieve data related to chemicals, diseases, genes, or other categories.
extr_ctd( input_terms, category = "chem", report_type = "genes_curated", input_term_search_type = "directAssociations", action_types = NULL, ontology = NULL, verify_ssl = FALSE, verbose = TRUE, ... )
extr_ctd( input_terms, category = "chem", report_type = "genes_curated", input_term_search_type = "directAssociations", action_types = NULL, ontology = NULL, verify_ssl = FALSE, verbose = TRUE, ... )
input_terms |
A character vector of input terms such as CAS numbers or IUPAC names. |
category |
A string specifying the category of data to query. Valid options are "all", "chem", "disease", "gene", "go", "pathway", "reference", and "taxon". Default is "chem". |
report_type |
A string specifying the type of report to return. Default is "genes_curated". Valid options include:
|
input_term_search_type |
A string specifying the search method to use. Options are "hierarchicalAssociations" or "directAssociations". Default is "directAssociations". |
action_types |
An optional character vector specifying one or more interaction types for filtering results. Default is "ANY". Other acceptable inputs are "abundance", "activity", "binding", "cotreatment", "expression", "folding", "localization", "metabolic processing"...See https://ctdbase.org/tools/batchQuery.go for a full list. |
ontology |
An optional character vector specifying one or more ontologies for filtering GO reports. Default NULL. |
verify_ssl |
Boolean to control of SSL should be verified or not. |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
... |
Any other arguments to be supplied to |
A data frame containing the queried data in CSV format.
Davis, A. P., Grondin, C. J., Johnson, R. J., Sciaky, D., McMorran, R., Wiegers, T. C., & Mattingly, C. J. (2019). The Comparative Toxicogenomics Database: update 2019. Nucleic acids research, 47(D1), D948–D954. doi:10.1093/nar/gky868
Comparative Toxicogenomics Database
input_terms <- c("50-00-0", "64-17-5", "methanal", "ethanol") dat <- extr_ctd( input_terms = input_terms, category = "chem", report_type = "genes_curated", input_term_search_type = "directAssociations", action_types = "ANY", ontology = c("go_bp", "go_cc") ) str(dat) # Get expresssion data dat2 <- extr_ctd( input_terms = input_terms, report_type = "cgixns", category = "chem", action_types = "expression" ) str(dat2)
input_terms <- c("50-00-0", "64-17-5", "methanal", "ethanol") dat <- extr_ctd( input_terms = input_terms, category = "chem", report_type = "genes_curated", input_term_search_type = "directAssociations", action_types = "ANY", ontology = c("go_bp", "go_cc") ) str(dat) # Get expresssion data dat2 <- extr_ctd( input_terms = input_terms, report_type = "cgixns", category = "chem", action_types = "expression" ) str(dat2)
The extr_ice
function sends a POST request to the ICE API to search for information based on specified chemical IDs and assays.
extr_ice(casrn, assays = NULL, verify_ssl = FALSE, verbose = TRUE, ...)
extr_ice(casrn, assays = NULL, verify_ssl = FALSE, verbose = TRUE, ...)
casrn |
A character vector specifying the CASRNs for the search. |
assays |
A character vector specifying the assays to include in the search. Default is NULL,
meaning all assays are included. If you don't know the exact assay name, you can use the
|
verify_ssl |
Boolean to control of SSL should be verified or not. |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
... |
Any other arguments to be supplied to |
A data frame containing the extracted data from the ICE API.
extr_ice_assay_names
,
NTP ICE database
extr_ice(c("50-00-0"))
extr_ice(c("50-00-0"))
This function allows users to search for assay names in the ICE database
using a regular expression. If no search pattern is provided (regex = NULL
),
it returns all available assay names.
extr_ice_assay_names(regex = NULL)
extr_ice_assay_names(regex = NULL)
regex |
A character string containing the regular expression to search for,
or |
A character vector of matching assay names.
extr_ice_assay_names("OPERA") extr_ice_assay_names(NULL) extr_ice_assay_names("Vivo")
extr_ice_assay_names("OPERA") extr_ice_assay_names(NULL) extr_ice_assay_names("Vivo")
The extr_iris
function sends a request to the EPA IRIS database to search for information based on a specified keywords and cancer types. It retrieves and parses the HTML content from the response.
Note that if keywords
is not provide all dataset are retrieved.
extr_iris( casrn = NULL, cancer_types = c("non_cancer", "cancer"), verbose = TRUE )
extr_iris( casrn = NULL, cancer_types = c("non_cancer", "cancer"), verbose = TRUE )
casrn |
A vector CASRN for the search. |
cancer_types |
A character vector specifying the types of cancer to include in the search. Must be either "non_cancer" or "cancer". |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
A data frame containing the extracted data.
extr_iris(c("1332-21-4", "50-00-0"))
extr_iris(c("1332-21-4", "50-00-0"))
This function retrieves information regarding Monographs from the World Health Organization (WHO) International Agency for Research on Cancer (IARC) based on CAS Registry Number or Name of the chemical.
extr_monograph(ids, search_type = "casrn", verbose = TRUE)
extr_monograph(ids, search_type = "casrn", verbose = TRUE)
ids |
A character vector of IDs to search for. |
search_type |
A character string specifying the type of search to perform. Valid options are "cas_rn" (CAS Registry Number)
and "name" (name of the chemical). If |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
A data frame containing the relevant information from the WHO IARC, including Monograph volume
, volume_publication_year
,
evaluation_year
, and additional_information
where the chemical was described.
https://monographs.iarc.who.int/list-of-classifications/
{ dat <- extr_monograph(search_type = "casrn", ids = c("105-74-8", "120-58-1")) str(dat) # Example usage for name search dat2 <- extr_monograph(search_type = "name", ids = c("Aloe", "Schistosoma", "Styrene")) str(dat2) }
{ dat <- extr_monograph(search_type = "casrn", ids = c("105-74-8", "120-58-1")) str(dat) # Example usage for name search dat2 <- extr_monograph(search_type = "name", ids = c("Aloe", "Schistosoma", "Styrene")) str(dat2) }
This function retrieves FEMA (Flavor and Extract Manufacturers Association) flavor profile information for a list of CAS Registry Numbers (CASRN) from the PubChem database using the webchem
package. It applies the function extr_fema_pubchem_
to each CASRN in the input vector and combines the results into a single data frame.
extr_pubchem_fema(casrn, verbose = TRUE)
extr_pubchem_fema(casrn, verbose = TRUE)
casrn |
A vector of CAS Registry Numbers (CASRN) as atomic vectors. |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
A data frame containing the FEMA flavor profile information for each CASRN. If no information is found for a particular CASRN, the output will include a row indicating this.
extr_pubchem_fema(c("83-67-0", "1490-04-6"))
extr_pubchem_fema(c("83-67-0", "1490-04-6"))
This function extracts GHS (Globally Harmonized System) codes from PubChem. It relies on the webchem
package to interact with PubChem.
extr_pubchem_ghs(casrn, verbose = TRUE)
extr_pubchem_ghs(casrn, verbose = TRUE)
casrn |
Character vector of CAS Registry Numbers (CASRN). |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
A dataframe containing GHS information.
extr_pubchem_ghs(casrn = c("50-00-0", "64-17-5"))
extr_pubchem_ghs(casrn = c("50-00-0", "64-17-5"))
This function queries the Comparative Toxicogenomics Database API to retrieve tetramer data based on chemicals, diseases, genes, or other categories.
extr_tetramer( chem, disease = "", gene = "", go = "", input_term_search_type = "directAssociations", qt_match_type = "equals", verify_ssl = FALSE, verbose = TRUE, ... )
extr_tetramer( chem, disease = "", gene = "", go = "", input_term_search_type = "directAssociations", qt_match_type = "equals", verify_ssl = FALSE, verbose = TRUE, ... )
chem |
A string indicating the chemical identifiers such as CAS number or IUPAC name of the chemical. |
disease |
A string indicating a disease term. Default is an empty string. |
gene |
A string indicating a gene symbol. Default is an empty string. |
go |
A string indicating a Gene Ontology term. Default is an empty string. |
input_term_search_type |
A string specifying the search method to use. Options are "hierarchicalAssociations" or "directAssociations". Default is "directAssociations". |
qt_match_type |
A string specifying the query type match method. Options are "equals" or "contains". Default is "equals". |
verify_ssl |
Boolean to control if SSL should be verified or not. Default is FALSE. |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
... |
Any other arguments to be supplied to |
A data frame containing the queried tetramer data in CSV format.
Comparative Toxicogenomics Database: http://ctdbase.org
Davis, A. P., Grondin, C. J., Johnson, R. J., Sciaky, D., McMorran, R., Wiegers, T. C., & Mattingly, C. J. (2019). The Comparative Toxicogenomics Database: update 2019. Nucleic acids research, 47(D1), D948–D954. doi:10.1093/nar/gky868
Davis, A. P., Wiegers, T. C., Wiegers, J., Wyatt, B., Johnson, R. J., Sciaky, D., Barkalow, F., Strong, M., Planchart, A., & Mattingly, C. J. (2023). CTD tetramers: A new online tool that computationally links curated chemicals, genes, phenotypes, and diseases to inform molecular mechanisms for environmental health. Toxicological Sciences, 195(2), 155–168. doi:10.1093/toxsci/kfad069
Comparative Toxicogenomics Database
tetramer_data <- extr_tetramer( chem = c("50-00-0", "ethanol"), disease = "", gene = "", go = "", input_term_search_type = "directAssociations", qt_match_type = "equals" ) str(tetramer_data)
tetramer_data <- extr_tetramer( chem = c("50-00-0", "ethanol"), disease = "", gene = "", go = "", input_term_search_type = "directAssociations", qt_match_type = "equals" ) str(tetramer_data)
This wrapper function retrieves toxicological information for specified chemicals by calling several external functions to query multiple databases, including PubChem, the Integrated Chemical Environment (ICE), CompTox Chemicals Dashboard, and the Integrated Risk Information System (IRIS).
extr_tox(casrn, verbose = TRUE)
extr_tox(casrn, verbose = TRUE)
casrn |
A character vector of CAS Registry Numbers (CASRN) representing the chemicals of interest. |
verbose |
A logical value indicating whether to print detailed messages. Default is TRUE. |
Specifically, this function:
Calls extr_monograph
to return monographs informations from WHO IARC.
Calls extr_pubchem_ghs
to retrieve GHS classification data from PubChem.
Calls extr_ice
to gather assay data from the ICE database.
Calls extr_iris
to retrieve risk assessment information from the IRIS database.
Calls extr_comptox
to retrieve data from the CompTox Chemicals Dashboard.
A list of data frames containing toxicological information retrieved from each database:
Lists if any, the WHO IARC monographs related to that chemical.
Toxicity data from PubChem's Globally Harmonized System (GHS) classification.
Assay data from the Integrated Chemical Environment (ICE) database.
Risk assessment data from the IRIS database.
Risk assessment data from the IRIS database.
List of dataframe with toxicity information from the CompTox Chemicals Dashboard.
extr_tox(casrn = c("100-00-5", "107-02-8"))
extr_tox(casrn = c("100-00-5", "107-02-8"))
This function creates an Excel file with each dataframe in a list as a separate sheet.
write_dataframes_to_excel(df_list, filename)
write_dataframes_to_excel(df_list, filename)
df_list |
A named list of dataframes to write to the Excel file. |
filename |
The name of the Excel file to create. |
No return value. The function prints a message indicating the completion of the Excel file writing.
tox_dat <- extr_tox("50-00-0") temp_file <- tempfile(fileext = ".xlsx") write_dataframes_to_excel(tox_dat, filename = temp_file)
tox_dat <- extr_tox("50-00-0") temp_file <- tempfile(fileext = ".xlsx") write_dataframes_to_excel(tox_dat, filename = temp_file)