NCBI EUtils. Get Homolog infromation using Gene ID

How to retrieve homologs information in xml or json format from NCBI using gene ID?

I tried one URL:

I dont know what to add in question mark regions

From Entrez Programming Utilities Help [Internet]:

Input: Any Entrez text query (&term); Entrez database (&db); &usehistory=y; Existing web environment (&WebEnv) from a prior E-utility call

To avoid the error messages, web1, and key1 can be used as terms (these are usually being used to associate with other searches), however this returns no data, probably because the ID you supplied was simply "9", and it remains unclear exactly what you're searching for in the context of other queries.

For example:

esearch.fcgi?db=&term=&usehistory=y # esearch produces WebEnv value ($web1) and QueryKey value ($key1) esummary.fcgi?db=&query_key=$key1&WebEnv=$web1

So a valid query, it's not pointing to anything.

A protocol for adding knowledge to Wikidata: aligning resources on human coronaviruses

Pandemics, even more than other medical problems, require swift integration of knowledge. When caused by a new virus, understanding the underlying biology may help finding solutions. In a setting where there are a large number of loosely related projects and initiatives, we need common ground, also known as a “commons.” Wikidata, a public knowledge graph aligned with Wikipedia, is such a commons and uses unique identifiers to link knowledge in other knowledge bases. However, Wikidata may not always have the right schema for the urgent questions. In this paper, we address this problem by showing how a data schema required for the integration can be modeled with entity schemas represented by Shape Expressions.


As a telling example, we describe the process of aligning resources on the genomes and proteomes of the SARS-CoV-2 virus and related viruses as well as how Shape Expressions can be defined for Wikidata to model the knowledge, helping others studying the SARS-CoV-2 pandemic. How this model can be used to make data between various resources interoperable is demonstrated by integrating data from NCBI (National Center for Biotechnology Information) Taxonomy, NCBI Genes, UniProt, and WikiPathways. Based on that model, a set of automated applications or bots were written for regular updates of these sources in Wikidata and added to a platform for automatically running these updates.


Although this workflow is developed and applied in the context of the COVID-19 pandemic, to demonstrate its broader applicability it was also applied to other human coronaviruses (MERS, SARS, human coronavirus NL63, human coronavirus 229E, human coronavirus HKU1, human coronavirus OC4).

The GEOmetadb package is an attempt to make querying the metadata describing microarray experiments, platforms, and datasets both easier and more powerful. At the heart of GEOmetadb is a SQLite database that stores nearly all the metadata associated with all GEO data types including GEO samples (GSM), GEO platforms (GPL), GEO data series (GSE), and curated GEO datasets (GDS), as well as the relationships between these data types. This database is generated by our server by parsing all the records in GEO and needs to be downloaded via a simple helper function to the user’s local machine before GEOmetadb is useful. Once this is done, the entire GEO database is accessible with simple SQL-based queries. With the GEOmetadb database, queries that are simply not possible using NCBI tools or web pages are often quite simple.

The relationships between the tables in the GEOmetadb SQLite database can be seen in the following entity-relationship diagram.

Best way to get list of SNPs by gene id?

I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)

So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:

This outputs a data frame that begins like this:

The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).

I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.

What is the best way to get a vector of SNPs that associated with a vector of gene ids?

Record versions

Accession versioning is done by appending a period followed by a version number, e.g. Q12345.1 or Q12345.2 would be two different versions of the same record. Versioning represents updates made to records, typically as new information becomes available.

One potential source of problems is that NCBI records obtained directly through the eUtils interface (as opposed to through the website) do not contain any information on related versions. This means that geeneus is unable to give this information either.

Querying a non-versioned accession (e.g. Q12345 or NP_1234567) will give the most up-to-date record associated with that accession, while querying a versioned value (Q12345.3 or NP_1234567.5) will give that specific version. However, there is no way to know if any specific versioned record is the most up-to-date record, or access previous records. This is not necessarily a problem, it's just worth being aware that if you query with explicit version numbers this may not give the most up to date version.

Note that GI numbers are unique for each different version, so deal with versioning in a different manner. The version number returned here refers to the non-GI accession version, where available. If no explicit version is available then we assume the version is 1.

1 Introduction

The increasing availability of biological data has not only resulted in a multitude of genome sequence data, but also substantial increases in the amount of accompanying metadata, including phylogenies, sampling conditions and locations and gene ontologies. To use such data in a biological analysis pipeline a programmatic approach is required to query and retrieve data from these databases. The National Center for Biotechnology Information (NCBI) is one of the largest such repositories and both developed and maintains the Entrez databases that currently comprise 37 individual databases storing 2.1 billion records related to the life sciences ( NCBI Resource Coordinators, 2016).

NCBI offers two approaches to interact programmatically with its Entrez databases: (i) E-utilities ( are a set of tools that allow the user to query and retrieve NCBI data using specific Uniform Resource Identifiers (URIs). Entrez databases can be accessed using an URI describing the function and its parameter, such as searching a database with a specific term and (ii) Entrez Direct—a powerful Perl program that allows ad hoc access to the NCBI databases through a command line interface ( Kans, 2016, E-Utilities offer a low-level interface to the Entrez databases via Entrez Direct. However, Entrez Direct is designed as a command line tool and is therefore primarily incorporated into analysis pipelines via a Shell, such as Bash, but not designed as a library. Although Python is increasingly used by biologists, incorporating Entrez Direct into Python pipelines requires the use of new processes outside Python, adding an additional layer of complexity.

Herein, we present Entrezpy. To our knowledge, this is the first Python library to offer the same functionalities as Entrez Direct, but as a Python library. Existing libraries, such as Biopython ( Cock et al., 2009) or ETE 3 ( Huerta-Cepas et al., 2016), offer either a basic or a very narrow interaction with E-utilities. Biopython does not handle whole queries, leaving the user to implement the logic to fetch large requests, while ETE represents a library focusing only on phylogenetics. In contrast, Entrezpy is specifically designed to interact with E-Utilities. It offers fine grained control on how to download data and can cache results locally for quick retrieval. This allows the querying and downloading data from Entrez databases as an integral part of an analysis pipeline. Entrezpy automatically configures itself to retrieve large datasets according to the implemented E-Utility function and limits enforced by NCBI.

Entrezpy includes a helper class, termed Conduit, that facilitates the creation and execution of query pipelines that is, several consecutive queries that may depend on previous queries with possible dependencies, and the ability to re-use previously obtained results. Entrezpy is licensed under the GNU Lesser General Public License and is packaged in PyPi ( or can be obtained from The Entrezpy source code is documented using Sphinx ( and the documentation, including usage examples, is available at


The number of records in Entrez Gene will continue to increase as new species are sequenced and genes are identified. During 2011, sections will be added to the web interface and/or the content will be enhanced so that users will be provided more information in the full report before navigating to related sites at NCBI. This transition was started in 2010 with the addition of the phenotype section. Finally, as new databases with gene-specific content are implemented at NCBI, content and/or links will be added to Entrez Gene.

Assessment of Tumor Sequencing as a Replacement for Lynch Syndrome Screening and Current Molecular Tests for Patients With Colorectal Cancer

Importance: Universal tumor screening for Lynch syndrome (LS) in colorectal cancer (CRC) is recommended and involves up to 6 sequential tests. Somatic gene testing is performed on stage IV CRCs for treatment determination. The diagnostic workup for patients with CRC could be simplified and improved using a single up-front tumor next-generation sequencing test if it has higher sensitivity and specificity than the current screening protocol.

Objective: To determine whether up-front tumor sequencing (TS) could replace the current multiple sequential test approach for universal tumor screening for LS.

Design, setting, and participants: Tumor DNA from 419 consecutive CRC cases undergoing standard universal tumor screening and germline genetic testing when indicated as part of the multicenter, population-based Ohio Colorectal Cancer Prevention Initiative from October 2015 through February 2016 (the prospective cohort) and 46 patients with CRC known to have LS due to a germline mutation in a mismatch repair gene from January 2013 through September 2015 (the validation cohort) underwent blinded TS.

Main outcomes and measures: Sensitivity of TS compared with microsatellite instability (MSI) testing and immunohistochemical (IHC) staining for the detection of LS.

Results: In the 465 patients, mean age at diagnosis was 59.9 years (range, 20-96 years), and 241 (51.8%) were female. Tumor sequencing identified all 46 known LS cases from the validation cohort and an additional 12 LS cases from the 419-member prospective cohort. Testing with MSI or IHC, followed by BRAF p.V600E testing missed 5 and 6 cases of LS, respectively. Tumor sequencing alone had better sensitivity (100% 95% CI, 93.8%-100%) than IHC plus BRAF (89.7% 95% CI, 78.8%-96.1% P = .04) and MSI plus BRAF (91.4% 95% CI, 81.0%-97.1% P = .07). Tumor sequencing had equal specificity (95.3% 95% CI, 92.6%-97.2%) to IHC plus BRAF (94.6% 95% CI, 91.9%-96.6% P > .99) and MSI plus BRAF (94.8% 95% CI, 92.2%-96.8% P = .88). Tumor sequencing identified 284 cases with KRAS, NRAS, or BRAF mutations that could affect therapy for stage IV CRC, avoiding another test. Finally, TS identified 8 patients with germline DPYD mutations that confer toxicity to fluorouracil chemotherapy, which could also be useful for treatment selection.

Conclusions and relevance: Up-front TS in CRC is simpler and has superior sensitivity to current multitest approaches to LS screening, while simultaneously providing critical information for treatment selection.

Conflict of interest statement

Conflict of Interest Disclosures: Ms Hampel discloses a consulting or advising role with Invitae and Genome Medical, and stock in Genome Medical. Dr Paskett has a research grant (to the institution) from Merck Foundation and stock in Pfizer. Dr de la Chapelle discloses a patent or intellectual property interest with Genzyme and Ipsogen. No other disclosures are reported.


Figure 1.. Present Paradigm for Universal Tumor…

Figure 1.. Present Paradigm for Universal Tumor Screening for Lynch Syndrome Among Patients With Colorectal…

Figure 2.. Proposed Universal Tumor Screening Pathway…

Figure 2.. Proposed Universal Tumor Screening Pathway Using Tumor Sequencing for All Patients With Colorectal…

CCDC151 mutations cause primary ciliary dyskinesia by disruption of the outer dynein arm docking complex formation

A diverse family of cytoskeletal dynein motors powers various cellular transport systems, including axonemal dyneins generating the force for ciliary and flagellar beating essential to movement of extracellular fluids and of cells through fluid. Multisubunit outer dynein arm (ODA) motor complexes, produced and preassembled in the cytosol, are transported to the ciliary or flagellar compartment and anchored into the axonemal microtubular scaffold via the ODA docking complex (ODA-DC) system. In humans, defects in ODA assembly are the major cause of primary ciliary dyskinesia (PCD), an inherited disorder of ciliary and flagellar dysmotility characterized by chronic upper and lower respiratory infections and defects in laterality. Here, by combined high-throughput mapping and sequencing, we identified CCDC151 loss-of-function mutations in five affected individuals from three independent families whose cilia showed a complete loss of ODAs and severely impaired ciliary beating. Consistent with the laterality defects observed in these individuals, we found Ccdc151 expressed in vertebrate left-right organizers. Homozygous zebrafish ccdc151(ts272a) and mouse Ccdc151(Snbl) mutants display a spectrum of situs defects associated with complex heart defects. We demonstrate that CCDC151 encodes an axonemal coiled coil protein, mutations in which abolish assembly of CCDC151 into respiratory cilia and cause a failure in axonemal assembly of the ODA component DNAH5 and the ODA-DC-associated components CCDC114 and ARMC4. CCDC151-deficient zebrafish, planaria, and mice also display ciliary dysmotility accompanied by ODA loss. Furthermore, CCDC151 coimmunoprecipitates CCDC114 and thus appears to be a highly evolutionarily conserved ODA-DC-related protein involved in mediating assembly of both ODAs and their axonemal docking machinery onto ciliary microtubules.

Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.


Zebrafish ccdc151 Is Expressed in…

Zebrafish ccdc151 Is Expressed in Ciliated Tissues and Required for Ciliary Motility-Dependent Processes…

CCDC151 Is Localized to Respiratory…

CCDC151 Is Localized to Respiratory Ciliary Axonemes (A) Immunoblot analysis (right lane) of…

Mutations in CCDC151 Affect the…

Mutations in CCDC151 Affect the Localization of ODA-Microtubule Docking-Complex-Associated Components CCDC114 and ARMC4…

A ketosynthase homolog uses malonyl units to form esters in cervimycin biosynthesis

Ketosynthases produce the carbon backbones of a vast number of biologically active polyketides by catalyzing Claisen condensations of activated acyl and malonyl building blocks. Here we report that a ketosynthase homolog from Streptomyces tendae, CerJ, unexpectedly forms malonyl esters during the biosynthesis of cervimycin, a glycoside antibiotic against methicillin-resistant Staphylococcus aureus (MRSA). Deletion of cerJ yielded a substantially more active cervimycin variant lacking the malonyl side chain, and in vitro biotransformations revealed that CerJ is capable of transferring malonyl, methylmalonyl and dimethylmalonyl units onto the glycoside. According to phylogenetic analyses and elucidation of the crystal structure, CerJ is functionally and structurally positioned between the ketosynthase catalyzing Claisen condensations and acyl-ACP shuttles, and it features a noncanonical catalytic triad. Site-directed mutagenesis and structures of CerJ in complex with substrates not only allowed us to establish a model for the reaction mechanism but also provided insights into the evolution of this important subclass of the thiolase superfamily.