Meaning of 'NC number' associated with a gene?'

Genes in listings etc. often have a number of the type NC_000012.12 associated with them. How should this be interpreted?

This particular ID represent RefSeq accession number. NC_ stands for a genomic molecule representing complete genomic molecule, usually reference assembly. Every sequence has a stable accession number, a version number, and an integer identifier (gi) assigned to it. RefSeq records can be distinguished from INSDC records by the inclusion of an underscore (“_”) at the third position of the accession number.

NOTCH1 gene

The NOTCH1 gene provides instructions for making a protein called Notch1, a member of the Notch family of receptors. Receptor proteins have specific sites into which certain other proteins, called ligands, fit like keys into locks. Attachment of a ligand to the Notch1 receptor sends signals that are important for normal development of many tissues throughout the body, both before birth and after. Notch1 signaling helps determine the specialization of cells into certain cell types that perform particular functions in the body (cell fate determination). It also plays a role in cell growth and division (proliferation), maturation (differentiation), and self-destruction (apoptosis).

The protein produced from the NOTCH1 gene has such diverse functions that the gene is considered both an oncogene and a tumor suppressor. Oncogenes typically promote cell proliferation or survival, and when mutated, they have the potential to cause normal cells to become cancerous. In contrast, tumor suppressors keep cells from growing and dividing too fast or in an uncontrolled way, preventing the development of cancer mutations that impair tumor suppressors can lead to cancer development.

The format

The format of a complete variant description is “reference : description” (spaces added for clarity only), e.g.

All variants are described in relation to a reference, the so called reference sequence, in the examples NM_004006.3 (from the GenBank database) NC_000023.11 (from the GenBank database). After the reference a description of the variant is given, in the examples c.4375C>T and g.32389644G>A.

A description without a reference sequence is near useless. Additional information will then be required to guess what reference sequence may have been used. When the guess you made is wrong you of course end up with a variant description which is wrong and the information you retrieved is also not correct. So be very careful when you make a guess it is better to check the source of the original description and ask for the reference sequence that was used. Additional information to make a guess may come from the name of the gene containing the variant, the associated phenotype studied (disease), the chromosome number and from possibly predicted consequences of the variant on the RNA and/or protein level. Since reference sequences usually change over time, the date of the report describing the variant can give useful information as well.

DNA > RNA > protein

In nature the DNA code is first transcribed in to a RNA molecule (see Wikipedia). Next, there are two options:

  • the RNA molecule is translated in to a protein and the protein is the final product of a gene. Proteins perform a vast array of functions, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells, and organisms, transporting molecules from one location to another, etc. (see Wikipedia)).
  • the RNA molecule is the final product of the gene (so the RNA is not translated in to a protein). RNA molecules perform a vast array of functions, including e.g. rRNAs (ribosomal RNA) and tRNAs (transfer RNAs) both active in protein translation.

Variants are usually detected by reading the DNA code, a method called DNA sequencing. A proper report always contains the variant described on the DNA level. In addition, a report usually contains a description of the predicted consequence of the variant on the protein, rarely the consequence on RNA. In rare cases, not following current standards, only the predicted consequences at the protein level are reported.

Some variants have an effect on how the transcript (RNA) is generated and consequently on its translation in to protein. When only DNA has been analysed, the consequences of the variant at the RNA and the protein level can only be predicted. The HGVS standard demands predicted consequences have to be reported in parenthesis. The predicted consequence of the variant NM_004006.2:c.4375C>T at the protein level is described as p.(Arg1459*). The “()” warn the variant described is a predicted consequence only.

Meaning of 'NC number' associated with a gene?' - Biology

A database providing information on the structure of assembled genomes, assembly names and other meta-data, statistical reports, and links to genomic sequence data.

A curated set of metadata for culture collections, museums, herbaria and other natural history collections. The records display collection codes, information about the collections' home institutions, and links to relevant data at NCBI.

A collection of genomics, functional genomics, and genetics studies and links to their resulting datasets. This resource describes project scope, material, and objectives and provides a mechanism to retrieve datasets that are often difficult to find due to inconsistent annotation, multiple independent submissions, and the varied nature of diverse data types which are often stored in different databases.

The BioSample database contains descriptions of biological source materials used in experimental assays.

Database that groups biomedical literature, small molecules, and sequence data in terms of biological relationships.

A collection of biomedical books that can be searched directly or from linked data in other NCBI databases. The collection includes biomedical textbooks, other scientific titles, genetic resources such as GeneReviews, and NCBI help manuals.

A resource to provide a public, tracked record of reported relationships between human variation and observed health status with supporting evidence. Related information in the NIH Genetic Testing Registry (GTR), MedGen, Gene, OMIM, PubMed and other sources is accessible through hyperlinks on the records.

A registry and results database of publicly- and privately-supported clinical studies of human participants conducted around the world.

A centralized page providing access and links to resources developed by the Structure Group of the NCBI Computational Biology Branch (CBB). These resources cover databases and tools to help in the study of macromolecular structures, conserved domains and protein classification, small molecules and their biological activity, and biological pathways and systems.

A collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality.

A collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database.

The dbVar database has been developed to archive information associated with large scale genomic variation, including large insertions, deletions, translocations and inversions. In addition to archiving variation discovery, dbVar also stores associations of defined variants with phenotype information.

An archive and distribution center for the description and results of studies which investigate the interaction of genotype and phenotype. These studies include genome-wide association (GWAS), medical resequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.

Includes single nucleotide variations, microsatellites, and small-scale insertions and deletions. dbSNP contains population-specific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral variations and clinical mutations.

The NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis. GenBank consists of several divisions, most of which can be accessed through the Nucleotide database. The exceptions are the EST and GSS divisions, which are accessed through the Nucleotide EST and Nucleotide GSS databases, respectively.

A searchable database of genes, focusing on genomes that have been completely sequenced and that have an active research community to contribute gene-specific data. Information includes nomenclature, chromosomal localization, gene products and their attributes (e.g., protein interactions), associated markers, phenotypes, interactions, and links to citations, sequences, variation details, maps, expression reports, homologs, protein domain content, and external databases.

A public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted and tools are provided to help users query and download experiments and curated gene expression profiles.

Stores curated gene expression and molecular abundance DataSets assembled from the Gene Expression Omnibus (GEO) repository. DataSet records contain additional resources, including cluster tools and differential expression queries.

Stores individual gene expression and molecular abundance Profiles assembled from the Gene Expression Omnibus (GEO) repository. Search for specific profiles of interest based on gene annotation or pre-computed profile characteristics.

A collection of expert-authored, peer-reviewed disease descriptions on the NCBI Bookshelf that apply genetic testing to the diagnosis, management, and genetic counseling of patients and families with specific inherited conditions.

Summaries of information for selected genetic disorders with discussions of the underlying mutation(s) and clinical features, as well as links to related databases and organizations.

A voluntary registry of genetic tests and laboratories, with detailed information about the tests such as what is measured and analytic and clinical validity. GTR also is a nexus for information about genetic conditions and provides context-specific links to a variety of resources, including practice guidelines, published literature, and genetic data/information. The initial scope of GTR includes single gene tests for Mendelian disorders, as well as arrays, panels and pharmacogenetic tests.

Contains sequence and map data from the whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life (bacteria, archaea, and eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and organelles.

The Genome Reference Consortium (GRC) maintains responsibility for the human and mouse reference genomes. Members consist of The Genome Center at Washington University, the Wellcome Trust Sanger Institute, the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI). The GRC works to correct misrepresented loci and to close remaining assembly gaps. In addition, the GRC seeks to provide alternate assemblies for complex or structurally variant genomic loci. At the GRC website (, the public can view genomic regions currently under review, report genome-related problems and contact the GRC.

A centralized page providing access and links to glycoinformatics and glycobiology related resources.

A database of known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliographies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data.

A collection of consolidated records describing proteins identified in annotated coding regions in GenBank and RefSeq, as well as SwissProt and PDB protein sequences. This resource allows investigators to obtain more targeted search results and quickly identify a protein of interest.

A compilation of data from the NIAID Influenza Genome Sequencing Project and GenBank. It provides tools for flu sequence analysis, annotation and submission to GenBank. This resource also has links to other flu sequence resources, and publications and general information about flu viruses.

Subset of the NLM Catalog database providing information on journals that are referenced in NCBI database records, including PubMed abstracts. This subset can be searched using the journal title, MEDLINE or ISO abbreviation, ISSN, or the NLM Catalog ID.

MeSH (Medical Subject Headings) is the U.S. National Library of Medicine's controlled vocabulary for indexing articles for MEDLINE/PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts.

A portal to information about medical genetics. MedGen includes term lists from multiple sources and organizes them into concept groupings and hierarchies. Links are also provided to information related to those concepts in the NIH Genetic Testing Registry (GTR), ClinVar, Gene, OMIM, PubMed, and other sources.

A comprehensive manual on the NCBI C++ toolkit, including its design and development framework, a C++ library reference, software examples and demos, FAQs and release notes. The manual is searchable online and can be downloaded as a series of PDF documents.

Provides links to tutorials and training materials, including PowerPoint slides and print handouts.

Part of the NCBI Handbook, this glossary contains descriptions of NCBI tools and acronyms, bioinformatics terms and data representation formats.

An extensive collection of articles about NCBI databases and software. Designed for a novice user, each article presents a general overview of the resource and its design, along with tips for searching and using available analysis tools. All articles can be searched online and downloaded in PDF format the handbook can be accessed through the NCBI Bookshelf.

Accessed through the NCBI Bookshelf, the Help Manual contains documentation for many NCBI resources, including PubMed, PubMed Central, the Entrez system, Gene, SNP and LinkOut. All chapters can be downloaded in PDF format.

A project involving the collection and analysis of bacterial pathogen genomic sequences originating from food, environmental and patient isolates. Currently, an automated pipeline clusters and identifies sequences supplied primarily by public health laboratories to assist in the investigation of foodborne disease outbreaks and discover potential sources of food contamination.

Bibliographic data for all the journals, books, audiovisuals, computer software, electronic resources and other materials that are in the library's holdings.

A collection of nucleotide sequences from several sources, including GenBank, RefSeq, the Third Party Annotation (TPA) database, and PDB. Searching the Nucleotide Database will yield available results from each of its component databases.

A database of human genes and genetic disorders. NCBI maintains current content and continues to support its searching and integration with other NCBI databases. However, OMIM now has a new home at, and users are directed to this site for full record displays.

Database of related DNA sequences that originate from comparative studies: phylogenetic, population, environmental and, to a lesser degree, mutational. Each record in the database is a set of DNA sequences. For example, a population set provides information on genetic variation within an organism, while a phylogenetic set may contain sequences, and their alignment, of a single gene obtained from several related organisms.

A collection of related protein sequences (clusters), consisting of Reference Sequence proteins encoded by complete prokaryotic and organelle plasmids and genomes. The database provides easy access to annotation information, publications, domains, structures, external links, and analysis tools.

A database that includes protein sequence records from a variety of sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

A database that includes a collection of models representing homologous proteins with a common function. It includes conserved domain architecture, hidden Markov models and BlastRules. A subset of these models are used by the Prokaryotic Genome Annotation Pipeline (PGAP) to assign names and other attributes to predicted proteins.

Consists of deposited bioactivity data and descriptions of bioactivity assays used to screen the chemical substances contained in the PubChem Substance database, including descriptions of the conditions and the readouts (bioactivity levels) specific to the screening procedure.

Contains unique, validated chemical structures (small molecules) that can be searched using names, synonyms or keywords. The compound records may link to more than one PubChem Substance record if different depositors supplied the same structure. These Compound records reflect validated chemical depiction information provided to describe substances in PubChem Substance. Structures stored within PubChem Compounds are pre-clustered and cross-referenced by identity and similarity groups. Additionally, calculated properties and descriptors are available for searching and filtering of chemical structures.

PubChem Substance records contain substance information electronically submitted to PubChem by depositors. This includes any chemical structure information submitted, as well as chemical names, comments, and links to the depositor's web site.

A database of citations and abstracts for biomedical literature from MEDLINE and additional life science journals. Links are provided when full text versions of the articles are available via PubMed Central (described below) or other websites.

A digital archive of full-text biomedical and life sciences journal literature, including clinical medicine and public health.

RefSeqGene A collection of human gene-specific reference genomic sequences. RefSeq gene is a subset of NCBI’s RefSeq database, and are defined based on review from curators of locus-specific databases and the genetic testing community. They form a stable foundation for reporting mutations, for establishing consistent intron and exon numbering conventions, and for defining the coordinates of other biologically significant variation. RefSeqGene is a part of the Locus Reference Genomic (LRG) Collaboration. Reference Sequence (RefSeq)

A collection of curated, non-redundant genomic DNA, transcript (RNA), and protein sequences produced by NCBI. RefSeqs provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. The RefSeq collection is accessed through the Nucleotide and Protein databases.

A collection of resources specifically designed to support the research of retroviruses, including a genotyping tool that uses the BLAST algorithm to identify the genotype of a query sequence an alignment tool for global alignment of multiple sequences an HIV-1 automatic sequence annotation tool and annotated maps of numerous retroviruses viewable in GenBank, FASTA, and graphic formats, with links to associated sequence records.

A summary of data for the SARS coronavirus (CoV), including links to the most recent sequence data and publications, links to other SARS related resources, and a pre-computed alignment of genome sequences from various isolates.

The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

Contains macromolecular 3D structures derived from the Protein Data Bank, as well as tools for their visualization and comparative analysis.

Contains the names and phylogenetic lineages of more than 160,000 organisms that have molecular data in the NCBI databases. New taxa are added to the Taxonomy database as data are deposited for them.

A database that contains sequences built from the existing primary sequence data in GenBank. The sequences and corresponding annotations are experimentally supported and have been published in a peer-reviewed scientific journal. TPA records are retrieved through the Nucleotide Database.

A repository of DNA sequence chromatograms (traces), base calls, and quality estimates for single-pass reads from various large-scale sequencing projects.

A wide range of resources, including a brief summary of the biology of viruses, links to viral genome sequences in Entrez Genome, and information about viral Reference Sequences, a collection of reference sequences for thousands of viral genomes.

An extension of the Influenza Virus Resource to other organisms, providing an interface to download sequence sets of selected viruses, analysis tools, including virus-specific BLAST pages, and genome annotation pipelines.


BLAST executables for local use are provided for Solaris, LINUX, Windows, and MacOSX systems. See the README file in the ftp directory for more information. Pre-formatted databases for BLAST nucleotide, protein, and translated searches also are available for downloading under the db subdirectory.

Sequence databases for use with the stand-alone BLAST programs. The files in this directory are pre-formatted databases that are ready to use with BLAST.

This site provides full data records for CDD, along with individual Position Specific Scoring Matrices (PSSMs), mFASTA sequences and annotation data for each conserved domain. See the README file for full details.

This site provides full data extractions in XML and summary data in VCF format. It contains files with information about standard terms used in ClinVar, MedGen, and GTR.

Sequence databases in FASTA format for use with the stand-alone BLAST programs. These databases must be formatted using formatdb before they can be used with BLAST.

This site contains files for all sequence records in GenBank in the default flat file format. The files are organized by GenBank division, and the full contents are described in the README.genbank file.

The protein sequences corresponding to the translations of coding sequences (CDS) in GenBank are collected for each GenBank release..Please see the README file in the directory for more information.

This site contains three directories: DATA, GeneRIF and tools. The DATA directory contains files listing all data linked to GeneIDs along with subdirectories containing ASN.1 data for the Gene records. The GeneRIF (Gene References into Function) directory contains PubMed identifiers for articles describing the function of a single gene or interactions between products of two genes. Sample programs for manipulating gene data are provided in the tools directory. Please see the README file for details.

This site contains GEO data in two formats: SOFT (Simple Omnibus in Text Format) and MINiML (MIAME Notation in Markup Language). Summary text files and supplementary data are also available. Please see the README.TXT file for more information.

This site contains genome sequence and mapping data for organisms in Entrez Genome. The data are organized in directories for single species or groups of species. Mapping data are collected in the directory MapView and are organized by species. See the README file in the root directory and the README files in the species subdirectories for detailed information.

Contains directories for each genome that include available mapping data for current and previous builds of that genome.

This site contains the full taxonomy database along with files associating nucleotide and protein sequence records with their taxonomy IDs. See the taxdump_readme.txt and gi_taxid.readme files for more information.

This site provides data from the PubChem Substance, Compound and Bioassay databases for download via ftp. Full downloads of the databases are available along with daily, weekly and monthly updates for Substance and Compound. Substance and Compound data are provided in ASN.1, SDF and XML formats. See the README files for more information.

This site contains all nucleotide and protein sequence records in the Reference Sequence (RefSeq) collection. The ""release"" directory contains the most current release of the complete collection, while data for selected organisms (such as human, mouse and rat) are available in separate directories. Data are available in FASTA and flat file formats. See the README file for details.

This site contains SKY-CGH data in ASN.1, XML and EasySKYCGH formats. See the skycghreadme.txt file for more information.

Downloadable data for SNP.

This site contains next-generation sequencing data organized by the submitted sequencing project.

FTP download site for NCBI databases, tools, and utilities.

This site contains ASN.1 data for all records in MMDB along with VAST alignment data and the non-redundant PDB (nr-PDB) data sets. See the README file for more information.

This site contains the trace chromatogram data organized by species. Data include chromatogram, quality scores, FASTA sequences from automatic base calls, and other ancillary information in tab-delimited text as well as XML formats. See the README file for details.

This site contains the UniVec and UniVec_Core databases in FASTA format. See the README.uv file for details.

This site contains whole genome shotgun sequence data organized by the 4-digit project code. Data include GenBank and GenPept flat files, quality scores and summary statistics. See the README.genbank.wgs file for more information.

Open-access data generally include summaries of genotype/phenotype association studies, descriptions of the measured variables, and study documents, such as the protocol and questionnaires. Access to individual-level data, including phenotypic data tables and genotypes, requires varying levels of authorization.

NLM leases MEDLINE/PubMed to U.S. individuals or organizations.

Specifications for NCBI data in ASN.1 or DTD format are available on the Index of data_specs page. The "NCBI_data_conversion.html" links to the conversion tool.

A suite of tag sets for authoring and archiving journal articles as well as transferring journal articles from publishers to archives and between archives. There are four tag sets: Archiving and Interchange Tag Set - Created to enable an archive to capture as many of the structural and semantic components of existing printed and tagged journal material as conveniently as possible Journal Publishing Tag Set - Optimized for archives that wish to regularize and control their content, not to accept the sequence and arrangement presented to them by any particular publisher Article Authoring Tag Set - Designed for authoring new journal articles NCBI Book Tag Set - Written specifically to describe volumes for the NCBI online libraries.

This service allows users to download compound or substance records corresponding to a set of PubChem identifiers, which can be supplied manually or through a text file. Numerous download formats are available, including SDF, XML and SMILES.

The PMC Open-Access Subset is a relatively small part of the total collection of articles in PMC. Whereas the majority of articles in PMC are subject to traditional copyright restrictions, these articles are protected by copyright, but are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyright. Please refer to the license statement in each article for specific terms of use.

Subscribe to Web/RSS feeds for updates about NCBI resources.


An online form that provides an interface for researchers, consortia and organizations to register their BioProjects. This serves as the starting point for the submission of genomic and genetic data for the study. The data does not need to be submitted at the time of BioProject registration.

Guidelines and instructions for submitting assertions about the pathogenicity of human genetic variants. These submissions can include summary data about a variant (variant level/aggregate data) support for variants per case (case-level) is in development.

Guidelines and requirements for submitting genotype and phenotype association data to dbGaP.

A web-based sequence submission tool for one or a few submissions to the GenBank database, designed to make the submission process quick and easy.

Tool for submission to the GenBank database of Barcode short nucleotide sequences from a standard genetic locus for use in species identification.

A stand-alone software tool developed by the NCBI for submitting and updating entries to public sequence databases (GenBank, EMBL, or DDBJ). It is capable of handling simple submissions that contain a single short mRNA sequence, complex submissions containing long sequences, multiple annotations, segmented sets of DNA, as well as sequences from phylogenetic and population studies with alignments. For simple submission, use the online submission tool BankIt instead.

A command-line program that automates the creation of sequence records for submission to GenBank using many of the same functions as Sequin. It is used primarily for submission of complete genomes and large batches of sequences.

Submit expression data, such as microarray, SAGE or mass spectrometry datasets to the NCBI Gene Expression Omnibus (GEO) database.

GeneRIF provides a simple mechanism to allow scientists to add to the functional annotation of genes in the Gene database.

Guidelines and instructions for registering laboratories and submitting genetic test information including clinical and research tests for germline or somatic test targets. GTR welcomes registration of cytogenetic, biochemical, and molecular tests for Mendelian disorders, pharmacogenetic phenotypes and complex panels.

The NIH Manuscript Submission (NIHMS) System is used to submit manuscripts that arise from NIH funding to the PubMed Central digital archive, in accordance with the NIH Public Access Policy and the law it implements. The law and Public Access Policy are intended to ensure that the public has access to the published results of NIH-funded research.

This site enables users to submit data to the PubChem Substance and BioAssay databases, including chemical structures, experimental biological activity results, annotations, siRNA data and more. It can also be used to update previously submitted records.

The SNP database tools page provides links to the general submission guidelines and to the submission handle request. The page has also two specific links for single- or batch submissions of the human variation data using Human Genome Variation Society nomenclature.

This link describes how submitters of SRA data can obtain a secure NCBI FTP site for their data, and also describes the allowed data formats and directory structures.

A single entry point for submitters to link to and find information about all of the data submission processes at NCBI. Currently, this serves as an interface for the registration of BioProjects and BioSamples and submission of data for WGS and GTR. Future additions to this site are planned.

This link describes how submitters of trace data can obtain a secure NCBI FTP site for their data, and also describes the allowed data formats and directory structures.


An interactive graphical viewer that allows users to explore variant calls, genotype calls and supporting evidence (such as aligned sequence reads) that have been produced by the 1000 Genomes Project.

This tool allows users to explore the characteristics of amino acids by comparing their structural and chemical properties, predicting protein sequence changes caused by mutations, viewing common substitutions, and browsing the functions of given residues in conserved domains.

Performs a BLAST search for similar sequences from selected complete eukaryotic and prokaryotic genomes.

Performs a BLAST search of the genomic sequences in the RefSeqGene/LRG set. The default display provides ready navigation to review alignments in the Graphics display.

This page links to a number of BLAST-related tutorials and guides, including a selection guide for BLAST algorithms, descriptions of BLAST output formats, explanations of the parameters for stand-alone BLAST, directions for setting up stand-alone BLAST on local machines and using the BLAST URL API.

Finds regions of local similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families.

Allows you to retrieve records from many Entrez databases by uploading a file of GI or accession numbers from the Nucleotide or Protein databases, or a file of unique identifiers from other Entrez databases. Search results can be saved in various formats directly to a local file on your computer.

A stand-alone application for classifying protein sequences and investigating their evolutionary relationships. CDTree can import, analyze and update existing Conserved Domain (CDD) records and hierarchies, and also allows users to create their own. CDTree is tightly integrated with Entrez CDD and Cn3D, and allows users to create and update protein domain alignments.

COBALT is a protein multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.

A stand-alone application for viewing 3-dimensional structures from NCBI's Entrez retrieval service. Cn3D runs on Windows, Macintosh, and UNIX and can be configured to receive data from most popular web browsers. Cn3D simultaneously displays structure, sequence, and alignment, and has powerful annotation and alignment editing features.

Part of the NCBI Bookshelf, Coffee Break combines reports on recent biomedical discoveries with use of NCBI tools. Each report incorporates interactive tutorials that show how NCBI bioinformatics tools are used as a part of the research process.

Displays the functional domains that make up a given protein sequence. It lists proteins with similar domain architectures and can retrieve proteins that contain particular combinations of domains.

Identifies the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (Reverse Position-Specific BLAST) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD).

Tools that provide access to data within NCBI's Entrez system outside of the regular web query interface. They provide a method of automating Entrez tasks within software applications. Each utility performs a specialized retrieval task, and can be used simply by writing a specially formatted URL.

A tool that allows users to construct an E-utility analysis pipeline using an online form, and then generates a Perl script to execute the pipeline.

Tool for aligning a query sequence (nucleotide or protein) to GenBank sequences included on microarray or SAGE platforms in the GEO database.

Displays the genetic codes for organisms in the Taxonomy database in tables and on a taxonomic tree.

This tool compares nucleotide or protein sequences to genomic sequence databases and calculates the statistical significance of matches using the Basic Local Alignment Search Tool (BLAST) algorithm.

A genome browser for interactive navigation of eukaryotic RefSeq genome assemblies with comprehensive inspection of gene, expression, variation and other annotations. GDV offers easy-to-load analytical track pre-configurations, a menu of data tracks for easy display and customization, and supports upload and analysis of user data. This browser also enables the production of displays for publishing.

An online tool that assists in the production of journal quality figures of annotations on an ideogram or sequence representation of an assembly.

NCBI's Remap tool allows users to project annotation data and convert locations of features from one genomic assembly to another or to RefSeqGene sequences through a base by base analysis. Options are provided to adjust the stringency of remapping, and summary results are displayed on the web page. Full results can be downloaded for viewing in NCBI's Genome Workbench graphical viewer, and annotation data for the remapped features, as well as summary data, is also available for download.

An integrated application for viewing and analyzing sequence data. With Genome Workbench, you can view data in publically available sequence databases at NCBI, and mix these data with your own data.

A service that allows third parties to link directly from PubMed and other Entrez database records to relevant web-accessible resources beyond the Entrez system. Examples of LinkOut resources include full-text publications, biological databases, consumer health information and research tools.

Provides special browsing capabilities of maps and assembled sequences for a subset of organisms. You can view and search an organism's complete genome, display maps, and zoom into progressively greater levels of detail, down to the sequence data for a region of interest.

An interactive web application that enables users to visualize multiple alignments created by database search results or other software applications. The MSA Viewer allows users to upload an alignment and set a master sequence, and to explore the data using features such as zooming and changing of coloration.

Provides information on new and updated resources and NCBI research and development projects. The News site contains feature articles highlighting services, resource features and tools, as well as frequent postings describing important announcements regarding key datasets and services of interest to the user community. Links to NCBI's social media sites along and a list of available RSS feeds and Email listservs are provided.

A set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. The software in the Toolbox is primarily designed to read records in Abstract Syntax Notation 1 (ASN.1) format, an International Standards Organization (ISO) data representation format.

A public domain quality assurance software package that facilitates the assessment of multiplex short tandem repeat (STR) DNA profiles based on laboratory-specific protocols. OSIRIS evaluates the raw electrophoresis data using an independently derived mathematically-based sizing algorithm. It offers two new peak quality measures - fit level and sizing residual. It can be customized to accommodate laboratory-specific signatures such as background noise settings, customized naming conventions and additional internal laboratory controls.

A graphical analysis tool that finds all open reading frames in a user's sequence or in a sequence already in the database. Sixteen different genetic codes can be used. The deduced amino acid sequence can be saved in various formats and searched against protein databases using BLAST.

Allows users to display, sort, subset and download position-specific score matrices (PSSMs) either from CDD records or from Position Specific Iterated (PSI)-BLAST protein searches. The tool also can align a query protein to the PSSM and highlight positions of high conservation.

Supports finding human phenotype/genotype relationships with queries by phenotype, chromosome location, gene, and SNP identifiers. Currently includes information from dbGaP, the NHGRI GWAS Catalog, and GTeX. Displays results on the genome, on sequence, or in tables for download.

The Primer-BLAST tool uses Primer3 to design PCR primers to a sequence template. The potential products are then automatically analyzed with a BLAST search against user specified databases, to check the specificity to the target intended.

A utility for computing alignment of proteins to genomic nucleotide sequence. It is based on a variation of the Needleman Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

PUG provides access to PubChem services via a programmatic interface. PUG allows users to download data, initiate chemical structure searches, standardize chemical structures and interact with the E-utilities. PUG can be accessed using either standard URLs or via SOAP.

Standardization, in PubChem terminology, is the processing of chemical structures in the same way used to create PubChem Compound records from contributors' original structures. This service lets users see how PubChem would handle any structure they would like to submit.

PubChem Structure Search allows the PubChem Compound Database to be queried by chemical structure or chemical structure pattern. The PubChem Sketcher allows a query to be drawn manually. Users may also specify the structural query input by PubChem Compound Identifier (CID), SMILES, SMARTS, InChI, Molecular Formula, or by upload of a supported structure file format.

A specialized PubMed search form targeted to clinicians and health services researchers. The page simplifies searching by clinical study category, finding systematic reviews and searching the medical genetics literature.

A collection of web and flash tutorials on PubMed searching and linking, saving searches in MyNCBI, using MeSH and other PubMed services.

The Related Structures tool allows users to find 3D structures from the Molecular Modeling Database (MMDB) that are similar in sequence to a query protein. Although the query protein may not yet have a resolved structure, the 3D shape of a similar protein sequence can shed light on the putative shape and biological function of the query protein.

A variety of tools are available for searching the SNP database, allowing search by genotype, method, population, submitter, markers and sequence similarity using BLAST. These are linked under ""Search"" on the left side bar of the dbSNP main page.

Sequence Cytogenetic Conversion Service An online tool that converts sequence and cytogenetic coordinates for human, rat, mouse and fruit fly genomic assemblies. Sequence Viewer

Provides a configurable graphical display of a nucleotide or protein sequence and features that have been annotated on that sequence. In addition to use on NCBI sequence database pages, this viewer is available as an embeddable webpage component. Detailed documentation including an API Reference guide is available for developers wishing to embed the viewer in their own pages.

A utility for computing cDNA-to-Genomic sequence alignments. It is based on a variation of the Needleman-Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, Splign is accurate in determining splice sites and tolerant to sequencing errors.

Supports searching the taxonomy tree using partial taxonomic names, common names, wild cards and phonetically similar names. For each taxonomic node, the tool provides links to all data in Entrez for that node, displays the lineage, and provides links to external sites related to the node.

Generates a taxonomic tree for a selected group of organisms. Users can upload a file of taxonomy IDs or names, or they can enter names or IDs directly.

Displays the number of taxonomic nodes in the database for a given rank and date of inclusion.

Displays the current status of a set of taxonomic nodes or IDs.

A tool for creating and displaying phylogenetic tree data. Tree Viewer enables analysis of your own sequence data, produces printable vector images as PDFs, and can be embedded in a webpage.

Variation Viewer A genomic browser to search and view genomic variations listed in dbSNP, dbVar, and ClinVar databases. Searches can be performed using chromosomal location, gene symbol, phenotype, or variant IDs from dbSNP and dbVar. The browser enables exploration of results in a dynamic graphical sequence viewer with annotated tables of variations. VecScreen

A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. VecScreen searches a query sequence for segments that match any sequence in a specialized non-redundant vector database (UniVec).

A computer algorithm that identifies similar protein 3-dimensional structures. Structure neighbors for every structure in MMDB are pre-computed and accessible via links on the MMDB Structure Summary pages. These neighbors can be used to identify distant homologs that cannot be recognized by sequence comparison alone.

This tool helps identify the genotype of a viral sequence. A window is slid along the query sequence and each window is compared by BLAST to each of the reference sequences for a particular virus.

Genes: Properties, Classification and Fine Structure | Genetics

Gene has been described by different researchers in various ways.

A gene has various structural and functional properties which are briefly described below:

The alternative form of a gene is known as allele. Generally each gene has two allelic forms. One of these forms in known as wild type and the other as mutant type. Allelic forms are known as dominant and recessive. Some genes have multiple allelic forms, but only two of them are present at a time in a true diploid individual.

Genes are located on the chromosome in a linear fashion like bead on a string. The position which is occupied by a gene on the chromosome is called locus. Studies on linkage, crossing over, sex chromosomes, sex linkage and bacterial transformation and transduction have clearly demonstrated that genes are located on the chromosomes.

Earlier it was believed that genes are the smallest units of inheritance which cannot be divided further. But Benzer demonstrated in 1955 that gene consists of several units of cistron, recon and muton which are the units of function, recombination and mutation within the gene.

Each diploid individual has two copies of each gene and gametic cells have one copy of each gene. Each individual has large number of structural and functional features or characters and each character is controlled by one or more genes.

Thus, each individual has large number of genes. The total number of genes in an individual is always higher than the number of chromosomes. Thus, each chromosome has several genes. The gene number is also fixed per chromosome which may be altered by deletion and duplication.

Genes have a specific sequence on the chromosome. The gene sequence is altered by structural chromosomal changes specially translocations and inversions.

Genes express in various ways. They may show incomplete dominance, complete dominance, over dominance and lack of dominance. When there is lack of dominance, the expression is intermediate between the two parents. The gene which is expressed is known as dominant gene and which is suppressed is known as recessive gene. The phenotypic expression of genes depends on allelic and non-allelic interactions.

7. Change in Form:

The gene may sometimes change from one allelic form to another. The change in the form of gene is brought out by gene mutation and the changed form of gene is called mutant gene, because generally the change occurs from dominant to recessive form. The reverse change is very rare.

8. Exchange of Genes:

The exchange of genes occurs between non-sister chromatids of homologous chromosomes due to crossing over and between non-homologous chromosomes due to translocation.

Gene is a macro molecule which is composed of DNA. In most of the organisms, gene is made up of DNA. However, the genetic material in some bacteriophages is RNA.

Each gene is duplicated at the time of chromosome duplication or replication. It is believed that chromosome duplication takes place because of gene duplication.

The primary function of each gene is to control the expression of a specific character in an organism. However, sometimes two or more genes are involved in the expression of some characters. The characters which are governed by one or few genes are known as oligogenic traits and those characters which are governed by several genes are referred to as polygenic characters.

In some cases, a single gene has manifold effects, means it controls the expression of more than one character. Such genes are known as pleiotropic genes. Each gene controls the production of one enzyme or one polypeptide chain which in turn governs the expression of specific character.

Genes in diploid organisms occur in pairs of alleles. The member of a pair segregates precisely like chromosomes during meiosis. Thus genes show segregation during meiosis.

When a character is governed by two or more genes, they sometimes show interaction. In such interaction one gene has masking effect over the other. The masking gene is known as epistatic gene and the gene which is masked or suppressed is called hypostatic gene. Gene interaction leads to modification of normal dihybrid segregation ratio into various other types of ratios.

Sometimes two or more genes are inherited together, such genes are referred to as linked genes. Some genes are linked with a particular sex, they are called as sex linked gene.

It is quite clear from the above discussion that there are some similarities or parallel features between chromosomes and genes. (Table 13.3).

Classification of Genes:

Genes can be classified in various ways. The classification of genes is generally done on the basis of:

A brief classification of genes on the basis of above criteria is presented in Table 13.4.

Changing Concept of Gene:

The concept of gene has been the focal point of study from the beginning of twentieth century to establish the basis of heredity. The gene has been examined from two main angles, i.e., (1) genetic view, and (2) biochemical and molecular view.

These aspects are briefly described below:

1. A Genetic View:

The genetic view or perspective of gene is based mainly on the Mendelian inheritance, chromosomal theory of inheritance and linkage studies. Mendel used the term factors for genes and reported that factors were responsible for transmission of characters from parents to their offspring.

Sutton and Boveri (1903) based on the study of mitosis and meiosis in higher plants established parallel behaviour of chromosomes and genes. They reported that both chromosomes and genes segregate and exhibit random assortment, which clearly demonstrated that genes are located on chromosomes. The Sutton-Boveri hypothesis is known as chromosome theory of inheritance.

Morgan based on linkage studies in Drosophila reported that genes are located on the chromosome in a linear fashion. Some genes do not assort independently because of linkage between them. He suggested that recombinants are the result of crossing over.

The crossing over increases if the distance between two genes is more. The number of linkage group is the same as the number of chromosomes. The chromosome theory and linkage studies reveal that genes are located on the chromosomes. This view is sometimes called as bead theory.

The important points about the bead theory are given below:

i. The gene is viewed as a fundamental unit of structure, indivisible by crossing over. Crossing over occurs between genes but not within a gene.

ii. The gene is considered as a basic unit of change or mutation. It changes from one allelic form to another, but there are no smaller components within a gene that can change.

iii. The gene is viewed as a basic unit of function. Parts of a gene, if they exist, cannot function.

The chromosome has been viewed merely as a vector or transporter of genes and exists simply to permit their orderly segregation and to shuffle them in recombination. The bead theory is no more valid for any of the above three points.

Now evidences are available which indicate that:

(ii) Part of a gene can function.

i. The Gene is Divisible:

Earlier it was believed that gene is a basic unit of structure which is indivisible by crossing over. In other words, crossing over occurs between genes but not within a gene. Now, intragenic recombination has been observed in many organisms which indicates that a gene is divisible.

The intragenic recombination has following two main features:

1. It occurs with rare frequency so that a very large test cross progeny is required for its detection. Benzer expected to detect a recombination frequency as low as 10 -6 , the lowest he actually found was 10 -4 (0.01 x 2 = 0.02%).

2. The alleles in which intragenic recombination occurs are separated by small distances within a gene and are functionally related.

Examples of intragenic recombination include bar eye, star asteroid eye and lozenge eye in Drosophila. The bar locus is briefly described below. Lozenge eye and star asteroid have been discussed under pseudo alleles.

Bar Eye in Drosophila:

The first case of intragenic recombination was recorded in Drosophila for bar locus which controls size of eye. The bar locus contains more than one unit of function. The dominant bar gene in Drosophila produces slit like eye instead of normal oval eye. Bar phenotype is caused by tandem duplication of 16A region in X chromosome, which results due to unequal crossing over.

The flies with different dose of 16A region have different types of eye as follows:

The homozygous bar eye (B/B) produced both wild and ultra-bar types though at a low frequency which indicated intragenic recombination in the bar locus but the frequency was much higher than that expected due to spontaneous mutations.

ii. Part of a Gene Can Function:

It was considered earlier that gene is the basic unit of function and parts of gene, if exist, cannot function. But this concept has been outdated now. Based on studies on rll locus of T4 phage, Benzer (1955) concluded that there are three sub divisions of a gene, viz., recon, muton and cistron.

These are briefly described below:

Recons are the regions (units) within a gene between which recombination’s can occur, but the recombination cannot occur within a recon. There is a minimum recombination distance within a gene which separates recons. The map of a gene is completely linear sequence of recons.

It is the smallest element within a gene, which can give rise to a mutant phenotype or mutation. This indicates that part of a gene can mutate or change. This disproved the bead theory according to which the entire gene was to mutate or change.

It is the largest element within a gene which is the unit of function. This also knocked down the bead theory according to which entire gene was the unit of function. The name cistron has been derived from the test which is performed to know whether two mutants are within the same cistron on in different cistrons. It is called cis-trans test which is described below.

d. Cis-Trans Test:

When two mutations in trans position produce mutant phenotype, they are in the same cistron. Complementation in trans position (appearance of wild type) indicates that the mutant sites are in different cistrons. There is no complementation between mutations within a ciston.

It is now known that some genes consist of only one cistron some consist of two or even more. For example, the mutant miniature (m) and dusky (dy) both decrease wing size in Drosophila and map in the same part of X chromosome. But when brought together in dy +/+m heterozygote, the phenotype is normal which indicates that the locus concerned with wing size is composed of at least two cistrons.

2. A Biochemical View:

It is now generally believed that a gene is a sequence of nucleotides in DNA which controls a single polypeptide chain. The different mutations of a gene may be due to change in single nucleotide at more than one location in the gene. Crossing over can take place between the altered nucleotides within a gene.

Since the mutant nucleotides are placed so close together, crossing over is expected within very low frequency. When several different genes which affect the same trait are present so close that crossing over is rare between them, the term complex locus is applied to them. Within the nucleotide sequence of DNA, which represents a gene, multiple alleles are due to mutations at different points within the gene.

Fine Structure of Gene:

Benzer, in 1955, divided the gene into recon, muton and cistron which are the units of recombination, mutation and function within a gene. Several units of this type exist in a gene. In other words, each gene consists of several units of function, mutation and recombination. The fine structure of gene deals with mapping of individual gene locus.

This is parallel to the mapping of chromosomes. In chromosome mapping, various genes are assigned on a chromosome, whereas in case of a gene several alleles are assigned to the same locus. The individual gene maps are prepared with the help of intragenic recombination.

Since the frequency of intragenic recombination is extremely low, very large population has to be grown to obtain such rare combination. Prokaryotes are suitable material for growing large population. In Drosophila, 14 alleles of lozenge gene map at four mutational sites which belong to the same locus (Green, 1961). Similarly, for rosy eye in Drosophila, different alleles map at 10 mutational sites of the same locus.

Descriptions about Each Genes:

There are some genes which are different from normal genes either in terms of their nucleotide sequences or functions. Some examples of such genes are split gene, jumping gene, overlapping gene and pseudo gene.

A brief description of each of these genes is presented below:

Usually a gene has a continuous sequence of nucleotides. In other words, there is no interruption in the nucleotide sequence of a gene. Such nucleotide sequence codes for a particular single polypeptide chain. However, it was observed that the sequence of nucleotides was not continuous in case of some genes the sequences of nucleotides were interrupted by intervening sequences.

Such genes with interrupted sequence of nucleotides are referred to as split genes or interrupted genes. Thus, split genes have two types of sequences, viz., normal sequences and interrupted sequences.

This represents the sequence of nucleotides which are included in the mRNA which is translated from DNA of split gene (Fig. 13.2). These sequences code for a particular polypeptide chain and are known as exons.

ii. Interrupted Sequence:

The intervening or interrupted sequences of split gene are known as introns. These sequences do not code for any peptide chain. Moreover, interrupted sequences are not included into mRNA which is transcribed from DNA of split genes.

The interrupted sequences are removed from the mRNA during processing of the same (Fig. 13.2). In other words, the intervening sequences are discarded in mRNA as they are non-coding sequences. The coding sequences or exons are joined by ligage enzyme.

The first case of split gene was reported for ovalbumin gene of chickens. The ovalbumin gene has been reported to consist of seven intervening sequences (Fig. 13.2). Later on interrupted sequences (split genes) were reported for beta globin genes of mice and rabbits, tRNA genes of yeast and ribosomal genes of Drosophila.

The intervening sequences are determined with the help of R loop technique. This technique consists of hybridization between mRNA and DNA of the same gene under ideal conditions, i.e., at high temperature and high concentration of form amide. The mRNA pairs with single strand of DNA.

The non-coding sequences or intervening sequences of DNA make loop in such pairing. The number of loops indicates the number of interrupted sequences and the size of loop indicates length of the intervening sequence. These loops can be viewed under electron microscope.

The ovalbumin gene has seven interrupted sequences (introns) and eight coding sequences (exons). The beta globin gene has been reported to have two intervening sequences, one 550 nucleotides long and the other 125 nucleotides long.

The intervening sequences are excised during processing to form mature mRNA molecule. Thus, about half of the ovalbumin gene is discarded during processing. Earlier it was believed that there is co-linearity (correspondence) between the nucleotide sequence and the sequence of amino acids which it specifies.

The discovery of split genes has disproved the concept of co-linearity of genes. Now co-linearity between genes and their products is considered as a chance rather than a rule. Split genes have been reported mostly in eukaryotes.

2. Jumping Genes:

Generally, a gene occupies a specific position on the chromosome called locus. However, in some cases a gene keeps on changing its position within the chromosome and also between the chromosomes of the same genome. Such genes are known as jumping genes or transposons or transposable elements.

The first case of jumping gene was reported by Barbara McClintock in maize as early as in 1950. However, her work did not get recognition for a long time like that of Mendel. Because she was much ahead of time and this was an unusual finding, people did not appreciate it for a long time. This concept was recognized in early seventies and McClintock was awarded Nobel prize for this work in 1983.

Later on transposable elements were reported in the chromosome of E. coli and other prokaryotes. In E. coli, some DNA segments were found moving from one location to other location. Such DNA segments are detected by their presence at such a position in the nucleotide sequence, where they were not present earlier. The transposable elements are of two types, viz., insertion sequence and transposons.

There are different types of insertion sequences each with specific properties. Such sequences do not specify for protein and are of very short length. Such sequences have been reported in some bacteria, bacteriophages and plasmids.

These are coding sequences which code for one or more proteins. They are usually very long sequences of nucleotides including several thousand base pairs. Transposable elements are considered to be associated with chromosomal changes such as inversion and deletion.

They are hot spots for such changes and are useful tools for the study of mutagenesis. In eukaryotes, moving DNA segments have been reported in maize, yeast and Drosophila.

3. Overlapping Genes:

Earlier it was believed that a nucleotide sequence codes only for one protein. Recent investigations with prokaryotes especially viruses have proved beyond doubt that some nucleotide sequences (genes) can code for two or even more proteins.

The genes which code for more than one protein are known as overlapping genes. In case of overlapping genes, the complete nucleotide sequence codes for one protein and a part of such nucleotide sequence can code for another protein.

Overlapping genes are found in tumor producing viruses such as ɸ X 174, SV 40 and G4. In virus ɸX 174 gene A overlaps gene B. In virus SV 40, the same nucleotide sequence codes for the protein VP 3 and also for the carboxyl-terminal end of the protein VP2. In virus G4, the gene A overlaps gene B and gene E overlaps gene D.

The gene of this virus also contains some portions of nucleotide sequences which are common for gene A and gene C.

There are some DNA sequences, especially in eukaryotes, which are non-functional or defective copies of normal genes. These sequences do not have any function. Such DNA sequences or genes are known as pseudogenes. Pseudogenes have been reported in humans, mouse and Drosophila.

The main features of pseudogenes are given below:

1. Pseudogenes are non-functional or defective copies of some normal genes. These genes are found in large numbers.

2. These genes being defective cannot be translated.

3. These genes do not code for protein synthesis, means they do not have any significance.

4. The well-known examples of pseudogenes are alpha and beta globin pseudogenes of mouse.

Materials and Methods

Code and build scripts for all analyses, including the downloading and preparation of the data sets, are available in a Git repository at In addition to specific tools referenced below, these analyses relied on the R language ( R Core Team 2016), Snakemake ( Köster and Rahmann 2012), and many components of the SciPy stack, including Matplotlib ( Hunter 2007).

Gene Expression Data

Gene expression levels were obtained from a microarray study of brain regions throughout human development ( supplementary tables S1 and S2 , Supplementary Material online) ( Kang et al. 2011). The total data set consisted of 1,331 samples. Genes were filtered to protein-coding genes known to Gencode 19. Normalized gene expression values were also downloaded for the Johnson et al. (2009) and Lambert et al. (2011) studies.

RNA-seq data for tissues from the GTEx project ( The GTEx Consortium 2015) were downloaded from the consortium’s website ( last accessed October 23, 2015). Analyses considered samples from 11 tissues: cerebellum, cerebral cortex, heart (left ventricle), kidney (cortex), liver, lung, skeletal muscle, ovary, pancreas, spleen, and testis samples. For comparison, each sample was classified as belonging to one of the three adult stages from the Kang et al. data set ( supplementary table S2 , Supplementary Material online), and the genes analyzed were restricted to those present in the microarray used in the Kang et al. study.

Identification of Candidate Regulatory Element Sets

The locations of HACNSs, CACNSs, and MACNSs were retrieved from the supporting online material of the Prabhakar et al. (2006a) study. The set of CNSs was generated according to the reported filtering criteria of the original analysis. Specifically, an element in the eight-way vertebrate phastCons data set (retrieved from last accessed April 6, 2015) was retained if it had a conservation score ≥ 400 and if it did not overlap with human mRNAs, human spliced ESTs, retroposed genes, or duplicated blocks. Note that the CNS set in the original analysis was generated with additional filtering steps based on non-human constraint and statistical power. We used the set of HARs generated by Lindblad-Toh et al. (2011) and filtered the coordinates to those that did not overlap with exons. All coordinates were converted to hg19 coordinates using UCSC Genome Browser’s LiftOver executable.

Human-specific LOF and GOF sets ( Schrider and Kern 2015) were downloaded from the popCons data repository (http://www.github/kern-lab/popCons last accessed April 14, 2016). Coordinates that overlapped with exons were removed. An OCNS set was generated that did not contain any LOF or GOF coordinates. A second set of OCNSs was also generated from the 100-way vertebrate phastCons elements (retrieved from last accessed June 17, 2016), as phastCons elements from this species set, rather than the 8-way set, were used in the original filtering of LOF and GOF candidates.

Determination of the Nearest Genes to CNSs

To find the nearest gene for each element, the coordinates were intersected with the longest transcripts of protein-coding genes from Gencode 19 using BEDTools ( Quinlan and Hall 2010). If an element’s coordinates were found within the start and end coordinates of a transcript, the corresponding gene was counted as a nearest gene. Otherwise, the gene with the minimum distance to an element, based on either bound of its largest transcript, was taken as the nearest gene. These nearest gene assignments were then used to tally the total number of times that each gene was the nearest gene to any element from a given set.

Classification of Genes as DEX

Before classifying genes in the Kang et al. data set as DEX, genes were filtered to those that had an average detection above background P value across all samples of 0.01 or lower. After filtering, two different linear models were constructed using the limma package ( Smyth 2004): one where the neocortical areas were taken as a single region, resulting in 6 brain regions, and another where only the 11 neocortical areas were considered. With both these model structures, each brain region or area was nested within its respective time period. These models also included covariates for the sample individual, treated as a random effect, and the sample RNA integrity number (RIN). Pairwise contrasts were formed for all region factors within that period. To be classified as DEX among brain regions, a gene was required to have a log2-fold change above 1, tested in limma using the TREAT method ( McCarthy and Smyth 2009), and an FDR-adjusted P value at or below 0.01 for at least one contrast. A similar procedure was used to classify genes in the Johnson et al. data set as DEX between regions, but all samples were taken as belonging to a single time period. For the Lambert et al. data set, which consisted of two brain regions from two individuals, region and individual were used as covariates, with the latter treated as a random effect.

As an alternative method, an ANOVA model was constructed that considered period 6 samples and included a factor for either 6 brain regions or 11 neocortical areas, with sample RIN as a covariate. Following the criteria of Kang et al. (2011), a gene was called DEX if it had an FDR-adjusted P value below 0.01, at least one sample with a log2-transformed signal intensity above 6, and an average log2-fold change above 1 between at least two regions.

To classify genes as DEX between tissues in the GTEx data set, genes were first filtered to include only those that had a minimum count of ten in at least three samples. The expression counts were transformed with the voom package ( Law et al. 2014) for modeling with limma. The sequencing batch, individual, and RIN were included as covariates, with the individual taken as a random effect. Pairwise contrasts were made between each tissue.

Biology 171

By the end of this section, you will be able to do the following:

  • Describe how changes to gene expression can cause cancer
  • Explain how changes to gene expression at different levels can disrupt the cell cycle
  • Discuss how understanding regulation of gene expression can lead to better drug design

Cancer is not a single disease but includes many different diseases. In cancer cells, mutations modify cell-cycle control and cells don’t stop growing as they normally would. Mutations can also alter the growth rate or the progression of the cell through the cell cycle. One example of a gene modification that alters the growth rate is increased phosphorylation of cyclin B, a protein that controls the progression of a cell through the cell cycle and serves as a cell-cycle checkpoint protein.

For cells to move through each phase of the cell cycle, the cell must pass through checkpoints. This ensures that the cell has properly completed the step and has not encountered any mutation that will alter its function. Many proteins, including cyclin B, control these checkpoints. The phosphorylation of cyclin B, a post-translational event, alters its function. As a result, cells can progress through the cell cycle unimpeded, even if mutations exist in the cell and its growth should be terminated. This post-translational change of cyclin B prevents it from controlling the cell cycle and contributes to the development of cancer.

Cancer: Disease of Altered Gene Expression

Cancer can be described as a disease of altered gene expression. There are many proteins that are turned on or off (gene activation or gene silencing) that dramatically alter the overall activity of the cell. A gene that is not normally expressed in that cell can be switched on and expressed at high levels. This can be the result of gene mutation or changes in gene regulation (epigenetic, transcription, post-transcription, translation, or post-translation).

Changes in epigenetic regulation, transcription, RNA stability, protein translation, and post-translational control can be detected in cancer. While these changes don’t occur simultaneously in one cancer, changes at each of these levels can be detected when observing cancer at different sites in different individuals. Therefore, changes in histone acetylation (epigenetic modification that leads to gene silencing), activation of transcription factors by phosphorylation, increased RNA stability, increased translational control, and protein modification can all be detected at some point in various cancer cells. Scientists are working to understand the common changes that give rise to certain types of cancer or how a modification might be exploited to destroy a tumor cell.

Tumor Suppressor Genes, Oncogenes, and Cancer

In normal cells, some genes function to prevent excess, inappropriate cell growth. These are tumor-suppressor genes, which are active in normal cells to prevent uncontrolled cell growth. There are many tumor-suppressor genes in cells. The most studied tumor-suppressor gene is p53, which is mutated in over 50 percent of all cancer types. The p53 protein itself functions as a transcription factor. It can bind to sites in the promoters of genes to initiate transcription. Therefore, the mutation of p53 in cancer will dramatically alter the transcriptional activity of its target genes.

Watch Using p53 to Fight Cancer (webpage, video) to learn more.

Proto-oncogenes are positive cell-cycle regulators. When mutated, proto-oncogenes can become oncogenes and cause cancer. Overexpression of the oncogene can lead to uncontrolled cell growth. This is because oncogenes can alter transcriptional activity, stability, or protein translation of another gene that directly or indirectly controls cell growth. An example of an oncogene involved in cancer is a protein called myc. Myc is a transcription factor that is aberrantly activated in Burkett’s Lymphoma, a cancer of the lymph system. Overexpression of myc transforms normal B cells into cancerous cells that continue to grow uncontrollably. High B-cell numbers can result in tumors that can interfere with normal bodily function. Patients with Burkett’s lymphoma can develop tumors on their jaw or in their mouth that interfere with the ability to eat.

Cancer and Epigenetic Alterations

Silencing genes through epigenetic mechanisms is also very common in cancer cells. There are characteristic modifications to histone proteins and DNA that are associated with silenced genes. In cancer cells, the DNA in the promoter region of silenced genes is methylated on cytosine DNA residues in CpG islands. Histone proteins that surround that region lack the acetylation modification that is present when the genes are expressed in normal cells. This combination of DNA methylation and histone deacetylation (epigenetic modifications that lead to gene silencing) is commonly found in cancer. When these modifications occur, the gene present in that chromosomal region is silenced. Increasingly, scientists understand how epigenetic changes are altered in cancer. Because these changes are temporary and can be reversed—for example, by preventing the action of the histone deacetylase protein that removes acetyl groups, or by DNA methyl transferase enzymes that add methyl groups to cytosines in DNA—it is possible to design new drugs and new therapies to take advantage of the reversible nature of these processes. Indeed, many researchers are testing how a silenced gene can be switched back on in a cancer cell to help re-establish normal growth patterns.

Genes involved in the development of many other illnesses, ranging from allergies to inflammation to autism, are thought to be regulated by epigenetic mechanisms. As our knowledge of how genes are controlled deepens, new ways to treat diseases like cancer will emerge.

Cancer and Transcriptional Control

Alterations in cells that give rise to cancer can affect the transcriptional control of gene expression. Mutations that activate transcription factors, such as increased phosphorylation, can increase the binding of a transcription factor to its binding site in a promoter. This could lead to increased transcriptional activation of that gene that results in modified cell growth. Alternatively, a mutation in the DNA of a promoter or enhancer region can increase the binding ability of a transcription factor. This could also lead to the increased transcription and aberrant gene expression that is seen in cancer cells.

Researchers have been investigating how to control the transcriptional activation of gene expression in cancer. Identifying how a transcription factor binds, or a pathway that activates where a gene can be turned off, has led to new drugs and new ways to treat cancer. In breast cancer, for example, many proteins are overexpressed. This can lead to increased phosphorylation of key transcription factors that increase transcription. One such example is the overexpression of the epidermal growth-factor receptor (EGFR) in a subset of breast cancers. The EGFR pathway activates many protein kinases that, in turn, activate many transcription factors which control genes involved in cell growth. New drugs that prevent the activation of EGFR have been developed and are used to treat these cancers.

Cancer and Post-transcriptional Control

Changes in the post-transcriptional control of a gene can also result in cancer. Recently, several groups of researchers have shown that specific cancers have altered expression of miRNAs. Because miRNAs bind to the 3′ UTR of RNA molecules to degrade them, overexpression of these miRNAs could be detrimental to normal cellular activity. Too many miRNAs could dramatically decrease the RNA population, leading to a decrease in protein expression. Several studies have demonstrated a change in the miRNA population in specific cancer types. It appears that the subset of miRNAs expressed in breast cancer cells is quite different from the subset expressed in lung cancer cells or even from normal breast cells. This suggests that alterations in miRNA activity can contribute to the growth of breast cancer cells. These types of studies also suggest that if some miRNAs are specifically expressed only in cancer cells, they could be potential drug targets. It would, therefore, be conceivable that new drugs that turn off miRNA expression in cancer could be an effective method to treat cancer.

Cancer and Translational/Post-translational Control

There are many examples of how translational or post-translational modifications of proteins arise in cancer. Modifications are found in cancer cells from the increased translation of a protein to changes in protein phosphorylation to alternative splice variants of a protein. An example of how the expression of an alternative form of a protein can have dramatically different outcomes is seen in colon cancer cells. The c-Flip protein, a protein involved in mediating the cell-death pathway, comes in two forms: long (c-FLIPL) and short (c-FLIPS). Both forms appear to be involved in initiating controlled cell-death mechanisms in normal cells. However, in colon cancer cells, expression of the long form results in increased cell growth instead of cell death. Clearly, the expression of the wrong protein dramatically alters cell function and contributes to the development of cancer.

New Drugs to Combat Cancer: Targeted Therapies

Scientists are using what is known about the regulation of gene expression in disease states, including cancer, to develop new ways to treat and prevent disease development. Many scientists are designing drugs on the basis of the gene expression patterns within individual tumors. This idea, that therapy and medicines can be tailored to an individual, has given rise to the field of personalized medicine. With an increased understanding of gene regulation and gene function, medicines can be designed to specifically target diseased cells without harming healthy cells. Some new medicines, called targeted therapies, have exploited the overexpression of a specific protein or the mutation of a gene to develop a new medication to treat disease. One such example is the use of anti-EGF receptor medications to treat the subset of breast cancer tumors that have very high levels of the EGF protein. Undoubtedly, more targeted therapies will be developed as scientists learn more about how gene expression changes can cause cancer.

Clinical Trial Coordinator A clinical trial coordinator is the person managing the proceedings of the clinical trial. This job includes coordinating patient schedules and appointments, maintaining detailed notes, building the database to track patients (especially for long-term follow-up studies), ensuring proper documentation has been acquired and accepted, and working with the nurses and doctors to facilitate the trial and publication of the results. A clinical trial coordinator may have a science background, like a nursing degree, or other certification. People who have worked in science labs or in clinical offices are also qualified to become a clinical trial coordinator. These jobs are generally in hospitals however, some clinics and doctor’s offices also conduct clinical trials and may hire a coordinator.

Section Summary

Cancer can be described as a disease of altered gene expression. Changes at every level of eukaryotic gene expression can be detected in some form of cancer at some point in time. In order to understand how changes to gene expression can cause cancer, it is critical to understand how each stage of gene regulation works in normal cells. By understanding the mechanisms of control in normal, non-diseased cells, it will be easier for scientists to understand what goes wrong in disease states including complex ones like cancer.

Free Response

New drugs are being developed that decrease DNA methylation and prevent the removal of acetyl groups from histone proteins. Explain how these drugs could affect gene expression to help kill tumor cells.

These drugs will keep the histone proteins and the DNA methylation patterns in the open chromosomal configuration so that transcription is feasible. If a gene is silenced, these drugs could reverse the epigenetic configuration to re-express the gene.

How can understanding the gene expression pattern in a cancer cell tell you something about that specific form of cancer?

Understanding which genes are expressed in a cancer cell can help diagnose the specific form of cancer. It can also help identify treatment options for that patient. For example, if a breast cancer tumor expresses the EGFR in high numbers, it might respond to specific anti-EGFR therapy. If that receptor is not expressed, it would not respond to that therapy.


Scientists have observed the following types of recombination in nature:

    • Homologous (general) recombination: As the name implies, this type occurs between DNA molecules of similar sequences. Our cells carry out general recombination during meiosis.
      • Nonhomologous (illegitimate) recombination: Again, the name is self-explanatory. This type occurs between DNA molecules that are not necessarily similar. Often, there will be a degree of similarity between the sequences, but it’s not as obvious as it would be in homologous recombinations.
        • Site-specific recombination: This is observed between particular, very short, sequences, usually containing similarities.
        • Mitotic recombination: This doesn’t actually happen during mitosis, but during interphase, which is the resting phase between mitotic divisions. The process is similar to that in meiotic recombination, and has its possible advantages, but it’s usually harmful and can result in tumors. This type of recombination is increased when cells are exposed to radiation.

        Prokaryotic cells can undergo recombination through one of these three processes:

          • Conjugation is where genes are donated from one organism to another after they have been in contact. At any point, the contact is lost and the genes that were donated to the recipient replace their equivalents in its chromosome. What the offspring ends up having is a mix of traits from different strains of bacteria.
            • Transformation: This is where the organism acquires new genes by taking up naked DNA from its surroundings. The source of the free DNA is another bacterium that has died, and therefore its DNA was released to the environment.
            • Transduction is gene transfer that is mediated by viruses. Viruses called bacteriophages attack bacteria and carry the genes from one bacterium to another.

            Modern Applications

            Sturtevant's discovery led to the golden age of chromosome transmission genetics, with an emphasis on identifying genes through alleles with visible phenotypes , and using them as markers for determining their position on the linkage map. Since then the emphasis in genetics has shifted to understanding the functions of genes. Linkage and gene mapping studies have progressed to being a critical tool in cloning genes and providing more description of their roles in the organism. These approaches include:

            • • Using map locations to distinguish different genes with similar sequences, mutant phenotypes, or functions. Examples are the cell division cycle mutants of the yeast Saccharomyces cercvisiae or the uncoordinated mutants of the roundworm C. elegans. In some cases mutants with different phenotypes have been shown to be done to different mutations in the same gene, as is the case with the Drosophilacircadian rhythm period mutants termed short, long, and none (per[S], per[L] and per[0]).
            • • Using map locations to track down genes to clone their deoxyribonucleic acid (DNA) by chromosome position. Examples are the human cystic fibrosis transmembrane regulator gene mutated in cystic fibrosis, or the polyglutamine repeat gene that is mutated in Huntington's disease. With genome sequences available on databases, mapping mutant phenotypes points to candidate loci for the gene at the chromosome position.

            New classes of markers in linkage analysis are based on naturally occurring DNA variation in the genome , and have many advantages. These variations are usually harmless and don't interrupt a gene, so there is no selection against them, meaning they persist over many generations. They are quite numerous and are distinguished throughout in the genome. Individuals are likely to be heterozygous from many of them and therefore the markers are informative for linkage. If the DNA variant is present heterozygously, can be detected, and shows Mendelian segregation, it is as good a linkage marker as yellow bodies or white eyes. The disadvantage is that analysis to detect the variant is sometimes more laborious and requires the techniques of molecular biology.

            The common types of DNA markers and the molecular techniques used to follow their inheritance are:

            • • Restriction fragment length polymorphisms (RFLPs) are derived from sequence variation that results in the loss of a restriction enzyme digestion site. The result is a longer fragment of the DNA from that location following digestion with that enzyme. A heterozygous parent will transmit either the allele specifying the long fragment or the allele specifying the short fragment to each child. After size separation of DNA fragments by gel electrophoresis and transfer to a Southern blot, these DNA fragments of interest can be identified with a specific DNA or ribonucleic acid (RNA) probe that also comes from that location. If the long fragment, for example, is linked to a disease gene, the child's DNA can reveal if he or she is likely to develop the disease.
            • • Randomly amplified polymorphic DNAs (RAPDs) are derived from sequence variation that results in the loss of the complementary site to a primer necessary to initiate chain amplification by polymerase chain reaction (PCR). If the DNA used as template contains complementary sites for both primers, a PCR product is obtained that can be detected by gel electrophoresis. If either site is absent or changed in the template no product will be obtained from the reaction.

            Why use gene expression profiling?

            Gene expression profiling enables you to investigate the effects of different conditions on gene expression by altering the environment to which the cell is exposed, and determining which genes are expressed. Alternatively, if you already know a gene is involved in a certain cell behavior, gene expression profiling helps you to determine whether a cell is carrying out this function. For example, certain genes are known to be involved in cell division if these genes are active in a cell, you can tell the cell is undergoing division, or whether a cell is differentiated [7,8].

            Gene expression profiling is often used in hypothesis generation. If very little is known about when and why a gene will be expressed, expression profiling under different conditions can help design a hypothesis to test in future experiments. For example, if gene A is expressed only when the cell is exposed to other cells, this gene may be involved in intercellular communication. Further experiments could determine whether this is the case [4].

            Gene profiling can also investigate the effect of drug-like molecules on cellular response. You could identify the gene markers of drug metabolism, or determine whether cells express genes known to be involved in response to toxic environments when exposed to the drug [4].

            Gene profiling can also be used as a diagnostic tool. If cancerous cells express higher levels of certain genes, and these genes code for a protein receptor, this receptor may be involved in the cancer, and targeting it with a drug might treat the disease. Gene expression profiling might then be a key diagnostic tool for people with this cancer [9].