|
|
||||||||
1 Departments of Pediatrics
2 Internal Medicine
3 Ophthalmology
4 Electrical and Computer Engineering
5 Biomedical Engineering
6 Center for Bioinformatics and Computational Biology, Roy J. and Lucille A. Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242
| ABSTRACT |
|---|
|
|
|---|
53,000 3'-expressed sequence tags (3'-ESTs). From these, a nonredundant UniGene set of more than 19,000 sequences was generated. Despite the relatively small contribution of airway epithelia to the total mass of the lung, focused gene discovery in this tissue yielded novel results. The ESTs included several thousand transcripts (6,416) not previously identified from cDNA sequences as expressed in the lung. Among the abundant transcripts were several genes involved in host defense. Most importantly, the set also included 879 3'-ESTs that appear to be novel sequences not previously represented in the National Center for Biotechnology Information UniGene collection. This UniGene set should be useful for studies of pulmonary diseases involving the airway epithelium including cystic fibrosis, respiratory infections and asthma. It also provides a reagent for large-scale expression profiling. normalization; subtraction; expressed sequence tag; UniGene; cystic fibrosis
| INTRODUCTION |
|---|
|
|
|---|
The lung is composed of airway and alveolar epithelia, submucosal glands, interstitial cells, vascular tissue, smooth muscle, cartilage, neuronal tissue, and circulating and resident hematopoietic cells. Mercer et al. (15) measured the total surface area of human airways from the trachea to the bronchioles and found it to be only 0.2 m2. This is a small proportion of the estimated total area of the human alveolar surface area of 100 m2 (15). Furthermore, the estimated number of cells in the airways (
1 x 1010 cells) is a minor fraction of the estimated total number of cells in the alveoli (
2 x 1012 cells) (15). These calculations indicate that the airway epithelium represents a small portion of the total cell mass in the lung. Therefore, cDNA libraries derived from whole lung RNA may greatly under represent the transcripts expressed in the airway epithelium. Moreover, mRNAs from normal lungs may not completely represent the transcriptional capabilities of this highly environmentally regulated mucosal surface. For example, antimicrobial peptides such as human ß-defensin-2 are normally expressed at very low levels in airway epithelia but are markedly induced under conditions associated with inflammation (8, 21). We evaluated the representation of ESTs from human lung epithelia or lung tissue in the dbEST database, particularly in the human UniGene collection developed at the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/). Consistent with the underrepresentation of airway epithelia, a search of the dbEST database found no sequences from human non-CF or CF airway epithelia or from CF lung. These results indicate that human airway epithelia and lung tissue are likely to be poorly represented in the human UniGene and EST collections. This suggests that microarray-based studies using data sets derived from the human UniGene and EST collections may not fully reflect the transcript diversity of the airway epithelium. Furthermore, this finding suggests that focused gene discovery efforts may rapidly produce comprehensive collections of ESTs from airway epithelia or lung tissue. Such approaches have been utilized in many cell and tissue types to identify a more comprehensive set of transcripts in these cells (9, 14, 18).
In the present study, three tissue sources were utilized to construct cDNA libraries: 1) primary cultures of well-differentiated non-CF epithelia grown under several conditions, 2) primary cultures of well-differentiated CF epithelia grown under several conditions, and 3) whole fetal and adult lung. Each of the three initial nonnormalized libraries was analyzed to generate
1,000 sequences each. These libraries were then each individually normalized, and a further 9,00013,000 sequences were generated. Finally, a subtracted library constructed from a pool of the two epithelial libraries was generated and sequenced. The resultant comprehensive UniGene set of the cDNAs expressed in airway epithelia and lung provides a novel tool for gene discovery and expression profiling in the airway epithelium and lung and may be of broad interest for studies of CF, asthma, and other lung diseases.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Primary cultures of human airway epithelia.
Airway epithelial cells were isolated from nasal, tracheal, and bronchial tissues obtained from CF and non-CF donors. Cells were seeded onto collagen-coated, semipermeable membranes (0.6 cm2 Millicell-HA; Millipore, Bedford, MA) and grown at the air-liquid interface as previously described (10). Epithelial cells were cultured in a 1:1 mixture of Dulbeccos modified Eagles media and Hams F12 media that was supplemented with 2% Ultroser G (BioSepra, Villeneuve la Garenne, France) and 100 mU/ml penicillin, 100 µg/ml streptomycin, 10 µg/ml gentamicin, 25 µg/ml colimycin, and 75 µg/ml ceftazidime, 25 µg/ml imipenem, 25 µg/ml cilastin, and 2 µg/ml fluconazole. Basolateral culture media was changed every 24 days. Representative samples from all epithelia preparations were evaluated for morphology using scanning electron microscopy to document the development of a ciliated apical surface. The bioelectric properties of each preparation were also characterized to verify phenotypes. All specimens were genotyped for CFTR mutations. All CF specimens used in this study were homozygous or heterozygous for the
F508 mutation, the most common CF-causing mutation. Samples used in the analysis were all well differentiated as determined by scanning electron microscopy and showed bioelectric properties consistent with normal epithelia or manifested the chloride transport defect characteristic of CF. All samples were cultured for >4 wk prior to use in the studies. Samples were collected with approval from the University of Iowa Institutional Review Board.
Cell culture conditions.
To prepare samples that reflect a broad range of the transcripts expressed by airway epithelia, we exposed cells to a variety of conditions (see Table 1). Because cells from non-CF epithelia were more abundant, they were treated with a greater number of conditions. The sources of the reagents are as follows: Clonetics media (BioWhittaker, Walkersville, MD), Ultroser G (BioSepra), keratinocyte growth factor (KGF; Amgen, Thousand Oaks, CA), heregulin, IL-1ß, -6, -8, -9, and -13, secretory leukocyte protease inhibitor (SLPI; R & D Systems, Minneapolis, MN), dexamethasone, triiodothyronine (T3), neutrophil elastase, Escherichia coli lipopolysaccharide (LPS), Pseudomonas aeruginosa LPS, Klebsiella pneumonia LPS (Sigma, St. Louis, MO), and adenovirus (ATCC, Manassas, VA). Pseudomonas elastase and pyocyanin were a generous gift of Dr. Charles Cox. Haemophilus influenzae strain 12 was a gift from Dr. Dwight Look. P. aeruginosa PAO1 was provided by Dr. Pete Greenberg.
|
cDNA Libraries
Directionally cloned start (nonnormalized), normalized, and serially subtracted cDNA libraries were constructed in a plasmid vector (pT7T3-Pac) from DNase-treated poly(A)+ mRNA isolated from a number of fetal and adult lung tissues and primary cultures of human CF and non-CF airway epithelia, as previously described (3, 22). A complete list of the culture conditions used is provided in Table 1. Briefly, first-strand cDNA was primed with a poly-dT oligonucleotide (TGTTACCATTCTGATGTTGGAGCGGCCGC-N[610]-T[18]) that contained a NotI restriction site for directional cloning and a library tag, used to identify the tissue of origin (7). Double-stranded cDNA was ligated to EcoRI adaptors (5'-AATTGGCACGAGG-3', 3'-GCCGTGCTCC-5'), digested with NotI, and directionally cloned into pT7T3-Pac.
Sequencing
Dideoxy terminator sequencing was performed in 96-well format by cycle sequencing using dRhodamine dye terminator chemistry (Applied Biosystems, Foster City, CA). After thermal cycling, sequencing reactions were ethanol precipitated, resuspended in loading buffer containing formamide, denatured, and analyzed on an ABI377 or an ABI3700 capillary sequencer. A detailed description of the sequencing protocol is available online at the Univ. of Iowa Rat EST Project web page (http://ratest.eng.uiowa.edu/localdocs/sequencing_protocol.html).
After data capture on the ABI sequencers, the gels were tracked (if necessary) and transferred to a centralized server. From there, the sequences were processed as outlined below and placed into a file-system hierarchy. Nucleotide sequences and per-base quality values were extracted from the ABI-generated chromatograph files (SCF files) using the phred base-calling program (6). All of the sequences generated as a part of this research were submitted to dbEST and incorporated into the human UniGene data set.
Feature Identification and Quality Assessment
Expected EST features and overall sequence quality were assessed using ESTprep (19) and RepeatMasker (A. Smit and P. Green, unpublished data), as described in Scheetz and Casavant (17). Briefly, the features detected include vector and cloning site sequence, polyadenylation tail and signal, and potential contaminating sequences (bacterial, mitochondrial, vector). In addition, the library tag (as described in cDNA library creation) is also identified, allowing discrimination of tissue source from a pooled cDNA library. The quality assessment protocol requires that several additional criteria be satisfied: overall sequence quality (in phred q scores) greater than 25, percent of sequence (in nt) over q20 > 50%, and the quality-trimmed EST insert length of more than 100 bp.
Clustering
Local clustering of the ESTs was performed using the UIcluster program (v3.0.5) (25). Default parameters were used, with the addition of allowing matching on both forward and reverse complement. This allowed rapid and robust novelty assessment of the ESTs generated in this project, an important component of the subtractive cDNA sequencing process. The human UniGene set (Ref. 20; ftp://ftp.ncbi.nih.gov/repository/UniGene) was also used to further evaluate the novelty of the ESTs.
BLAST Analysis
BLAST-based sequence similarity was used to compare a representative element from each cluster against the nonredundant nucleotide database, dbEST, and the Affymetrix consensus sequences used to design the oligos. These sequences were obtained from the NCBI and Affymetrix web sites. A significance criterion of at least 100 bp and 90% identity was used in the BLAST analysis.
Assessment of Genomic Localization
The 3' and 5' sequences for each clone were aligned to the human genome (June 2003 release) using the BLAT alignment tool. A comparison of localization and orientation was made between the EST alignments of each clone and known genes and mRNAs in the Univ. of California Santa Cruz (UCSC) genome browser (http://genome.ucsc.edu). Assessment of novelty vs. in silico gene predictions was performed using the GenScan track on the UCSC site. The 3' ESTs with a poly-A tail should align in the opposite orientation with respect to the known transcript. Sequences that did not overlap a known mRNA sequence and showed evidence of untemplated polyadenylation were considered novel in this analysis. The quality of the BLAT hits was assessed by their alignment scores. The distances reported are the minimum distance between the mRNA and EST alignment locations. In cases where the clone sequences are partially or completely contained within an mRNA, the clone was placed in the "overlap" category.
ORF Analysis
The 3' and 5' sequences were assembled and translated in all three frames. These assemblies were then blasted against the nonredundant amino acid database from NCBI. Sequences with hits with an E-value less than 0.01 were manually inspected to assess the identity of the hit.
| RESULTS |
|---|
|
|
|---|
The ESTs generated are expected to consist primarily of untranslated sequence (UTR). In a comparison to annotated human mRNAs, ESTs from 2,228 of the 19,059 clusters aligned to one of the mRNAs, and 1,088 extended into the CDS. Thus we expect that slightly less than half of the ESTs with a polyadenylation tail and signal will contain coding sequence. This same analysis estimated an average UTR length of 772 bp.
Transcript Profiles of Nonnormalized Libraries
Although normalized and subtracted libraries are excellent for efficiently identifying a comprehensive set of mRNA transcripts, these cannot be utilized to infer an expression profile. Therefore,
1,000 clones were sequenced from each of the three initial nonnormalized libraries. The 20 most frequently sequenced transcripts from each of the three nonnormalized libraries are presented in Tables 2 4. The commonly sequenced epithelial transcripts included several gene products previously recognized for their roles in mucosal host defense. In non-CF epithelia (Table 2), these included the polymeric immunoglobulin receptor, IL-8, and ß2-microglobulin. Epithelial cytoskeletal and adhesion related gene products sequenced included ß1-integrin and annexin A1. In addition, many genes with "housekeeping functions" were identified including ribosomal subunit RNAs, chaperones, ß-actin, and cellular enzymes. The CF epithelial library (Table 3) shared many transcripts with the non-CF library. The 3' ESTs frequently sequenced from CF epithelia included keratin 19, CD74, cathepsin D, MEN1, and properdin B-factor. In addition, two abundant transcripts of epithelial origin sequenced more frequently in the CF library included the human homolog of the mouse palate, lung, and nasal epithelium clone "PLUNC" (also termed LUNX or SPLUNC1) and the von Ebner minor salivary gland protein (also termed LPLUNC1) (2). The most commonly sequenced genes from the lung library (Table 4) included surfactant protein C, surfactant protein A1, and many "housekeeping" genes.
|
|
|
|
A graphical representation of the relative discovery from the tissue sources utilized in building the cDNA libraries is presented in Fig. 2. Each of the circles in the Venn diagram presents the number of clusters containing at least one sequence from that tissue. It is important to note that for 660 clusters the tissue source could not be determined, and these were therefore not included in Fig. 2. The places where the circles overlap denote clusters with sequences derived from two or more of the tissues. From Fig. 2, several points can be made. First, of the 19,059 clusters identified, 1,932 contain messages common to all three tissues. Second, each tissue uniquely contributes a few thousand clusters. Thus each of the starting tissues contributed to the overall gene discovery process. Within the UniGene build, 2,014 of the clusters were lung specific (i.e., comprised only of ESTs derived from lung tissues) and 1,190 were epithelia specific. We observed a substantially higher number of unique sequences contributed from the non-CF epithelial library (5,686 3'-ESTs) than from the CF epithelial library (3,488 3'-ESTs). This result was expected, as the non-CF epithelia samples were treated with many more conditions designed to induce gene expression (Table 1). Confirming our prediction that the plasticity of the airway epithelial transcription profile would be highly regulated by environmental and nonenvironmental factors, both epithelial libraries contributed substantially to the collection of 19,059 clusters.
|
|
|
600 bp) during the labeling reaction, labeled targets derived from these unrepresented (or poorly represented) transcripts are unlikely to hybridize with the Affymetrix GeneChip probe sets. In other words, the limited length of labeled targets implies that probes not specifically designed for the prevalent lung transcripts are unlikely to hybridize. Therefore, it would be difficult to use transcript profiling with current commercial arrays to investigate their importance in the development, progression, or treatment of CF or other lung diseases. From the complete set of 3' EST sequences submitted to GenBank, 3,168 were not selected for inclusion within the current UniGene build. Although these 3,168 ESTs were not represented within the current UniGene build, these were available for incorporation into UniGene and were included within the local clustering. These sequences defined a set of 879 clusters comprising only sequences not included in NCBI UniGene set.
A representative clone was selected from each of the 879 clusters not included in UniGene set. These clones were resequenced from the 3' and 5' ends to further assess their novelty. A total of 491 of these clones were further validated as novel based upon the lack of a significant BLAST hit (other than themselves) to a database of all human ESTs in dbEST. Those with a weak BLAST hit (less than 90% identity over 100 nt) are probably homologs of known genes. Those ESTS lacking any BLAST hit likely represent either novel transcripts or previously unobserved 3' ends. A final sequence composition analysis was applied to these 491 clones, identifying 199 clones in which the 3' EST contained polyadenylation tail and signal (canonical or alternative). As mentioned above, those sequences lacking a polyadenylation tail and/or signal likely represent internal sequence for previously discovered but incompletely characterized/sequenced genes. Both the 3' and 5' sequences from these 199 clones were aligned to the human genome using BLAT (11). Of these 199 sequences, 134 were determined to be the result of untemplated polyadenylation based upon alignment to the human genome (i.e., the homology to the genomic sequence does not extend into the polyadenylation tail). The genomic location and context were then evaluated using the UCSC genome browser (http://genome.ucsc.edu/; Ref. 12). Specifically, the sequences were evaluated to determine whether they were associated with novel forms of known transcripts or represented potentially novel transcripts.
Only transcripts in the proper orientation were considered in this analysis, the results of which are presented in Fig. 4. Of these re-arrayed clones, 61 overlapped at least partially with previously reported transcripts, and another 5 fell within 1 kb of known genes. These clones most likely represent novel 3' ends of previously identified transcripts. Another 21 ESTs mapped within 1 and 10 kb of known transcripts. The identity of this group of sequences is more challenging to definitively classify. Because they were found to lie further from known transcripts, the probability that they are derived from a different (novel) transcript increases. It is likely that some of the 16 clones localizing within 15 kb of a known gene represent products of different transcriptional units. However, the majority may represent alternative 3' splicing and/or polyadenylation events. The five clones that localize even further (510 kb) from neighboring transcripts are more likely to represent novel transcripts, rather than additional 3' sequence for known transcripts. Of special note are the 47 transcripts that did not localize within 10 kb of any reported human mRNA sequence. It is quite likely that these clones represent previously unidentified transcripts. They may be low-abundance transcripts or may be specific to lung epithelia. A list of these transcripts is available in the online data supplement (Supplemental Table 6, available at the Physiological Genomics web site).1
|
| DISCUSSION |
|---|
|
|
|---|
From these 879 clusters, 80 were eventually determined to have a high probability of representing novel transcripts. None of the novel ESTs identified were included in previous human genome annotations, indicating that they were missed. These findings demonstrate how focused cell- or tissue-specific gene discovery may reveal novel alternative transcripts of known genes and identify many new genes. They also call into question current estimates of
25,00030,000 genes in the human genome (13, 16, 27). The functions of these transcripts are unknown at present.
Of significant interest from the sequencing of nonnormalized libraries were the contrasts among the transcripts derived from epithelia and those from the lung (Tables 24). The abundant transcripts from the lung libraries included many sequences recognized for their "housekeeping" functions. In contrast, the more frequently sequenced epithelial ESTs included antimicrobial proteins, cytokines, immunoregulatory genes, and genes involved in cellular metabolism. This is consistent with the role of the airway epithelium as an important interface between the host and the environment. The observation that some sequences were more frequently identified from the CF libraries than the non-CF (i.e., keratin 19, LPLUNC1, and PLUNC) may merely reflect the greater number of treatments applied to the non-CF epithelia (see Table 1) and should be further investigated in additional studies.
These results confirm and emphasize the potential yields from focused gene discovery efforts in specific underrepresented cells and tissues and the value of in vitro manipulation of the cells prior to isolating input RNA for library construction and gene discovery. Our findings are consistent with previous findings in other organisms and tissues [human (9), mouse (14), rat (18)].
In summary, we generated a UniGene collection comprising more than 19,000 transcripts expressed in human airway epithelia and lung, including many novel transcripts and hundreds of sequences not represented on commercial arrays. This gene collection may have broad applications for gene discovery and will be useful for large-scale expression analysis for investigators interested in lung diseases.
| GRANTS |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
The complete set of 19,059 nonredundant lung and epithelial expressed clones is available from Open Biosystems (http://www.openbiosystems.org).
| FOOTNOTES |
|---|
Address for reprint requests and other correspondence: P. B. McCray, Jr., Dept. of Pediatrics, 240-G EMRB, Univ. of Iowa College of Medicine, Iowa City, IA 52242 (E-mail: paul-mccray{at}uiowa.edu).
10.1152/physiolgenomics.00188.2003.
1 The Supplementary Material for this article (Supplementary Table 6, a list of transcripts) is available online at http://physiolgenomics.physiology.org/cgi/content/full/00188.2003/DC1. ![]()
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. Kesimer, S. Kirkham, R. J. Pickles, A. G. Henderson, N. E. Alexis, G. DeMaria, D. Knight, D. J. Thornton, and J. K. Sheehan Tracheobronchial air-liquid interface cell culture: a model for innate mucosal defense of the upper airways? Am J Physiol Lung Cell Mol Physiol, January 1, 2009; 296(1): L92 - L100. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. A. Barnes, L. Bingle, and C. D. Bingle Pulmonary Genomics, Proteomics, and PLUNCs Am. J. Respir. Cell Mol. Biol., April 1, 2008; 38(4): 377 - 379. [Full Text] [PDF] |
||||
![]() |
M. Liang and B. Ventura Physiological genomics in PG and beyond: July to September 2005 Physiol Genomics, October 17, 2005; 23(2): 119 - 124. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |