|
|
||||||||
1 Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064 Japan
2 Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN Yokohama Institute, Yokohama 230-0045 Japan
3 Department of Biotechnology, University of Tokyo, Tokyo 113-8657 Japan
| ABSTRACT |
|---|
|
|
|---|
outlier detection; tissue-specific expression; DNA microarray; AIC; expression analysis
| INTRODUCTION |
|---|
|
|
|---|
Bortoluzzi et al. (6), who analyzed 4,080 putative muscle genes, found that most genes were present in at least one additional tissue, possibly because all cells have a cytoskeleton, most cells exhibit some contractile properties, and most tissues share certain types of cells.
For the analysis of several tissue-specific expression patterns, a few methods, e.g., analysis of variance (ANOVA) and the so-called template-matching method, can be applied by assigning a confidence estimate such as a P value to the markedly contracting genes (21). However, those methods are less useful in the following cases. First, in the two-color competitive hybridization assays on cDNA microarrays customarily used in most of the published studies, there are often strongly biased origins of transcripts on the glass slides. For example, in a mouse 18,816-clone array made by 23 libraries, 3,110 clones were derived from tongue, whereas only 17 clones were from spleen (19). This bias can result in the mistaken conclusion that a large number of clones are upregulated with a high statistical significance when tongue tissue is used as the target. Hence, this bias may confuse the confidence estimation. Furthermore, since hybridization experiments are typically noisy (8, 13), the combined single-expression matrix of tissues and genes is often included in the missing data after data processing (15). Second, among the expression levels of particular tissues whose levels are significantly different from those of other tissues, similar intra-tissue levels cannot always be identified. For example, when an observed expression ratio profile for 48 tissues is (2, 1, 1, 0,..., 0) and the template profile is (1, 1, 1, 0,..., 0), then the P value between these profiles is 5.2E-23. The P value increases to 1.49E-07 if the observed profile is (10, 1, 1, 0,..., 0). In general, the greater the number of different levels in particular tissues of a markedly contracting clone, the lower is the confidence level assigned to the clone. However, such clones are just as important as are clones with similar expression levels of particular tissues.
A large expression data matrix of 49 adult and embryonic mouse tissues and 18,816 mouse cDNAs, and a Web interface (called READ, for "RIKEN Expression Array Database") have recently been constructed (5, 19). The READ system facilitates tissue-specific expression searches by inputting an optional value for the expression ratios under the "search by tissue form" option. It also provides a search tool, RINGENE, that can dynamically calculate expression neighbors (or anti-neighbors) to a hand-selected clone across the expression pattern of specified tissues based on an arbitrary threshold (5); thus it involves "thresholding."
Akaikes information criterion (AIC), introduced almost 30 years ago by H. Akaike, is an information criterion for the identification of an optimal model from a class of competing models (2). Kitagawa (17) subsequently used AIC to detect outliers, and Ueda (25) more recently simplified AIC. The most significant advantages of those methods are 1) it is possible to reach a relatively objective decision because the procedure does not require the selection of a significance level, and 2) various situations (e.g., single outlier, multiple lowest or highest outliers, two-sided and grouped cases) can be treated equally. We now report the application of a simplified method for the identification of markedly contracting clones from mouse cDNA microarray data. The validity of this novel approach is demonstrated by the distribution of the data detected as outliers and by the comparison with the other method.
| METHODS |
|---|
|
|
|---|
Minimum AIC procedure.
In general, the problem of identifying tissue-specific expression patterns in multisource data can be viewed as an outlier identification problem (10). We applied a procedure based on AIC to detect outliers (17, 25). Unlike other conventional approaches (9, 11), this method has several favorable characteristics for dealing with ratio-type microarray data: 1) determination of the number of outliers and the "test" can be performed simultaneously, 2) various situations (e.g., single outlier, several lowest or highest outliers, two-sided and grouped cases) can be treated equally, and 3) objective decision-making is possible because the procedure does not require the selection of a significance level such as 1% or 5% (17).
According to Ueda (25), a statistic U to identify outliers is defined as
![]() |
48), s denotes the number of outlier candidates, and
denotes the standard deviation of scores assigned to n samples excluding outlier candidates. The statistic U has a clear interpretation in outlier detection. A low value for the first term in the equation does, whereas a high value does not, indicate that the combination of s outlying observations is likely to be bona fide. The second term indicates increased unreliability due to an increased number of parameters (in this case, s). Therefore, a low value for the first term and a high value for the second term would indicate the incorrect prediction of non-outliers as outliers and the correct prediction of true outliers (i.e., low sensitivity and high specificity). The best approximating combination is one that achieves the lowest value for U and is termed the minimum AIC estimate (MAICE). The procedure aimed at obtaining the MAICE of the models is called the minimum AIC procedure (17).
Detecting tissue-specific expressions as outliers.
The minimum AIC procedure is executed for each clone. In the procedure, (n + s) observations for each clone are included (n + s
48 except for missing data). Consider, for example, centrin2 (clone ID 1700007M18), which is known to be specifically upregulated in testis (12, 27). We expect the observation (expression ratio) in testis to be identified as an outlier on the high (upregulated) side since the clone is derived from mouse testis ("17" in the clone ID indicate testis).
The (n + s) observations are normalized by subtracting the mean and dividing by the standard deviation, then sorted in order of increasing magnitude by -1.86, -1.08, ..., 1.36, 5.66. With the resultant values, MAICE is decided by considering various combinations of outlier candidates starting from both sides of the values. We set the maximum number of the outlier candidates to be half of the (n + s) observations. Accordingly, in practice, we consider the number of combinations as X(X + 1)/2, where X = 1 + (n + s)/2 and the value for the second term is cut off at and below the figures of the first decimal place. For example, MAICE is decided by considering 25(25 + 1)/2 combinations for a clone with 48 observations; for a clone with 46 or 47 observations, it is 24(24 + 1)/2. A schematic illustration of this procedure is shown in Fig. 1. Using this procedure on the clone, the MAICE is the case with two outliers, the observation in testis on the upregulated side and in muscle on the downregulated side. The result we obtained for the upregulated side coincides with earlier reports (12, 27). We applied the procedure to each of 14,610 clones and constructed a matrix (called outlier matrix) for storing the information about the outliers detected in the up- and downregulated side. The program was developed in the C language, and the computation time to calculate 14,610 clones across 48 tissues was about 10 s by a Pentium III 933 MHz (1 GB memory) on RedHat Linux 7.1.
|
| RESULTS |
|---|
|
|
|---|
Table 1 shows the number of observations and the outliers for each of the examined tissues. Of 669,214 observations, 16,389 (2.45%) were identified as outliers on the up- or downregulated side. Interestingly, the number of outliers in testis was the highest; in "E16head" it was the lowest. The trend in the number of outliers across the examined tissues was dissimilar from the number in clones derived from those tissues on the array (correlation coefficient between the numbers across tissues, 0.17). We posit that this was ascribable to the biased collection of cDNA clones to reduce the chance of capturing clones already collected in a cDNA library.
|
|
1.0. The closest observation to zero in the outliers was -0.04689 in thymus (preg1) tissue in a 2510028M24 clone. The outlier was detected on the upregulated side because the majority of the observations in that clone were negatives. On the other hand, for values >5.0, 1,682 of 1,711 (98.3%) possible observations were identified to be outliers. An example of a clone that included the 29 remaining observations was clone 2310028E01. There were three observations with values >5.0 in that clone (5.05, 6.03, and 8.26 in spleen, stomach, and pancreas, respectively). Of the three high values, only 8.26 in pancreas was detected to be an outlier because of the high deviation in the observations across tissues.
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
The method is based on AIC, whose information criterion has been used for modeling in the fields of statistics, engineering, numerical analysis, and recently gene expression analysis (1, 3, 23; and Kadota K, Tominaga D, Akiyama Y, and Takahashi K, unpublished observations). A detailed explanation of the method has been presented elsewhere (17, 25; and Kadota et al., unpublished observations). The most significant advantage of the method is that it is possible to arrive at an objective decision because the method does not require the selection of a significance level such as 1% or 5%.
The current procedure is quite different from previous procedures in which genes were ranked in order of the confidence level defined as a P value. Both the current and conventional strategies entail pros and cons. Answers derived by the current means are free of confidence since the procedure is independently applied to each N clone array. Therefore, if N is increased, the clones examined with the current strategy continue to be present. Use of conventional means, on the other hand, may result in the disappearance of some clones, and the confident P value will be changed.
Interpretation regarding the population of outliers in the examined tissues (Table 1) is difficult and strongly dependent on the distribution of the observations (correlation coefficient between the number of outliers and the standard deviations across 48 tissues, 0.82). On the other hand, the correlation coefficient between the number of outliers and the cDNA clones printed on the array (18 tissues) was relatively low at 0.17, possibly because of a bias in the collection of cDNA clones from biased libraries. These considerations point to the importance of valid normalization, a topic discussed at the Microarray Gene Expression Data Society Meeting (MGED; http://www.mged.org). The data analyzed here were applied to the conventional global median normalization strategy. It supposes that the confidences of two expression ratios such as 2,000/1,000 and 20/10 are essentially the same, although the former must be more robust than the latter. Application of a sophisticated data processing method such as the intensity-dependent method (29) will lead to the acquisition of more confident results.
The detection of tissue-specific gene expression in the data set yielded important findings. Among tissues derived from two or more outliers in each clone (Table 2), we observed histological similarities and the proximity of the regions. This is reasonable because most such tissues share certain types of cells. Moreover, the strong tendency toward the upregulated side in the outlying observations is validated by the degree of homogeneity of the target samples (see Table 3). Namely, the whole E17.5 embryos used as a reference are quite heterogeneous compared with the experimental target tissues (19). Therefore, we conclude that overall, we observed target tissue-dominant upregulation (in contrast to the whole E17.5 embryo-specific downregulations).
Once an outlier matrix that corresponds to the original gene-expression matrix is constructed, it will become easy to extract specific expression patterns from arbitrarily selected tissues, lung-specific patterns, for example (see Fig. 2). The clustering technique frequently used in microarray analysis might be able to obtain a cluster, most of which shows a tissue-specific pattern. However, the procedure includes manual retrieval with an arbitrarily determined threshold. Moreover, there is no guarantee that such clusters will be formed.
The problem of "thresholding" persists in the conventional methods (e.g., template-matching and ANOVA). Although we showed template matching here only as an example (see Fig. 3), we also observed that the results of the ANOVA method were similar to those obtained by template matching (see Supplementary Material at the Physiological Genomics web site). One can set a confident threshold P value as <1/(total number of clones) for a single comparison of a template. The large number of comparisons (248) in the 48-tissue array is considerable. If we apply the Bonferroni correction to eliminate the problem of multiple comparisons, then we must raise the confident threshold according to the number of comparisons; thus the confident number of the tissue-specific clones enters into the considerations.
We observed a remarkable difference in the detected brain-specific clones (highly expressed only in "brain" and "cerebellum") between the minimum AIC procedure and the template-matching method. The former seemed to be able to detect the clones, whereas the latter also included some extra observations especially in "cortex" and "eyeball." We explain the unsatisfactory results obtained by the template-matching method as follows. One reason may be that template matching based on the correlation coefficient considers all variants to be essentially equivalent, a fact already discussed by Pavlidis and Noble (21). Another reason may be the strong similarities among the four tissues: hierarchical clustering of the 49 tissues showed the cluster consisting of the four tissues (15). Accordingly, we conclude that the minimum AIC procedure is specifically applicable to the extraction of specific expression patterns from arbitrarily selected tissues under the condition of coexisting similar tissues.
The advantages of the method we proposed here are 1) the acquired answer is objective and 2) various situations (e.g., single outlier, multiple lowest or highest outliers, two-sided and grouped cases) can be treated equally. As these characteristics mirror those of the method currently in wide use, our method appears to be readily applicable to various expression data.
| ACKNOWLEDGMENTS |
|---|
This work was supported by a Grant-in-Aid for Scientific Research on Priority Areas (C) "Genome Information Science" from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
| FOOTNOTES |
|---|
Address for reprint requests and other correspondence: K. Takahashi, Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, 2-41-6 Aomi, Koto-ku, Tokyo 135-0064 Japan (E-mail: takahashi-k{at}aist.go.jp).
10.1152/physiolgenomics.00153.2002.
1 The Supplementary Material for this article is available online at http://physiolgenomics.physiology.org/cgi/content/full/12/3/251/DC1. ![]()
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. Liang, Y. Li, X. Be, S. Howes, and W. Liu Detecting and profiling tissue-selective genes Physiol Genomics, September 14, 2006; 26(2): 158 - 162. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Honda, Y. Hayashida, T. Umaki, T. Okusaka, T. Kosuge, S. Kikuchi, M. Endo, A. Tsuchida, T. Aoki, T. Itoi, et al. Possible Detection of Pancreatic Cancer by Plasma Protein Profiling Cancer Res., November 15, 2005; 65(22): 10613 - 10622. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Usadel, A. Nagel, O. Thimm, H. Redestig, O. E. Blaesing, N. Palacios-Rojas, J. Selbig, J. Hannemann, M. C. Piques, D. Steinhauser, et al. Extension of the Visualization Tool MapMan to Allow Statistical Analysis of Arrays, Display of Coresponding Genes, and Comparison with Known Responses Plant Physiology, July 1, 2005; 138(3): 1195 - 1204. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Gariboldi, M. Spinola, S. Milani, C. Pignatiello, K. Kadota, H. Bono, Y. Hayashizaki, T. A. Dragani, and Y. Okazaki Gene expression profile of normal lungs predicts genetic predisposition to lung cancer in mice Carcinogenesis, November 1, 2003; 24(11): 1819 - 1826. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Schultz, F. K. Hamra, and D. L. Garbers A multitude of genes expressed solely in meiotic or postmeiotic spermatogenic cells offers a myriad of contraceptive targets PNAS, October 14, 2003; 100(21): 12201 - 12206. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |