|
|
||||||||
Toolbox
Department of Physiological Sciences, Oklahoma State University, Stillwater, Oklahoma
| ABSTRACT |
|---|
|
|
|---|
10 min to validate results of spot quality after initial evaluation and correct
0.3% of falsely assigned qualities of 10,000 spots. We validated 1,641 of 2,110 differentially expressed genes identified by SAM analysis in
1/2 h by comparing each gene with its respective spot image. Furthermore, we found that 6 of 48 genes in one cluster from k-mean clustering method showed inconsistent trends of spot images. RealSpot is efficient for validating microarray results and thus helpful for improving the reliability of the whole microarray experiment for experimentalists. spot quality; data normalization; data filtering
| INTRODUCTION |
|---|
|
|
|---|
Raw data from the hybridization images are the fundamental information for further data analysis, including data normalization (8), statistical inference (10), cluster analysis (4), principal component analysis (PCA) (9), pathway construction (3, 11), and data interpretation. The results of a microarray experiment depend on the quality of hybridization images and the respective raw data sets. However, results from each analysis step are rarely validated with spot images in current data analysis methods or software packages. A typical data analysis procedure of microarray is as following: extracting quantification data from images, filtering data with chosen standards (e.g., background-to-noise ratio >2), normalizing log2 ratios, identifying differentially expressed genes (e.g., 2-fold change of expression or statistical inference with P = 0.05), clustering genes (e.g., hierarchic or k-mean clustering), and selecting target genes for further study. Microarray data analyses confront challenges of diverse methods, standards, and software packages. An example is spot quality evaluation, discussed below.
The image quality varies from spot to spot due to printing, sample quality, and hybridization. Ideally, all the spots with poor quality should be filtered before further data analysis. There are two main approaches for filtering poor-quality spots: manually flagging spots and automatically filtering genes (1, 6, 12, 13). Manually flagging spots is time consuming because of the large number of spots in a microarray slide. Automatic methods based on quantification information are fast and efficient. These methods calculate composite scores from spot size, intensity, signal-to-background ratio (SBR), and/or circularity coefficient (area-to-perimeter ratio). Generally, the composite scores represent several aspects of spot images. These methods may fail at spots with irregular morphology, e.g., donuts, black holes ("ghost images"), and tiny dust-contained spots. Although they are suitable for large-scale data analysis, different methods or parameters frequently generate nontrivial different results from an identical raw data set. In such a case, to validate results and choose an appropriate method are not straightforward. The association of spot images with derived data may identify irregular spots. Some software packages such as Acuity and Longhorn/Standford Microarray Database (LMD/SMD) can locate each spot image on a single slide (7). Acuity locates a spot image on a scanned slide image using GenePix results, whereas LMD can show a spot image as well as other retrieved data in a data query report. In both cases, only a single spot image can be retrieved to validate the derived data from one slide.
Here, we report a software package, RealSpot, for validating results from dual-color DNA microarray hybridizations. RealSpot evaluates spot quality and validates it with spot images. RealSpot links the images and raw data of each spot side by side and organizes them in a spreadsheet table. By standard table operation such as sorting, searching, and editing, a user can directly compare the spot images, raw data, and processed data in an efficient and reliable way. Furthermore, RealSpot provides tools for one-way ANOVA, gene ontology, and web page export, which are also helpful in choosing target genes for further study and thus improving the reliability of the whole microarray experiment. It is freely available for academic use and can be obtained at http://www.lungmicroarray.org or via an electronic mail request (liulin{at}okstate.edu).
| IMPLEMENTATION |
|---|
|
|
|---|
|
65,531) directly generated by a scanner, e.g., ScanArray Express. During data import, each 16-bit TIFF image is split into spot images using the spot geometry x, y, and diameter from the respective raw data file. The spot images are then linearly transformed into 8 bit for visualization. The linear transformation is based on the whole image or individual subgrid. By default, the lowest 5% image data are converted to 0, the highest 5% to 255, and the rest between 0 and 255, calculated from
![]() |
255 for image visualization), F16 is the original 16-bit fluorescence intensity of each pixel, and P5 and P95 are the 16-bit intensities at the 5th and 95th percentiles from a slide image or a subgrid, respectively. During data import, a user is also asked to import sample information, including the sample names and the respective dye channels (green or red). In a RealSpot spreadsheet, each row represents a gene and each column contains gene probes (gene ID and name), information of array print layout, such as block, and subarray, spot images, and raw data such as fluorescence intensity and background (Fig. 2). Additional columns are added to the table, e.g., quality index, 16-bit spot signal, and SBR calculated directly from 16-bit TIFF images (see below).
|
![]() |
4 are shown as columns: the shorter the height, the weaker the intensity. A QI of 5 is shown as a prohibiting cross.
|
|
![]() |
n), to the selected spot C, marked as Ci (i = 1
n). The summations are based on index i and calculated from i = 1 to n. IS value is ranged from 1 (identical spots) to 0 (entirely different spots). RealSpot then sorts the spots by IS, so that spots with similar images are arranged together. After the sorting, the selected spot moves to the first row of the top, followed by other similar spots. A user may manually check these similar spots and correct QI accordingly, since these spots have similar morphology.
Data organization.
When there are more than one hybridizations or slides in one DNA microarray experiment, RealSpot organizes the evaluated slides as an experiment for calculating QI summary and spot signal summary, performing one-way ANOVA, associating gene ontology information with each gene, and retrieving data to verify results. RealSpot uses sample information of each slide and aligns slides by sample names. A user can import multiple slide files into a metatable at the same time by selecting the file names. A column for summarizing QI of each spot from multiple slides is added into the metatable. This column is calculated as follows: the contaminated or bad spots are first removed; the mean and SD of the QIs are calculated from spots with a QI of 0, 1, 2, 3, and 4; the mean QI is rounded to an integer and shown as an icon, as shown in Table 1; and the SD is shown as an error bar (Fig. 2). A column of spot signal summary from multiple slides is calculated as follows. The 16-bit signal intensities of each channel are scaled to an arbitrary range (1
1,000 in RealSpot). The data scaling serves as a global normalization, so that the gene expression data from different slides are comparable. The data scaling is based on an assumption that the lowest 5% of genes are not expressed (i.e., negative spots) and are converted to 1, and the highest 95% of spots are highly expressed (i.e., saturated spots, typically from housekeeping genes) and are converted to 1,000. The rest of the spot signals are linearly scaled to 1
1,000. The mean and SD of scaled spot signals for each sample group are calculated and visualized in the column as bar plots. A one-way ANOVA is performed, based on the above globally normalized signals, if three or more slides are used in an experiment. Before ANOVA, normalized signals are logarithm transformed, which can improve the homogeneity of standard deviations among sample groups (a prerequirement of ANOVA). A P value of each gene is obtained from ANOVA. RealSpot highlights the bar plots of significant genes (P value < significance level, default = 0.05) with thick lines. This indicates that there is a significant difference of gene expression for at lease two samples. RealSpot also accepts gene ontology association files [tab-delimited text files with columns of Gene Ontology (GO) ID, gene symbol, gene ID, GO term, and GO part]. If a GO file is read, a column of ontology is added to display the functions of known genes (Fig. 2). In the metatable, some columns are the same for all the slides, such as gene ID and name, printing layout information, and summary QI. Other columns, such as QI and spot images, are specific for each slide and are organized as a subtable within the respective row of each gene. A sorting column is used for showing information for sorting, e.g., the P value column in Fig. 2.
Data verification.
The data verification module directly compares DNA microarray data with spot images, providing an additional step for quality control and, more importantly, a method to validate data analysis results. After data quality evaluation, the filtered data set can be used for further data analysis such as cluster analysis or hunting differentially expressed genes. From the downstream analysis, a researcher may obtain a list of genes and associated data, such as the genes with similar normalized ratios, differential expression, and expressed pattern across a series of conditions. Before further analysis or functional studies, a user may compare the final genes with the respective spot images by clicking the "search genelist" button in Fig. 2 to search and group these genes and, optionally, the associated data. RealSpot shows all the found genes on the top of the metatable. It is relatively easy for a human being to identify a few distinct spots from other spots showing a similar pattern. Consequently, search by genelist in RealSpot is efficient for identifying the inconsistency of data analysis results and spot images. The inconsistent genes may be eliminated from further analysis, or an alternative method may be chosen to analyze the same data set to see whether consistent results are achieved.
Data export.
Spot images and raw data as well as QI can be exported by selecting interested genes or items. An export module guides a user exporting the respective information. For image export, the spot images of selected genes are exported as a Windows Enhanced Metafile (WMF), bitmap file (BMP), and web pages (hypertext markup language files; HTML). The WMF file is the default format. It contains the instructions for drawing the text and spot images and has a very high resolution. It is best for printing high-quality images. The file formats can be read by most image and word processing software packages. RealSpot also exports the summary QI and gene expression ratio of two samples. RealSpot can export hundreds or thousands of genes as an HTML file. A user may directly post it on the internet for data communication among internal lab members or external DNA microarray communities. Before data export, RealSpot provides tools for data normalization and scatter plotting. A user may select two samples and filter spots by QI or directly select spots from the metatable. The scatter plot visualizes the global distribution of the signal intensity of the selected genes in two samples. Global or intensity-dependent normalization methods are provided. LOWESS normalization (15) based on print tip is the default. A user may select an appropriate normalization method based on the scatter plot. RealSpot exports the gene ID and name, QI, and normalized expression ratio of select genes and samples as a tab-delimited text file.
| RESULTS |
|---|
|
|
|---|
Performance.
It took
10 s for RealSpot to import raw data and two images of a slide with 30,000 features. The table file created by RealSpot was
20 MB or one-third of the total size of the imported raw data and image files (6070 MB). RealSpot evaluates one slide, based on intensity and SBR, in
500 ms. It takes
510 min for a user to semi-automatically correct spot QI. Current version RealSpot can manage an experiment with hundreds of slides. For loading 77 slide files (30,000 spots each) from a whole experiment, RealSpot only spent
10 s because of the compact size and binary format of table files. The slowest performance of RealSpot was data exporting, due to the intensity-dependent LOWESS normalization. RealSpot spent
10 s to normalize a table, or 3 min for a whole experiment of 20 slides.
Quality evaluation.
During the above performance test,
0.3% of the spots (100 of 30,000) of each slide were semi-automatically corrected after initial quality evaluation, based on visual and subjective observation of spot images. By sorting, these spots were grouped at the end of the table. Most of these spots were extremely big and were falsely identified as bad spots. It took 510 min to correct these spots for each slide. This is a substantial time saving, compared with GenePix, where we normally spent several hours on manually evaluating the location and quality of individual spots through a whole slide.
To assess the data quality after the quality evaluation, we compared the scatter plots after data filtering using different ranges of QIs. As shown in Fig. 4, most of the bad spots and empty spots were in the lower-intensity end. After these spots were filtered (QI = 0 or 5), more consistent results were obtained.
|
0). It is noteworthy that the false-positive spots from RealSpot had a QI of 1, which means weak or ambiguous spots. A user may filter such weak spots in a particular experiment, and these spots might not be false positive in such a circumstance. For positive control probes, we choose 86 highly abundant genes such as ribosomal proteins and GAPDH. There was a significantly lower false-negative spot count in RealSpot than in GenePix (P < 0.05).
|
1/2 h in this way. About 22% differential genes were eliminated because of inconsistence between log ratio and spot images or very low gene expression levels (weak spots). Using k-mean cluster analysis (4), we identified 10 clusters from 6 tissue hybridizations (lung, heart, kidney, liver, spleen, and brain). One of the clusters is spleen-specific genes (Fig. 5, E and F). The spot images were generally consistent with expression patterns from cluster analysis, except the second gene, which was apparently a false-positive result and thus eliminated from further study. We found that 6 of 48 genes in a whole cluster showed inconsistent trends between gene expression levels (normalized signal in arbitrary unit) and the respective spot images.
|
| DISCUSSION |
|---|
|
|
|---|
The spot images are directly imported from scanned microarray slides and linearly transformed to 0255, representing original image information. This linear transformation trims the extreme signals, i.e., the lowest and the highest 5% of image signals, since the former is usually background noise and the latter the saturated signal. Trimming these signals does not lose much image information, but makes 90% of pixels visible without adjusting brightness and contrast. This method is similar to Affymetrix data normalization, which linearly transforms fluorescence intensity to an arbitrary range, e.g., 0
10,000. The advantage is that different slides are comparable after transformation. We noticed that some slides show differences among printing tips. We therefore added an option in RealSpot for a separate transformation of each subarray or block from an identical printing tip to compensate for such differences, similar to the printing tip-based LOWESS normalization (8). We generated quantitative data (signal intensity and SBR) by RealSpot from original 16-bit spot images located by geometric data from GenePix (x, y, and diameter) for quality evaluation and verification of the original raw data from image analysis software packages. Signal intensity and SBR were affected by spot alignment algorithms but not spot segmentation algorithms (14). In RealSpot, the image of each spot was split from a whole slide at the respective spot center (x, y). The size of each spot image was identical, i.e., the average distance between two adjacent spots. The signal intensity was the average intensity of the whole transformed image of a spot, and the SBR was estimated from the center one-fourth area (for signal) and the four corners (for background). The problem in term of assigning a QI to large spots was associated with the estimation of SBR in RealSpot. The problematic spots usually had a diameter larger than the average distance between two adjacent spots. The four corners of these spot images were largely occupied by the spots. Consequently, the estimated SBR was lower than true SBR, and thus a larger spot may be falsely assigned a QI for a contaminated spot. This problem would be even worse in the slide area where large spots were clustered together. A potential solution would be the estimation of SBR using global background through a whole slide, since this problem resulted from the estimation of local background. Currently, we do not test global background, since local background works well with 99.7% of spots for identifying low-quality spots in our results. However, in RealSpot, these falsely assigned QI can be manually corrected in a quick table operation style: sorting, selecting, and editing. The quick manual correction of a table in RealSpot results in a higher efficiency compared with the time-consuming manual spot correction of a whole slide in GenePix.
The data imported from raw data and transformed images are useful for manual evaluation. For instance, the contaminated spots move together when the table from a slide is sorted by flags, signal-to-noise ratio (SNR), or "B535 mean" column (from GPR file). These contaminated spots may be manually selected and remarked simultaneously. Sorting the table also helps to correct errors, in particular, weak, noisy, or irregular spots. On the other hand, some spots may be marked as bad spots based on SBR, but they are good spots by visual assessment. Most of the reported automatic methods were based on raw data, using some criteria such as SNR, SBR, circularity coefficient, or composite score (1, 6, 12). We found that these criteria were sometimes inconsistent with spot images. For instance, many weak spots have high SBRs, because both their signals and backgrounds are close to 0. Some spots with a good morphology and intensity may be contaminated with a few tiny dust specks. These spots are good spots if manually marked but may be identified as bad spots because their SNRs are low. These spots can be corrected by sorting the QI column, followed by the spot diameter column. In RealSpot, the QIs of all the spots can be corrected in 5
10 min for 30,000 spots, and thus the mistakes that are unavoidable for automatic tools are minimized.
The one-way ANOVA implemented in RealSpot simplifies multiple statistical factors (e.g., dye, slide, and sample treatment) to one factor (sample treatment). The original 16-bit gene expression data from each slide are globally scaled to an identical range (1
1,000) and logarithm transformed for calculation of P values. The P values from this simplified one-way ANOVA can be used for fast monitoring of differentially expressed genes. For instance, a gene with a P value lower than the significance cutoff (P value = 0.05) may be considered a significantly differentially expressed gene, for further investigation. The significance cutoff may be adjusted, e.g., Bonferroni adjustment, which sets P value cutoff = 0.05/n (n = total gene no.). The adjusted cutoff may decrease type I errors (false positive).
The GO information of known genes is helpful for understanding gene functions. In a typical ontological annotation file, a gene is assigned multiple GO IDs, reflecting the molecular function, biological process, and cellular location. RealSpot summarizes all the ontological annotations of each gene and thus provides a user with comprehensive information on a gene. On the basis of the P value from one-way ANOVA and ontological annotation, a user can quickly find interesting genes and their potential roles in a particular experiment.
RealSpot is efficient. It evaluates and assigns a QI to each spot immediately after data import. A user also can identify incorrectly evaluated spots by sorting similar spots together using spot images and raw data. Another feature is the icons used for QIs. RealSpot uses standard scores 04 and represents them as bar plot-like icons to visualize spot quality (Table 1). It is easier for a user to compare a spot with the respective icon than with a number. It is also helpful for visualizing the trends of gene expression level when several slides or samples are grouped together.
RealSpot is flexible and easy to use. First, the images from many DNA microarray scanners and raw data from commonly used image analysis software packages can be directly imported. The import module guides a user importing data step by step with detailed help information. RealSpot skips the description of some raw data files, e.g., GenePix GPR files, and imports user-selected columns of raw data. By organizing raw data and spot images in a table, the user interface of RealSpot is similar to Microsoft Excel worksheets. Under such operation environment, a user can focus on data evaluation without learning new instructions for operating software. The export module exports the table as a text file for importing to database or data analysis software. Images can be exported in bitmap or metafile formats. These formats are most popularly supported in Windows operation systems. Organization of data by samples and slides clearly displays microarray experimental designs, such as a loop or reference design, and helps a user to interpret data from biological samples.
RealSpot is designed for quality evaluation of raw data and spot images, not for data analysis, although some simple data process tools such as data normalization are included. Further improvement of RealSpot may include image transformation, a direct link to database, and data analysis tools. In the current version of RealSpot, a 16-bit image is linearly transformed to an 8-bit image for display. A square-root transformation may be used to strengthen weak spots. A direct link of RealSpot with database will help a user manage microarray data. More powerful sort and search tools may be implemented in metatable. Another limitation of the current RealSpot version is that it can only work with images from dual-color hybridization, and this issue should be addressed in the future version. In summary, the software package RealSpot is efficient for validating microarray results and thus helpful for improving the reliability of the whole microarray experiment. The improvement results from the association of microarray data with the respective spot images.
| GRANTS |
|---|
|
|
|---|
|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Address for reprint requests and other correspondence: L. Liu, Dept. of Physiological Sciences, Oklahoma State Univ., 264 McElroy Hall, Stillwater, OK 74078 (E-mail: liulin{at}okstate.edu).
10.1152/physiolgenomics.00236.2004.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
D. Gou, A. Mishra, T. Weng, L. Su, N. R. Chintagari, Z. Wang, H. Zhang, L. Gao, P. Wang, H. M. Stricker, et al. Annexin A2 Interactions with Rab14 in Alveolar Type II Cells J. Biol. Chem., May 9, 2008; 283(19): 13156 - 13164. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Daskalakis, D. Cavouras, P. Bougioukos, S. Kostopoulos, D. Glotsos, I. Kalatzis, G. C. Kagadis, C. Argyropoulos, and G. Nikiforidis Improving gene quantification by adjustable spot-image restoration Bioinformatics, September 1, 2007; 23(17): 2265 - 2272. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Weng, Z. Chen, N. Jin, L. Gao, and L. Liu Gene expression profiling identifies regulatory pathways involved in the late stage of rat fetal lung development Am J Physiol Lung Cell Mol Physiol, November 1, 2006; 291(5): L1027 - L1037. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Chen, Z. Chen, N. R. Chintagari, M. Bhaskaran, N. Jin, T. Narasaraju, and L. Liu Alveolar type I cells protect rat lung epithelium from oxidative injury J. Physiol., May 1, 2006; 572(3): 625 - 638. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Liang and B. Ventura Physiological genomics in PG and beyond: October to December 2005 Physiol Genomics, December 14, 2005; 24(1): 1 - 3. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |