Testing The Readability Of Web Page Colors

Authors

Chris Ridpath, Jutta Treviranus, Patrice L. (Tamar) Weiss

Affiliations

Introduction

Unlike the typical printed page, web documents are generally designed to include color. The careful use of color can make the document easier to read (Hoadley, 1990), easier to navigate and more appealing to the reader. The proper use of color can also increase the user's performance in computer based decision support systems (Legge, Parish, Luebker, Andres and Wurm, 1990). In contrast, a poor selection of text-background color combinations can significantly detract from a document's readability. But what colors should be used for the text and background on web pages? This paper describes an algorithm that can be used to machine test the readability of colors used for web pages. We also describe a study undertaken over the Internet to test the effectiveness of the algorithm.

This work was undertaken at the University of Toronto's Adaptive Technology Resource Center as part of the development on a software program called A-Prompt. A-Prompt is a software toolkit that may be integrated into HTML editor programs and will prompt the user to write more accessible web pages (http://aprompt.snow.utoronto.ca/). A-Prompt is based on the Web Access Initiative (WAI) Page Authoring Guidelines that specify what must be done to make a Hypertext Markup Language (HTML) page more accessible (http://www.w3.org/TR/WAI-WEBCONTENT/). The WAI guideline 2, checkpoint 2.2 specifies:

Ensure that foreground and background color combinations provide sufficient contrast when viewed by someone having color deficits or when viewed on a black and white screen.

The guideline does not go so far as to specify which color combinations provide good visibility nor does it specify how color combinations may be tested for readability. To implement a test for compliance with WAI guideline 2, checkpoint 2.2 into A-Prompt, we needed to create an algorithm that would test the readability of text and background color combinations used in web pages. We also needed to test the validity of the algorithm and to determine whether it would work under real world conditions.

There has been much research on color perception and how the eye reacts to color. From this research several general principles have emerged that can be applied to colored text perception on web pages. The general principles presented by Arditi and Knoblauch (1996) (c.f., http://www.lighthouse.org/color_contrast.htm) suggest that for greater visibility, a large difference in brightness between the background and text is necessary. They also suggest that a large difference in color hue between the background and text is necessary for good readability. Regrettably, how the values for color brightness and color hue in web pages are calculated is not specified.

Algorithm

To create an algorithm that can distinguish the readability of web page colors, we chose to test colors based on brightness difference and hue. Web page colors are described in an HTML document by their Red, Green and Blue (RGB) components.

Difference In Brightness - To measure the perceived brightness of a color we used an algorithm that performs a linear transformation from RGB values to Luminance, Intensity and Crominance (YIQ) values. YIQ is a color system used by National Television System Committee (NTSC) broadcasters to optimize the transmission of color pictures for television and for downward compatibility with black and white television. It exploits a characteristic of the human eye that make it more sensitive to changes in brightness than to changes in hue or saturation. The Y parameter in this system is used to carry brightness information and can be calculated from the RGB value in a linear progression. We used only the Y value in our calculations as we were interested in measuring brightness. To calculate the Y value, or brightness, from an RGB value, the following calculation was used:

Y=((R X 299) + (G X 587) + (B X 114)) / 1000

This calculation will produce a value in the range of 0 - 255 with 0 being least brightness and 255 being greatest brightness. It can be seen that the blue color component is given a much lower weighting than the green component for perceived brightness. Red is weighted approximately in the middle of the three components for perceived brightness.

After calculating the perceived brightness of the background and the perceived brightness of the text we took the difference between the two values to determine the brightness difference.

Difference In Hue - To measure the difference in hue we used the following algorithm:

Maximum ((TextR - Background R) , (Background R - TextR)) + Maximum ((TextG - BackgroundG) , (BackgroundG - TextG)) + Maximum ((TextB - BackgroundB) , (BackgroundB - TextB))

This calculation will produce a value in the range of 0 - 765 with 0 being no difference in color and 765 being the greatest difference in color. For our calculations, we rounded the values to whole numbers.

Example Calculation

To illustrate the calculations used in the algorithm, the RGB values shown in the table below were used.

  Red Green Blue
Background 0 0 51
Text 255 255 204

Background brightness=(( 0 X 299) + (0 X 587) + (51 X 114)) / 1000=6

Text brightness=((255 X 299) + (255 X 587) + (204 X 114)) / 1000=249

Difference in hue=((red text 255) - (red background 0)) + ((green text 255 - green background 0)) + ((blue text 204 - blue background 51))

In the above example, the contrast (difference in brightness)=(249 - 6)=243 and the difference in color (hue)=663.

Our supposition was that a large difference in brightness and a large difference in hue should indicate a high degree of readability. A small difference in brightness and/or a small difference in hue should indicate a low degree of readability. The large difference in brightness and large difference in hue used in the example image below (Figure 1) should indicate that it has a high degree of readability.

Sample text used in the user test. Select image to read a description.

Figure 1: Sample text block used in the study. According to the algorithm this sample has a high readability value.

Use Of Color On The Web

The colors used in web pages for text and background are specified according to their red, green and blue (RGB) components. RGB color specification is a system commonly used in video displays. The HTML specification states that each component (red, green or blue) of the RGB color value have a range of 0 to 255. A complete RGB color value then will be described by three values that range from 0 - 255 giving a potential of 16,777,216 (256 X 256 X 256) colors that may be used. With a potential of 16,777,216 colors for the background and 16,777,216 colors for the text this makes a total of 281,474,976,710,656 color combinations that may be used on a web page.

The number of colors typically used in web page design is a reduced set of only 216 "browser safe" colors. The reduced set is used because of the display limitations of some, usually older, personal computer systems. Almost all personal computers can display the reduced set of 216 colors while many systems can not display all 16,777,216 colors. This is true for Microsoft Windows based computers and Apple Macintosh based computers which together make up the vast majority of personal computers in use today. When a web page author uses any of these 216 "browser safe" colors they can be assured that the color selected will be supported on the video display of their audience. If a color is selected outside of this set, the color may not appear as intended on the viewer's display. In some instances, the color may appear as 'dithered'. That is, two or more of the supported colors are displayed together to try and create a color that is close to the unsupported color. Or a color may be selected from the supported set and used instead. In either case, the web page author's selection of color is different from what is displayed to the reader. There are several software programs for facilitating the selection of colors used in web pages. All of the programs surveyed promote the use of these 216 colors. Each of the RGB components in the reduced set of 216 colors have values of 0, 51, 102, 153, 204 or 255. Most browsers, including Microsoft Internet Explorer and Netscape Navigator, allow the user to override the colors present in the web page and display user defined colors instead. However, this is an uncommon practice and most users will view their web pages using the colors selected by the page author.

Validating The Algorithm

To gather real world data to investigate the validity of the algorithm we developed a study that could be conducted over the Internet. Initially, we created several different samples of text and background color combinations and measured their brightness difference and color difference. These samples were viewed by our development group and rated for readability. In general, the brightness difference and color difference values did give an indication of how readable the text was to us. However, we were unsure whether our algorithm could accurately predict readability under the varying conditions presented by other web page viewers. Web pages may be viewed using a wide variety of computer systems, a wide variety of displays and a large variance of lighting conditions. We also needed to test the algorithm with users who have visual deficits.

Subjects - A convenience sample of 149 volunteer subjects (65 males and 84 females) was recruited electronically from various Internet mailing lists and listservs over a period of four weeks. Almost 90% of the respondents were between the ages of 21 and 65 years with 61 subjects aged 21 to 40 years old and 72 subjects aged 41 to 65 years old. Twelve subjects were 20 years old or less and 4 subjects were older than 65. Slightly fewer than 20% of the participants had good visual acuity without any need for correction. About 44% were myopic (short-sighted), 23% were hyperopic (far-sighted), 30% had asystigmatism, and 20% were presbyopic (needed correction for reading due to aging). (The percentages add up to more than 100 because many subjects had multiple acuity problems.) For the subjects whose data were analyzed in the present study, these problems of visual acuity were corrected with regular glasses or contact lenses. They reported no impairment of visual field or color vision.

A further 50 subjects with various visual impairments also responded to the study. These data will be presented in a subsequent paper.

Experimental Procedure - The study presented a series of sample text blocks that used different background / text color combinations. Users were asked to rate the text images for readability using a sliding scale (visual analog scale). At the end of the test, users were asked several questions about their visual ability (e.g., acuity, field, color vision), personal characteristics (e.g., age, sex), and their computer system (e.g., monitor quality).

A total of 42 samples were used in the study including 35 unique and randomly selected samples and 7 duplicate samples (which were used to test the reliability of the user's ratings). The sample text blocks were displayed using images stored in the GIF file format. Although it would have been possible to set the background color and text color on the HTML document and then use embedded text, we decided not to use this method because some users may override the test colors with their own default color preferences. Also, we would not have been able to control the text size displayed on the user's browser.

Thus, using images of text enabled us to ensure that all subjects saw the same colors on the text block and that the text was always the same size.

The text used in the sample was drawn from a source that we thought would have neutral content. It was an English translation of the Greek text on the Rosetta stone of Egypt. The text was displayed using a common serif typeface with a large typeface heading and ample room for background color. We felt that the sample would be generally similar to what is encountered on typical web pages.

Subjects were asked to rate the readability of the sample text blocks on a visual analog scale of 'impossible to read' to 'effortless to read' as shown in Figure 2. The scale anchor labels were chosen to encourage subjects to base their judgement of color readability primarily on ease or difficulty. We were not looking for color combinations that were visually appealing or had other personal significance. To make a selection, the user clicked on the scale using the computer's pointing device. A red indicator marked the scale where the user had selected. This scale was intended to function in a similar manner to traditional paper and pencil visual analog scales.

Horizontal scale used to rate the readability of the sample image. Select this to read a complete description.

Figure 2: Sliding visual analog scale used to mark readability

The location of the red marker was adjusted by the subject by pointing to any location on the sliding scale and clicking the mouse button. The subject could move the red marker repeatedly by clicking elsewhere on the scale until he or she was satisfied with its location.

For users who were unable use a pointing device on their computer system, and could not access this sliding scale, we created an alternate selection method that did not require a pointing device. Our alternate selection method allowed the user to enter a number between 0 and 100 to indicate the readability of the samples. Users of older browsers that do not support the Java script required for the sliding scale were also asked to use the numeric input version of the study as it did not require a Java script enabled browser. These data were analyzed separately from the main data pool, and are not presented in this paper.

For our sample text images we used only the 216 "browser safe" colors. As indicated above, if we had used colors outside this set we would have had to factor color dithering or color mapping into our results. With 216 background colors and 216 text colors a total of 46,656 color combinations is possible. Using our algorithm as described above, we grouped each of the 46,656 color combinations into 7 categories based upon their projected readability ranging from "good" readability (category 1) to "bad" readability (category 7). For our study, we knew that we could present subjects with only a small subset of the possible 46,656 color combinations since asking subjects to grade all 46,656 color combinations would not be possible.

Results

The 42 ratings of a typical subject are presented in the scatter plot shown in Figure 3. Each response from this subject to one of the randomly presented text-background images is plotted with respect to the category number for that image. Several interesting observations are immediately apparent. Most importantly, the data show a clear positive association between increases in the subject's rating and increments in category number. Nevertheless, within each category number there is a moderate amount of variability. For example, the responses for category 6, which according to the algorithm should be rated similarly, vary from about 30 to about 60. For other categories, e.g., number 3, the variability is quite small.

Graph showing subject rating and image category number. Select to read description

Figure 3: Ratings of 42 test-background color combinations from a typical subject.

Figure 3 also illustrates the consistent response of the subject to identical images. As indicated above, one color combination from each category was randomly selected to be duplicated in order to permit analysis of the test-retest reliability of the users' ratings. The close juxtaposition of the responses for the repeated images, shown by the filled-in diamond shaped markers, points to the repeatability of the test. The non-parametric Spearman's correlation test demonstrated a significant test-retest reliability of the subjects' median responses (r=.997, p<.000).

We noted some difficulty in combining the responses from all 152 subjects since, for a large number of the tested images, the responses were highly skewed. This is illustrated by the boxplot shown in Figure 4 (below), a graph commonly used to illustrate the distribution of individual responses. The upper and lower edges of this particular box represent the upper and lower quartile values respectively for all responses to one of the images in category 1. The central notched line depicts the median value whereas the whiskers extending above and below the box show the extent of the rest of the data (in this case, 1.5 times the inter-quartile range). Asteriks denote outliers beyond even this range. It is evident from this example plot that although many subjects responded with low ratings for this colour combination, a few subjects responded with quite high ratings. Preliminary testing using all of the raw data confirmed the fact that significant rating differences between subjects were present and were thus not well represented by the mean value. It was decided therefore that the median subject rating for each colour combination would be used as the summary response variable in all subsequent analyses.

Plot showing subject responses to one of the images in category 1. Select this to read a description

Figure 4: Boxplot summarizing subject responses to one of the images in category 1. The plot illustrates that the data tended to be skewed and were therefore better represented by the median summary statistic.

Figure 5 (below) shows a plot of the median subject ratings versus category. Due to nature of the scale used for the category data, non-parametric statistics (Spearman's correlation coefficient) were used to evaluate the goodness of fit between the readability rating output by the algorithm based on the category number and the median subject ratings. The correlation coefficient is 0.77 and it is significant at p < .000. It can be seen that as the image category increases, so does the user rating.

Graph showing user rating and category number. Select this to read a description

Figure 5: Graph showing median subject rating and image category number.

In order to assess whether the direction of the brightness difference (i.e. bright text on dim background vs. dim text on bright background) is of any importance in determining readability ratings, brightness difference was calculated as the difference between background and text brightness. In the case where the background was brighter than the text, this difference was found to be positive. For those images in which the text was brighter than the background, this difference was negative. The brightness difference was then plotted against the median rating, and yielded a 'U-shaped' plot, shown in Figure 6 (below). This plot suggests that the direction of the brightness difference does not appear to be of importance in determining overall readability ratings. Low readability ratings were observed to be associated with brightness differences close to zero, and ratings were found to increase as the distance from the brightness difference to zero increased. This provides support to the theory that readability ratings are higher for images which contain a greater difference in the brightness of the text and background, regardless of the direction of this difference.

Graph showing the relationship between user rating and brightness difference. Select this for description.

Figure 6: Graph showing the relationship between user rating and brightness difference

To support this hypothesis, a regression analysis was run in which brightness difference was separated into two components. In the case where background brightness was found to be greater than text brightness, the first term in the model was set to be equal to the brightness difference, and the second was set to 0. In the case where text brightness was greater than background brightness, the first term was set to 0 and the second term was equal to the absolute value of the brightness difference. A t-test was then performed on these two parameter estimates in order to assess whether they differed significantly from each other. This test yielded a t-value of 1.36 (p=0.1827), indicating that we have no evidence to suggest a difference in readability rating due to the direction of the brightness difference.

To assess the effects of colour difference, brightness difference and overall brightness on the readability of web pages, a linear regression was performed. The median subject rating was used as a response variable, and the independent variables consisted of the absolute value of the brightness difference between text and background, the absolute value of the colour difference between text and background, and the brightness value of the text on the webpage. The overall model was found to be highly significant (p < 0.0001), suggesting that a significant relationship exists between readability ratings and these three predictors. A t-test was performed on the parameter estimate of brightness difference, and was found to be significant at alpha=0.05 (p=0.0049). This result provides evidence to suggest that brightness difference is an important factor in readability ratings when taking the effects of colour difference and overall brightness into account. Similarly, a t-test performed on the parameter estimate for colour difference was also found to be significant (p<0.0001), which indicates that colour difference plays an important role in determining ratings as well, when controlling for the effects of the other two terms. Testing of the parameter estimate for text brightness yielded a p-value of 0.0544. While this result is not significant at alpha=0.05, it does provide some evidence to suggest that overall brightness may be of moderate importance in determining readability ratings.

Discussion

These results demonstrate that it is possible to judge the readability of web pages based on their color specifications. However, the judgement, based on brightness difference and color difference, is not entirely accurate. There are other factors that influence readability that call for further investigation. Several color samples that were rated excellent in terms of brightness difference and hue difference by our algorithm were rated poorly by our subjects in terms of readability. We noticed that in some of these samples, the large difference in hue caused artifacts around the borders between the text and the background when displayed on a Cathode Ray Tube (CRT) monitor. It may be that the very large difference in hue causes these artifacts and decreases their readability. Other samples that were rated poor by our algorithm in terms of brightness difference and hue difference were rated very highly by our subjects in terms of readability. It may be that several color combinations are more easily detected by our visual system and are inherently more readable. Some color combinations may display better or worse depending on the design of the computer display, the display settings or the type of ambient light.

Our analysis of the subject's responses resulted in an algorithm that predicts the readability of our test color samples. We are unsure if this algorithm can accurately predict the readability of all color combinations but are proceeding to test if this is true. We hope that an algorithm derived from the subject's rating could account for the factors that are difficult to measure.

It is important to note that we used only the 'browser safe' set of 216 colors in our test. Other color combinations, outside of this set, may be 'dithered' on the users display system and cause different readability results. As computer systems evolve, they are able to display more colors so the problem of dithering may disappear over time.

Web pages commonly use more than two colors for the display of text. Most pages use one color for normal text, another color for unaccessed hyperlink text and a third color for accessed hyperlink text. How these three colors interact with each other and the background color was not investigated and may be a general factor in readability.

In summary, the results of this experiment demonstrated that despite the variability in individual ratings, the subjects' responses fit very well to the readability of colors predicted by the algorithm. Several other important findings also resulted from the analysis including the fact that light-on-dark color combinations were rated the same as dark-on-light color combinations and the fact that as the overall brightness of the text/background combination increased, so did the overall user rating. We plan to carry out further testing which will explicitly examine these and other color readability phenomena.

Acknowledgments

We thank Taras Kowaliw for the CGI and Java script programming on the Internet based test, Tamara Arenovich for the statistical design and analysis, Jeff Jutai and Kent Campbell for advice about the study design, Linda Petty for help with the visual function questionnaire, and Ethan Hoddes for data entry.

References