Sawtooth Technologies logo.About Sawtooth Technologies.Software Products.Seminars and User Groups.News.Technical Support.
Sawtooth Logo.Sawtooth Library.Sawtooth News.What's new on the site.
Home.Contact Us.Search.Sitemap.

 

Trade-Off Analysis of Consumer Values

by Richard M. Johnson, Ph.D.

Reprinted from Journal of Marketing Research, published by the American Marketing Association, Vol. 11 (May 1974), pp. 121-127. All rights reserved.

Scroll through the text or use the hyperlinks below to access the subsections of this article:

Back to top

Introduction

This article develops and describes a method for evaluating the value systems of consumers. The three components of this method are: (1) a technique of data collection requiring a respondent to consider "trade-offs" among desirable alternatives; (2) a computational method which derives "utilities" accounting as nearly as possible for each respondent's choice behavior; and (3) a simple market simulation model which attempts to determine those characteristics of a product which will maximize its share of preference within any particular competitive context. This method has been used in several problem areas in the recent past. These include pricing of condominium units [2] and forecasting air traffic between cities [1].

Much marketing research activity is directed toward trying to find out what consumers want. Consumers are often asked what product attributes are most important to them, or what their "ideal" levels of various product attributes are. Neither of these traditional approaches is entirely satisfactory. For instance, judgments concerning the importance of various attributes are usually ambiguous unless great care is taken in defining attributes. Odor, for instance, may be an "important" attribute when considering products which differ noticeably in odor, but may be quite unimportant with a different sample of products from the same category if they all happen to smell the same. Safety may be regarded as an overpoweringly important attribute of airlines, when considered in the abstract. Yet, if airlines are not considered to differ in degree of safety, it cannot affect a passenger's choice of airline. Importance judgments are therefore not necessarily meaningful unless discussed in a highly specific context.

The identification of "ideal levels" of attributes is also frequently inadequate. There are many product attributes for which ideal levels do differ from consumer to consumer, such as saltiness of pretzels, lightness of beer, or sudsiness of detergent. For attributes such as convenience, economy, or level of performance. however. we can safely assume that every consumer would prefer a product having as high a level of each attribute as possible. What is needed in such cases is information about consumer "trade-offs": since no manufacturer can afford to sell an infinitely convenient and high performing product for a price of zero, it becomes relevant to determine how consumers value various levels of each attribute and the extent to which they would forego a high level of one attribute to achieve a high level of another. The method to be described here is based on the premise that each consumer's choice behavior is governed by such trade-off values and that, although he or she may be unable to articulate them, they may be revealed by choices among product concepts having characteristics which are varied in systematic ways.

Techniques of conjoint measurement have generated much interest in the field of mathematical psychology in the last few years, where the notion was first enunciated by Luce and Tukey [7]. Green and Rao [3] describe the application of such methods to marketing research problems. The basic idea is that by providing consumers with stimuli from among which to choose we can make inferences about their value systems based upon behavior rather than upon self-reports. The word "conjoint" has to do with the fact that we can measure relative values of things considered jointly which might be unmeasurable taken one at a time.

Conjoint measurement is fundamentally different from those types of measurement with which most market researchers are familiar, It requires a basic assumption, or "measurement model," regarding the ways in which attributes of objects are related. Although it requires only rank-order data, it produces measurements which are "stronger" than rank orders. Conjoint measurement is similar in this respect to several nonmetric scaling procedures.

Back to top

Measurement of Consumer Values

Suppose that we wish to assess the "importance" or "utility" to a prospective car buyer of each level of several car attributes. As a way of collecting data we might give a respondent a pair of attributes and ask for his rank order or preference for cars differing on these two attributes. He would thus be asked to "trade-off" these attributes against one another.

Consider cars differing only in price and top speed, and suppose a respondent were to state his rank order of preference for cars with nine combinations of price and top speed. Such data could be arranged as follows:

  Top Speed (MPH)
Price 130 100 70
$2,500 1 2 5
$4,000 3 4 6
$6,000 7 8 9

If we were to examine these data one attribute at a time, we would conclude that this respondent prefers lower prices to higher prices, and faster cars to slower cars, other things being equal. Although we can obtain such potentially valuable information by examining these attributes separately, we can learn much more by examining them jointly. For instance, we see that while this respondent's preferred car will cost $2,500 and go 130 MPH, his second choice shows that he would rather drop to a top speed of 100 MPH than pay the higher price of $4,000. Thus, by considering these two attributes jointly, we can learn something about their relative importance in influencing his preferences. If we wished to investigate this respondent's value system more generally, we could have him express his preferences for cars differing in warranty and seating capacity, warranty and price, and so on. If he were very highly motivated, we could ask him to provide trade-off data for all possible pairs of attributes in which we were interested.

One possible data-gathering procedure consists of giving each respondent a booklet in which each page contains a trade-off matrix with rows representing various levels of one attribute and columns representing levels of a second attribute. The respondent is asked to rank those combinations of attributes presented in each matrix according to his preferences.

This data collection approach is considerably different from another technique described by Green and Rao [3]. With that procedure, which might be called a "concept evaluation" technique, respondents provide rank orders of preference for product concepts which differ simultaneously with respect to all attributes being studied. Each approach has advantages. The concept evaluation approach has the advantages of greater "realism," since respondents are choosing among concepts which are more elaborately specified, and at least theoretically, of being able to quantity interactions among attributes.

However, for many product categories it is desirable to study upwards of a dozen product attributes. It is hard to handle this many attributes if all concepts are to be given a specified level of each attribute. The pairwise approach has the advantage that the number of attributes to be studied is limited only by constraints of interview length and respondent endurance. A second advantage of the pairwise approach is that the respondents provide information about trade-offs among pairs of attributes in such a direct form than one can infer relative "importances" of attributes by simple tabulations of the data.

We shall now provide a numerical example of how conjoint measurement can be used to infer consumer values from pairwise trade-off data. Let us suppose that automobiles could be described adequately in terms of four attributes, each with three levels. Rank order of preference data for an actual respondent are shown in Table 1, in which are shown six trade-off matrices -- one for each pair of attributes.

 
Table 1
One Respondent's Trade-off Data
(Rank Orders of Preference)
  Top Speed Seating
Capacity
Months of
Warranty
Price 130 100 70 2 4 6 60 12 3
$2,500 1 2 5 2 1 3 1 3 4
$4,000 3 4 6 5 4 6 2 5 6
$6,000 7 8 9 8 7 9 7 8 9
Top Speed
130 MPH   2 1 3 1 2 5
100 MPH   5 4 6 3 4 6
70 MPH   8 7 9 7 8 9
Seating Capacity
2   2 5 8
4   1 4 7
6   3 6 9

Consider a simple model of preference formation which assumes that each respondent has a positive "utility" value for each level of each attribute, and that the relative degree of his "liking" for a specific car is obtained by multiplying together his utilities for the attribute levels describing that car. If we knew a respondent's utilities for the relevant attributes we could predict his rank order of preference for specific cars. A set of utilities for this respondent is provided in Table 2.

 
Table 2
Estimated Utility Values for One Respondent
  Level Utility
Price $2,500
$4,000
$6,000
.57
.33
.10
Top Speed 130 MPH
100 MPH
70 MPH
.51
.34
.15
Seating Capacity 2 persons
4 persons
6 persons
.31
.42
.27
Warranty 60 months
12 months
3 months
.49
.31
.20

This person's relative liking for a $4,000, 130 MPH car would be .33 x .51 = .1683. This is only a relative value and will have meaning only when compared with other similarly-derived values for cars having other levels of price and top speed. For this person a $2,500, 100 MPH car would have a relative value of .57 X .34 =.1938. Therefore, this respondent should prefer the $2,500, 100 MPH car. In choosing among cars differing in all four attributes, our respondent's relative values would be obtained by computing the products of four utility values at a time rather than two at a time.

This respondent's utilities are estimated so as to account simultaneously for all six of his pairwise trade-off matrices in Table 1. By way of illustration, Table 3 indicates the computations of pairwise products for the price versus speed comparison. This respondent's utilities for the three price levels are shown at the left margin, and his utilities for the three speeds are shown at the top. The value in each cell is obtained by multiplying together his utilities for that row and column. The rank orders of the numerical values in the cells of this table are indicated by the numbers in parentheses. We find that these pairwise products have nearly the same rank order as the data themselves, the single exception being the cells ranked 6 and 7. Thus, the estimated utilities are quite consistent with the data and may be taken as a summary. These utility values are only meaningful in a relative sense. If we were to raise them to any positive exponent (such as squaring them or taking their square roots) their meaning would be unchanged. Also, since their absolute magnitudes are arbitrary, they are scaled so that the sum for each attribute is unity.

 
Table 3
Pairwise Products of Utilities
  130 MPH 100 MPH 70 MPH
  (1) (2) (5)
$2,500 .57 .2907 .1938 .0855
  (3) (4) (7)
$4,000 .33 .1683 .1122 .0495
  (6) (8) (9)
$6,000 .10 .0510 .0340 .0150

Although the model underlying this computation is a multiplicative one, it is not different in any important sense from additive models in more common use. By taking logarithms of these values we could get new values for which sums rather than products would have the desired rank orders. Even considering the arbitrariness of scaling conventions, these particular utility values are not unique; other values obtained by slight modifications of these will still provide pairwise products having almost the same rank order as the data. However, if the respondent had reacted to several pairs involving each attribute and we were to solve simultaneously for utilities "best fitting" all his preference data there is likely to be a unique solution apart from scaling.

Back to top

Computation

The numerical techniques available to convert the observed rank orders into estimates of utilities are similar to techniques of nonmetric scaling in [4, 5, 6]. The computing method used most frequently here is an iterative procedure which attempts to minimize a measure of "badness of fit" of the utilities to the data. Since the data consists of only rank orders, the measure of fit must indicate the extent to which the pairwise products of utilities have rank orders similar to the data.

Two measures have been helpful; the first of these is Kentall's tau. Suppose we have n objects which have been approximately rank ordered from largest to smallest. The tau statistic is the difference between proportions of pairs in "right order" and "wrong order." A tau of 1.0 indicates a perfect rank order, a tau of -1.0 indicates a perfect negative relationship, and a value of zero indicates an unrelated ordering.

Suppose that a respondent has filled out a trade-off matrix of size 3 x 3, and we have estimated utilities for him which are multiplied together to produce a "theoretical" value for each cell of the matrix as in Table 4. With 9 cells there are 36 pairs of cells. Tau would be the number of these pairs of cells for which the difference between theoretical values is in the right direction (the same direction as his data for that pair of concepts) minus the number in the wrong direction, all divided by 36. If a respondent had fill out six such matrices and we wished to measure the overall extent to which his utilities fit his data, we would cumulate the numerator and denominator of tau over all 6 matrices. When the utilities in Table 2 are applied to explain the data in Table 1. we get a tau of .935, indicating a reasonably close but not perfect fit.

 
Table 4
Actual versus Predicted Preferences
for Five Optical Product Concepts
Concept Actual first choice votes Predicted first choice votes
A 43 28
B 101 114
C 157 117
D 204 252
E 152 146
Total 657 657

The tau statistic is based on a count of numbers of errors without regard to their size. A second measure, phi, takes into account the sizes of errors. For each pair of trade-off cells we consider the ratio of the computed values. If this ratio is denoted by the symbol r, then the quantity [r + (1/r) - 2) may be regarded as a measure of the "distance" of the ratio from one. This quantity is zero if the ratio is one and increases as the ratio becomes either larger or smaller than one. The statistic phi is defined by the expression:

This index would have a value of zero if there were no errors of fit, and a value of one if the order of every pair of cells were incorrectly predicted. The most successful computing algorithm currently available uses a "gradient" technique to minimize phi. Normally, those respondents with low values of phi also have high values of tau, suggesting that either of these indices may work reasonably well in practice as a measure of lack of fit.

Back to top

Assumptions

The model of preference formation underlying this method assumes that the attributes studied are independent. This assumption has two ramifications. The first is that the attributes must be nonredundant, or more accurately, they must all be equally redundant. The utility for a collection of attributes is considered to be the product of the utilities of each of its attributes. If an attribute were represented twice, for instance, its utility would figure into the overall as its square, rather than its first power. Lacking any good way to measure the extent of redundancy among the attributes in a list, it seems prudent to conduct preliminary research to formulate attribute lists which are as nonredundant as possible. The second ramification of the independence assumption is one regarding interaction among variables. The model assumes, for instance, that the extent to which a respondent prefers a red car to a black one will be independent of size, price, and model type. It seems possible that red may be someone's preferred color for a convertible while black may be his preferred color for a limousine. This assumption of no interaction is most certainly false when applied to such extreme cases; however, it appears to be tenable under ordinary circumstances. If such interactions do exist in a specific set of data, they will be indicated by unfavorable values of tau and phi.

Back to top

Determining Optimal Product Characteristics

A simple model of preference formation has been described which expresses an individual's theoretical relative preferences as products of sets of utilities. A method for estimating these utilities from rank order data has also been suggested. We next consider the problem of converting these relative values to something more nearly approaching shares of the market. Suppose a market currently consists of products A, B, C,. . . , etc. We wish to predict the relative sales of a new product, X, if X were to become available. The most natural approach would seem to consist of estimating each respondent's overall liking for each product and then to count the number of respondents for whom X has the highest value. This approach assumes that an individual restricts his purchases to his preferred product. This may be nearly true in product categories with high brand loyalties, such as cigarettes, or with infrequent "large ticket" purchases, such as houses. In other product categories it may be more appropriate to employ a probabilistic model which distributes an individual's probability of purchase in some way over his several most preferred products.

Suppose that an appropriate sample of respondents has provided the necessary data and that utilities have been computed for each respondent. We may also have gathered demographic, product consumption. media exposure, and other information about each individual. Suppose we have several experimental versions of a product in mind (which do not necessarily yet exist). We assume that these versions are all feasible from a manufacturing and pricing standpoint, that we could produce any of them, and we wish to choose the "best" version.

We compute each individual's overall liking for the first version of the experimental product, determining whether or not it would have a value higher than any currently available product. If it would have a higher value than any current product, we conclude that this individual would in fact buy it if it were available. If our respondent sample is well chosen, if we weight individuals appropriately to reflect individual differences in consumption, and if we have included the relevant product attributes, then the resulting proportions of respondents with predicted preferences for each product should correspond approximately to actual market shares for currently available products (apart from differences caused by variables unaccounted for, such as advertising and sales force effectiveness). We could then estimate:

  1. how many respondents would choose the experimental product X, in the context of A, B, C.... etc.;
  2. what the likely volume of consumption would be;
  3. what products such individuals are now using and from which they would be switched if X were introduced;
  4. who they are, demographically, and how they may be contacted by advertising. By repeating the process for experimental versions X, X2, ... etc. , we can determine which of these optimizes whichever criterion we wish.

Since computations are done on a respondent-by-respondent basis, it is possible to study interactions among products. For instance, that pair of experimental products could be selected involving relatively little overlap with one another which will theoretically maximize total profitability for the corporation. Likewise, a companion product or line extension can be chosen which appears capable of producing the greatest net increase in total corporate profit.

Back to top

Evidence Regarding Appropriateness of the Model

Although the procedures described here have been in use for a fairly short time, a number of methodological studies have been conducted, three of which will now be described briefly.

Content Dependence. Even with as few as seven or eight attributes it becomes impractical to have a respondent fill out trade-off matrices for all attribute pairs. It is therefore relevant to inquire whether the utilities obtained for an attribute depend upon the other attributes with which it is compared. In one experiment, involving 24 product attributes, respondents were divided randomly into 2 groups. Respondents in each group filled out trade-off matrices for a subset of 18 attributes. Only 12 attributes were common to both questionnaires, and no pair of attributes appeared in both questionnaires.

It was possible to examine the mean utilities for each level of each attribute to see whether different utilities were produced by the experimental groups as a function of context. The 12 common attributes had a total of 46 levels. A t-test was conducted for each of these to determine whether the means were significantly dissimilar. We would have expected between 4 and 5 differences to appear significant at the 90% level of confidence due to chance alone, but only 3 values this large were observed, somewhat less than chance. Therefore this experiment failed to demonstrate any difficulty with context dependence, and lends support to the practice of exposing respondents to subsets of attribute pairs.*1

External validation - Prediction of Preference.The most critical question regarding validity of the model is whether a respondent's utilities, when multiplied together properly, do in fact provide an accurate prediction of his preferences. This question was examined in two experiments. In the first of these, respondents filled out 6 trade-off matrices comprising all possible pairings of 4 attributes. Each attribute had 3 levels so that 12 utilities were computed for each respondent. It would have been possible to specify 3^4 = 81 possible hypothetical product descriptions using these attributes. A subset of 12 of these was chosen, each having the characteristic that it had the "best" level of one attribute, the "worst" level of another, and "middle" levels of the remaining two attributes. The same respondents also provided rank orders of preference for these 12 hypothetical products. We were interested in determining how closely the actual rank orders of preference for these 12 concepts would be predicted by the model.

It should be noted that the model was being asked to work under exceptionally difficult circumstances. The 12 hypothetical products were chosen so as to be as nearly equivalent in overall desirability as possible. If a product had been included with the "best" level of each attribute and/or one with the "worst" level of each attribute, the prediction of preference would most surely have been easier.

In order to assess the goodness of prediction, a rank order correlation coefficient was computed between the actual and predicted rank orders of preference for each respondent. The median of these values was .80. This was felt to constitute a reasonable level of prediction, given the unreliability inherent in the measure being predicted.

Since the fit was not perfect, however, it seemed prudent to inquire whether the errors tended to be random or systematic. If, for instance, the model tended systematically to over- or underpredict level of preference for any of the 12 concepts, we would have evidence of its failure to account for some aspect of the respondents' preferences. The respondents' rank orders of preference were therefore averaged, as were the rank orders of their predicted preference. The rank order correlation between these two sets of averages was .91. This appeared to be an acceptably high value, and inspection of the differences between the two sets of averages provided no evidence of systematic over- or undervaluing any attribute.

In another study respondents filled out 15 trade-off matrices dealing with 10 attributes of products in a "hard goods" category. The same respondents were also presented with five concept statements describing hypothetical products from this category and asked to rank these concepts in order of their preferences. Each respondent's trade-off data were used to estimate his utilities for the 10 attributes, and these were used in turn to predict the rank order of his preferences for the 5 concepts.

The distribution of actual and predicted first choices is given in Table 4. The distribution of first choice votes estimated from the trade-off data is similar to the actual distribution (r= .92), with the exceptions that Concept D is overpredicted by about 25%, while Concept C is underpredicted by a similar amount. The fit is less impressive on a person-by-person basis, however. With 5 products we should expect to predict a respondent's first choice correctly 1 time in 5, or 20% of the time. The actual number of "hits" is 294, representing a success rate of approximately 45%. Thus the similarity of the distributions of actual and estimated first choices is partially caused by compensating errors. This corresponds to experience in other product categories, where the success rate at predicting first choice has ranged from a low of about 40% (twice the chance level with 5 products) to a high of 85% (about 6 times the chance level with 7 products).

An analysis was also conducted to determine whether prediction was more successful among most and least preferred products than among products in the middle of the preference distribution. For each respondent the five concepts were arranged in order of stated preference, from first choice to last choice. For each pair of positions a check was then made to see whether the model made the correct prediction of pairwise order. Cumulating over respondents, we can determine the accuracy with which the model predicts pairwise preference for any two actual preference ranks. These percentages are provided in Table 5.
Table 5
Percent Accuracy for Pairwise Preference Predictions
Rank order of less
preferred concept
Rank order of more preferred concept
1 2 3 4
2 64  
3 67 60  
4 75 69 59  
5 82 82 75 70

With these data the model was successful in predicting preference of the most preferred over the second most preferred concept only 64% of the time, while it predicted the preference of the most over the least preferred concept 82% of the time. The lowest percent accuracy figures are for discrimination between rank orders 2 versus 3 and 3 versus 4, as might be expected. The success rate increases in general as the spread between rank order positions increases, and is somewhat higher for most and least preferred concepts than for those more in the middle. Perhaps surprisingly, the model is somewhat more successful at predicting least preferred product than most preferred, though this may be a specific characteristic of these data.

Back to top

Evaluation and Conclusions

The greatest strength of the procedure seems to be its ability to generate rather refined predictions from quite primitive data. This is a characteristic of all the nonmetric scaling methods. A second strength is its apparently wide applicability. Not only can the model provide predictions about levels of buying interest for new concepts, but it can also provide information about the "trade-offs" among product attributes. For instance, the model can estimate how much price might be increased when a new feature is included without loss of market share, or whether one feature might be substituted for another without loss of share. The procedure is by no means limited to verbalized product dimensions. It is possible to study color, odor, texture, size. shape, and other physical product attributes with the aid of visual or other sensory aids. Indeed, the simulation procedure can easily be generalized to incorporate each respondent's perceptions of current products or new product concepts on such subjective attributes as "beauty" or "satisfaction."

A third important benefit of the procedure is that the concepts tested need not actually exist in concept form. Even if there were only 10 relevant attributes with only 2 levels each. a total 2^10= 1024 possible concept statements would be definable. Using traditional concept testing procedures, several hundred tests might be required to explore all the attractive possibilities. Using a model such as that described here, all 1024 of them could be evaluated in some sense without ever exposing even one concept to any respondent. The cost of this strength is the rather heroic assumption of no interaction among attributes. The model assumes that the whole is equal to the sum (literally, the product) of its parts and, to whatever extent this assumption is false, it will produce misleading results.

It is clear that the usefulness of the procedures described here will ultimately be judged on an empirical basis. At this time there are not yet any clear predictions made by the model which have had time to be proven either true or false in the marketplace. Such information should soon become available, however, since the model has been applied in a number of subject areas.

*1 The t-statistic is not strictly appropriate for this purpose, since it involves a normality assumption which utilities probably do not satisfy (although their logarithms might). However, the t-statistic is generally considered to be relatively robust under this condition, a property not so characteristic of mulitvariate analysis of variance, which might otherwise have been a more appropriate technique.

Back to top

References

  1. Davidson, J. D. "Forecasting Traffic on STOL," Operational Research Quarterly, 24 (1973), 561-9.
  2. Fiedler, John A. "Condominium Design and Pricing: A Case Study in Consumer Trade-Off Analysis" in M. Venkatesan, ed., Proceedings, Third Annual Conference, Association for Consumber Research, 1972, 279-93.
  3. Green, Paul E. and Vithala Rao, "Conjoint Measurement for Quantifying Judgmental Data," Journal of Marketing Research, 8 (August 1971), 355-63.
  4. Johnson, Richard M. "Pairwise Nonmetric Multidimensional Scaling," Psychometrika, 38 (March 1973), 11-8.
  5. Kruskal, Joseph B. "Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis," Psychometrika, 29 (March 1964), 1-27.
  6. -----. "Analysis of Factorial Experiments by Estimating Monotone Transformations of the Data," Journal of the Royal Statistical Society, Series B, 27 (March 1965), 251-63.
  7. Luce, R. Duncan and John W. Tukey, "Simultaneous Conjont Measurement: A New Type of Fundamental Measurement," Journal of Mathematical Psychology, 1 (February 1964), 1-27.

Back to the Library…

Library IndexCase Studies

Top | About Sawtooth Technologies | Software Products | Seminars | News
Technical Support | Contact Us | Search | Sitemap | Home

Sawtooth Technologies, Inc. 1500 Skokie Blvd., Suite 510, Northbrook, IL 60062
Tel: 847.239.7300 Fax: 847.239.7301 E-mail: info@sawtooth.com

©1999 Sawtooth Technologies, Inc. All rights reserved. A-00-01b