LEKULE: Research Issues and Methods in Menu Selection

It is one thing to generate theory. It is another thing to prove it. When it comes to theory that involves human behavior, verification cannot rest on intuition, argument, or opinion. Instead verification must rest on the bench of empirical research. In the arena of human/computer interaction it is all too easy to generate theory, principles, and guidelines and to apply them without restraint. Although intuition aids in good design, there are many cases in which the best design is initially counter intuitive even to experts in the field. Research in the form of controlled laboratory experiments, observation in usability labs, and even field observations is needed to verify decisions about design options. Furthermore, research is needed to generate theory and to discover general principles in design.
This chapter emphasizes the importance of research based on systematic experiments involving human subjects. It is conceded that such research is difficult, costly, and time consuming. However, it is argued that menu design in the absence of empirical research spells disaster. Two examples of lack of verification will be discussed.
A number of issues central to empirical research will be discussed as they pertain to studies of menu selection. The first question is how to provident reliable, valid, and efficient results. The issues of statistical reliability and generality will be discussed. Second, the principles of experimental design will be presented along with the unique problems faced in human/computer interaction research. Finally, the issue of efficiency of design and statistical power will be discussed.
The material in this chapter is not meant as a review of statistical methods. A familiarity with statistics is assumed. Rather it is meant to elucidate particular issues and to motivate researchers to conduct more and better experiments.

5.1 Intuition and Data in Conflict
Design in the absence of data results in arbitrary, capricious, and uninformed decisions. Of course, such decisions have to be made. Production deadlines force the designer to operate on intuition rather than data when established guidelines do not exist. The result is uncertain and the cost of design errors varies due to type and severity. Five different types of errors are defined below:
(a) Undetected Positive Feature: The designer may not realize that some factor has an effect either positive or negative on performance. Since the factor is not viewed as relevant, an arbitrary level may be selected that has a positive effect. Since enumerable design decisions are made for any menu system, it stands to reason that designers must make some good decisions. The problem is that they may not realize when they have hit upon an important feature. It may well be that the positive nature of a feature may only be discovered when it is changed or omitted in subsequent versions of the software.
(b) Undetected Negative Feature: Again, the designer may not realize that a factor has an effect, but in this case, may arbitrarily select a level that has a negative effect. It may only be after successive usability testing or field observation that the negative feature is detected and changed.
(b) False Relevance: The designer may think that some factor has an effect when it is actually irrelevant. Although design decisions along such factors do not affect behavior, they may affect cost. Much time and effort may be expended to provide a feature that has little or no benefit.
(c) False Positive Feature: The designer may think that a design option has a positive effect when in actuality it is negative. This is often the result of designers thinking that users think like they do. But what is good for the expert may be bad for the novice and vice versa.
(d) Missed Positive Feature: Finally, the designer may totally miss a feature that could have a positive effect on performance.
All but the first of these errors result in negative outcomes and could be avoided by careful research. However, as noted, there is little time for empirical research at the point of design although some have run studies on crucial design questions. Rather, the time for research is prior to design in an effort to establish a literature that will answer design questions as they arise. Such a literature may also be used to generate guidelines and design principles.
The central question is how to build such a literature. In essence the answer lay in what research questions to ask. Basic research progresses from question, "Does Factor X have an effect on Performance Y?" Such questions arise either from theories about human/computer interaction that postulate the effects of factors on behavior or from lists of design features that could have an effect. In the first case research is theory driven, in the second it is design driven. In either case, the effect must be subjected to empirical test or it remains in the realm of intuition.

5.2 Replicability
Empirical research demands replicability. If the same result cannot be replicated, the conclusion is not valid. Replicability adds inductive support to the conclusion. Anecdotal evidence and single case studies lack replicability. They constitute one instance without corroborating evidence. Empirical evidence must be provided that the results have predictive power and application beyond the one case at hand.
Consider a comparison between two systems A and B. Imagine that only two users 1 and 2 are randomly sampled and each assigned to one system. Now suppose that User 1 on System A performed better than User 2 on System B. The advantage could be due to the fact that (a) System A is superior to System B, (b) User 1 is more proficient than User 2, or (c) User 1 happens to work better on System A and/or User 2 happens to work better on System B. The problem is that the difference cannot be unambiguously ascribed to a system advantage. The question is whether the same advantage would remain if two more users were randomly sampled and assigned to groups. When additional observations corroboarate the result, one's confidence in the conclusion is increased. The question is "How much?"
Statistical theory provides an estimate of replicability within a population. The significance level of a test indicates the probability that the result was due to sampling variation rather than a true effect. For example, consider that the means of two groups are compared and a p-value of .01 is reported for significance. One would be assured that a difference of this size or greater would occur only one time out of a hundred when no true (repeatable) difference exists. The power of a statistical test is defined as the probability that a result is significant when indeed there is a true difference in the population. Statistical power is gained by increasing the number of observations. In general, this means increasing the number of users tested in the experiment or by repeating observations on the same user.
Another form of replicability occurs across experiments. One researcher may find that Condition A results in better performance than Condition B. However, another researcher runs a similar study and does not replicate the finding. Such a result raises interesting questions. It could be that the first study was in error and that no true difference exists. Additional research is required to resolve the conflict. On the other hand, it could be that some difference existed between the two studies that resulted in the discrepancy (e.g., different levels of user experience, different equipment, different tasks, etc.). Subsequent research is required in which additional factors are varied to see when the effect occurs and when it does not. These studies investigate moderating factors and interactions of factors.
But to the extent that results are replicated in different experiments, by different researchers, and under somewhat different conditions, the effect under investigation is said to be robust. Robust effects are extremely important because they are most likely to impact the performance of current design. In addition, procedures have been developed to combine results from a number of experiments to generate a "meta-analysis" of the effect (e.g., Hedges, 1985; Hunter, 1982). Meta-analyses are often useful in resolving discrepancies among studies when they catagorize studies by factors that were not initially varied in any one experiment. One result may have occurred, for example, in studies using only novice users and a different result in studies using only experienced users. Thus, meta-analytic procedures may uncover issues not originally considered by the researchers.

5.3 Importance of the Result
Just because a factor has a reliable effect it does not necessarily mean that it has a big effect on performance or should have great impact on design. It all depends on the size of the effect and its frequency of occurance. Effect size may be measured in two ways. It may be measured in terms of absolute magnitude. For example, the difference in user response time between one menu layout and another may be only be 100 ms. If the menu is accessed only a few times in a session, the effect is certainly not an important one. However, if the difference were 1 second and the menu was accessed hundreds of times, the impact would be more substantial. Effects which are measured in terms of time, number of transactions, or precent error can be easily translated into impact. Furthermore, one can gauge the importance of the effect by percent improvement. A menu organization which reduces selection error by 15% can be directly compared to others.
Effect size is also gauged against total variability in performance. Total variability is the sum of all effects due to differences among conditions and users. Effect size can be defined as the observed difference divided the population standard deviation (Cohen, 1977). An effect of size .2 is said to be a small effect, .5 a medium effect, and .8 a large effect.
Small Effects: An effect may be small either because it truly has a minor effect on performance or because the experimental conditions tested only subtle variations (e.g., font size of 10 vs 12 rather than 10 vs 24). If small effects are the result of subtle variations in conditions, it suggests that a large effect may surface when conditions differ to a greater degree. Furthermore, it must be remembered that even though an effect may truly be small, its cumulative impact on system design can be substantial if there is a high frequency of occurance. Finally, the fact that an effect exists no matter what its size, is important from a theoretical standpoint in that it reveals something about the cognitive processing of the user.
Medium Effects: Most experimental results seem to produce effects in the moderate range. These are often the design features that must be carefully traded off against time and cost, aspects of productivity, and throughput of work. A number of medium effects can work together to gain an overall advantage.
Large effects: Large effects must always be considered. To the extent that a new feature results in a substantial improvement in performance, it is worth effort. These are the effects that will help to drive technology and design change. There are large effects, however, that the designer can do little about. It would be nice to design a system in which the users are immediately trained experts. In practice, one has to live with some large effects. Furthermore, some large effects are reported in the literature that do not bear on design. The experiment may have compared a reasonable design with a severly crippled version to illustrate a point. For example, a large effect in search time may be found using an alphabetic listing of names versus a randomly ordered listing. Hopefully, no designer would consider using the random order.

5.4 Generalization of Results
It has been noted in previous chapters that subject variables, task variables, and system variables all affect performance. Statistical theory stringently limits the generality of the results to the same subject population sampled, the same experimental conditions, and the same fixed levels of the independent variables. Strict adherence to this requirement would make all research futile. Instead, a principle of reasonable generalization must be adopted. Results from one class of users may be generalized to a similar class of the users. Results from one type of task may be generalized to a similar type of task. Results from one set of experimental conditions may be generalized to a similar set of conditions. But it should be clear that one cannot generalize results found using a group of novice users to a group of experienced users without empirical evidence that the same results obtain for both groups.
Table 5.1 lists a number of variables that should be considered when one attempts to generalize results. This list is by no means exhaustive. Any specific experiment can only manipulate a small number of these variables. The rest are typically fixed at some arbitrary value. The question is whether the effect will be same at different levels of other variables. If so, then the results are robust and can be generalized to other conditions. If not, additional experiments are needed to investigate the limits of generalizability.
The insightful researcher must ascertain the relative impact and importance conditions and the way in which they may attenuate or mitigate the effects of other variables. First, the generalizability of results is increased if the experimenter arranges the experimental conditions to be most similar to the conditions to which the results are to be applied, that is, standard equipment is used, similar users, typical task demands, etc. The experiment is said to be ecological valid if the conditions in the laboratory simulate the real world conditions or ecology (Neisser, 1976).

Table 5.1
Classes of User, Task, and System Variables that Should Be Considered in Generalization

User Classes
- Experience (knowledge, understanding, automaticity, etc.)
- Skills (typing, clerical, writing, drawing, motor, visual, etc. )
- Motivation (achievement, excitation level, etc.)
- Personality (style, temperment, etc.)
Task Conditions
- Knowledge domain (structure, concreteness, richness, etc.)
- Definition (problem definition, types of targets, restrictions on solution)
- Demands (time pressure, costs of errors, etc.)
System Conditions
- Input Devices (keyboard, mouse, etc.)
- Display Devices (monitor, audio, etc.)
Environmental Conditions
- Social/Organization (audience, status, role, etc.)
- Physical (lighting, sound, etc.)

In order to generalize the results from a specific experiment to a design situation, a number of steps must be considered. These steps are outlined in the following four questions:
* What are the specific conditions (Source Conditions) under which the empirical results were obtained?
* To what extent can these specific conditions be assumed to generalize to a wider set of conditions (Generalized Conditions) and how robust are the empirical results (Generalizability Factor)?
* What are the specific conditions (Target Conditions) to which the empirical results are to be applied?
* To what extent are the specific conditions in the application subsumed by the wider set? (Applicability Factor)
Figure 5.1 shows a schematic of the generalization process. The box on the left hand side symbolizes the specific conditions of the experiment. The set of boxes in the middle symbolize equivalence classes for the original conditions. Finally, the box at the right hand side symbolizes the set of conditions to which the results are applied.

5.5 Experimental Designs
Experimental design is concerned with the assignment of cases to conditions. Assignment is made in such a way that confounding factors, called nuisance variables, are systematically controlled or cancelled out while allowing true effects to be clearly ascribed to the independent variables of interest. Experimental design is very complex topic and beyond the scope of the present discussion. Nevertheless, a few key issues will be presented as they relate to experimental research on menu selection.
In a typical experiment on menu selection one or more design features are selected and varied systematically to create treatment conditions. There are two basic ways of assigning treatments to subjects. In a between subjects design (also called a "completely randomized design") treatment levels are assigned to independent groups of subjects (see the left panel of Figure 5.2). In a within subjects design (also called a "randomized block" or "repeated measures" design) the treatment levels are assigned to the same subjects, that is, the subject participates in multiple treatments (see the right panel of Figure 5.2). The advantage of the within subjects design is that performance of individual subjects in one condition can be directly compared to their performance in another. In a sense, subjects serve as their own baseline or control. The between subjects design has less statistical power because comparisons are made between two independent groups of subjects.

Unfortunately, the within subject design is not without problems. Since each subject participates in several conditions, differences may be due to practice and/or sequence effects. Subjects may get better with practice so that the second treatment has a order advantage. The second panel of Figure 5.2 illustrates this case. One does not know if the 4 point advantage for System B is due to the system or to being second. This problem can in general be controlled by counterbalancing treatments orders. Half of the subjects work with System A first and System B second. The other half work with System B first and System A second. Essentially, order of treatment is added as a between subjects factor in the design. The left panel of Figure 5.3 shows how order is added to the design. In these data there is a 2 point difference due to System and a 2 point difference due to order.

Sequence effects occur if there is a advantage or disadvantage to participating in one condition before the other. Directional transfer of training and contrast effects can cause sequence effects. For example, in working with System A first, the user may gain a full understanding of the task which subsequently results in good performance on System B. However, in working with System B first, the user may become confused which subsequently results in poor performance on System A. The right panel of Figure 5.3 illustrates the problem with such a system by order interaction. Overall, there is no difference in the means for the two systems and no difference due to order. However, there is a two point difference between the Sequence A-B and B-A. When this occurs the only clear indication that one system is superior to the other is in first session performance. The experimenter must fall back on a between subjects comparison using only first session performance. Thus, in the first session data, one can see the 2 point difference due to system.
Between subjects designs should be used when one wants to exclude any transfer of information or practice from one condition to another. It is often the case, that the same test material or database is used to assess performance on different systems. A between subjects design allows the experimenter to use that same database and test questions. A within subjects design requires different databases and test materials for each system. Unless, the databases and test materials are matched, they must also be counterbalanced for order effects. Figure 5.4 shows the layout for such a design. The analysis of such designs and the interpretation of the results becomes more and more complex due to the three-way interaction of order by system by material.

Within subject designs are particularly called for when one is interested in practice effects and effects due to time. Mixed designs are used when learning curves are compared between two different systems. Sucessive practice is varied as a within subject factor and system is varied as a between subjects factor.
Within subject designs are also required when users are asked to make comparison among different systems. In order to rate systems, users must be familiar with them prior to the ratings. Ratings made using between subjects designs are made in isolation and are often insensitive to true system characteristics and biased by prior expectations.

5.5 Summary
There are many other issues to consider in the design of experiments. The serious researcher must have a background of statistics and experimental design courses to do good work. The issues are complex and the best design is not always intuitively obvious. Every experiment seems to have its unique set of problems and requires a well thoughtout design.
The serious reader of empirical literature must also be knowledgable of statistical issues and experimental design and be critical of the results. Rather than expecting the author's conclusions at face value, the reader should ask the questions: Are the results reliable? Are they valid or confounded by other variables? How far can they be generalized?

The research reported in the following chapters must be subjected to the same questions. Although reported in leading journals and conducted by respected researchers, one must nevertheless, view all results with a healthy degree of scepticism. As noted earlier, it is all too easy to write guidelines on the basis of intuition. It may be even more dangerous to write them armed with only a little knowledge.

12 Nov 2015

Research Issues and Methods in Menu Selection

No comments:

Post a Comment