Sensory analysis

 

Sensory analysis is the science of determining the attributes of products using the human senses, with the expert wine taster as a familiar example.

Though chemistry, physics or microbiology, say, can tell us about some ways that products differ, other important attributes such as pleasantness or the similarity of one odour to another, can be measured only by using human assessors.  There are well developed methods for characterizing products in this way, but Difference Testing is a special branch of sensory analysis, addressing a particular kind of problem.

 


 

Difference testing

Difference testing is concerned with questions such as whether or not two batches of a product are noticeably different.  Probably no two batches are ever completely identical, given sufficiently sensitive analytical methods, but even real differences may not be noticed by consumers.  Here are a few examples.

 

Many types of difference test have been devised.  Those in widespread use are described in detail in textbooks of sensory methods and in national and international standards.  All take the form of requiring someone called 'an assessor' or, more usually, a panel of several assessors, to carry out a task which can be performed perfectly by an assessor who detects the difference with certainty but will frequently result in errors if the assessor is unsure about the difference.

Results from difference testing are rarely clear-cut. Usually, they are statistical in nature, which means that answers always have some uncertainty. The aim of statistical analysis of difference tests is to obtain answers that are as precise as the data allow and whose uncertainty is quantified.

One tool for quantifying the uncertainty of a difference test is a test of statistical significance. This calculates a numerical probability, the significance level of the result. This is the probability of obtaining results showing at least as much detectability as was actually observed if, in truth, the difference is completely undetectable.  This probability can be calculated, since if the difference is totally undetectable, answers must be given at random and the probability of being correct on any one trial can be deduced from the nature of the test.

 


 

Statistical significance

Suppose a difference test required each of 12 assessors to select the sweetest sample from a different set of three.  In each set, two samples were identical and one had more sweetener in it.  Of course various precautions had to be taken to ensure that no difference other than sweetness could be used to select the odd sample.  On each trial (each presentation of three samples) the assessor was asked to select one of the three as sweetest even if the choice had to be a guess.  This procedure is known as 3-alternative forced choice, or 3-AFC.  (Note that this procedure is different from the triangle test which also involves the selection of one out of three samples.)

If the difference is truly undetectable, the assessor has to choose with no help from the samples.  All three are therefore equally likely to be chosen so the probability of choosing the right one is one in three (0.333).  The probability of being correct on a single trial is written p(C).

 

If this is so, it is straightforward to calculate the probability of the correct choice being made by 1, 2, 3 or any number (k) of the 12 assessors.  These probabilities are shown in the graph.

Now one of these outcomes must occur. That is, out of the 12 assessors, the number giving a correct answer must be one of 0, 1, 2 ... 11 or 12.  Therefore, the probabilities of these 13 possible events must add up to 1.  So, the total blue area in the graph is 1.0

The probability of  10, 11 or all 12 of the assessors making correct choices is tiny (less than 1 in 1000) if they are choosing without detecting any difference in sweetness.  The probability of 9 being correct is also small (about 1 in 250) so if we find that 9 or more out of the panel of 12 make correct choices, we feel pretty confident that they had some assistance from the samples and since we have taken care that the only difference between samples is in the amount of sweetener, we interpret that as evidence that they detected a difference in sweetness.

 

The red area in this graph is the total probability of 8, 9, 10, 11 or 12 assessors out of 12 giving correct answers if the probability of being correct is 1 in 3 for each of them.

This area is about one fiftieth of the total coloured area in the graph and indicates that the probability of  8 or more giving correct answers is a little less than one in fifty (actually, 0.019).

If we had tested 12 assessors in this way and found that 8 or more had given correct answers, we would conclude that the evidence indicated that at least some of them were detecting a difference in sweetness.  We could say that we have found a statistically significant difference between the results of the test and the predictions made by assuming that no difference was detectable.  More briefly, we often say that we have found a significant difference in sweetness between the samples.

 

The red area in this graph has been increased by adding in the probability of exactly 7 correct answers.

The red area is now 0.066 of the total coloured area. This is more than one twentieth (0.05) of the total.  Although getting as many as 7 correct answers out of 12 is also unlikely, (about 1 in 15) it is conventional to regard any probability greater than 0.05 (1 in 20) as not significant.

If we find that 7 out of 12 assessors give correct answers in this test, we very emphatically do not conclude that they were unable to detect any difference in sweetness. The proper conclusion when a result is not significant is that the data do not allow you to conclude with the confidence you want that the assessors did detect a difference in sweetness. This is quite a different conclusion. In fact, if we get 7 correct answers out of 12, this is quite a lot more than the number to expect if the assessors were always just guessing. So far as it goes, a result of 7 correct out of 12 does suggest that some of them detected a difference.  It is just that with the amount of data gathered (12 trials) the chance of 7 or more successful guesses is not small enough to meet the significance standard that convention demands.

If the same success rate is maintained when we carry out the test with a lot more assessors, the result will be significant.  For instance, if we use 24 assessors and obtain 14 correct answers (the same success rate) the result is significant at the 0.01 level.  In other words, the probability of getting 14 or more correct in 24 random guesses is less than one in a hundred.

For this reason, a test of significance can be used to draw a fairly confident conclusion about a difference being detectable, if the results are appropriate, but it cannot tell us that a difference is undetectable whatever the results are.  If we are interested in Similarity Testing – seeking reassurance that a difference is undetectable – we need a different approach.

 

  Please send comments or suggestions about this page to:

webmaster@difftest.co.uk