Transcampus.com

advert
Home Instructors Journals ContactUs
Home

 

Instructors

 

Journals

 

Contact Us

 

JOURNAL OF RESEARCH IN NATIONAL DEVELOPMENT VOLUME 5 NO 2, DECEMBER, 2007

COMPARISON OF CLUSTERING TECHNIQUES AND A UNIVARIATE TECHNIQUE IN THE SITE SELECTION ON MULTILOCATION TRIAL


A.F. Busari
Department of Mathematics/Computer Science
Federal University of Technology, Minna

Abstract
The experiment was carried out in four zones of Kwara Agricultural Development Programmes and each of these techniques was used to group the sixteen locations into groups of similar locations and a critical intelligent observation of each group was then carried out for each technique. At the initial stage of analysis, there was a disagreement between the two techniques.  In further analysis of clustering technique and with the application of either complete linkage or further neighbour linkage methods, there was a positive perfect correlation on the memberships of the groups.  Therefore, any of these two techniques could be used to group locations.  The only difference is that clustering technique is more subjective than ANOVA/DMR test technique.

Keywords: Locations, Clustering technique, Analysis of Variance technique and Multilocation trial


Introduction
Multilocation trial is a system been commonly practiced by many Agricultural Research Institute to determine the best genotype (variety) for an environment or location in terms of optimum yield. A multilocation trial is an experiment which is located in more than two or more locations in order to obtain the best treatments for the locations or environments.  In each of these locations, the experimental design and the treatments are usually the same.  This experiment was carried out in each zone of Kwara State Agricultural Development Programmes (KWADP).  In each of these zones, five or more different locations were chosen in which the experiment carried out.   The experiment involved five cowpea populations in randomized block design of five replicates on the farm plots.

In this paper, a set of quantitative data obtained from a multilocation trial above was used and two statistical techniques were studied in grouping and selection of sites.  The two techniques are Analysis of Variance and Clustering.  The first method is univariate techniques while the second method of multivariate but a primitive method which is commonly used in Taxonomy, Biology and Ecology. Analysis of variance is a univariate method that is used to test the significance of both the main and interaction effects.  It is also known to be an additive model because it can only analyze main effective effectively but not the interaction effect (Kempton, 1984).

While the clustering technique is a multivariate technique that is used to determine or reveal the fundamental features or inherent or natural groupings in the data, or provide a convenient summarization of the data into groups without prior to knowledge of the analyst regarding group memberships.  The significant interrelationships present in the data are being highlighted.  This is one of the many multivariate statistical uses of clustering technique.  Hierarchical clustering method was being made use of in the field of biological taxonomy from 1950s till date. In this paper we intend to compare the results of the ANOVA (univariate technique) and that of the clustering technique (a primitive multivariate technique in the grouping of similar sites or locations).

If the null hypothesis is rejected at certain level of significance, which implies a significant difference exists between the treatment means. Further analysis is carried out to determine which of the treatments differ from others. In such a situation further comparisons between groups of treatment means is usually carried out.

There are many comparisons tests that can be used whose type I error is not inflated. Among these tests are Newman (1939), Turkey (1953), Duncan (1955), Bonfernier and Kenls (1952) and Newman-Kenls tests. According to Carmer and Swanson (1973) in an extensive comparison of these and other test procedures showed that Duncan Multiple procedure is superior to the Newman-Kenls tests in detecting true differences between pairs of means.

Methods of Grouping

i.          Analysis of Variance

As it has been said earlier that Analysis of variance (ANOVA) is a univariate technique that is used to analyze the main effects effective is due to its addivity nature.
The ANOVA model is as written below
sdhgs 
where
sfsf      -           is the observation in the ith level of factor A and jth level of factor B in the kth  replicate.
sfs       -           is the true effect of the ith level of factor A (main effect)
sf      -           is the true effect of the jth level of factor B (main effect)
fs -           is the true effect of the interaction between  gf and gs
sfg       -           is the random or residual error components which is assumed to be independently          and identically distributed as a normal variate. That is, sf sgf      sg                 
            Finally, Ducan Multiple Range Test (DMRT) is used to compare the location values.

ii.          Clustering

Clustering technique is used to determine the
inherent or natural groupings in the data or
provide a convenient summary of the data into groups.  The objects to be classified have numerical measurements on a set of variables or characteristics.  Therefore, the analysis is carried out on the rows of a matrix.  The rows of the matrix can be viewed as vectors in a multidimensional space while the dimensionality of the space is refer the number of variables or columns.  Clustering can also be applied on the column of a data matrix since there is no direct relationship between the clustering of the rows and that of the columns.
There is a great deal of subjectivity involved in choosing a similarity structure.  This includes the nature of the variables, scales of measurement and subject matter knowledge. 
Clustered items or units implies that a there is proximity between the items while variables are usually grouped on the basis of correlation coefficients or measure of association.

Method of Clustering

There are majorly two classes of clustering algorithms.  These are called
(i)         Hierarchical Algorithm  (ii)        Non-Hierarchical Algorithm.
But the former have been dominant.  Under the Hierarchical method, there are about three branches called Linkage hierarchical clustering methods.  The hierarchies could be the evolutionary relationships between the organisms under study.  These are Single linkage method, Complete linkage method       and Average linkage method. 

For this paper, single linkage was applied to cluster items or objects.  This method fuses groups according to the distance between the nearest members.  The dissimilarity coefficient obtained is assumed to be symmetric and clustering algorithm is implemented on half the dissimilarity matrix. In Single Linkage Method, you have an fg set of dissimilarities and the smallest dissimilarity, di,k is determined.  Then aglome objects i and k; and update dissimilarities such that,

Objects    j ¹ i,k  " oblect
That is, replace objects i and k with a new object i u k:
            d(iuk,j) = min {dij, dkj}
Then delete dissimilarities dij and dkj for all jas these are no lomger in use.
Repeat the steps above as long as you have at least two objects remaining.
The name single Linkage is derived from the fact that the interconnectivity dissimilarity between two clusters {I u k, and j} is the least interconnecting dissimilarity that exist between a member of one and member of the other.  Other hierarchical clustering methods mentioned above derived their name from the functions of the interconnecting Linkage dissimilarities.

This method of single Linkage is also being referred to as Minimum Methods by Johnson
(1967) and as the nearest neighbour by Lence and Williams (19670).  It can also be displayed in the form of a two-dimensional diagram called DENDOGRAM.  Dendogram illustrates the mergers that take place at successive levels.  It also reveals set inclusion relationships, partition of the objects-sets and significant clusters.

Data Analysis and Results

Five different elite sorghum varieties were planted in 16 different locations of KWADP in a randomized block design under different ecological conditions of the state.  The varieties were given equal treatment in each location.  The gross weights of the yield / plot of the varieties in each of the sixteen selected locations are as recorded below:


Table 1.1: Sorghum Yield/Plot in Kg.

Variety
/Location

V1

V2

V3

V4

V5

1

4.80

2.80

2.20

1.30

1.70

2

2.20

2.00

1.50

1.50

2.00

3

6.00

6.50

4.00

3.00

3.00

4

8.00

8.00

6.00

5.00

5.00

5

4.00

2.50

2.00

1.00

0.50

6

1.00

0.50

0.50

0.50

0.50

7

3.00

1.50

1.00

0.50

0.50

8

6.00

3.50

4.00

3.00

3.00

9

5.50

2.00

2.00

2.50

2.00

10

6.00

3.50

4.00

3.00

3.00

11

4.20

2.60

2.40

2.00

1.30

12

4.50

3.10

2.50

2.10

1.60

13

4.30

3.00

2.60

2.20

1.40

14

3.70

3.40

3.50

3.00

2.60

15

4.30

3.80

3.20

2.60

3.00

16

2.80

2.80

2.50

2.00

1.50

           
Analysis of Variance (ANOVA)

Table 1.2: ANOVA Table for Sorghum


Source of variation

Degrees of freedom

Sum of square

Mean sum of square

F-Ratio

Fcal.

Genotype
Environment interaction

4
15
34

56.8525
138.654
20.733

14.2131
9.2436
0.6098

2.74
2.07
1.88

187.2365
121.7705
8.798

IPCA 1
IPCA 2

18
16

11.70
9.033

0.6498
0.5646

2.02
2.05

8.5601
7.4373

Error

26

1.974

0.07591

-

-

Total

79

218.21

 

 

 

 


From table 1.2 above, it could be observed that the F-calculated for the genotype, environments and the interaction effects were very significant, at a = 5% level of significance.  The significance of the interaction effect implies the significance of the IPCA 1 and 2 at that level of significance a = 5%.  The significance of f-calculated led to the rejection of the three null hypotheses set for genotype environment t and the interaction effects.

 Further analysis required using Duncan Multiple Range Tests (DMR Test) as follows:
            Ssf
            gsfgwhere
MSis the mean square error picked from the ANOVA table
            sfss is the standard error
P = 2, 3, 4………..16, n = 5,  a = 5%

Therefore, the use of DMR test resulted into the following grouping of Locations
B  E  P  K  A  M  L  I  N  O  J  H  C  D  F  G
The underlined locations formed a group while those that are not underlined are others.  That is the locations are significantly different from other locations.  The groups are therefore (B, E, P), (K, A, M, L, I), (N, O), (j, H). There was an overlap in the group formation and subjective decision which may be taken.  The major groups formed as a result of the fusions of minute’s group at different level of analysis are as stated below:


Table 1.3: Groups Formed Using ANOVA Technique

Groups

Group I

Group II

Group III

Group IV

Group V

2

B, E, P, K. A, M, L, I

N, O, J, H, C, D, F, G

-

-

-

5

B, E, P

K, A, M, L, I

N, O

J, H

C, D, F, G

4

B, E, P, K, A, M, L, I

N, O

J, H

C, D, F, G

-


Note that locations C, D, F, and G are outliers significantly different other locations.
Using complete Linkage method, the following groups were obtained at initial level of analysis (A, K, L, M, I), (C, H, J, N, O), (B, P, E, G), F, and D, being outliers.  At higher level of analysis the first two groups used together to form one large group of insignificant locations. That is,  (A, K, L, M, I, C, H, J, N, O).  At this level of analysis we have two major groups (A, K, L, M, I, C, H, J, N, O), (B, P, E, G, F) and D still remains an outlier.  At further higher level of analysis the two major groups fused together to form one large group. But with single Linkage method, all the locations form a group with location D being an outlier.


Table 1.4:  Groups formed using Clustering Method

No. of Groups

Group I

Group II

Group III

Group IV

Remark

2

A, K, L, M, I, C, H, J, N, O

B, P, E, G, F

-

-

D  outlier

3

A, K, L, M, I

C, H, J, N, O

B, P, E, G

-

F and D  outlier

1

A, K, L, M, I, C, H, J, N, O, B, P, E, G, F

-

-

-

D  outlier


Discussion and Conclusion
From tables 1.2 and 1.3 respectively, it could be observed that there are two major groups named when ANOVA/DMR technique is applied.  One of them consists of insignificantly different location while the other group is made up of locations that are significantly different from members of group I.  Four locations, of the second group are completely outliers while the remaining four locations formed two minutes groups of significantly different locations but significantly different from each other. That is, the two minutes groups are significantly different from one another.

Using complete linkage method (or furtherest neighbour Linkage Method) gives a detailed analysis.  At the initial stage of analysis, three major groups were formed with locations F and D as others.  At the second stage of analysis two of the first two groups merged to form one large group while location f merged with the third group leaving location D as complete outliers.

When two groups are required, we have equal number of locations in each group using the AVOVA/DMR Test while we have ten locations in group and five locations in the other group with location D as an outlier when clustering technique was applied at initial.  But when three groups were required it was with the application of clustering technique that could be acquired.  It was possible at its initial stage of analysis, because clustering technique is more subjective than the ANOVA / DMR Test technique.
Since two groups are necessary one of which must be identified to contain locations of optimum yield and the other for insignificant yield, then it will be a wastefully exercise getting so many groups.  However, when two groups are required there is disagreement between the two techniques ANOVA/DMR Test disagreed in classifying locations K, A, M, I, L with locations N, O, J, H, and C.  But at the next level of analysis for clustering technique there is a positive perfect correlation between the two techniques.  That is, when the membership of a group are almost the same, if not all.  Therefore, what is most important is the identification of group of optimum yield.  We can therefore say that at a stage in the further analysis, we shall have the same result using the two techniques.

References
Anderson, R. W. and Bancroft (1952): Statistical Theory in Research,New York: McGraw-Hill

Anderson, T.W. (1985). Introduction to Multivariate Statistical Analysis, New York: John Wiley

Bradu, D. and Gabriel, K. R. (1978). The Biplot as a Diagnostic Tool for Model of Two-way Tables, Technometrics, 20, 47-68.

Chatfield, C. and Collion, A.J.(1995). Introduction to Multivariate Analysis, New York: Chamman and Hall

Crossa, et al., (1991). AMMI Adjustment for Statistical Analysis of  an International Wheat Yield Trial, Theo. App. Genet., 27-33.

Eisenhert, C. (1947). The Assumptions Underlying the Analysis of Variance, Biometrics, 1-21.

Fisher, R. A. (1918). The Design of Experiment, Edinburgh: Oliver and Boyd

Mandel, J. (1961). Non Additivity in Two-way Analysis of Variance. Journal of American Statistics Association. 878-888.

Mandel, J. (1971). A New Analysis of Variance Model for Non-additive Technometric 1-18.

Oyejola. B. A., Riley, J. and Balton, S. (1998). A Study of form the Northern Brazil.

Payne, R.W. et al (1994). Genstat 5 Release 3 Reference Manual, Oxford: Clarendom Press,  P538-585.
Daniel, J. A., Miock & Hebert, J. W. (1975). Introductory Multivariate Analysis for Educational, Psychological and Social Research, 113-208 Bruce Korth Explanatory Factor Analysis 123-150.

Gabriel, R. K. (1971). The Biplot Graphical Display of Matrices with Application to Principal Component Analysis, Biometrika, 452-467.

Zobel, R. W., Wright, M. J.and Gauch, H. J. (1988). Statistics Analysis of Yield Trial Journal of Agro, P 388-393.

Oscar, K. (1975). Design and Analysis of Experiment. P 85-92, 184-191.

Lum, M. D. (1954). Rules for Determining Error Terms in Hierarchical and Partially Hierarchical Model Wright Air Devicentre Dayton Ohio.