Home Instructors Journals ContactUs






Contact Us




L. A. Nafiu

Department of Mathematics/Computer Science

Federal University of Technology, Minna and A.A. Adewara

Department of Statistics, University of Ilorin


Diabetic patients’ surveys have to deal with the lack of proper formal sampling frames. For survey researches at the community level, often some partial sampling frames such as medical centres or households to which a person is linked are available. These frames can be used to draw a network sample. At a selected household, the adult occupants are asked to report on the occurrence of the characteristics not only in them but also in their siblings. Using this network sampling design, the total number of people with diabetics can be estimated with lower variance than conventional procedures. The design is illustrated by an analysis of the network data of Nafiu (2007) in a survey to estimate the population of diabetic patients in Niger State, Nigeria. Two estimators:  Hansen-Hurwitz estimator and Horvitz-Thompson estimator were considered; and the results were obtained using a program written in Microsoft Visual C++ programming language. 

Keywords:      Graph, Sampling Frame, Households, Hansen-Hurwitz estimator and Horvitz-Thompson estimator.

In network management, accurate measures of network status are needed to aid in planning, troubleshooting and monitoring. For example, it may be necessary to monitor the bandwidth consumption of several hundred links in a distributed system to pinpoint bottlenecks. If the monitoring is too aggressive, it may create artificial bottlenecks. With too passive a scheme, the network monitor may miss important events. Network query rates must strike a balance between accurate performance characterization and low bandwidth consumption to avoid changing the behaviour of the network while still providing a clear picture of the behaviour. This balance is often achieved through sampling. Sampling techniques are used to study the behaviour of a population of elements based on a representative subset.
In a survey to estimate the prevalence of a disease like diabetic, a random sample of medical centres is selected. From the records of each medical centre in the sample, records of patients treated for that disease are obtained. However, a given patient may have been treated, the higher is the probability that, that patient’s records will be obtained in the sample.
In another survey, also with the purpose of estimating the prevalence of a rare characteristic in a population, a simple random sample of households is selected. At a selected household, the adult occupants are asked to report on the occurrence of the characteristics not only in them but also in their siblings. Thus a person with several siblings who are living in different households has a higher inclusion probability than one with no siblings living in separate households. Even within a single household, the inclusion probabilities for different occupants are not necessary equal. Designs of the above type are referred to as network sampling. In this case, a simple random sample or stratified random sample of units (selection units) is selected, and all observation units (diabetic patients) which are linked to any of the selected units are included or observed. The network of a person is the number of selection units, that is, medical centres or households to which a person is linked. Defining a network to be a set of observation units with a given linkage pattern, a network may be linked with more than one selection unit (siblings living in more than one household). If the population of selection units is stratified, a network may also intersect more than one stratum.

Because of the unequal selection or inclusion probabilities, the sample total does not form an unbiased estimator of the population total with such a design. Unbiased estimators for such designs were given by Thompson (1992). In one of these estimators – termed the “Hansen-Hurwitz estimator” – each observation is divided by its network. In this case, the network is proposed to the draw-by-draw selection probability. The Horvitz-Thompson estimator for network sampling, in which each person’s inclusion probability is determined by the networks was also given.  
References to many innovative applications of network sampling are found in Cowan (1986) and Anderson (1980). Faulkenberry and Garoui (1991) discussed network sampling estimators in the context of area sampling methods used in agricultural surveys.
Problem Definition
We consider an undirected graph jg with vertex set jfand adjacency matrixjk, representing a set of social actors and some relationship between them. The adjacency matrix is defined on the set fjk of the ordered pairs of vertices; fuif there is an edge between vertices fandkf; and fjk  otherwise (fj for allkf). Since the graph is undirected, k for all jk. Based on some binary auxiliary variablefhjk, vertex set fjkjcan be partitioned into two disjoint vertex subsets fj and f frui, that is, ruiui with order i and u with order t.
For the sake of clarity throughout this paper vertices kf and nkjk refer to subset fjuk while vertices ut and ur refer to subsetui. Based on vertex sets fu and frui, population graph gj can be decomposed into three sub graphs:

  1. Sub graph ui with arcs between the vertices of set fu.
  2. Sub graph ru with arcs between the vertices of set yi.
  3. Sub graph ryuuwith arcs between the vertices of sets fu and fu,

Figure 1.1 below is an illustration of population gj with vertex set ir or order u and size i, that is, gj consists of ru vertices and u arcs. Based on auxiliary variable kj vertex set kfjk is partitioned into subset i (the uncoloured vertices) and subset ru  (the coloured vertices).

                                    1                                                                      5

  1. 2                                                6

4                                                                      7


Figure 1.1: Population f with vertex set fjk.

The number of relations between vertices of ui, that is, the size of sub graph ru, is denoted
g                                                                    (1.1)
Between the vertices of ifr, that is, the size of sub graph i, is denoted 
hd                                                                   (1.2)
and between vertices of ui and ui, that is, the size of sub graph iriruis denoted
gh                                                         (1.3)
In figure 1.1, dg and h
The mean number of relations for dg with other gd is
                        dh                                                                          (1.4a)
For hdwith ghdis
                        g                                                                       (1.4b)
The mean number of relations for g with other d is
                        gh                                                                           (1.5a)

For d with other g is
d                                                                        (1.5b)
In figure 1.1, we have g and hd 
Hence, we observe the relationship
                        gh                                                                         (1.6)
which can be used to get an indication of the total number of vertices.
The described graph – theoretical representation reflects a lot of diabetic patients’ surveys. In such surveys, there are some urban areas with unknown populations of diabetic patients, and only a partial sampling frame is available from which some probability samples can be drawn. Frequently, some non probability sample is used to describe the study population. However, using a network sample in these situations will provide more accurate information about distributions of individual characteristics and additionally, also structural information about distributions of relations between diabetic patients can be estimated. The purpose of this paper is to estimate some simple network parameters that can be used to describe the study population. For that purpose, we use network data from Nafiu (2007) in a survey to estimate the prevalence of diabetes in Niger State, Nigeria where the register lists of the adult people have the role of a single partial sampling frame, rfyu, that is, the diabetic patients in the household and dg, that is, the non-diabetic patients in the household.
Estimation of Population Total
Let the value of the variable of interest for the h observational unit in the population be denoted gh. In a survey to estimate the prevalence of a disease or other characteristic, d is an indicator variable, equal to one if the unit has the characteristic and zero otherwise. The variable of interest gh need not be an indicator variable, it could, for example, be the cost of medical treatment for the disease for the d person. Let ir denote the number of observational units in the population. The population total is dg. Let h be the network of the ghobservational unit, that is, the number of selection units to which that observational unit is linked. The number of selection units in the population will be denoted hdThe population mean per selection unit is gh.

(A). Hansen-Hurwitz Estimator
Consider a sampling design in which a simple random sample (without replacement) of g selection units is obtained and every observational unit linked to any selected selection unit is included in the sample. The draw-by-draw selection probability h for the d observational unit is the probability that any one of the d selection units to which it is linked is selected, that is,
                        d                                                                                 (2.1)

An unbiased estimator of the population total g may be formed by dividing each observed hd value by the associated selection probability. The Hansen-Hurwitz estimator thus obtained is
                        g                                                                     (2.2)
in which hd is the sequence of observational units in the sample, including repeat selections. An observational unit may be selected more than once, even though selection units are sampled without replacement, because the observational unit may be linked to more than one selection unit. The expected number of times the gh observational unit is selected is g.
The notation for the Hansen-Hurwitz estimator may be simplified in a way which renders the statistical properties of the Hansen-Hurwitz estimator transparent. For the hd selection unit in the population, define the variable hd to be the sum of the dg for all observational units linked with selection unit jfk, that is,
                        h                                                                            (2.3)
where dg is the set of observational units that are linked to selection unit jk.
With this notation, the Hansen-Hurwitz estimator may be written
                        h                                                                      (2.4)
Thinking of gh as a new variable of interest associated with the g selection unit, then the Hansen-Hurwitz estimator is just dg, where h is the sample mean of a simple random sample of size d. Thus, from the basic results on simple random sampling,
                        dg                                                   (2.5)
where h                                                                   (2.6)
in which d is the population mean per selection unit.
An unbiased estimator of this variance is
                        dg                                                    (2.7)

      where h                                                                       (2.8)
for estimating the population mean per selection unit, dg and h.

(B). The Horvitz – Thompson Estimator
The probability that the d observational unit is included in the sample is the probability that one or more of the gselection units to which it is linked is selected. Since the inclusion probabilities are identical for all observational units in a network, the problem can be simplified by changing notation to be in terms of networks rather than individual observational units. The population can be partitioned into dg networks, which will be labeled hd. Let g now denote the total of the h over all the observational units in the dg network, and let dgh denote the common multiplicity for any observational unit within this network.
The inclusion probability for the h network, which is in fact the inclusion probability for any of the observational units within their network, is
            d                                                                (2.9)
that is, one minus the probability that the entire simple random sample of gselection units is selected from the gh selection units which are not linked with network d.
Let gdenote the number of district networks of observational units include in the sample. The Horvitz-Thompson estimator of the population total is
            hd                                                                                        (2.10)
Let g denote the number of selection units linked to both networks g and hd. The probability that both networks hdand g are included in the sample is:
            hd                            (2.11)
The usual variance formulae for the Horvitz-Thompson estimator then apply, giving
            gh                                    (2.12)


An unbiased estimator of this variance is:
            d                       (2.13)
For estimating the population mean per selection unit,
gh and d.
Illustration and Results
In this section, we analyzed the network data on diabetes in Niger State, Nigeria obtained from Nafiu (2007), M.Sc thesis (Unpublished), Department of Statistics, University of Ilorin, Ilorin, Nigeria using Horvitz-Thompson and Hansen-Hurwitz estimators. The results in table 1.1 below for the standard errors of the estimates for the years 2000 - 2003 were obtained with the help of computer program written in Microsoft Visual C++ programming language (Hubbard, 2000).


























Table 1.1: Estimates for the standard errors using Horvitz-Thompson and Hansen-Hurwitz estimators.  

Discussion of Results
The results presented in table 1.1 indicate that substantial reductions in the standard error can be obtained through the use of network design without forfeiting an unbiased estimate of the sampling standard error. We also observed that irrespective of the year considered, the standard error of Horvitz-Thompson estimator (d) is always less than that of Hansen-Hurwitz estimator (dg). The Horvitz-Thompson estimator is an unbiased estimator which, unlike the Hansen-Hurwitz estimator, does not depend on the number of times any unit is selected.

Conclusion and Recommendations
When an unbiased estimator of high precision and an unbiased sample estimate of its standard error are required, the network sampling design is a better indication of the total size of the diabetes population. If we accept the assumption that each diabetic patient that is not registered knows at least one other diabetic patient that is a client of the medical centre, then by using network sampling design, we can define a simple ratio estimator for the total population.



Anderson, D. R.  (1980). Estimation of Density from Line Transect Sampling ofBiological Populations. Journal of Wildlife Management, 72, 325-336
Cowan, C. D. (1988).  Capture-Recapture Models when both sources have Clustered Observations. Journal of American Statistical Association, 81, 347-353
Faulkenberry, G. D. and Garoui, A. (1991). Estimating a Population Total Using an Area Frame. Journal of the American Statistical Association, 86, 445-449
Frank, O. (1977).  Survey Sampling in Graphs. Journal of Statistical Planning and Inference, 1, 224-235
Frank, O. (1978). Sampling and Estimation in Large Social Networks, Social Networks, 1, 91-101
Horvitz, D.G. and Thompson D. J. (1952). A Generalization of Sampling Without Replacement from a Finite Universe. Journal of American Statistical Association, 47, 663-685
Hubbard, J.R. (2000). Programming with C++. Second Edition. Schaum’s Outlines, New Delhi: Tata McGraw-Hill Publishing Company Limited
Nafiu, L.A. (2007). Comparison of Four Estimators under Sampling without Replacement, Unpublished M.Sc. Thesis, University of Ilorin, Ilorin, Nigeria
Thompson, S.K. (1992). Sampling. New York: John Wiley and Sons Inc.