**Format****Pages****Chapters**

# Comparison of Spurious Correlation Methods Using Probability Distributions and Proportion of Rejecting a True Null Hypothesis

**Comparison of Spurious Correlation Methods Using Probability Distributions and Proportion of Rejecting a True Null Hypothesis**

**ABSTRACT**

The problem of spurious correlation analysis, e.g. Pearson moment-product correlation test is that, the data need to be normally distributed. This research work compares spurious correlation methods using some non- normal probability distributions in order to obtain the method with the best degree of association among them. The methods were compared using proportions of rejecting true null hypothesis obtained from t and z test statistics for testing correlation coefficients. Data from Normal, log-normal, exponential and contaminated normal distributions were generated using simulation method with different sample sizes. The results indicate that, when the data are normal, exponential and contaminated normal random distributions, Pearson’s and Spearman’s rank have the best proportion of rejecting the true null hypothesis. But, when the data are log-normal distribution, only Spearman’s rank correlation coefficient has the best proportion of rejecting the true null hypothesis. Thus, Pearson’s and Spearman’s rank have the best degree of association under normal, exponential and contaminated normal distributions. While, for log-normal distribution only Spearman’s rank has the best degree of association.

**TABLE OF CONTENTS**

TITLE PAGE ……………………………………………………………………………………………………… i

DEDICATION …………………………………………………………………………………………………… ii

CERTIFICATION …………………………………………………………………………………………….. iii

ACKNOWLEDGEMENTS ………………………………………………………………………………… iv

TABLE OF CONTENTS……………………………………………………………………………………. vi

LIST OF FIGURES

…………………………………………………………………………………………. viii

LIST OF TABLES ……………………………………………………………………………………………… x

ABBREVIATIONS …………………………………………………………………………………………… xi

ABSTRACT…………………………………………………………………………………………………….. xii

CHAPTER ONE

1.0 INTRODUCTION …………………………………………………………………………………… 1

1.1 Background to the Study…………………………………………………………………………… 1

1.2 Statement of the Problems ………………………………………………………………………… 4

1.3 Aim and Objectives …………………………………………………………………………………. 4

1.4 Significance of Study ……………………………………………………………………………….. 4

1.5 Scope and Limitations ……………………………………………………………………………… 5

1.6 Definition of Terms……………………..……………………………………….6

CHAPTER TWO

2.0 LITERATURE REVIEW …………………………………………………………………………… 7

2.1 Introduction …………………………………………………………………………………………….. 7

2.2 Theory of Spurious Correlation Coefficients……………………………………………….. 9

2.2.1 The Pearson Correlation Coefficient …………………………………………………………….. 9

2.2.2 The Spearman Rank Correlation Coefficients ……………………………………………. 10

2.2.3 Kendall Rank Correlation Coefficients……………………………………………………… 10

CHAPTER THREE

3.0 MATERIALS AND METHODOLOGY …………………………………………………… 11

3.1 Data Used for the Study ………………………………………………………………………….. 11

3.2 Simulation Study……………………………………………………………………………………. 11

3.3 Probability Distribution Used for Simulation …………………………………………….. 12

vii

3.3.1 Normal Distribution ……………………………………………………………………………….. 12

3.3.2 Log-normal Distribution …………………………………………………………………………. 12

3.3.3 Exponential Distribution …………………………………………………………………………. 12

3.3.4 Contaminated normal Distribution …………………………………………………………… 13

3.4 Software Used ……………………………………………………………………………………….. 13

3.5 Level of Significant ……………………………………………………………………………….. 13

3.6 The Pearson Correlation Coefficient ………………………………………………………… 13

3.7 The Spearman Rank Correlation Coefficients ……………………………………………. 14

3.8 Kendall Rank Correlation Coefficients……………………………………………………… 15

3.9 Testing a Single Correlation Coefficient …………………………………………………… 16

3.10 Testing Two Correlation Coefficients

………………………………………………………. 17

3.11

Criteria for Identifying Best Proportion of Rejecting True Null Hypothesis

….. 18

CHAPTER FOUR

4.0 RESULTS AND DISCUSSION ………………………………………………………………. 20

4.1 Introduction …………………………………………………………………………………………… 20

4.2 Spurious Correlation Test for Poverty Levels in Nigeria …………………………….. 20

4.3 Comparison of Correlation Coefficient Tests …………………………………………….. 22

4.4 Discussion of Findings……………………………………………………………………………. 43

CHAPTER FIVE

5.0 SUMMARY, CONCLUSION AND SUGGESTION FOR FURTHER STUDIES

5.1 Summary ………………………………………………………………………………………………. 48

5.2 Conclusion ……………………………………………………………………………………………. 49

5.3 Suggestion for further Studies …………………………………………………………………. 49

REFERENCES ……………………………………………………………………………………… 51

APPENDICES ………………………………………………………………………………………. 54

viii

**CHAPTER ONE**

**1.0 INTRODUCTION**

**1.1 Background to the Study**

The awareness of problems related to the statistical analysis on spurious correlation began as early as 1897 by Karl Pearson in his seminar paper on spurious correlations, which title began significantly with the words “On a form of spurious correlation” and then repeatedly by a geologist Chayes (1960).

The main source of information about the history of spurious correlation test is that, Pearson used the term spurious correlation to “distinguish the correlations of scientific importance from those that were not.” The problem, according to Pearson, was that some correlations did not indicate an “organic relationship.” Although this term is never defined, the examples used suggest that spurious correlation was the same as a correlation between two variables that were not causally connected and the term correlation coefficient only measures the strength of linear relationships (Johnson and Kotz 1992). The simplicity and interpretability should be the main ideas when selecting measures of association. Historically, the Pearson correlation has been the main association measure in multivariate analysis. It is simple, as it relates only two variables of a random vector; it concerns only linear transformation in n R , i.e. change of scale plus a shift. Interpretation relies on the linear regression ideas, which in turn are related to the geometry of n R , where covariance appears as a Euclidean inner product in the space of samples (Lovell et al, 2013). All these desirable properties will be achieved when Pearson correlation is applied to study association. Correlations between variables can be measured with the use of different indices (coefficients). The three most popular are: Pearson’s coefficient r , Spearman’s rho coefficient and Kendall’s tau coefficient . Although coming back to the history of developing the idea of measuring correlation strength, Aitchison (1986), created the basis for a proper and correct application and interpretation of correlations (in the modern meaning of the word). Thus, the history was presented by Rodgers and Nicewander (1988). Pearson’s coefficient of correlation was discovered by Bravais in 1846, but Karl Pearson was the first to describe it in 1896, by showing the standard method that is, the formula for its calculation, application and interpretation of correlations coefficient. Pearson also

offered some comments about an extension of the idea made by Galton (1879), who applied it to anthropometric data. He called this method the “product-moments’’ method or the Galton function for the coefficient of correlation r .

In 1904 Spearman adopted Pearson’s correlation coefficient as a measure of the strength of the relationship between two variables that cannot be measured quantitatively. He noted: “The most fundamental requisite is to be able to measure our observed correspondence by a plain numerical symbol. There is no reason whatever to be satisfied either with vague generalities such as “large”, “medium”, “small,” or, on the other hand, with complicated tables and compilations. Kendall’s tau, introduced by Kendall (1938), which can be used as an alternative to Spearman’s rho for data in the form of ranks. It is a simple function of the minimum number of neighbour swaps needed to produce one ordering from another. Its properties were also analyzed by Kendall in his book concerning rank correlation methods, first published in 1948. The main advantages of using Kendall’s tau are the fact that its distribution has slightly better statistical properties, and that there is a direct interpretation of this statistics in terms of probabilities of observing concordant and discordant pairs. Nonetheless, coefficient τ has not been used so often in the past (the last sixty years) as Spearman’s coefficient in measuring rank correlation, mainly because it was the one more difficult to compute.

Nowadays the calculation of Kendall’s τ poses no problem. Kendall’s τ is equivalent to Spearman’s rs in terms of the underlying assumptions, but they are not identical in magnitude, since their underlying logic and computational formulae are quite different.

The relationship between the two measures for large numbers of pairs is given by Daniels (1944) as –1 ≤ 3τ – 2rs ≤ 1. Properties and comparisons of Kendall’s τ and Spearman’s rs have been analyzed by many researchers and they are still under investigation (Valz and Thompson 1994, Weichao et al. 2010). Hence the association between two variables is often of interest in data analysis and methodological research. Pearson’s, Spearman’s and Kendall’s correlation coefficients are the most commonly used measures of monotone association, with the latter two usually suggested for non-normally distributed data. These three correlation coefficients can be represented as the differently weighted averages of the same concordance indicators. The weighting used in the Pearson’s correlation coefficient could be preferable for reflecting monotone association in some types of continuous and not necessarily bivariate normal data. (Nian,2010). The Pearson correlation coefficient r measuring a linear relationship between the variables is one of the most frequently used tools in statistics (Rodgers and Nicewander, 1988). Generally, correlation indicates how well two normally distributed variables move together in a linear way ( Aczel, 1998). When the assumption about the

normal distributions of the variables considered is not valid or the data are in the of ranks, we use other measures of the degree of association between two variables, namely the Spearman rank correlation coefficient rs ( Aczel, 1998) or the Kendall correlation coefficient. In addition, since the normality assumption of data usually does

not provide an adequate approximation to data sets with heavy tail, non-normal distributions are used in practice (Johnson and Kotz, 1992; Kotz, et.al, 2000).

**1.2 Statement of the Problems**

The Spurious correlations coefficient tests (Pearson, Spearman and Kendall) are the major methods used in measuring the degree of association. These methods have been adopted by many researchers’ like Nian (2010) and Jan and Tomasz (2011) that used normal data and they found out that, Pearson’s product moment correlation coefficient has the best degree of association among the other methods. But, comparisons of these methods using different distributions have received no or little attention. Hence, the present research will study these methods under normal and some non normal distributions.

**1.3 Aim and Objectives**

The aim of this research work is to compare the different “spurious correlation tests” using data of poverty levels and simulated data sets. The objectives are to:

i. Assess the proportion of rejecting true null hypothesis under the Pearson’s product-moment correlation test, Spearman’s rank correlation rho test and Kendall’s rank correlation tau test.

ii. Identify the method with the best degree of association under normal, lognormal, exponential and contaminated normal random distributions.

**1.4 Significance of Study**

The realization of the problem in the issue of “spurious correlation”, to measure the dependence between the two variables began as early as 1897, these methods have been adopted by many researchers, but mostly of them used Primary and Secondary data, without investigate the nature of data that is, which probability distribution does the data follows. It is our hope that at end of this study, we will come up with some best methods that will be used on some probability distributions when need arise.

**1.5 Scope and Limitations**

There are many different types of spurious correlation coefficients that reflect somewhat different aspects of association and are interpreted differently in statistical analysis. In this study, focus will be on three popular methods that are often provided next ideal to each other, namely the Pearson’s, Spearman’s and Kendall’s correlation. Only normal, log-normal, exponential and contaminated normal random distributions will be used for study. In addition, proportion of rejecting true null hypothesis (type one error rate) is used in this research work.

**1.6 Definition of Terms**

Poverty is multifaceted and has no single universally accepted definition. The World Bank defined poverty as a pronounced deprivation of human wellbeing; which includes vulnerability to adverse events outside their control, being badly treated by the institutions of state and society and being excluded from having a voice and power. Any household or individual with insufficient income or expenditure to acquire the basic necessities of life is considered to be poor. Most countries of the world fall under the absolute poverty line, which indicates that they live on less than one U.S Dollar per day.

Those that are moderate or relatively poor live on more than one US Dollar but less than two Dollars per day.NBS (2010).

*Food Poverty *–is an aspect of absolute Poverty Measure which considers only food expenditure for the affected Household.

**Relative Poverty** – is defined by reference to the living standards of the majority of people in a given society.

**Absolute Poverty** – is defined in terms of the minimal requirements for food, clothing, healthcare and shelter.

**Dollar Per-Day**– Measure of poverty refers to the proportion of people living on less than US$1 per day poverty line based on World Bank’s Purchasing Power Parity (PPP) index NSBS (2012).