ODAM: An optimized distributed association rule mining algorithm

  • Ms Word Format
  • 77 Pages
  • ₦3000
  • 1-5 Chapters

ODAM: An optimized distributed association rule mining algorithm



Variants of a basic word commonly exist in natural language texts [6] . Morphological variationsareusuallythemosttypical,alongwithothersourcessuchasalternatespellings, miss-spellings, and variations coming from transliteration and abbreviation [6] . Stemming solves the challenges that arise via varying morphological forms by effectively reducing semantically related words to a common stem [1] ,[6] .

As stated in [1] , stemming algorithms are automated rules to reduce all terms with the identicalroottoacommonform,normallybyeliminatingthewords’morphologicalaffixes. The researcher also discusses that stemming researchers are most desirable today in many fields of computational linguistics and IR, but for numerous motives. In morphological analysis, the stem of a term could possibly be of much less quick desire than its affixes, which is often used as hints to grammatical structure [1] ,[3] .

The reason behind research works on stemming algorithms is the need to enhance informationretrievalaccuracy[1] ,andnowadays,stemmersarewidelyappliedindifferent fields of NLP such as IR, text classification, text summarization and automatic machine translation [1] .

According to Sharma [2] , in the manual approach, a word in a document is queried by searching one of its variant at a time. The same researcher discusses that this technique is very tiresome and misses the related information of same importance [2] . Hence, that is why stemming is broadly applied in several information retrieval systems to avoid such kinds of difficulties and to enhance retrieval performance [3] .

Stemming is applied as preprocessing stage in the development of automated text summarization systems. Stemming algorithm is also used in machine translation to get stemmed words or sentences [14] .

Designing stemming algorithm for Kambaata language has a benefit of developing other naturallanguageprocessingapplicationssuchas,textclassification,textcategorizationand morphological analyzer[16] .


An example of a stem can be the word “mar” (go – 2M) which is the stem for the variants “marro” (goes), “marree(u)” (went), “marimba’a” (didn’t go), “marano” (will go), “marayyoo(u)” (is going), and “marota” (to go).

Morphology is the study of structure of words and defines word formations in a language. The most common ways of word variant formation in natural language text are suffixing and prefixing [5] . Inflectional and derivational morphologies are the two types of morphology [6] . Inflectional morphology is a creation of different forms of the sameword withoutchangingitspartofspeech.Usually,thevariationsareresultsofchangesinperson, number, tense and gender. As stated in [7] , such variations have not effect on a word’s class; that means, a verb still remains verb after its tense form is changed. For example, “agud” (look), “agujjo” (looks), “agudayyoo(u)” (looking), “agujjee(u)”(looked).

In another way, derivational morphology results in change of the word’s class [7] . For instance, affix changes a word from adjective to nouns, from verb to nouns, from noun to verbs, and so on; like “jaalu” (friend), “jaalloomaan” (friendly), “jaalloomat” (friendliness) and “jaalloomata” (friendship).

Based upon the rich morphological property of individual languages, several variations of terms could possibly be resulted out of single stem [2] . This huge variant existence has powerfulimpactoninformationretrievalprograms.Asaresult,rightnowthereisademand for automated procedure that can minimize the size of various terms to controllable level, and also record the strong connection that present amongst diverse word types [8] . Even if the several languages have various degree of morphological complexity, stemming is generally employed in information retrieval, with the fundamental reason that morphological variations represent equivalent meaning[16] .

Morphological processing is a commonly used application for powerful and successful information retrieval, machine translation and word summarization [8] , [9] .Consequently, it becomes extremely crucial for IR as it needs figuring out the proper word variations as index [10] . According to Salton [11] , automatic IR system is a computer software element that helps request and access of information from databases by diverse endusers.

AccordingtoBaeza-Yates[5] ,baseduponontheirparticularstemmingstrategy,stemmers are grouped in to four. These are: affix removal, table lookup, successor variety, and n- gramstemmers.


Rule-based also known as Affix removal technique is a strategy that is implementedeasily and efficiently [63] . In this strategy, affixes are eliminated from the terms resulting in stems.Thistechniquewasappliedby[1] ,[12] ,[15] ,[16] [17] ,and[18] .Tablelookupalso calledDictionary-basedstrategylookupsthestemofawordinatableofdictionary.Table lookup method was employed by [14] to stem Amharic words. This approach is straightforward and relies on the dimension of stem dictionary. The strategy also requires significant storage space. Successor variety technique is centered on the identification of morphemeboundariesoftermsandmakesuseofexpertisefromstructurallinguistics.This approach is much more complicated compared to that of affix removal technique. N-gram method is primarily based on the recognition of n-grams for instance bi-grams and tri- grams. This method was utilized by [13] to stem language independent words making use ofuni-gram.

As opposed to morphologically simple languages for instance English, Cushitic languages forexampleSidaamaandKambaatahaveverycomplexmorphology[19] .AccordingTreis

  • and [20] , Kambaata does not make use of prefixes for word formation. Nevertheless, complicated terms can be created by suffixation, infixation, compounding and reduplication, specifically by full reduplication or by reduplication of portion of the word. The reduplicated section of the syllable is prefixed in Kambaata[19] .

Kambaata is known as “Kambaati afoo” literally means ‘the mouth of Kambaata’ in Kambaata language. It belongs to the Highland East Cushitic branch that encompass languages spoken in south-central Ethiopia, such as Hadiyya, Libido, Kambaata, Alaaba, Qabeena, Sidaama, Gedeo, and Burji [19] . The language is spoken and institutionalized in Kambaata and Tambaaro (KT) Zone, which is located at northeastern part of Southern Nations, Nationalities, and Peoples Region (SNNPR) of Ethiopia and situated (the Zone) 250 km south west of Addis Ababa, Ethiopia’s capital. The language is also spoken by Kambaata migrants in other parts of the country and abroad. For instance, there is significant population of Kambaata speaking migrants in South Africa.

The Kambaata people’s name and the language that they communicate is available in numerous spellings in the literary works in addition to Kambaata; the most frequent ones includeKambata,Kambatta,Kembata,Kembatta,Cambata,Cambatta,Kambara,Kemata and Donga. The people  of Kambaata call  their language by the  name Kambaatissata or



script – Amharic), and sometimes Kambatic (in English, just like the ‘ic’ ending of “Amharic or Arabic”) [19] , [22] .


Kambaata is a Highland East Cushitic language, part of the Cushitic and the much bigger Afro-AsiaticgroupandspokenbythepeopleofKambaata.Kambaatadialects(withlexical similarity between dialects) are: Tambaaro (95%), Alaaba (81%), Kabeena/ Qabeena (81%). Kambaata also has higher lexical similarity with other HEC groups, i.e. Sidaamo (62%), Libido (57%), Hadiyya (56%), and Gedeo (54%)[22] .


Kambaata is as well the name of a smaller Highland East Cushitic division, the Kambaata group,whichcomprisesofKambaata(itself-beingthemain)andTambaaroandalsoAlaaba and Qabeena which usually known as its dialects [19] ,[23] .


Kambaata is one of the Zonal languages in the SNNPR of Ethiopia. At present, it is estimatedtobespokenbymorethanonemillionpeople[22] ,[23] .Currently,thelanguage serves as a medium of instruction in the primary schools, and is also provided as a subject in the junior and secondary high schools and preparatory schools of KT Zone. The first Kambaata-Amharic dictionary was published by today’s KT Zone Culture and Tourism Department (1995 E.C./2003) and the second dictionary, ‘Kambaatissa-Amharic-English’ dictionary was published by Alamu Banta (2009 E.C/2016). Kambaata Old Testament Bible translation in the official Latin orthography is 71% completed as of today according the data from Bible Society of Ethiopia [68] . Booklets having bible stories like “Haaroo Woqqaa” ‘New Way’, written in both Ge’ez and Latin script, has as well been published bytheBibleSocietyofEthiopia[19] .KambaatalanguageProverbs,TalesandLegendsare few of the works which has been accomplished partially until now[33] .

Plenty of translation works have already started in translating materials from other languages to Kambaata. The language is being studied at numerous levels both locallyand by researchers from abroad. As an example, now there are different research works performed for this language by research unit of Languages and Cultures of Sub-Saharan Africa (LLACAN) which is affiliated to the National Centre for Scientific Research (NCSR)andtheNationalInstituteofOrientalLanguagesandCivilizations(INALCO)[24] . Such possibilities open up opportunities to produce a lot more written materials in the language.


The advancement of technology and digital media in Ethiopia is expanding progressively and more quickly. Accessibility and usage of the Internet and search engines are getting part of everyday activity not only in Ethiopia but also in the rest of the Africa and the World.Textbooks,referencebooks,publications,articlesandvariousotherdocumentscan be accessed digitally on all pervasive devices and support flexible access. Networking of educational institutions (i. e. universities, colleges, high schools) and corporations as well as businesses is in progress and a number of research projects with national and international institutions on study of Kambaata language have been startedrecently.


Together with having access to the Internet, there is proof of a swiftly growing number of Kambaata educational, cultural, religious, journal articles and other kinds of documents in electronic media. Nowadays, one of the hot issues in the field is the mechanism forstoring and accessing this pervasive information in an effective and efficient way. Therefore, document summarization, classification and information retrieval are fields that attempt to deal with these kinds of problems [2] .

Kambaata language has rich morphology [19] , [54] . It makes use of the two types of morphologies,i.e.inflectionalandderivationalforwordformation.Forinstance,morethan two hundred variants can be formed from a single stem by inflection and derivation (see Appendix VII) [55] . Example is given for a progressive form of a verb “kul” (tell) inflection. For full list, see – Appendix VI and AppendixVII.






Person e.g. “kultell
1SG kul-ayyoom(m)
2SG kul-tayyoont
3M kul-ayyoo(u)
3F/ 3PL kul-tayyoo(u)
3HON kul- eenayyoomma
1PL kun- nayyoom(m)
2PL/2HON kul- teenayyoonta


However, according to the researcher’s knowledge, Kambaata language has very few linguistic resources and absolutely no computational works have been carried out to computerize / automate the language in relation to NLP applications.


As reported by [7] , the morphological complexity of a language could end up inextremely significant amounts of variations for a word. Subsequently, word variations may induce a substantial influence on the efficiency of IR systems as well as on morphological analysis tools. Since Kambaata is morphologically complex language [19] , there is a need for automated programs that can easily stem words and decrease the size of a words for required storage space minimization, text summarizers and retrieval applications, and also determine the strong associations existing among various word varieties in the language [17] .

Stemming also has a crucial role indetermination of stem from a word by the removal of both inflectional and derivational affixes, and therefore, there have been much desire for stemmers with this goal [5] , [7] . This demand is growing further and most likely to boost in the future as a lot more text-processing applications turn out to be of crucialimportance [10] .

According to the researcher’s observation during visits to the zone, at this time, there is already a need for stemming algorithm and other applications for Kambaata languagetext. This requires designing an automated procedure that removes inflectional andderivational affixes of Kambaata words; and requires exploring how semantically related terms in the language can be conflated together with one another automatically using rule based approach which is dependent exclusively on the morphology of the particular language. Exploring word conflation technique and designing the stemming algorithm is totally dependentonthemorphologicalpropertyofthatspecificlanguageduetothefactthatevery languagehasdistinctmorphologicalstructure[1] ,[2] ,[6] .Thus,itrequiresfindingpatterns for stemming and defining and developing a program/stemmer based on the language’s morphology and word formations [1] , [15] , [16] , [17] ,[18] .

The formerly pointed out factors helped to determine the need to design an algorithm that conflates Kambaata texts effectively for the users of the language.

Stemming algorithm researches have been conducted to several languages both internationally and locally. Locally, stemmers have been attempted for Amharic [7] , [14] , Afaan Oromo [15] , [29] , Tigrigna [12] , [16] , Wolaytta [17] , Silt’e [18] and few others. To the best of the researchers’ knowledge, there is no any research carried out to explore stemming methods for Kambaata words and there has never been any attempt done to designarulebasedstemmerfortheKambaatalanguagetext.Thus,theresearcherusedthis


opportunity to do a research on exploring stemming rules and designing an appropriate algorithm for stemming Kambaata words.

Therefore,thepurposeofthisparticularstudyistoexploremethodsandrulesforstemming Kambaata words and design a rule based stemmer in order to provide automatic word conflation method fordocuments.

Hence, the research attempts to respond to the following research questions in the study course of action: –


  • What are the morphological properties and how are words formed inKambaata?
  • What are the challenges in designing rule based stemmer forKambaata?
  • Whatoptimalstemmingperformancemaypossiblybeachievedonagiventestcorpus?



The research has the following general and specific objectives.


The main objective of this research is to design a stemming algorithm for Kambaata text using Rule based approach.


Toachievetheabovegeneralobjective,thefollowingspecifictasksand/orobjectiveswere performed in the research work.


  • To carry out a literature analysis on stemming algorithmresearches.
  • To review morphological behavior of the language to become acquainted with word formations.
  • To prepare corpus that is needed to identify affixes and stems in thelanguage.
  • To construct affixes list used in Kambaata from the corpus and differentliteratures.
  • To explore/ adapt techniques and define rules for stemming Kambatawords.
  • To design a rule based stemmer for Kambaata words that conflates inflectional and derivational affixes of thelanguage.
  • To experiment the stemmer on selected test set and measure the performance of the stemmer designed in the study.




The general research approach adopted for this study is a Design Science Research methodology which is employed for the design of the algorithm. As stated in [70] , the design science research requires the creation of an innovative, purposeful artifact for a specified problem domain. This research process involves problem identification, solution suggestion, development, evaluation and conclusion [70] .

Problemidentification:Theresearchprobleminthisstudyhasbeenidentifiedbyreading NLP research problems in Ethiopia, more specifically Stemming research gaps. Consequently, reading the research gap in the field provided the researcher an opportunity to get aware of the limitation of stemming research and helped to easily identify which languages have not been studied in this regard.


Suggestion: After problem identification, a research proposal has been prepared with a need to apply an existing knowledge of stemming to new area of Kambaata language as a new research effort.

Development: As part of this process, a rule based stemming technique was selected and the appropriate algorithm was designed for the Kambaata language based on the detail study of its morphology. Finally, an artifact of the study (the stemming algorithm) is developedandimplementedusingpythonprogramminglanguagethroughcontextsensitive and longest matchapproaches.

Evaluation: After the design of the algorithm, the stemmer was evaluated using error countinganddictionaryreductionmethods.Theevaluationresultsweremeasuredinterms of correctly stemmed, over stemmed and under stemmedwords.

Conclusion: At the end of the research process, conclusions have been derived from the main research findings. The challenges during designing stemmer for the Kambaata language are also discussed and the summarized behavior of the artifact is also discussed in this phase.


Tounderstandstemmingalgorithmsanddevelopmentoradaptingstrategy,severalresearch works for stemming algorithms on various languages such as English, French, Greek, Amharic,Afaan-Oromo,Tigrigna,Wolaytta,Silt’ehavebeenreviewed.Tounderstandthe morphological properties of Kambaata, review of researches on the language is performed by advising diverse sources such as books, journal articles, dictionary, textbooks. For literature review of the morphology of the Kambaata language, books and journal articles are downloaded from the online journal libraries and further information is also compiled via email with appropriate individuals (linguists) of thelanguage.


Acorpusisthefundamentaldatasourceneededinthedevelopmentofstemmingalgorithm [16] .Atextdataiscollectedandliteraturesurveyisappliedforcompilingaffixes.Asstated in [16] , a large sized text can show a reasonable language morphological behavior. Selectionofmuchlargersizedtextistherefore,anessentialelementindesigningastemmer [18] . Hence for the purpose of this research, the researcher utilized a corpus of 129,929 wordtokensthatisbelievedtobearepresentativeofthelanguagebecauseofitssizewhich


is collected from school textbooks from Kambaata educational offices and high school.As stated in limitation section, the researcher could not be able to get a corpus of different domains.


Toimplementthestemmer,Pythonprogramminglanguagehasbeenutilizedforthereason that it provides extensive support libraries for string operation [18] and in addition the researcherismuchmorefamiliarwithPythonthananyotherlanguage.Kambaatalanguage has rich morphology [54] ; hence the process of stemming Kambaata words involves dealing with mainly suffix stripping and infixation and other irregular words at lesser degree. The algorithm is designed by examining morphological rules of thelanguage.

For the development of the Kambaata stemmer, Affix removal (often called Rule-based) with longest match technique is employed since it’s a broadly used stemming approach. Rule based stemming is easier and can be implemented efficiently than other techniques if we know the rules for affix removal in the language [63] . Most of the stemmers that have been developed until now are based on this approach [2] . Sharma [2] has mentioned that Rule based technique has the following advantage over Statistical approaches:

  1. Stemming programs constructed using Rule based technique are faster as compared to Statistical stemmers. Thus, stems can be obtained within short computing time using rule basedapproach.
  2. The performance of stemmed words by Rule based stemmers are quitehigher.



Error counting technique is commonly applied method for evaluating stemmer performance. This technique is used to examine the effectiveness of the stemmer. Identification of correctly stemmed, over-stemmed, under-stemmed words and dictionary size reduction is conducted to observe the result of the stemmer. The result is represented in numbers and percentage. The percentage is used to demonstrate the accuracy of the stemmer.



In this study, the very first stemmer for Kambaata words that removes the affixes for different NLP applications have been designed. Kambaata is a strictly suffixing language; consequently,ithasnoknownprefixestillnow[19] ,[20] .Thisstudyisthefirstattemptto explore stemming technique and design the algorithm for Kambaata language words and thestemmeristhefirstofitskind.Inthisresearch,rulestostemwordshavebeenexplored andanalgorithmtohandlesuffixes,infixesandsomeirregularwordsinKambaatahasbeen designed. The stemmer is not only suffix removal but it also has context sensitive and recoding rules to transform words that are not handled by suffix removal rules. Reduplicated and compound words in the language are very few and are not in the scope of this study because of the complexity of the morphological behavior of the languageand other resource limitations. The limitation of the research is that the corpus used for this study is not of different domain. The corpus is mainly of educational domain and the researcher is not able to find other domain corpus due to unavailability of the resources to theresearcher.


Designing stemming algorithm for Kambaata words helps the language’s speakers to discover information they desire quickly without having any kind of problems while queryingwords.Theartifactofthisstudycouldalsobeafoundationtoexploreanddevelop variousotherNLPapplicationssuchas,IRsystems,textsummarizers,machinetranslation, text categorization applications and morphological analyzers for Kambaata. Kambaata word processing tools could also need stemming algorithm that functions together with spell checker software to enhance the efficiency of spelling checking[69] .

Kambaata word stemmer could also give an advantage of reducing the size of documents [14] .Becauseanindividualstemusuallycorrespondstoseveralcompleteterms,bystoring stemsratherthanwords,adatacompressionrateofmorethan50percentcouldpossiblybe attained [2] .

InKambaata,awordhasgotquitelargevariantsandconflatingallthesevariantsincreases performance of the retrieval [64] . It also decreases storage space needed for index documents [2] . Moreover, exploring stemming techniques for the language’s words could also provide the following advantages:-


  • The study helps to design tools such as term frequencycounter.



The thesis document is structured in five chapters. The first chapter is the introductionthat features background, statement of the problem and its justification, the objective, methodology, scope and limitation and the importance/application of thestudy.

The second chapter explains principles and strategies of stemming algorithms. Comprehensive discussions are made on methods to stemming. Review is also made on stemmers developed for local and foreign languages as part of related works.

Kambaata language morphology is introduced and reviewed in the third chapter. In depth outline of Kambaata morphology is provided in this chapter.

Chapter four is the key portion of the thesis work. It presents the exploration and designof the Kambaata stemming algorithm with short introduction followed by corpus preparation for the algorithm. The discussion proceeds with the collection of Kambaata affixes succeeded by the implementation of the stemmer. Finally, the evaluation of the stemming algorithm is discussed indepth.

The last chapter presents conclusions comprehended from the findings and recommendations for future study.


Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like