A Japanese logographic character frequency list for cognitive PDF BF03200819.pdf

teachers of Japanese that knowledge of 2,000~3,000 kanji characters is required results of the present analysis, the top 500 most frequent kanji characters ac-

Previous PDF

Next PDF

[PDF] THE FIRST 103 KANJI

writing the first 103 kanji characters required for the Japanese Language List of the 46 basic katakana and their 25 diagritics (with ゛or ゜) A I U E O different words, which feels a little bit unusual for most foreigners It takes time to get

A Japanese logographic character frequency list for cognitive

teachers of Japanese that knowledge of 2,000~3,000 kanji characters is required results of the present analysis, the top 500 most frequent kanji characters ac-

[PDF] 2000 most common japanese kanji pdf

[PDF] 2001 argentina presidents

[PDF] 2001 l'odyssée de l'espace analyse

[PDF] 2001 l'odyssée de l'espace livre

[PDF] 2001 l'odyssée de l'espace musique

[PDF] 2001 l'odyssée de l'espace netflix

[PDF] 2001 l'odyssée de l'espace soundtrack

[PDF] 2006 french exam

[PDF] 201 rue saint martin 75003 paris

[PDF] 2010 accessible design standards

[PDF] 2010 ada accessible design standards

[PDF] 2012 ap french exam

[PDF] 2012 french beginners hsc exam

[PDF] 2014 ap chemistry free response

[PDF] 2014 french exam vcaa

BehaviorResearch Methods, Instruments,&Computers

2000,

32 (3),482-500

AJapaneselogographiccharacter

frequencylist forcognitive scienceresearch

NOBUKOCHIKAMATSU

DePaulUniversity,Chicago,Illinois

SHarCHIYOKOYAMA

NationalLanguageResearchInstituteofJapan, Tokyo,Japan

HIRONARINOZAKI

AichiUniversityofEducation,Kariya,Japan

ERIC LONG

NationalLanguage ResearchInstituteofJapan, Tokyo,Japan and

SACHIOFUKUDA

ThispaperdescribesaJapaneselogographiccharacter(kanji) frequency list, which isbasedon an analysis of the largest recently availablecorpusofJapanesewords and characters. Thiscorpuscom prised a fullyear of morning and evening editions of a major newspaper, containing morethan23mil lionkanjicharactersand morethan 4,000different kanjicharacters.Thispaperliststhe 3,000most fre quent kanjicharacters,as wellas an analysis of kanji usage andcorrelationsbetween thepresentlist and previousJapanesefrequency lists. Theauthorsbelievethatthepresentlist willhelpresearchers moreaccuratelyand efficientlycontroltheselectionofkanjicharactersin cognitivescienceresearch andinterpretrelatedpsycholinguistic data.

In manyempiricalpsycholinguisticstudies,word fre

quency is used as anindependentvariable toselectma terialshavingdesiredfrequencycharacteristicsor as a controlvariable tomatchtwo or more setsofmaterialsin order tominimizeperformancedifferencesattributable to wordfrequencyeffects in wordrecognition,memory, psycholinguisticresearch to focus on the frequency effects oflinguisticunitssmallerthan words, such asletterclus ters (e.g.,bigramsortrigrams),syllabic-typeunits (sylla ble vs.nonsyllable),morphemeunits, as well asposition

Appleman

&Mayzner, 1981;Grainger&Jacobs,1993; Srinivas, Roediger, & Rajaram, 1992). Forexample,logo graphiccharacterfrequency is a crucial factor toconsider in wordexperimentsusinglanguageswithlogographic scripts, such asChinese,Japanese,or Korean,whereeach logographiccharactermay function as a word(Matsunaga,

1996). In short, it isimportantto carefullycontrolthe fre

quencyofprintedcharactersand/orwords whenempiri cal psycholinguisticstudiesareconducted. N.

802 WestBeldenAvenue,

Chicago,IL60614(e-mail:nchikama@

condor.depaul.edu).

In thepast,compilinglinguisticcorporawas an ex

concernscausedbyhumanerror. However, ascomputer technologycontinuesto develop,researchersare obtain ing morereliablelinguisticcorporaandcompilingword orcharacterfrequencylists on the basis ofthesecorpora forlinguisticor cognitive scienceresearch.For American English, some widely used wordfrequencylists are the

Browncorpus(Kucera & Francis, 1967), theAmerican

Heritage WordFrequency

Book(Carroll,Davies,&Rich

man, 1971), and theThorndike-Lorgecount(Thorndike &Lorge, 1944; see thesummaryin Solso, Juel,&Rubin,

1982). Many

ofthesecorporaand lists are available in computerdatabaseformand/orthrough the Internet. Con sequently,researchersmay use thesecorporaand lists to controlwordfrequencyin empiricallanguageresearch more easily,efficiently,andaccuratelythan in the past. remainlimitedinnumberor are stillunderdevelopment (Edwards,1993; Leech &Fligelstone,1992).

WordandCharacter

FrequencyListsinJapanese

Over the last twodecades,researchersin the area

of experimentalpsychology, especially word recognition and memory, haveincreasinglyfocused on the Japanese lan guage, owing to the uniqueness ofits writing system (Kess

Copyright2000PsychonomicSociety, Inc. 482

&Miyamoto,1994; Paradis,Hagiwara,&Hildebrandt,

1985; Yokoyama, 1997). Inparticular,kanji(characters

in alogographicscriptthat is one ofthree scripts used in writingJapanese)has been widely used inexperimental materialsinorderto examine newaspects ofcognitive ment in theacquisitionand usageoflanguage.However, althoughmanystudieshave been conductedthatuse

Japanesewords, thedevelopment

ofJapaneseword fre quency lists or kanjicharacterfrequency lists has not kept up with thedemandfor such lists. As aresult,for exam ple, thekanjicharacteror wordfrequency ofselected kanji items has often not beencontrolledormentionedin Japanese wordrecognition studieswhen frequency has not been used as adependentvariable (e.g., Eko & Nakamizo,

1989;Flores

d'Arcais& Saito, 1993;Floresd'Arcais, Saito, Kawakami, & Masuda, 1994; Kikuchi, 1996; Mor ton, Sasanuma, Patterson, & Sakuma, 1992; Nagae, 1994; Naito &Komatsu, 1989; Osaka, 1992;Sekiguchi&Abe,

1992; Wang, 1988; Yokosawa

&Shimomura,1993). In many other studies, the frequency ofkanji characters or wordsiscontrolledon the basis of(1) theresearcher'ssub jective, intuitivejudgment(e.g., Flores d' Arcais, Saito, &Kawakami,1995; Hatta, Koike, &Langman,1994;

Shimomura

&Yokosawa, 1991),(2) theexaminee'sjudg ment, such assubjects'rating onselecteditems (e.g., Ya mada,Mitarai,& Yoshida, 1991), (3) thecategorization ofkanjicharactersstandardizedby theJapaneseMin istry ofEducation(e.g., Kyoiku kanji orGakushukanji 1; see, e.g.,Hayashi,1988; Hirose, 1992;Nakagawa,1994; Sakuma, Itoh, &Sasanuma,1989), (4) listscompiledby examinersthemselves(e.g., Wydell,Butterworth, &Pat terson, 1995; Wydell, Patterson, &Humphreys,1993), or (5) theNationalLanguageResearchInstitute's(NLRI's)

1962 or 1976word/characterfrequencylists(Cabeza,

1995; Morikawa, 1985; Naito

&Komatsu, 1988; Sasa numa, Sakuma, & Kitano, 1992;Tsuzuki,1993). One ofthe mainimpedimentsto thedevelopmentof Japanese wordfrequencylists is that theelectronicrep resentation ofJapanesecharactersismorecomplicated than that ofalphabeticallanguages. Atpresent,there are characters(i.e., kanaandkanji):JapaneseIndustrial Standards(JIS), Shift-JIS (SJIS),and Extended Unix Code (EUC). Generally, EUC is used in Unix workstations on the Internet, whereas JIS is used for Japanese electronic mail. However, SJIS has beenadoptedfor use with per tems are usedacrosstasks ormethods,one musttransfer onecharactercode to another, using aconvertersuch as the network kanji code conversionfilter(NKF).

Another factorimpedingthedevelopment

ofJapanese frequencylistsis theJapanesewritingsystemitself, whichcomprisesthreetypes oforthographies-hira gana,katakana,and kanji.Hiraganaandkatakanaare syllabic scripts in which each symbolrepresentsa sound unit(asyllable).These scriptseachcontain 46 basic forms, with additional diacritic and historical forms giving a totalJAPANESECHARACTERFREQUENCYLIST 483 of83hiraganaand 86katakanaformsencodedin JIS and

EUe.Hiraganaandkatakanasharethe samesyllabic

soundrepresentationand can betranscribedone by the other (e.g., a syllable

IsaIistranscribedasinhiragana

and kanji, is alogographicscriptadoptedfrom theChinese language,in which each symbolrepresentsmeaningand functionsas amorpheme.Asinglekanjicharactermay representanindependentword (e.g., *IhonI, book)or part ofaword (e.g.,*inB*Inihon/,Japan).The mean ing ofeachconstituent(i.e., asinglecharacter)in a kanji word issometimesless clear ortransparentthan that ofan independentword. Owing to themannerin which kanji charactersweretransferredfrom the Chinese to the Japa neselanguageover thecenturies,a single kanjicharac ter may haveobtainedmore than onereadingand may bepronouncedin severaldifferentways. Forinstance, thecharacter

Wi,whichmeanshead,isreadasIto/,

a greatnumber ofhomophones(i.e., kanji characters that share acommonpronunciation butrepresentdifferent meanings)occur in Japanese kanji usage. Forinstance, andmanyothers are allpronounced

Ikil.Thus, in con

trast to bothhiraganaandkatakana,kanjicharactersdo not have asystematicsoundrepresentationor a one-to onerelationshipbetween sound and symbol. Thenumber ofkanjicharactersis quite large andpracticallyuncount able (i.e., kanjidictionariesmay contain between 12,000 and 50,000 entries; Kindaichi, 1991; Morohashi, 1989). Among hiragana, katakana, and kanji, usually only one isconventionallychosen and used to write a given Japa nese word. Hiragana is usedprimarilyfor words that have agrammaticalfunction, such asparticlesorcase-makers, and for some nativeJapanesecontentwords.Katakanais used for loan words (i.e., wordsmainlyborrowedfrom westernlanguages,such asEnglish,French,and Por tuguese).Kanji is used forcontentwords, such as nouns, verbs, and adjectives.? Thus, a singleJapanesesentence isusuallywritten with all threescriptscombined.

However, choice

ofscript is not alwaysconsistentand may vary,dependingon awriter'sintentionor apublish er'sguidelinesfor style. Forinstance,theJapaneseword meaningegg,pronounced

ItamagoI,could be written as

'k"iJ;.Z::inhiraganain onecontext,but asin kanji in anothercontext. In their study ofthesubjectivefrequency ofscripttype,Ukita,Sugishima,Minagawa, Inoue, and

Kashu(1996)studied750Japanesewords that can be

writtenin more than oneJapanesescript. The study was judgewhethera given word(writtenin a given script) is seenoften,occasionally,or rarely. Theresultsshowed that more than halfof the tested words wereidentifiedas words seen in more than one script. Thisinconsistencyin orthographicrepresentation makes word counting inJapa wordsegmentationin Japanese is more complex than that inEnglish,since wordboundariesare notseparatedwith spaces inwrittentexts. A single kanjicharactercouldbe amorpheme ofapartofa word or awordbyitselfand may bepronounceddifferently, depending on the context. With out clear wordboundaries,compoundwords are easily formed. Thus,complicationsin wordcountingand seg mentationpresentnontrivialchallengesfor those compil ing wordfrequencylists inJapanese.

Owing totheseproblems,few haveattemptedto make

wordand kanjicharacterfrequencylists inJapanese.The NLRI in Japanpublisheda wordfrequencylist in 1962, based on acorpusderivedfrom 90different journalsand magazineswith fivedifferentgenres,allpublishedin

1956.A total

of140milliontokens,consistingof40,000 different words (i.e., types), were analyzed inorderto de velopa frequency list ofwordspossessingafrequencyof at least nine. In 1976, theNLRIalsopublisheda kanji (character)frequencylistbasedon acorpuspublishedin

1966derived fromthreemajornewspapers,

Asahi,Yomi

uri, andMainichi.Thiscorpusprovidedatotalof

991,375 kanjitokensand afrequencylistof3,213 dif

ferent kanjicharacters.This was the firstattemptto an alyze aJapanesecorpuswithcomputers,and theresults were used tostandardizeandregulatethe use ofkanji characters formassmediaandeducationinJapan.For the past threedecades,researchershave used these lists as aninformativeresourcefor manylanguage-relatedre searchprojects. In 1997, theNLRIpublishedits 1962 list in floppy diskformat(NLRI,1997).

However,severalproblemsareassociatedwith the use

ofthe NLRI lists forempiricalstudies.First, the lists are basedondated mediasamples.Almostthreeorfour decades havepassedsince thecorpora ofthe 1962 and

1976lists werecollected.Consequently,thereliability

of these lists is open toquestion,since the useofwordsor kanji in massmediaandeducationmay havechanged. Second, the 1976 list does notidentifylow-frequency kanji. The 1962 word list does not contain wordspossess inga frequency less than nine, and the 1976 kanji list does not providecharacterspossessingafrequencyless than nine.Low-frequencywords orcharacterswiththe fre quency ofone arerequiredfor manyempiricallanguage studies.Third,thelistsare noteasilyaccessed,since both were available only in hard copy form until recently. Although theNLRI1962 list is nowavailablein floppy disk format'it isnotasaccessible,especiallyto re available over theInternet.

Itiscrucialthattheselistsbe

accessible toresearchersincomputerdatabaseformatto help make thecontrolandselection ofword and kanji fre quency simpler,moreeffective,and moreaccurate.

In 1994, thesituationforJapanesecorpuslinguistics

changed forthe better, when CD-ROMdatabases ofnews paper articlesbecameavailableat arelativelylow cost.

Thesecomputerizedcorpora

ofnewspapersmadeit pos sibleto develop anupdatedkanji frequency listaccessible oncomputersand over theInternet.

Thepurpose

ofthepresentprojectis todevelopa new kanjicharacter frequencylistto bemadeaccessible throughtheInternet.Inconsiderationofthe word seg mentationand otherproblemsmentionedabove, the au thorsdecidedto start bydeveloping frequency listsofkanji charactersbeforeattemptingkanji wordfrequencylists.

Furthermore,thecurrentusage

ofkanjiinprintedmass media isanalyzedon the basis ofacomparisonwith 1966 frequenciesfrom theNLRI1976 list.

SourceofData,MethodofAnalysis,and Results

Thepresentcorpusis available on aCD-ROMentitled

whichcontainsthe textofarticlesappearingin a major newspaper,covering 1 year ofmorningand evening edi tionspublishedin 1993. The dataanalysiswasconducted on Unixworkstations,with theprogramwrittenby the authors in Perl and awk. First,headlineswereexcluded from theapproximately110,000 articles> in the data. Sec ond, all theprintablecharacterswerecountedand ranked byfrequency from highest to lowest in each category (e.g., kanji,hiragana,or katakana).Frequencyratio (%) and cu mulativefrequencyratio (%) were alsocalculated. The corpusprovideda totalof56,563,595printable characters,vmakingit, at thetime,thelargestcorpus used for thecompilation ofJapaneseword/characterfre quency lists. Ideally, inadditiontonewspapers,corpora would becollectedfrom otherprintedtexts, such as mag azines and novels ofvariousgenres.However, mostof theseprintedmaterialsare not yetavailableincomputer readableform. As aresult,it is notcurrentlyfeasible to gathersufficientamounts ofdata from thesematerials. of may pose anobstacletomakinglistsbasedon such ma terials freelyavailableto the public.

The56,563,595charactersin the 1993collection

ofthe

Asahinewspaperform the basisofthepresentcharacter

frequencylist andbreakdown into thegeneralcategories shown in Table I. A totalof23,408,236kanji tokens were found,makingup 4,476differenttypes.Kanjicovered 41%
ofall theprintablecharactersin thenewspapers.

Thecentralresult

ofthissurveyis a listofthekanji char first3,000charactersare listed inAppendixA with their ranking,raw frequency,frequencyratio (%),and cumu lativefrequencyratio (%). Inaddition,frequencylists of hiraganaandkatakanacharactersare alsoincludedin Ap pendices

BandC,7respectively, with the same headings.

Cumulativepercentageofkanjicharacteruse. The

cumulativefrequencyratio ofkanjicharacters,ranked from high to low frequency, is shown inTable 2.Although it has beenconventionallysaidamonglearnersand teachers ofJapanesethatknowledgeof kanjicharactersisrequiredforone to readJapanesenews papers, thisconventionalwisdom was notsupportedby thepresentdata.Accordingto theresults ofthepresent analysis,the top 500mostfrequentkanjicharactersac counted forapproximately80%oftotalkanj iuse.Further more, the top 1,600 most frequentcharacterscovered 99%

JAPANESECHARACTERFREQUENCYLIST 485

CharactersTypes TokensFrequencyRatio(%J

Table1

PrintableCharacterTypeTotals

Discussion

In thepresentproject,the authorsintroducedthe first computer-basedkanjicharacterfrequency lists derived from the largestcorpus ofJapanesenewspapertexts.It isimportanttocontrolthe frequencyofkanjicharacters oftenneglectedthis variable, owing to theunavailability ofreliablefrequencylists. Using thepresentlist, the au thors havecomparedkanji usageobservedin data from

1966and 1993.Althoughchanges inlanguageusage are

oftendiscussedinJapaneselinguisticstudiesin terms of usage ofkanjicharactersdid not show anysignificant variation over the past 30 years. However, there are several issues left to discuss in the present study.First, the corpusofthepresent wordlist was oftotal use and therest-thatis, the next 3,000 charac ters-madeup only 1%ofthe total kanji use.

Thecumulativefrequencyratio

obtainedfrom the

1966corpus (i.e., the NLRI 1976list) is alsoindicatedin

Table 2. Theresultsfrom the twocorpora,whichare

30 years apart,indicatealmostidenticalratios for each

frequency level.Furthermore,accordingto thecurrent frequency list, ofthe 500 most frequentcharactersranked in 1966, 445charactersare also rankedamongthe 500 most frequentcharactersin the 1993corpus,but all of those 55characterslowerthan the first 500 fall within the

1,000mostfrequentcharacters in the 1993 corpus. Thus,

thehigh-frequencykanji characters thataccountfor over

80% of the total kanji use have notchangedmuch in the

last 30 years. use. To examinechangein kanji usage over the past 30 years, acorrelationanalysis wasconductedbetween the present 1993listand the NLRI 1976listbased on the 1966 newspapercorpus.The rawfrequencies ofthe3,000 highestfrequencycharactersin thepresentlist'' and the equivalentcharactersin the NLRI 1976 list were con verted totheirbase-10 logarithmic equivalents? and sub mitted to aPearsoncorrelationcoefficientanalysis. The resultindicatesa highcorrelation (r=.95) and was sig nificant[F(l,3029) =28,037.67,p<.01]. In Figure 1, although low-frequency characters scatter more than high frequency characters, the distribution is close to linear.t? Thus, on the basisoftheanalysis, it appears that the over all pattern ofkanji usage has notsignificantlychanged over the past 30 years.

Cumulative Frequency Ratio(%J

1993 1966

10.00 10.61

27.41 27.64

40.71 40.15

57.02 56.05

80.68 79.42

94.56 93.88

98.63 98.40

99.72 99.63

99.92 99.90

99.97 99.98

10 50
100
200
500
1,000 1,500 2,000 2,500quotesdbs_dbs20.pdfusesText_26

[PDF] A Japanese logographic character frequency list for cognitive