teachers of Japanese that knowledge of 2,000~3,000 kanji characters is required results of the present analysis, the top 500 most frequent kanji characters ac-
Previous PDF | Next PDF |
[PDF] THE FIRST 103 KANJI
writing the first 103 kanji characters required for the Japanese Language List of the 46 basic katakana and their 25 diagritics (with ゛or ゜) A I U E O different words, which feels a little bit unusual for most foreigners It takes time to get
A Japanese logographic character frequency list for cognitive
teachers of Japanese that knowledge of 2,000~3,000 kanji characters is required results of the present analysis, the top 500 most frequent kanji characters ac-
[PDF] 2000 most common japanese kanji pdf
[PDF] 2001 argentina presidents
[PDF] 2001 l'odyssée de l'espace analyse
[PDF] 2001 l'odyssée de l'espace livre
[PDF] 2001 l'odyssée de l'espace musique
[PDF] 2001 l'odyssée de l'espace netflix
[PDF] 2001 l'odyssée de l'espace soundtrack
[PDF] 2006 french exam
[PDF] 201 rue saint martin 75003 paris
[PDF] 2010 accessible design standards
[PDF] 2010 ada accessible design standards
[PDF] 2012 ap french exam
[PDF] 2012 french beginners hsc exam
[PDF] 2014 ap chemistry free response
[PDF] 2014 french exam vcaa
ofall theprintablecharactersin thenewspapers.
100
200
500
1,000 1,500 2,000 2,500quotesdbs_dbs20.pdfusesText_26
[PDF] 2001 argentina presidents
[PDF] 2001 l'odyssée de l'espace analyse
[PDF] 2001 l'odyssée de l'espace livre
[PDF] 2001 l'odyssée de l'espace musique
[PDF] 2001 l'odyssée de l'espace netflix
[PDF] 2001 l'odyssée de l'espace soundtrack
[PDF] 2006 french exam
[PDF] 201 rue saint martin 75003 paris
[PDF] 2010 accessible design standards
[PDF] 2010 ada accessible design standards
[PDF] 2012 ap french exam
[PDF] 2012 french beginners hsc exam
[PDF] 2014 ap chemistry free response
[PDF] 2014 french exam vcaa
BehaviorResearch Methods, Instruments,&Computers
2000,32 (3),482-500
AJapaneselogographiccharacter
frequencylist forcognitive scienceresearchNOBUKOCHIKAMATSU
DePaulUniversity,Chicago,Illinois
SHarCHIYOKOYAMA
NationalLanguageResearchInstituteofJapan, Tokyo,JapanHIRONARINOZAKI
AichiUniversityofEducation,Kariya,Japan
ERIC LONG
NationalLanguage ResearchInstituteofJapan, Tokyo,Japan andSACHIOFUKUDA
ThispaperdescribesaJapaneselogographiccharacter(kanji) frequency list, which isbasedon an analysis of the largest recently availablecorpusofJapanesewords and characters. Thiscorpuscom prised a fullyear of morning and evening editions of a major newspaper, containing morethan23mil lionkanjicharactersand morethan 4,000different kanjicharacters.Thispaperliststhe 3,000most fre quent kanjicharacters,as wellas an analysis of kanji usage andcorrelationsbetween thepresentlist and previousJapanesefrequency lists. Theauthorsbelievethatthepresentlist willhelpresearchers moreaccuratelyand efficientlycontroltheselectionofkanjicharactersin cognitivescienceresearch andinterpretrelatedpsycholinguistic data.In manyempiricalpsycholinguisticstudies,word fre
quency is used as anindependentvariable toselectma terialshavingdesiredfrequencycharacteristicsor as a controlvariable tomatchtwo or more setsofmaterialsin order tominimizeperformancedifferencesattributable to wordfrequencyeffects in wordrecognition,memory, psycholinguisticresearch to focus on the frequency effects oflinguisticunitssmallerthan words, such asletterclus ters (e.g.,bigramsortrigrams),syllabic-typeunits (sylla ble vs.nonsyllable),morphemeunits, as well aspositionAppleman
&Mayzner, 1981;Grainger&Jacobs,1993; Srinivas, Roediger, & Rajaram, 1992). Forexample,logo graphiccharacterfrequency is a crucial factor toconsider in wordexperimentsusinglanguageswithlogographic scripts, such asChinese,Japanese,or Korean,whereeach logographiccharactermay function as a word(Matsunaga,1996). In short, it isimportantto carefullycontrolthe fre
quencyofprintedcharactersand/orwords whenempiri cal psycholinguisticstudiesareconducted. N.802 WestBeldenAvenue,
Chicago,IL60614(e-mail:nchikama@
condor.depaul.edu).In thepast,compilinglinguisticcorporawas an ex
concernscausedbyhumanerror. However, ascomputer technologycontinuesto develop,researchersare obtain ing morereliablelinguisticcorporaandcompilingword orcharacterfrequencylists on the basis ofthesecorpora forlinguisticor cognitive scienceresearch.For American English, some widely used wordfrequencylists are theBrowncorpus(Kucera & Francis, 1967), theAmerican
Heritage WordFrequency
Book(Carroll,Davies,&Rich
man, 1971), and theThorndike-Lorgecount(Thorndike &Lorge, 1944; see thesummaryin Solso, Juel,&Rubin,1982). Many
ofthesecorporaand lists are available in computerdatabaseformand/orthrough the Internet. Con sequently,researchersmay use thesecorporaand lists to controlwordfrequencyin empiricallanguageresearch more easily,efficiently,andaccuratelythan in the past. remainlimitedinnumberor are stillunderdevelopment (Edwards,1993; Leech &Fligelstone,1992).WordandCharacter
FrequencyListsinJapanese
Over the last twodecades,researchersin the area
of experimentalpsychology, especially word recognition and memory, haveincreasinglyfocused on the Japanese lan guage, owing to the uniqueness ofits writing system (KessCopyright2000PsychonomicSociety, Inc. 482
&Miyamoto,1994; Paradis,Hagiwara,&Hildebrandt,1985; Yokoyama, 1997). Inparticular,kanji(characters
in alogographicscriptthat is one ofthree scripts used in writingJapanese)has been widely used inexperimental materialsinorderto examine newaspects ofcognitive ment in theacquisitionand usageoflanguage.However, althoughmanystudieshave been conductedthatuseJapanesewords, thedevelopment
ofJapaneseword fre quency lists or kanjicharacterfrequency lists has not kept up with thedemandfor such lists. As aresult,for exam ple, thekanjicharacteror wordfrequency ofselected kanji items has often not beencontrolledormentionedin Japanese wordrecognition studieswhen frequency has not been used as adependentvariable (e.g., Eko & Nakamizo,1989;Flores
d'Arcais& Saito, 1993;Floresd'Arcais, Saito, Kawakami, & Masuda, 1994; Kikuchi, 1996; Mor ton, Sasanuma, Patterson, & Sakuma, 1992; Nagae, 1994; Naito &Komatsu, 1989; Osaka, 1992;Sekiguchi&Abe,1992; Wang, 1988; Yokosawa
&Shimomura,1993). In many other studies, the frequency ofkanji characters or wordsiscontrolledon the basis of(1) theresearcher'ssub jective, intuitivejudgment(e.g., Flores d' Arcais, Saito, &Kawakami,1995; Hatta, Koike, &Langman,1994;Shimomura
&Yokosawa, 1991),(2) theexaminee'sjudg ment, such assubjects'rating onselecteditems (e.g., Ya mada,Mitarai,& Yoshida, 1991), (3) thecategorization ofkanjicharactersstandardizedby theJapaneseMin istry ofEducation(e.g., Kyoiku kanji orGakushukanji 1; see, e.g.,Hayashi,1988; Hirose, 1992;Nakagawa,1994; Sakuma, Itoh, &Sasanuma,1989), (4) listscompiledby examinersthemselves(e.g., Wydell,Butterworth, &Pat terson, 1995; Wydell, Patterson, &Humphreys,1993), or (5) theNationalLanguageResearchInstitute's(NLRI's)1962 or 1976word/characterfrequencylists(Cabeza,
1995; Morikawa, 1985; Naito
&Komatsu, 1988; Sasa numa, Sakuma, & Kitano, 1992;Tsuzuki,1993). One ofthe mainimpedimentsto thedevelopmentof Japanese wordfrequencylists is that theelectronicrep resentation ofJapanesecharactersismorecomplicated than that ofalphabeticallanguages. Atpresent,there are characters(i.e., kanaandkanji):JapaneseIndustrial Standards(JIS), Shift-JIS (SJIS),and Extended Unix Code (EUC). Generally, EUC is used in Unix workstations on the Internet, whereas JIS is used for Japanese electronic mail. However, SJIS has beenadoptedfor use with per tems are usedacrosstasks ormethods,one musttransfer onecharactercode to another, using aconvertersuch as the network kanji code conversionfilter(NKF).Another factorimpedingthedevelopment
ofJapanese frequencylistsis theJapanesewritingsystemitself, whichcomprisesthreetypes oforthographies-hira gana,katakana,and kanji.Hiraganaandkatakanaare syllabic scripts in which each symbolrepresentsa sound unit(asyllable).These scriptseachcontain 46 basic forms, with additional diacritic and historical forms giving a totalJAPANESECHARACTERFREQUENCYLIST 483 of83hiraganaand 86katakanaformsencodedin JIS andEUe.Hiraganaandkatakanasharethe samesyllabic
soundrepresentationand can betranscribedone by the other (e.g., a syllableIsaIistranscribedasinhiragana
and kanji, is alogographicscriptadoptedfrom theChinese language,in which each symbolrepresentsmeaningand functionsas amorpheme.Asinglekanjicharactermay representanindependentword (e.g., *IhonI, book)or part ofaword (e.g.,*inB*Inihon/,Japan).The mean ing ofeachconstituent(i.e., asinglecharacter)in a kanji word issometimesless clear ortransparentthan that ofan independentword. Owing to themannerin which kanji charactersweretransferredfrom the Chinese to the Japa neselanguageover thecenturies,a single kanjicharac ter may haveobtainedmore than onereadingand may bepronouncedin severaldifferentways. Forinstance, thecharacterWi,whichmeanshead,isreadasIto/,
a greatnumber ofhomophones(i.e., kanji characters that share acommonpronunciation butrepresentdifferent meanings)occur in Japanese kanji usage. Forinstance, andmanyothers are allpronouncedIkil.Thus, in con
trast to bothhiraganaandkatakana,kanjicharactersdo not have asystematicsoundrepresentationor a one-to onerelationshipbetween sound and symbol. Thenumber ofkanjicharactersis quite large andpracticallyuncount able (i.e., kanjidictionariesmay contain between 12,000 and 50,000 entries; Kindaichi, 1991; Morohashi, 1989). Among hiragana, katakana, and kanji, usually only one isconventionallychosen and used to write a given Japa nese word. Hiragana is usedprimarilyfor words that have agrammaticalfunction, such asparticlesorcase-makers, and for some nativeJapanesecontentwords.Katakanais used for loan words (i.e., wordsmainlyborrowedfrom westernlanguages,such asEnglish,French,and Por tuguese).Kanji is used forcontentwords, such as nouns, verbs, and adjectives.? Thus, a singleJapanesesentence isusuallywritten with all threescriptscombined.However, choice
ofscript is not alwaysconsistentand may vary,dependingon awriter'sintentionor apublish er'sguidelinesfor style. Forinstance,theJapaneseword meaningegg,pronouncedItamagoI,could be written as
'k"iJ;.Z::inhiraganain onecontext,but asin kanji in anothercontext. In their study ofthesubjectivefrequency ofscripttype,Ukita,Sugishima,Minagawa, Inoue, andKashu(1996)studied750Japanesewords that can be
writtenin more than oneJapanesescript. The study was judgewhethera given word(writtenin a given script) is seenoften,occasionally,or rarely. Theresultsshowed that more than halfof the tested words wereidentifiedas words seen in more than one script. Thisinconsistencyin orthographicrepresentation makes word counting inJapa wordsegmentationin Japanese is more complex than that inEnglish,since wordboundariesare notseparatedwith spaces inwrittentexts. A single kanjicharactercouldbe amorpheme ofapartofa word or awordbyitselfand may bepronounceddifferently, depending on the context. With out clear wordboundaries,compoundwords are easily formed. Thus,complicationsin wordcountingand seg mentationpresentnontrivialchallengesfor those compil ing wordfrequencylists inJapanese.Owing totheseproblems,few haveattemptedto make
wordand kanjicharacterfrequencylists inJapanese.The NLRI in Japanpublisheda wordfrequencylist in 1962, based on acorpusderivedfrom 90different journalsand magazineswith fivedifferentgenres,allpublishedin1956.A total
of140milliontokens,consistingof40,000 different words (i.e., types), were analyzed inorderto de velopa frequency list ofwordspossessingafrequencyof at least nine. In 1976, theNLRIalsopublisheda kanji (character)frequencylistbasedon acorpuspublishedin1966derived fromthreemajornewspapers,
Asahi,Yomi
uri, andMainichi.Thiscorpusprovidedatotalof991,375 kanjitokensand afrequencylistof3,213 dif
ferent kanjicharacters.This was the firstattemptto an alyze aJapanesecorpuswithcomputers,and theresults were used tostandardizeandregulatethe use ofkanji characters formassmediaandeducationinJapan.For the past threedecades,researchershave used these lists as aninformativeresourcefor manylanguage-relatedre searchprojects. In 1997, theNLRIpublishedits 1962 list in floppy diskformat(NLRI,1997).However,severalproblemsareassociatedwith the use
ofthe NLRI lists forempiricalstudies.First, the lists are basedondated mediasamples.Almostthreeorfour decades havepassedsince thecorpora ofthe 1962 and1976lists werecollected.Consequently,thereliability
of these lists is open toquestion,since the useofwordsor kanji in massmediaandeducationmay havechanged. Second, the 1976 list does notidentifylow-frequency kanji. The 1962 word list does not contain wordspossess inga frequency less than nine, and the 1976 kanji list does not providecharacterspossessingafrequencyless than nine.Low-frequencywords orcharacterswiththe fre quency ofone arerequiredfor manyempiricallanguage studies.Third,thelistsare noteasilyaccessed,since both were available only in hard copy form until recently. Although theNLRI1962 list is nowavailablein floppy disk format'it isnotasaccessible,especiallyto re available over theInternet.Itiscrucialthattheselistsbe
accessible toresearchersincomputerdatabaseformatto help make thecontrolandselection ofword and kanji fre quency simpler,moreeffective,and moreaccurate.In 1994, thesituationforJapanesecorpuslinguistics
changed forthe better, when CD-ROMdatabases ofnews paper articlesbecameavailableat arelativelylow cost.Thesecomputerizedcorpora
ofnewspapersmadeit pos sibleto develop anupdatedkanji frequency listaccessible oncomputersand over theInternet.Thepurpose
ofthepresentprojectis todevelopa new kanjicharacter frequencylistto bemadeaccessible throughtheInternet.Inconsiderationofthe word seg mentationand otherproblemsmentionedabove, the au thorsdecidedto start bydeveloping frequency listsofkanji charactersbeforeattemptingkanji wordfrequencylists.Furthermore,thecurrentusage
ofkanjiinprintedmass media isanalyzedon the basis ofacomparisonwith 1966 frequenciesfrom theNLRI1976 list.SourceofData,MethodofAnalysis,and Results
Thepresentcorpusis available on aCD-ROMentitled
whichcontainsthe textofarticlesappearingin a major newspaper,covering 1 year ofmorningand evening edi tionspublishedin 1993. The dataanalysiswasconducted on Unixworkstations,with theprogramwrittenby the authors in Perl and awk. First,headlineswereexcluded from theapproximately110,000 articles> in the data. Sec ond, all theprintablecharacterswerecountedand ranked byfrequency from highest to lowest in each category (e.g., kanji,hiragana,or katakana).Frequencyratio (%) and cu mulativefrequencyratio (%) were alsocalculated. The corpusprovideda totalof56,563,595printable characters,vmakingit, at thetime,thelargestcorpus used for thecompilation ofJapaneseword/characterfre quency lists. Ideally, inadditiontonewspapers,corpora would becollectedfrom otherprintedtexts, such as mag azines and novels ofvariousgenres.However, mostof theseprintedmaterialsare not yetavailableincomputer readableform. As aresult,it is notcurrentlyfeasible to gathersufficientamounts ofdata from thesematerials. of may pose anobstacletomakinglistsbasedon such ma terials freelyavailableto the public.The56,563,595charactersin the 1993collection
oftheAsahinewspaperform the basisofthepresentcharacter
frequencylist andbreakdown into thegeneralcategories shown in Table I. A totalof23,408,236kanji tokens were found,makingup 4,476differenttypes.Kanjicovered 41%ofall theprintablecharactersin thenewspapers.
Thecentralresult
ofthissurveyis a listofthekanji char first3,000charactersare listed inAppendixA with their ranking,raw frequency,frequencyratio (%),and cumu lativefrequencyratio (%). Inaddition,frequencylists of hiraganaandkatakanacharactersare alsoincludedin Ap pendicesBandC,7respectively, with the same headings.
Cumulativepercentageofkanjicharacteruse. The
cumulativefrequencyratio ofkanjicharacters,ranked from high to low frequency, is shown inTable 2.Although it has beenconventionallysaidamonglearnersand teachers ofJapanesethatknowledgeof kanjicharactersisrequiredforone to readJapanesenews papers, thisconventionalwisdom was notsupportedby thepresentdata.Accordingto theresults ofthepresent analysis,the top 500mostfrequentkanjicharactersac counted forapproximately80%oftotalkanj iuse.Further more, the top 1,600 most frequentcharacterscovered 99%JAPANESECHARACTERFREQUENCYLIST 485
CharactersTypes TokensFrequencyRatio(%J
Table1
PrintableCharacterTypeTotals
Discussion
In thepresentproject,the authorsintroducedthe first computer-basedkanjicharacterfrequency lists derived from the largestcorpus ofJapanesenewspapertexts.It isimportanttocontrolthe frequencyofkanjicharacters oftenneglectedthis variable, owing to theunavailability ofreliablefrequencylists. Using thepresentlist, the au thors havecomparedkanji usageobservedin data from1966and 1993.Althoughchanges inlanguageusage are
oftendiscussedinJapaneselinguisticstudiesin terms of usage ofkanjicharactersdid not show anysignificant variation over the past 30 years. However, there are several issues left to discuss in the present study.First, the corpusofthepresent wordlist was oftotal use and therest-thatis, the next 3,000 charac ters-madeup only 1%ofthe total kanji use.Thecumulativefrequencyratio
obtainedfrom the1966corpus (i.e., the NLRI 1976list) is alsoindicatedin
Table 2. Theresultsfrom the twocorpora,whichare
30 years apart,indicatealmostidenticalratios for each
frequency level.Furthermore,accordingto thecurrent frequency list, ofthe 500 most frequentcharactersranked in 1966, 445charactersare also rankedamongthe 500 most frequentcharactersin the 1993corpus,but all of those 55characterslowerthan the first 500 fall within the1,000mostfrequentcharacters in the 1993 corpus. Thus,
thehigh-frequencykanji characters thataccountfor over80% of the total kanji use have notchangedmuch in the
last 30 years. use. To examinechangein kanji usage over the past 30 years, acorrelationanalysis wasconductedbetween the present 1993listand the NLRI 1976listbased on the 1966 newspapercorpus.The rawfrequencies ofthe3,000 highestfrequencycharactersin thepresentlist'' and the equivalentcharactersin the NLRI 1976 list were con verted totheirbase-10 logarithmic equivalents? and sub mitted to aPearsoncorrelationcoefficientanalysis. The resultindicatesa highcorrelation (r=.95) and was sig nificant[F(l,3029) =28,037.67,p<.01]. In Figure 1, although low-frequency characters scatter more than high frequency characters, the distribution is close to linear.t? Thus, on the basisoftheanalysis, it appears that the over all pattern ofkanji usage has notsignificantlychanged over the past 30 years.Cumulative Frequency Ratio(%J
1993 1966
10.00 10.61
27.41 27.64
40.71 40.15
57.02 56.05
80.68 79.42
94.56 93.88
98.63 98.40
99.72 99.63
99.92 99.90
99.97 99.98
10 50100
200
500
1,000 1,500 2,000 2,500quotesdbs_dbs20.pdfusesText_26