Английская Википедия:Chinese character frequency

Chinese character frequency (Шаблон:Zh) is the applicational frequency of characters in written Chinese. It is calculated on a corpus, i.e., a collection of texts representing one or more languages. The frequency of a character is the ratio of the number of its occurrences to the total number of characters in the corpus, with the formula of Шаблон:Sfn

Шаблон:Math,

where Шаблон:Serif is the number of times a certain (Шаблон:Serif) Chinese character appears in the corpus, and Шаблон:Serif is the total number of (occurrences of) characters in the corpus.

Chinese character frequency is fundamental to quantitative linguistics of Chinese, and is of referential value to Chinese language teaching and information processing. Шаблон:Sfn

Origins

The first person to make a serious statistic study on the frequency of Chinese characters was Chen Heqin (Шаблон:Zhi).Шаблон:Sfn In the 1920s, he and his assistants spent over two years manually counting and comparing the characters in a corpus of six categories of texts. There were totally 554,478 characters in 4,261 different character forms. They then compiled a book entitled Applied Lexis of Vernacular Chinese (Шаблон:Zhi).Шаблон:Sfn The 10 most frequently-used characters in their corpus are, by descending frequency,

Шаблон:Zhi (of), Шаблон:Zhi (no, not), Шаблон:Zhi (one, a(n)), Шаблон:Zhi (Шаблон:Gcl), Шаблон:Zhi (to be), Шаблон:Zhi (I/me), Шаблон:Zhi (on, up), Шаблон:Zhi (he/him), Шаблон:Zhi (to have), Шаблон:Zhi (person).

A trans-regional diachronic survey

In 2001, the Chinese University of Hong Kong (CUHK) published a number of frequency lists on the Web,Шаблон:Sfn entitled "Hong Kong, Mainland China and Taiwan Chinese Frequency: a Trans-regional Diachronic Survey". The frequency data came from a grand corpus with a number of sub-corpora representing the Chinese languages in the three regions of Hong Kong, Mainland China and Taiwan and in the two time periods of the 1960s and 1980/90's. Each sub-corpus consists of approximately 660,000 characters, making a total of 3,970,514 characters for the whole corpus. Each sub-corpus includes about 5,000 different characters, as shown by their frequency lists.

From the data of these frequency lists, some important and interesting features of Chinese can be discovered:

Шаблон:Zhi, Шаблон:Zhi and Шаблон:Zhi are the three most frequently-used characters across the regions and time periods of the corpora. And Шаблон:Zhi is number one in all the frequency lists.
The 10 most frequently-used characters across the three regions and two time periods are very consistent. That means a frequently-used character in one region or period is very likely to be frequently-used in another region or period.
The 100 most frequently-used characters in the 80/90's cover (i.e., have an accumulated frequency of) 41.00% of the Hong Kong texts of that period, 41.34% of the Mainland texts, and 41.88% of the Taiwan texts. That is more than 4 out of every 10 characters for the three regions.
The 1000 most frequently-used characters in the 80/90's cover 89.25% of the Hong Kong texts of that period, 90.26% of the Mainland texts, and 88.74% of the Taiwan texts.

The top 10 characters in the frequency lists for the three regions of the 1980/1990's are

Hong Kong: 的，一，是，不，人，有，在，了，我，中;
Taiwan:    的，一，是，不，人，在，有，我，了，中;
Mainland:  的，一，是，了，不，在，有，人，我，他.

More information can be found in the English Users' Guide on the home page. Шаблон:Sfn

Frequencies in different divisions

Most of the previous frequency experiments are for comprehensive usage of Chinese characters. In addition, there is the frequency of use of Chinese characters in a certain discipline, such as news reporting, literature and art, information technology, etc.

And there are frequency lists for linguistic divisions. Polyphonic characters may be counted separately according to different pronunciations, for example, the frequencies for 的 (de), 的 (di1), 的 (di2) and 的 (di4). Polysemy characters are counted separately according to different meanings, for example, 里 (裡裏, inside) and 里 (里, 0.5 km). There are also frequencies for different parts of speech, for example: 花(n) and 花(v). Or a combination of the above divisions. Шаблон:Sfn

Application of frequency statistics

Chinese character frequency is essential to quantitative research of Chinese characters, and has been applied to language teaching, dictionary composition, character lists compilation, Chinese character information processing, etc.Шаблон:Sfn

Chinese character utility decline rate

The uses of Chinese characters mainly concentrate on frequently used characters. Zhou Youguang summarized the Chinese character utility decline rate (Шаблон:Zhi) based on the frequency statistics results of various parties. Its basic content is: Шаблон:Sfn

The coverage rate of the most frequently-used 1,000 characters on the corpus is about 90%, which means the missing rate is about 10%. For every additional 1,400 secondary frequent characters, the missing rate is reduced to 10% of the original number. For example, The missing rate of 1000+1400=2400 most frequently-used characters is approximately 10% * 10% =1% of the corpus, that means the coverage rate is 99%. The missing rate of 2400+1400=3800 most frequently-used characters is about 1% * 10% = 0.1%, and the coverage rate is 99.9%. The rule is supported by later experiment results as well, such as: Шаблон:Sfn

**Coverage rates of the most frequently-used n characters on a corpus of 4,868 different characters**
characters	occurrences	%
100	782,866	42.14
500	1,439,352	77.48
1,000	1,681,228	90.50
2,000	1,817,047	97.81
3,000	1,848,648	99.51
4,000	1,856,226	99.92
4,868	1,857,660	100

Decreasing rate of frequently-used character strokes

The basic content of the Decreasing rate of frequently-used character strokes (Шаблон:Zh) Шаблон:Sfn is:

The application rate of a character is inversely proportional to its number of strokes, that is, characters with high application rates have fewer strokes on average. This is supported by the data in article Stroke numbers. According to the data of the second and third tables, the average number of strokes of the 3,500 frequently-used characters is 9.74, and the average number of strokes of the 7.000 commonly-used characters (a super set of the 3,500 characters) is 10.75. That means generally speaking, frequently-used characters have less strokes than less frequently-used characters.

The reason is for convenience of writing. If a character of many strokes is used frequently, people will try to simplify it. If there are multiple variant characters of the same function, regardless of other reasons, the one with fewer strokes is more likely to be used.

Distribution rate and application rate

When determining the importance of a character, in addition to frequency of use, it is often necessary to consider distribution rate. The formula for calculating distribution rate is Шаблон:Sfn

Шаблон:Math,

where Di is the distribution rate of character or word i, ti is the number of texts in which the character or word appears, and T is the total number of texts in the corpus.

Application rate is a combination of distribution rate and frequency. A newer calculation formula Шаблон:Sfn is:

Ui=(Fi*Di)/Σ(j=1 to n)(Fj*Dj)

where Ui is the application rate of character i, Fi is the frequency of character i, Di is the distribution rate of character i, and n represents the total number of characters. This calculation method allows the cumulative application rates to approach 1.

Application in Media

Large-scale surveys by the Ministry of Education and the State Language Commission of PRC over the years have shown that the use of Chinese characters and words has a strong distribution pattern. The number of different characters used in modern Chinese is stable at about 10,000, and the number of different words has stabilized at around 2.3 million.Шаблон:Sfn

The number of most frequently-used characters with a coverage rate of 80%, 90%, and 99% is about 590, 960, and 2,400 respectively. The number of words with coverage rates of 80%, 90%, and 95% is about 4,800, 14,000, and 30,000. Words with greater changes from the previous years in frequency of use reflect the hot topics of social life and media attention that year. Шаблон:Sfn

References

Citations

Шаблон:Reflist

Works cited

Шаблон:Refbegin

Шаблон:Refend

External links

Партнерские ресурсы
Криптовалюты	Обмен криптовалют - www.bestchange.ru Криптовалютная биржа CoinEx Криптовалютная биржа Binance HIVE OS - операционная система для майнинга e4pool - Мультивалютный пул для майнинга.
Магазины	AliExpress — глобальная виртуальная (в Интернете) торговая площадка, предоставляющая возможность покупать товары производителей из КНР; computeruniverse.net - Интернет-магазин компьютеров(Промо код 5 Евро на первую покупку:FWWC3ZKQ);
Хостинг	DigitalOcean - американский провайдер облачных инфраструктур, с главным офисом в Нью-Йорке и с центрами обработки данных по всему миру;
Разное	Викиум - Онлайн-тренажер для мозга Like Центр - Центр поддержки и развития предпринимательства. Gamersbay - лучший магазин по бустингу для World of Warcraft. Ноотропы OmniMind N°1 - Усиливает мозговую активность. Повышает мотивацию. Улучшает память. Санкт-Петербургская школа телевидения - это федеральная сеть образовательных центров, которая имеет филиалы в 37 городах России. Lingualeo.com — интерактивный онлайн-сервис для изучения и практики английского языка в увлекательной игровой форме. Junyschool (Джунискул) – международная школа программирования и дизайна для детей и подростков от 5 до 17 лет, где ученики осваивают компьютерную грамотность, развивают алгоритмическое и креативное мышление, изучают основы программирования и компьютерной графики, создают собственные проекты: игры, сайты, программы, приложения, анимации, 3D-модели, монтируют видео. Умназия - Интерактивные онлайн-курсы и тренажеры для развития мышления детей 6-13 лет SkillBox - это один из лидеров российского рынка онлайн-образования. Среди партнеров Skillbox ведущий разработчик сервисного дизайна AIC, медиа-компания Yoola, первое и самое крупное русскоязычное аналитическое агентство Tagline, онлайн-школа дизайна и иллюстрации Bang! Bang! Education, оператор PR-рынка PACO, студия рисования Draw&Go, агентство performance-маркетинга Ingate, scrum-студия Sibirix, имидж-лаборатория Персона. «Нетология» — это университет по подготовке и дополнительному обучению специалистов в области интернет-маркетинга, управления проектами и продуктами, дизайна, Data Science и разработки. В рамках Нетологии студенты получают ценные теоретические знания от лучших экспертов Рунета, выполняют практические задания на отработку полученных навыков, общаются с экспертами и единомышленниками. Познакомиться со всеми продуктами подробнее можно на сайте https://netology.ru, линейка курсов и профессий постоянно обновляется. StudyBay Brazil – это онлайн биржа для португалоговорящих студентов и авторов! Студент получает уникальную работу любого уровня сложности и больше свободного времени, в то время как у автора появляется дополнительный заработок и бесценный опыт. Автор24 — самая большая в России площадка по написанию учебных работ: контрольные и курсовые работы, дипломы, рефераты, решение задач, отчеты по практике, а так же любой другой вид работы. Сервис сотрудничает с более 70 000 авторов. Более 1 000 000 работ уже выполнено. StudyBay – это онлайн биржа для англоязычных студентов и авторов! Студент получает уникальную работу любого уровня сложности и больше свободного времени, в то время как у автора появляется дополнительный заработок и бесценный опыт.

Английская Википедия:Chinese character frequency

Содержание