Английская Википедия:International Corpus of English

Шаблон:Short description The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

History

Sidney Greenbaum's goal to compile corpora that would compare the syntax of world English became the ICE project that was achieved by Professor Charles F. Meyer. Sidney Greenbaum anticipated for international teams of researchers to collect comparable national variations of English both written and spoken.^[1] Comparable variations would be British English, American English, and Indian English, that would be represented through a computer corpora.^[2] The corpora are used by researchers to compare the syntax of the varieties of English.^[3] ICE corpora completion would have comprehensive linguistic analysis of varieties of English that have emerged.^[4] Ongoing research for ICE is implemented by international teams in diversified regions.^[5] The project began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-three research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989.^[6] For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.

Description

Each corpus contains one million words in 500 texts of 2000 words,^[7] following the sampling methodology used for the Brown Corpus. Unlike Brown or the Lancaster-Oslo-Bergen (LOB) Corpus (or indeed mega-corpora such as the British National Corpus), however, the majority of texts are derived from spoken data.

With only one million words per corpus, ICE corpora are considered very small for modern standards.^[8] ICE corpora contain 60% (600,000 words) of orthographically transcribed spoken English. The father of the project, Sidney Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those containing, e.g. parliamentary or legal paraphrases.

The corpora consist entirely of data from 1990 or later. The subjects from which the data was collected are all adults who were educated in English and were either born, or moved at an early age, to the country to which their data is attributed.^[7] There are speech and text samples from both men and women of many age groups, but the corpus website makes it a point to note that, "The proportions, however, are not representative of the proportions in the population as a whole: women are not equally represented in professions such as politics and law, and so do not produce equal amounts of discourse in these fields."^[7]

The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk et al.^[9] phrase structure grammar, and the analyses have been thoroughly checked and completed. This analysis includes a part-of-speech tagging and parsing of the entire corpus. The treebank can be thoroughly searched and explored with the ICE Corpus Utility Program or ICECUP software. More information is in the handbook.^[10]

To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation.^[11] Many corpora are currently available for download on the ICE official webpage, though some require a license. Others, however, are not ready for publication.^[12]

Textual and Grammatical Annotation

Researchers and Linguists follow specific guidelines when annotating data for the corpus, which can be found here, in the International Corpus of English Manuals and Documentation. The three levels of annotation are Text Markup, Wordclass Tagging, Syntactic Parsing.^[13]

Textual Markup

Original markup and layout such as sentence and paragraph parsing is preserved, with special markers indicating it as original. Spoken data is transcribed orthographically, with indicators for hesitations, false starts, and pauses.^[13]

Word Class Tagging

Word Classes, also called Parts of Speech, are grammatical categories for words based upon their function in a sentence.

British texts are automatically tagged for wordclass by the ICE tagger, developed at University College London, which uses a comprehensive grammar of the English language.

All other languages are tagged automatically using the PENN Treebank and the CLAWS tagset. While the tags are not corrected manually, they are checked regularly for quality.^[13]

Syntactic Parsing

The sentence are parsed automatically and, if necessary, are manually corrected with ICECUP, a syntax tree editor created specifically for the corpus.

Dependency parsing is also done automatically with the Dependency Parser Pro3GreS. The results are not manually verified.^[13]

Pragmatic Parsing

Ireland is currently the only participant country who includes pragmatic annotation in their data.

Design of the Corpora

Below are the subsections of the ICE, with the number of corpora for each category and sub-category in parentheses.^[7]

Spoken (300)
Dialogues (180)	Private (100)	Face-to-face conversations (90) Phonecalls (10)
Dialogues (180)	Public (80)	Classroom Lessons (20) Broadcast Discussions (20) Broadcast Interviews (10) Parliamentary Debates (10) Legal cross-examinations (10) Business Transactions (10)
Monologues (120)	Unscripted (70)	Spontaneous commentaries (20) Unscripted Speeches (30) Demonstrations (10) Legal Presentations (10)
Monologues (120)	Scripted (50)	Broadcast News (20) Broadcast Talks (20) Non-broadcast Talks (10)

Written (200)
Non-Printed (50)	Student Writing (20)	Student Essays (10) Exam Scripts (10)
Non-Printed (50)	Letters (30)	Social Letters (15) Business Letters (15)
Printed (150)	Academic Writing (40)	Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10)
	Popular Writing (40)	Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10)
	Reportage (20)	Press news reports (20)
	Instructional Writing (20)	Administrative Writing (10) Skills/hobbies (10)
	Persuasive Writing (10)	Press editorials (10)
	Creative Writing (20)	Novels & short stories (20)

Publications

There are a number of books published about the International Corpus of English, as well as books based in part on the corpora.^[14]

English in the Caribbean: Variation, Style and Standards in Jamaica and Trinidad (2014) by Dagmar Deuber
The Present Perfect in World Englishes: Charting Unity and Diversity (2014) by Valentin Werner
Mapping Unity and Diversity Worldwide: Corpus-based Studies of New Englishes (2012) by Marianne Hundt and Ulrike Gut
The Syntax of Spoken Indian English (2012) by Claudia Lange
Oxford Modern English Grammar (2011) by Bas Aarts
Adjunct Adverbials in English (2010) by Hilde Hasselgård
ICAME Journal No 34 (2010)
An Introduction to English Grammar (2009) by Sidney Greenbaum and Gerald Nelson
Word-Formation in New Englishes: A corpus-based Analysis (2008) by Thomas Biermeier
Special issue of World Englishes Volume 23 Number 2 (2004)
Exploring Natural Language: Working with the British component of the International Corpus of English (2002) by Gerald Nelson, Sean Wallis, and Bas Aarts
Comparing English Worldwide: The International Corpus of English (1996) by Sidney Greenbaum
Oxford English Grammar (1996) by Sidney Greenbaum

Participants

The current list of participant countries are (*= available):

Australia
Cameroon
Canada*
East Africa (Kenya, Malawi, Tanzania)*
Fiji
Ghana
Great Britain* (parsed)
Hong Kong*
India*
Ireland*
Jamaica*
Malta
Malaysia
New Zealand*
Nigeria* (tagged)
Pakistan
The Philippines*
Sierra Leone
Singapore*
South Africa
Sri Lanka
Trinidad and Tobago
USA*

References

Шаблон:Reflist

External links

The International Corpus of English website

Шаблон:Corpus linguistics

↑ Шаблон:Cite web
↑ Шаблон:Cite web
↑ Шаблон:Cite journal
↑ Шаблон:Cite web
↑ Шаблон:Cite web
↑ Шаблон:Cite web
↑ ^7,0 ^7,1 ^7,2 ^7,3 Шаблон:Cite web
↑ Шаблон:Cite journal
↑ Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985). A Comprehensive Grammar of the English Language London: Longman
↑ Nelson, Gerald, Wallis, Sean, and Aarts, Bas (2002). Exploring Natural Language. Working with the British Component of the International Corpus of English Amsterdam: John Benjamins
↑ Шаблон:Cite web
↑ Шаблон:Cite web
↑ ^13,0 ^13,1 ^13,2 ^13,3 Шаблон:Cite web
↑ Шаблон:Cite web

[1] Шаблон:Cite web

[2] Шаблон:Cite web

[3] Шаблон:Cite journal

[4] Шаблон:Cite web

[5] Шаблон:Cite web

[6] Шаблон:Cite web

[:0-7] 7,0 ^7,1 ^7,2 ^7,3 Шаблон:Cite web

[8] Шаблон:Cite journal

[9] Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985). A Comprehensive Grammar of the English Language London: Longman

[10] Nelson, Gerald, Wallis, Sean, and Aarts, Bas (2002). Exploring Natural Language. Working with the British Component of the International Corpus of English Amsterdam: John Benjamins

[11] Шаблон:Cite web

[12] Шаблон:Cite web

[:1-13] 13,0 ^13,1 ^13,2 ^13,3 Шаблон:Cite web

[14] Шаблон:Cite web

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Партнерские ресурсы
Криптовалюты	Обмен криптовалют - www.bestchange.ru Криптовалютная биржа CoinEx Криптовалютная биржа Binance HIVE OS - операционная система для майнинга e4pool - Мультивалютный пул для майнинга.
Магазины	AliExpress — глобальная виртуальная (в Интернете) торговая площадка, предоставляющая возможность покупать товары производителей из КНР; computeruniverse.net - Интернет-магазин компьютеров(Промо код 5 Евро на первую покупку:FWWC3ZKQ);
Хостинг	DigitalOcean - американский провайдер облачных инфраструктур, с главным офисом в Нью-Йорке и с центрами обработки данных по всему миру;
Разное	Викиум - Онлайн-тренажер для мозга Like Центр - Центр поддержки и развития предпринимательства. Gamersbay - лучший магазин по бустингу для World of Warcraft. Ноотропы OmniMind N°1 - Усиливает мозговую активность. Повышает мотивацию. Улучшает память. Санкт-Петербургская школа телевидения - это федеральная сеть образовательных центров, которая имеет филиалы в 37 городах России. Lingualeo.com — интерактивный онлайн-сервис для изучения и практики английского языка в увлекательной игровой форме. Junyschool (Джунискул) – международная школа программирования и дизайна для детей и подростков от 5 до 17 лет, где ученики осваивают компьютерную грамотность, развивают алгоритмическое и креативное мышление, изучают основы программирования и компьютерной графики, создают собственные проекты: игры, сайты, программы, приложения, анимации, 3D-модели, монтируют видео. Умназия - Интерактивные онлайн-курсы и тренажеры для развития мышления детей 6-13 лет SkillBox - это один из лидеров российского рынка онлайн-образования. Среди партнеров Skillbox ведущий разработчик сервисного дизайна AIC, медиа-компания Yoola, первое и самое крупное русскоязычное аналитическое агентство Tagline, онлайн-школа дизайна и иллюстрации Bang! Bang! Education, оператор PR-рынка PACO, студия рисования Draw&Go, агентство performance-маркетинга Ingate, scrum-студия Sibirix, имидж-лаборатория Персона. «Нетология» — это университет по подготовке и дополнительному обучению специалистов в области интернет-маркетинга, управления проектами и продуктами, дизайна, Data Science и разработки. В рамках Нетологии студенты получают ценные теоретические знания от лучших экспертов Рунета, выполняют практические задания на отработку полученных навыков, общаются с экспертами и единомышленниками. Познакомиться со всеми продуктами подробнее можно на сайте https://netology.ru, линейка курсов и профессий постоянно обновляется. StudyBay Brazil – это онлайн биржа для португалоговорящих студентов и авторов! Студент получает уникальную работу любого уровня сложности и больше свободного времени, в то время как у автора появляется дополнительный заработок и бесценный опыт. Автор24 — самая большая в России площадка по написанию учебных работ: контрольные и курсовые работы, дипломы, рефераты, решение задач, отчеты по практике, а так же любой другой вид работы. Сервис сотрудничает с более 70 000 авторов. Более 1 000 000 работ уже выполнено. StudyBay – это онлайн биржа для англоязычных студентов и авторов! Студент получает уникальную работу любого уровня сложности и больше свободного времени, в то время как у автора появляется дополнительный заработок и бесценный опыт.

Английская Википедия:International Corpus of English

Содержание