Английская Википедия:FM-index

In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,^[1] who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space.^[2]

It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. The query time, as well as the required storage space, has a sublinear complexity with respect to the size of the input data.

The original authors have devised improvements to their original approach and dubbed it "FM-Index version 2".^[3] A further improvement, the alphabet-friendly FM-index, combines the use of compression boosting and wavelet trees^[4] to significantly reduce the space usage for large alphabets.

The FM-index has found use in, among other places, bioinformatics.^[5]

Background

Using an index is a common strategy to efficiently search a large body of text. When the text is larger than what reasonably fits within a computer's main memory, there is a need to compress not only the text but also the index. When the FM-index was introduced, there were several suggested solutions that were based on traditional compression methods and tried to solve the compressed matching problem. In contrast, the FM-index is a compressed self-index, which means that it compresses the data and indexes it at the same time.

FM-index data structure

An FM-index is created by first taking the Burrows–Wheeler transform (BWT) of the input text. For example, the BWT of the string Шаблон:Mono"abracadabra$" is "ard$rcaaaabb", and here it is represented by the matrix Шаблон:Mono where each row is a rotation of the text, and the rows have been sorted lexicographically. The transform corresponds to the concatenation of the characters from the last column (labeled Шаблон:Mono).

Шаблон:Mono	Шаблон:Mono		Шаблон:Mono
1	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
2	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
3	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
4	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
5	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
6	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
7	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
8	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
9	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
10	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
11	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono
12	Шаблон:Mono	Шаблон:Mono	Шаблон:Mono

The BWT in itself allows for some compression with, for instance, move to front and Huffman encoding, but the transform has even more uses. The rows in the matrix are essentially the sorted suffixes of the text and the first column F of the matrix shares similarities with suffix arrays. How the suffix array relates to the BWT lies at the heart of the FM-index.

It is possible to make a last-to-first column mapping Шаблон:Mono from an index Шаблон:Mono to an index Шаблон:Mono, such that Шаблон:Mono = Шаблон:Mono, with the help of a table Шаблон:Mono and a function Шаблон:Mono.

Шаблон:Mono is a table that, for each character Шаблон:Mono in the alphabet, contains the number of occurrences of lexically smaller characters in the text.
The function Шаблон:Mono is the number of occurrences of character Шаблон:Mono in the prefix Шаблон:Mono. Ferragina and Manzini showed^[1] that it is possible to compute Шаблон:Mono in constant time.

Шаблон:Mono of "Шаблон:Mono"
Шаблон:Mono	$	a	b	c	d	r
Шаблон:Mono	0	1	6	8	9	10

The last-to-first mapping can now be defined as Шаблон:Mono. For instance, on row 9, Шаблон:Mono is Шаблон:Mono and the same Шаблон:Mono can be found on row 5 in the first column Шаблон:Mono, so Шаблон:Mono should be 5 and Шаблон:Mono. For any row Шаблон:Mono of the matrix, the character in the last column Шаблон:Mono precedes the character in the first column Шаблон:Mono also in T. Finally, if Шаблон:Mono, then Шаблон:Mono, and using the equality it is possible to extract a string of Шаблон:Mono from Шаблон:Mono.

The FM-index itself is a compression of the string Шаблон:Mono together with Шаблон:Mono and Шаблон:Mono in some form, as well as information that maps a selection of indices in Шаблон:Mono to positions in the original string Шаблон:Mono.

Шаблон:Mono of "Шаблон:Mono"
	a	r	d	$	r	c	a	a	a	a	b	b
	1	2	3	4	5	6	7	8	9	10	11	12
$	0	0	0	1	1	1	1	1	1	1	1	1
a	1	1	1	1	1	1	2	3	4	5	5	5
b	0	0	0	0	0	0	0	0	0	0	1	2
c	0	0	0	0	0	1	1	1	1	1	1	1
d	0	0	1	1	1	1	1	1	1	1	1	1
r	0	1	1	1	2	2	2	2	2	2	2	2

Count

The operation count takes a pattern Шаблон:Mono and returns the number of occurrences of that pattern in the original text Шаблон:Mono. Since the rows of matrix Шаблон:Mono are sorted, and it contains every suffix of Шаблон:Mono, the occurrences of pattern Шаблон:Mono will be next to each other in a single continuous range. The operation iterates backwards over the pattern. For every character in the pattern, the range that has the character as a suffix is found. For example, the count of the pattern "bra" in "abracadabra" follows these steps:

The first character we look for is Шаблон:Mono, the last character in the pattern. The initial range is set to Шаблон:Mono. This range over Шаблон:Mono represents every character of Шаблон:Mono that has a suffix beginning with a.
The next character to look for is Шаблон:Mono. The new range is Шаблон:Mono Шаблон:Mono Шаблон:Mono, if Шаблон:Mono is the index of the beginning of the range and Шаблон:Mono is the end. This range over Шаблон:Mono is all the characters of Шаблон:Mono that have suffixes beginning with ra.
The last character to look at is Шаблон:Mono. The new range is Шаблон:Mono Шаблон:Mono Шаблон:Mono. This range over Шаблон:Mono is all the characters that have a suffix that begins with bra. Now that the whole pattern has been processed, the count is the same as the size of the range: Шаблон:Mono.

If the range becomes empty or the range boundaries cross each other before the whole pattern has been looked up, the pattern does not occur in Шаблон:Mono. Because Шаблон:Mono can be performed in constant time, count can complete in linear time in the length of the pattern: Шаблон:Mono time.

Locate

The operation locate takes as input an index of a character in Шаблон:Mono and returns its position Шаблон:Mono in Шаблон:Mono. For instance Шаблон:Mono. To locate every occurrence of a pattern, first the range of character is found whose suffix is the pattern in the same way the count operation found the range. Then the position of every character in the range can be located.

To map an index in Шаблон:Mono to one in Шаблон:Mono, a subset of the indices in Шаблон:Mono are associated with a position in Шаблон:Mono. If Шаблон:Mono has a position associated with it, Шаблон:Mono is trivial. If it's not associated, the string is followed with Шаблон:Mono until an associated index is found. By associating a suitable number of indices, an upper bound can be found. Locate can be implemented to find occ occurrences of a pattern Шаблон:Mono in a text Шаблон:Mono in Шаблон:Math time with <math>O \left(H_k(T) + {{\log\log u}\over{\log^\epsilon u}} \right)</math> bits per input symbol for any Шаблон:Math.^[1]

Applications

DNA read mapping

FM index with backtracking has been successfully (>2000 citations) applied to approximate string matching/sequence alignment, See Bowtie http://bowtie-bio.sourceforge.net/index.shtml

References

Шаблон:Reflist

↑ ^1,0 ^1,1 ^1,2 Paolo Ferragina and Giovanni Manzini (2000). "Opportunistic Data Structures with Applications". Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390.
↑ Paolo Ferragina and Giovanni Manzini (2005). "Indexing Compressed Text". Journal of the ACM, 52, 4 (Jul. 2005). p. 553
↑ Шаблон:Cite web
↑ P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. An Alphabet-Friendly FM-index. In Proc. SPIRE'04, pages 150-160. LNCS 3246.
↑ Шаблон:Cite journal

[opportunistic_2000-1] 1,0 ^1,1 ^1,2 Paolo Ferragina and Giovanni Manzini (2000). "Opportunistic Data Structures with Applications". Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390.

[2] Paolo Ferragina and Giovanni Manzini (2005). "Indexing Compressed Text". Journal of the ACM, 52, 4 (Jul. 2005). p. 553

[3] Шаблон:Cite web

[FGMN04-4] P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. An Alphabet-Friendly FM-index. In Proc. SPIRE'04, pages 150-160. LNCS 3246.

[5] Шаблон:Cite journal

[1]

[2]

[3]

[4]

[5]

Партнерские ресурсы
Криптовалюты	Обмен криптовалют - www.bestchange.ru Криптовалютная биржа CoinEx Криптовалютная биржа Binance HIVE OS - операционная система для майнинга e4pool - Мультивалютный пул для майнинга.
Магазины	AliExpress — глобальная виртуальная (в Интернете) торговая площадка, предоставляющая возможность покупать товары производителей из КНР; computeruniverse.net - Интернет-магазин компьютеров(Промо код 5 Евро на первую покупку:FWWC3ZKQ);
Хостинг	DigitalOcean - американский провайдер облачных инфраструктур, с главным офисом в Нью-Йорке и с центрами обработки данных по всему миру;
Разное	Викиум - Онлайн-тренажер для мозга Like Центр - Центр поддержки и развития предпринимательства. Gamersbay - лучший магазин по бустингу для World of Warcraft. Ноотропы OmniMind N°1 - Усиливает мозговую активность. Повышает мотивацию. Улучшает память. Санкт-Петербургская школа телевидения - это федеральная сеть образовательных центров, которая имеет филиалы в 37 городах России. Lingualeo.com — интерактивный онлайн-сервис для изучения и практики английского языка в увлекательной игровой форме. Junyschool (Джунискул) – международная школа программирования и дизайна для детей и подростков от 5 до 17 лет, где ученики осваивают компьютерную грамотность, развивают алгоритмическое и креативное мышление, изучают основы программирования и компьютерной графики, создают собственные проекты: игры, сайты, программы, приложения, анимации, 3D-модели, монтируют видео. Умназия - Интерактивные онлайн-курсы и тренажеры для развития мышления детей 6-13 лет SkillBox - это один из лидеров российского рынка онлайн-образования. Среди партнеров Skillbox ведущий разработчик сервисного дизайна AIC, медиа-компания Yoola, первое и самое крупное русскоязычное аналитическое агентство Tagline, онлайн-школа дизайна и иллюстрации Bang! Bang! Education, оператор PR-рынка PACO, студия рисования Draw&Go, агентство performance-маркетинга Ingate, scrum-студия Sibirix, имидж-лаборатория Персона. «Нетология» — это университет по подготовке и дополнительному обучению специалистов в области интернет-маркетинга, управления проектами и продуктами, дизайна, Data Science и разработки. В рамках Нетологии студенты получают ценные теоретические знания от лучших экспертов Рунета, выполняют практические задания на отработку полученных навыков, общаются с экспертами и единомышленниками. Познакомиться со всеми продуктами подробнее можно на сайте https://netology.ru, линейка курсов и профессий постоянно обновляется. StudyBay Brazil – это онлайн биржа для португалоговорящих студентов и авторов! Студент получает уникальную работу любого уровня сложности и больше свободного времени, в то время как у автора появляется дополнительный заработок и бесценный опыт. Автор24 — самая большая в России площадка по написанию учебных работ: контрольные и курсовые работы, дипломы, рефераты, решение задач, отчеты по практике, а так же любой другой вид работы. Сервис сотрудничает с более 70 000 авторов. Более 1 000 000 работ уже выполнено. StudyBay – это онлайн биржа для англоязычных студентов и авторов! Студент получает уникальную работу любого уровня сложности и больше свободного времени, в то время как у автора появляется дополнительный заработок и бесценный опыт.

Английская Википедия:FM-index

Содержание

Background

FM-index data structure

Count

Locate

Applications

DNA read mapping

See also

References

Навигация

Действия на странице

Действия на странице

Персональные инструменты

Навигация

Поиск

Инструменты