Английская Википедия:Discretization of continuous features
In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals. This can be useful when creating probability mass functions – formally, in density estimation. It is a form of discretization in general and also of binning, as in making a histogram. Whenever continuous data is discretized, there is always some amount of discretization error. The goal is to reduce the amount to a level considered negligible for the modeling purposes at hand.
Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).[1]
Mechanisms for discretizing continuous data include Fayyad & Irani's MDL method,[2] which uses mutual information to recursively define the best bins, CAIM, CACC, Ameva, and many others[3]
Many machine learning algorithms are known to produce better models by discretizing continuous attributes.[4]
Software
This is a partial list of software that implement MDL algorithm.
- discretize4crf tool designed to work with popular CRF implementations (C++)
- mdlp in the R package discretization
- Discretize in the R package RWeka
See also
References
- ↑ Шаблон:Cite journal
- ↑ Fayyad, Usama M.; Irani, Keki B. (1993) Шаблон:Cite web, Proc. 13th Int. Joint Conf. on Artificial Intelligence (Q334 .I571 1993), pp. 1022-1027
- ↑ Dougherty, J.; Kohavi, R.; Sahami, M. (1995). "Supervised and Unsupervised Discretization of Continuous Features". In A. Prieditis & S. J. Russell, eds. Work. Morgan Kaufmann, pp. 194-202
- ↑ Шаблон:Cite journal