Английская Википедия:Datasaurus dozen

Материал из Онлайн справочника
Перейти к навигацииПерейти к поиску

Шаблон:Short description Шаблон:Data Visualization The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed.[1] It was inspired by the smaller Anscombe's quartet that was created in 1973.

Data

The following table contains summary statistics for all thirteen data sets.

Property Value Accuracy
Number of elements 142 exact
Mean of x 54.26 to 2 decimal places
Sample variance of x: sШаблон:Supsub 16.76 to 2 decimal places
Mean of y 47.83 to 2 decimal places
Sample variance of y: sШаблон:Supsub 26.93 to 2 decimal places
Correlation between x and y −0.06 to 3 decimal places
Linear regression line y = 53 − 0.1x to 0 and 1 decimal places, respectively
Coefficient of determination of the linear regression: <math>R^2</math> 0.004 to 3 decimal places
thirteen graphs of the datasets in the Datasaurus Dozen, visualized graphically and also summarized numerically to show their statistical summaries are similar, while their graphical representations are not similar
The thirteen datasets in the Datasaurus Dozen, visualized and summarized

The thirteen data sets were labeled as the following:

  • away
  • bullseye
  • circle
  • dino
  • dots
  • h_lines
  • high_lines
  • slant_down
  • slant_up
  • star
  • v_line
  • wide_lines
  • x_shape

Similar to the Anscombe's quartet, the Datasaurus dozen was designed to further illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.[2][3][4][5][1][6]

Creation

Файл:Datasaurus.png
The Datasaurus dataset created by Alberto Cairo that inspired the creation of the Datasaurus Dozen

The initial "datasaurus" dataset was constructed in 2016 by Alberto Cairo.[7] It was proposed by Maarten Lambrechts that this dataset also be called "Anscombosaurus".[7]

This dataset was then accompanied by twelve other datasets that were created by Justin Matejka and George Fitzmaurice at Autodesk. Unlike the Anscombe's quartet where it is not known how the data set was generated,[8] it is known that the authors used simulated annealing to make these data sets. They made small, random, and biased changes to each point towards the desired shape. Each shape took 200,000 iterations of perturbations to complete.[1]

The pseudocode for this algorithm is as follows:

current_ds ← initial_ds
for x iterations, do:
    test_ds ← perturb(current_ds, temp)
    if similar_enough(test_ds, initial_ds):
        current_ds ← test_ds

function perturb(ds, temp):
    loop:
        test ← move_random_points(ds)
        if fit(test) > fit(ds) or temp > random():
            return test

where

  • initial_ds is the seed dataset
  • current_ds is the latest version of the dataset
  • fit() is a function used to check whether moving the points gets closer to the desired shape
  • temp is the temperature of the simulated annealing algorithm0
  • similar_enough() is a function that checks whether the statistics for the two given datasets are similar enough
  • move_random_points() is a function that randomly moves data points

See also

References

Шаблон:Reflist

External links