经典数据集



2017年09月04日    Author:Guofei

文章归类: 0x13_特征工程    文章编号: 199

版权声明:本文作者是郭飞。转载随意,但需要标明原文链接,并通知本人
原文链接:https://www.guofei.site/2017/09/04/dataset.html


sklearn中的数据集

sklearn随机造的数据

samples-generator

回归

sklearn.datasets.make_regression

from sklearn import datasets

X, y, coef = \
    datasets.make_regression(n_samples=1000,
                             n_features=5,
                             n_informative=3,  # 其中,3个feature是有信息的
                             n_targets=1,  # 多少个 target
                             bias=1,  # 就是 intercept
                             coef=True,  # 为True时,会返回真实的coef值
                             noise=1,  # 噪声的标准差
                             )

分类

sklearn.datasets.make_classification

from sklearn import datasets

X, y = datasets. \
    make_classification(n_samples=1000,
                        n_features=10,
                        n_informative=2,
                        n_redundant=3,  # 用 n_informative 线性组合出这么多个特征
                        n_repeated=3,  # 用 n_informative+n_redundant 线性组合出这么多个特征
                        n_classes=2,
                        n_clusters_per_class=1,
                        weights=[0.2, 0.8],  # class 数量不均衡
                        scale=[5] + [1] * 8 + [3],  # feature 的 scale
                        flip_y=0.1  # 随机交换这么多比例的y,以制造噪声
                        )

另外,还有一个 make_multilabel_classification 用来生成多标签(target是多个维度)

其它

datasets.make_s_curve # S型
datasets.make_circles # 圆环
datasets.make_moons # 月亮
...

sklearn真实数据

import sklearn.datasets as datasets
dataset=datasets.load_iris()
dataset.data,dataset.target
dataset.feature_names,dataset.target_names
print(dataset.DESCR)

转为pandas

import pandas as pd
import sklearn.datasets as datasets
dataset=datasets.load_iris()
df_feature=pd.DataFrame(data=dataset.data,columns=dataset.feature_names)
df_target=pd.DataFrame(data=dataset.target,columns=['target'])
df=pd.concat([df_feature,df_target],axis=1)

boston (regression).

Boston House Prices dataset
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive

breast_cancer (classification).

:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class

diabetes (regression).

:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values

digits (classification).

:Number of Instances: 5620
:Number of Attributes: 64 (8x8 image of integer pixels)

iris (classification).

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class

linnerud (multivariate regression).

:Number of Instances: 20
:Number of Attributes: 3

load_digits (classification)

手写数字识别,8x8像素,16色,肉眼都不容易判断

=================   ==============
Classes                         10
Samples per class             ~180
Samples total                 1797
Dimensionality                  64
Features             integers 0-16
=================   ==============

load_linnerud (multivariate regression)

==============    ============================
Samples total     20
Dimensionality    3 (for both data and target)
Features          integer
Targets           integer
==============    ============================

load_wine (classification)

=================   ==============
Classes                          3
Samples per class        [59,71,48]
Samples total                  178
Dimensionality                  13
Features            real, positive
=================   ==============

statsmodels中的数据集

import statsmodels.api as sm
dat = sm.datasets.get_rdataset("Guerry", "HistData").data

网络资源

UCI,有很多经典数据:
http://archive.ics.uci.edu/ml/
(PS)UCI有时候会崩,太不靠谱了,git上自己做了个库datasets_for_ml

应用举例:

df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data',sep='\s+',na_values='?',
               header=None,names=['mpg','cylinders','displacement','horsepower','weight','acceleration','model_year','origin','car_name'])
df.head()

sogou数据实验室:
http://www.sogou.com/labs/resource/list_pingce.php

https://www.quandl.com/search

MNIST

  1. lecun官网
  2. 我给转成了csv格式,放到了github上,用代码读也方便
    # step1:下载
    import requests
    url = 'https://github.com/guofei9987/datasets_for_ml/blob/master/MNIST_data_csv.zip?raw=true'
    r = requests.get(url)
    with open('MNIST_data_csv.zip', 'wb') as f:
     f.write(r.content)
    # step2:解压
    import zipfile
    azip = zipfile.ZipFile('MNIST_data_csv.zip')
    azip.extractall()
    # step3:读取
    import pandas as pd
    mnist_train_images = pd.read_csv('mnist_train_images.csv', header=None).values
    mnist_test_images = pd.read_csv('mnist_test_images.csv', header=None).values
    mnist_train_labels = pd.read_csv('mnist_train_labels.csv', header=None).values
    mnist_test_labels = pd.read_csv('mnist_test_labels.csv', header=None).values
    
  3. tensorflow
    from tensorflow.examples.tutorials.mnist import input_data
    mnist = input_data.read_data_sets('MNIST_data/', one_hot=True)
    mnist.train.images,mnist.train.labels
    rand_test_indices = np.random.choice(len(mnist.test.images), test_size)
    mnist.test.images[rand_test_indices],mnist.test.labels[rand_test_indices]
    

图片识别库

Cifar-10

Cifar-10,由Hinton等整理出,搜集了10种类型60000张图片,全部是$32\times 32$彩图,人工标注正确率为94%
http://www.cs.toronto.edu/~kriz/cifar.html

ImageNet

ImageNet是基于WordNet的大型图片数据库,由李飞飞带头整理,1500万张图片关联到20000个名词同义词上。
图片从互联网上爬虫,通过亚马逊人工标注服务分类。
每张图片上由多个实体。
http://www.image-net.org/challenges/LSVRC/


您的支持将鼓励我继续创作!