sklearn中的数据集
sklearn随机造的数据
回归
sklearn.datasets.make_regression
from sklearn import datasets
X, y, coef = \
datasets.make_regression(n_samples=1000,
n_features=5,
n_informative=3, # 其中,3个feature是有信息的
n_targets=1, # 多少个 target
bias=1, # 就是 intercept
coef=True, # 为True时,会返回真实的coef值
noise=1, # 噪声的标准差
)
分类
sklearn.datasets.make_classification
from sklearn import datasets
X, y = datasets. \
make_classification(n_samples=1000,
n_features=10,
n_informative=2,
n_redundant=3, # 用 n_informative 线性组合出这么多个特征
n_repeated=3, # 用 n_informative+n_redundant 线性组合出这么多个特征
n_classes=2,
n_clusters_per_class=1,
weights=[0.2, 0.8], # class 数量不均衡
scale=[5] + [1] * 8 + [3], # feature 的 scale
flip_y=0.1 # 随机交换这么多比例的y,以制造噪声
)
另外,还有一个 make_multilabel_classification 用来生成多标签(target是多个维度)
聚类
from sklearn import datasets
X, y_true = datasets.make_blobs(
n_samples=1000,
n_features=2,
centers=3,
center_box=(-1, 1), # 取值范围
cluster_std=0.05,
random_state=23)
其它
datasets.make_s_curve # S型
datasets.make_circles # 圆环
datasets.make_moons # 月亮
...
sklearn真实数据
import sklearn.datasets as datasets
dataset=datasets.load_iris()
dataset.data,dataset.target
dataset.feature_names,dataset.target_names
print(dataset.DESCR)
转为pandas
import pandas as pd
import sklearn.datasets as datasets
dataset=datasets.load_iris()
df_feature=pd.DataFrame(data=dataset.data,columns=dataset.feature_names)
df_target=pd.DataFrame(data=dataset.target,columns=['target'])
df=pd.concat([df_feature,df_target],axis=1)
boston (regression).
Boston House Prices dataset
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
breast_cancer (classification).
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
diabetes (regression).
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
digits (classification).
:Number of Instances: 5620
:Number of Attributes: 64 (8x8 image of integer pixels)
iris (classification).
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
linnerud (multivariate regression).
:Number of Instances: 20
:Number of Attributes: 3
load_digits (classification)
手写数字识别,8x8像素,16色,肉眼都不容易判断
================= ==============
Classes 10
Samples per class ~180
Samples total 1797
Dimensionality 64
Features integers 0-16
================= ==============
load_linnerud (multivariate regression)
============== ============================
Samples total 20
Dimensionality 3 (for both data and target)
Features integer
Targets integer
============== ============================
load_wine (classification)
================= ==============
Classes 3
Samples per class [59,71,48]
Samples total 178
Dimensionality 13
Features real, positive
================= ==============
statsmodels中的数据集
import statsmodels.api as sm
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
网络资源
UCI,有很多经典数据:
http://archive.ics.uci.edu/ml/
(PS)UCI有时候会崩,太不靠谱了,git上自己做了个库datasets_for_ml
应用举例:
df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data',sep='\s+',na_values='?',
header=None,names=['mpg','cylinders','displacement','horsepower','weight','acceleration','model_year','origin','car_name'])
df.head()
sogou数据实验室:
http://www.sogou.com/labs/resource/list_pingce.php
https://www.quandl.com/search
MNIST
- lecun官网
- 我给转成了csv格式,放到了github上,用代码读也方便
# step1:下载 import requests url = 'https://github.com/guofei9987/datasets_for_ml/blob/master/MNIST_data_csv.zip?raw=true' r = requests.get(url) with open('MNIST_data_csv.zip', 'wb') as f: f.write(r.content) # step2:解压 import zipfile azip = zipfile.ZipFile('MNIST_data_csv.zip') azip.extractall() # step3:读取 import pandas as pd mnist_train_images = pd.read_csv('mnist_train_images.csv', header=None).values mnist_test_images = pd.read_csv('mnist_test_images.csv', header=None).values mnist_train_labels = pd.read_csv('mnist_train_labels.csv', header=None).values mnist_test_labels = pd.read_csv('mnist_test_labels.csv', header=None).values
- tensorflow
from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets('MNIST_data/', one_hot=True) mnist.train.images,mnist.train.labels rand_test_indices = np.random.choice(len(mnist.test.images), test_size) mnist.test.images[rand_test_indices],mnist.test.labels[rand_test_indices]
图片识别库
Cifar-10
Cifar-10,由Hinton等整理出,搜集了10种类型60000张图片,全部是$32\times 32$彩图,人工标注正确率为94%
http://www.cs.toronto.edu/~kriz/cifar.html
ImageNet
ImageNet是基于WordNet的大型图片数据库,由李飞飞带头整理,1500万张图片关联到20000个名词同义词上。
图片从互联网上爬虫,通过亚马逊人工标注服务分类。
每张图片上由多个实体。
http://www.image-net.org/challenges/LSVRC/