Tuitionet

聲明啟事 └→ 關於「Tuitionet.blogspot.com」的聲明事項 ◎ 謝謝您使用「Tuitionet.blogspot.com」所提供的閱覽服務，「Tuitionet.blogspot.com」提醒您注意下列事項： 1.部分「Tuitionet.blogspot.com」文章透過「ePaper 電子報」等等發行單位所提供之電子報訂閱及自動發送機制，發佈或轉貼「Tuitionet.blogspot.com」內容，該內容都是由各個電子報或訊息提供者所提供，「Tuitionet.blogspot.com」不持有其內容。 2.「Tuitionet.blogspot.com」，不介入讀者與內容提供者之間的任何意識形態問題。 3.各該電子報所表達的意見或言論，不代表「Tuitionet.blogspot.com」的立場。

2009-03-04

DM Dataset

Kdnuggets http://www.kdnuggets.com/

http://www.kdnuggets.com/

*
Datasets

*

Datasets for Data Mining

KDD Cup center, with all data, tasks, and results.

SAS Technology Workshop Series. Reserve your seat today

UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.
UCI Machine Learning Repository.
AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
Delve, Data for Evaluating Learning in Valid Experiments
FEDSTATS, a comprehensive source of US statistics and more
FIMI repository for frequent itemset mining, implementations and datasets.
Financial Data Finder at OSU, a large catalog of financial data sets
GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
Grain Market Research, financial data including stocks, futures, etc.
ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.
Infobiotics PSP (protein structure prediction) datasets, adjustable real-world family of benchmarks for testing the scalability of classification/regression methods.
Investor Links, includes financial data
Microsoft's TerraServer, aerial photographs and satellite images you can view and purchase.
MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.
NASDAQ Data Store, provides access to market data.
National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
PubGene(TM) Gene Database and Tools, genomic-related publications database
SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.
SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site.
STATOO Datasets part 1 and STATOO Datasets part 2
UCR Time Series Classification/Clustering page, offering datasets, papers, links, and code.
United States Census Bureau.

KDD Cup and Workshop 2007

Co-organized by ACM SIGKDD and Netflix

To be held at KDD-2007, San Jose, California, Aug 12, 2007

http://www.cs.uic.edu/~liub/Netflix-KDD-Cup-2007.html#download

3. Obtaining the Training Dataset and the Qualifying Answer Sets

The Netflix Prize training dataset is available for download from here. You must register separately at that site to download the training dataset, even if you elect not to enter the Netflix Prize contest itself. The format of the training data is described on the Netflix Prize website and in the training dataset file. No additional training data will be provided. The qualifying answer sets can be downloaded from the links below.

The who_rated_what_2006.txt file consists of 100,000 lines, each containing a user_id and movie_id pair.
The how_many_ratings_2006.txt file consists of 8863 lines, each containing a movie_id.

The user_ids and movie_ids are taken from the Netflix Prize training dataset.

*

Welcome to the UCI Knowledge Discovery in Databases Archive

This is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets.

Creation of this archive was supported by a grant from the Information and Data Management Program at the National Science Foundation. The archive is intended to serve as a permanent repository of publicly-accessible data sets for research in KDD and data mining. It complements the original UCI Machine Learning Archive , which typically focuses on smaller classification-oriented data sets.

In addition to storing data and description files, we also archive task files that describe a specific analysis, such as clustering or regression, for the data sets stored. The call for data sets lists typical data types and tasks of interest.

Data Sets		Task Files
by data type by application area by name by date (reverse chronological) Machine Learning Repository		by task type by application area by name by date (reverse chronological)

http://kdd.ics.uci.edu/

*

Welcome to the UC Irvine Machine Learning Repository!

We currently maintain 177 data sets as a service to the machine learning community. You may view all data sets through our searchable interface. Our old web site is still available, for those who prefer the old format. For a general overview of the Repository, please visit our About page. For information about citing data sets in publications, please read our citation policy. If you wish to donate a data set, please consult our donation policy. For any other questions, feel free to contact the Repository librarians. We have also set up a mirror site for the Repository.

http://archive.ics.uci.edu/ml/

*

Frequent Itemset Mining Dataset Repository

The following two datasets were generated using the generator from the IBM Almaden Quest research group. This generator can be downloaded from their website.
Another implementation that can be compiled using the g++ compilers can be dowloaded from Paolo Palmerini's website.

The following datasets were prepared by Roberto Bayardo from the UCI datasets and PUMSB.

The next dataset was provided to us by Ferenc Bodon and contains (anonymized) click-stream data of a hungarian on-line news portal.

kosarak (.gz)

There are three datasets available which were used for the KDD CUP 2000.
They're described in the paper "Real world performance of association rule algorithms" by Zheng, Kohavi and Mason.
Before you can download the datasets, you are required to clickthrough on an agreement,
after which you recieve a password that will allow you to download the datasets:

The following dataset was donated by Tom Brijs and contains the (anonymized) retail market basket data from an anonymous Belgian retail store.
The data are provided 'as is'. Basically, any use of the data is allowed as long as the proper acknowledgment is provided and a copy of the work is provided to Tom Brijs.
More details can be found here.

retail (.gz)

The following dataset was donated by Karolien Geurts and contains (anonymized) traffic accident data.
The data are provided 'as is'. Basically, any use of the data is allowed as long as the proper acknowledgement is provided and a copy of the work is provided to Karolien Geurts.
More details can be found here.

accidents (.gz)

The following dataset was donated by Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri and was built from a spidered collection of web html documents.
More details can be found here.

webdocs.dat.gz (488 MB zipped!)

http://fimi.cs.helsinki.fi/data/

*

Gene Expression Omnibus: a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.

Public data