- Kdnuggets http://www.kdnuggets.com/
Datasets for Data Mining
- KDD Cup center, with all data, tasks, and results.
- UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.
- UCI Machine Learning Repository.
- AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
- Delve, Data for Evaluating Learning in Valid Experiments
- FEDSTATS, a comprehensive source of US statistics and more
- FIMI repository for frequent itemset mining, implementations and datasets.
- Financial Data Finder at OSU, a large catalog of financial data sets
- GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
- Grain Market Research, financial data including stocks, futures, etc.
- ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.
- Infobiotics PSP (protein structure prediction) datasets, adjustable real-world family of benchmarks for testing the scalability of classification/regression methods.
- Investor Links, includes financial data
- Microsoft's TerraServer, aerial photographs and satellite images you can view and purchase.
- MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.
- NASDAQ Data Store, provides access to market data.
- National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
- National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
- PubGene(TM) Gene Database and Tools, genomic-related publications database
- SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.
- SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site.
- STATOO Datasets part 1 and STATOO Datasets part 2
- UCR Time Series Classification/Clustering page, offering datasets, papers, links, and code.
- United States Census Bureau.
KDD Cup and Workshop 2007
Co-organized by ACM SIGKDD and Netflix
To be held at KDD-2007, San Jose, California, Aug 12, 2007
3. Obtaining the Training Dataset and the Qualifying Answer Sets
The Netflix Prize training dataset is available for download from here. You must register separately at that site to download the training dataset, even if you elect not to enter the Netflix Prize contest itself. The format of the training data is described on the Netflix Prize website and in the training dataset file. No additional training data will be provided. The qualifying answer sets can be downloaded from the links below.- The who_rated_what_2006.txt file consists of 100,000 lines, each containing a user_id and movie_id pair.
- The how_many_ratings_2006.txt file consists of 8863 lines, each containing a movie_id.
Welcome to the UCI Knowledge Discovery in Databases Archive
This is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets.
Creation of this archive was supported by a grant from the Information and Data Management Program at the National Science Foundation. The archive is intended to serve as a permanent repository of publicly-accessible data sets for research in KDD and data mining. It complements the original UCI Machine Learning Archive , which typically focuses on smaller classification-oriented data sets.
In addition to storing data and description files, we also archive task files that describe a specific analysis, such as clustering or regression, for the data sets stored. The call for data sets lists typical data types and tasks of interest.
Data Sets | Task Files | |
Welcome to the UC Irvine Machine Learning Repository!
We currently maintain 177 data sets as a service to the machine learning community. You may view all data sets through our searchable interface. Our old web site is still available, for those who prefer the old format. For a general overview of the Repository, please visit our About page. For information about citing data sets in publications, please read our citation policy. If you wish to donate a data set, please consult our donation policy. For any other questions, feel free to contact the Repository librarians. We have also set up a mirror site for the Repository.http://archive.ics.uci.edu/ml/
Frequent Itemset Mining Dataset Repository
Gene Expression Omnibus: a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
| ||||||||||||||