Where to get datasets? Top 10 public dataset finders

As you might already know, machine learning (ML) is a domain of artificial intelligence (AI) that uses computer systems that learn and adapt without following explicit human instructions. Instead, these computer systems use algorithms and statistical models to analyze and draw conclusions from data patterns. So when working on ML projects and building the AI and ML models, the data you have for training and testing is the core component that plays a decisive role in how your model will eventually perform. This means that the first step in getting your hands on model development is gathering (or finding) a suitable dataset to build SuperData for your AI. Only then can you focus on pixel-cut annotations, proper labeling, and model training.

What is a dataset?

A dataset is a ready-made collection of images, audio, videos, texts, or tables for model training. Today, the internet is full of paid and free ML datasets, and we've composed a complete list of public ML datasets, so it's easier for you to filter the one that fits you best. However, depending on a use case, you may have a hard time finding a dataset for your specific project. In this article, you'll find the most popular dataset finders, so next time you wonder where to get datasets, you know where to come.

Top 10 open dataset finders

Kaggle
Google Dataset Search
UCI Machine Learning Repository
Papers with Code
VisualData
OpenML
DataHub
Data.Gov
HealthData.gov
NLP Index

Kaggle

With over 50,000 public datasets and 400,000 public notebooks, Kaggle is a competition website for data science, as well as a data science community with tools and resources that include externally contributed machine learning datasets of all kinds. As a result, Kaggle is one of the best places to look for quality training data from health through sports, food, travel, education, and more. In addition, Kaggle supports various dataset publication formats and encourages sharing accessible data so that more researchers can benefit from it and the platform overall.

Google Dataset Search

This one is a search engine from Google that helps researchers display and search for freely available online data. Similar to how Google Scholar works, Google Dataset Search lets you find datasets wherever they are hosted, whether a publisher's site, a digital library or an author's web page. Among over 25 million datasets, you can find economic and financial data and datasets uploaded by organizations like WHO, Statista, or Harvard.

UCI Machine Learning Repository

Being one of the oldest dataset aggregators on the web, the UCI Machine Learning Repository is a collection of databases, domain theories, and data generators used by the machine learning community to analyze machine learning algorithms. The archive was created as an FTP archive in 1987, and the website design was later updated in 2007. Having been widely used by students, educators, and researchers worldwide, the repository earned the status of a primary source of machine learning datasets. As an indication of the archive's impact, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The 588 datasets are categorized by task, attribute, data type, area of expertise and are available for download without registration.

Papers with Code

As stated on the website, the mission of Papers with Code is to create a free and open resource with ML papers, code, datasets, methods, and evaluation tables. The community project now includes over 3900 datasets that are easily filtered by modality, task, or language. The content on this website is openly licensed under CC-BY-SA (same as Wikipedia), meaning everyone can contribute and make edits. There are also specialized portals for papers with code in astronomy, physics, computer sciences, mathematics, and statistics.

VisualData

VisualData is an excellent source of datasets for image classification, image processing, and image segmentation projects, a search engine for computer vision datasets. You can easily filter them by category, date, popularity or use a search box to find a theme-specific dataset. Each dataset is either curated or submitted by the community, tagged with relevant topics for filtering. You can type keywords in the search bar that will be matched to the title and description of the dataset to find the best ones to use real quick. They are sorted according to published date, so the most recent addition gets surfaced. You can also sort by popularity to access trending datasets based on the visit frequency.

OpenML

This one is an open-source online ML experiments database for sharing and organizing data with ML engineers and researchers. More than 21.000 datasets are regularly updated and automatically versioned. Besides, OpenML analyzes each dataset and annotates it with rich meta-data to streamline analysis.

DataHub

DataHub is a collection of thousands of machine learning datasets available for free without registration. The collection includes sets from financial market data, macroeconomic data, and population growth to cryptocurrency prices.

Data.Gov

Data.Gov was launched in 2009 and is managed and hosted by the U.S. General Services Administration, Technology Transformation Service. The platform is powered by two open source applications, CKAN and WordPress, and is hosted on GitHub. There are over 330,700 datasets available to download in multiple formats, including CSV, JSON, PDF, RDF, XML.

HealthData.gov

This site is dedicated to making high-value health data more accessible to entrepreneurs, researchers, and policymakers in the hopes of better health outcomes for all. On HealtheData.gov, you can find data on a wide range of topics, including environmental health, medical devices, social services, community health, mental health, and substance abuse. The data is collected and supplied from agencies from the U.S. Department of Health and Human Services and state partners. This includes the Centers for Medicare and Medicaid Services, Centers for Disease Control and Prevention, Food and Drug Administration, and the Agency for Health Care Research and Quality.

NLP Index

The NLP Index is a brand new resource for NLP code discovery, combining and indexing more than 3,000 papers and code pairs at launch. It is handy for NLP research since it contains datasets for various natural language processing tasks, created and curated by Quantum Stat. Each entry in the index includes a paper title, its abstract, the authors, links to the paper itself, and a corresponding GitHub repository.

Key takeaways

A successful ML model can't perform well without a proper dataset. Keep in mind that the dataset should be in line with your project requirements. A few core considerations include the number of instances, how balanced the dataset is, and whether it contains all the elements you need to label. Then, leverage these dataset finders and make sure to pick the one that provides the best data foundation for the AI model you're building.