Machine learning (ML) is a part of artificial intelligence (AI) that studies computer algorithms that automatically improve through processing data. ML algorithms use historical data as input to output new prediction values. Nowadays, ML is a central part of world-leading companies such as Facebook, Google, Amazon, etc., and it’s nearly impossible to estimate the number of domains that ML contributes to.
It’s no secret that data is the core component of any ML project. To make sure we receive quality output predictions, we have to make sure our inputs are crystal clear in the first place, meaning we should collect the data thoroughly and carefully. However, this may be a tedious task to complete and, often, the most expensive one. Imagine developing a face recognition algorithm – you may need hundreds of thousands of various images, including faces of different shapes, colors, lighting conditions, etc. And the good news? Someone already did it for you and was kind enough to put out this collection of images, videos, texts, or data tables on the internet. In this article, we’ll introduce a list of free-to-use datasets for machine learning.
What is a machine learning dataset?
In order to craft ML models, we usually need huge amounts of data that are typically grouped into what we call a dataset. In other words, a dataset is essentially a collection of information records referring to a specific subject.
Top public machine learning datasets
To make it easier for you to filter the dataset you might be looking for, we’ve decided to organize our list into groups representing different ML problems a specific dataset helps to solve or the industry where this dataset belongs, so feel free to jump to any section of your preference.
Face recognition, as a primary method of identifying and verifying people on the photographs by their facial features, doesn’t cease to help models achieve superhuman performance. Similar models are trained on vast datasets, whether publicly available or collected manually. Let's dive into some of the open face recognition datasets to help you get started.
- With more than 200K celebrity images and 40 attribute annotations for each image, CelebFaces Attributes Dataset (CelebA) provides a good starting point for face recognition, face detection, landmark (or facial part) localization, and face editing & synthesis projects. On top of that, the images in this dataset cover large pose variations and background clutter.
- CelebAMask-HQ is a dataset of 30,000 high-resolution face images with manually-annotated masks and 19 classes including facial components such as skin, nose, eyes, eyebrows, ears, mouth, lip, hair, hat, eyeglass, earring, necklace, neck, and cloth. The dataset can be used to train and evaluate algorithms of face parsing, face recognition, and GANs for face generation and editing.
- CelebA-Spoof is again a CelebA-based dataset that has 625,537 images from 10,177 subjects, which includes 43 rich attributes, where three attributes belong to spoof images including spoof types, environments, and illumination conditions and the rest are facial parts and accessories.
- Google facial expression comparison dataset consists of around 500,000 image triplets with 156,000 face images. It’s important to notice that each triplet in this dataset was annotated by six or more human raters. This dataset supports projects related to facial expression analysis such as expression-based image retrieval, expression-based photo album summarization, emotion classification, expression synthesis, etc. One needs to fill out a short form to access the dataset.
- IMDb-Face is a large-scale noise-controlled dataset for face recognition research, and as the name suggests, all images were obtained from the IMDb website. The dataset contains about 1.7 million faces, 59k identities, which were manually cleaned from 2.0 million raw images.
- The aim of the Facial Deformable Models of Animals (FDMA) is to challenge the current approaches in human facial landmark detection and tracking and offer new algorithms that can deal with much larger variability, that is typical to the facial features of the animals. The algorithms presented by the project proved to be capable of detecting and tracking landmarks on human faces, handling different variations caused by the changes of facial expressions or poses, partial occlusions, and illumination.
Public object detection datasets are widespread, addressing this or that research problem in scene understanding. Below we provide a comprehensive list to cut off your search time.
- DOTA (Dataset of Object deTection in Aerial images) is a large-scale dataset for object detection that contains 15 common categories (e.g., ship, plane, vehicle, etc.), 1411 images training data, and 458 images validation data.
- COCO (Common Objects in Context) is one of the most popular and common large-scale image datasets that works well for object detection, keypoint detection, semantic segmentation, panoptic segmentation, and image captioning tasks.
- Pascal Visual Object Classes (VOC) is a collection of patterned image and annotation datasets for object detection, class recognition, and instance/semantic segmentation. The dataset is used to assist in standard evaluation procedures and allows for the evaluation and comparison of different methods.
- The Pascal3D+ multi-view dataset consists of images in the wild, i.e., images of object categories exhibiting high variability, captured under uncontrolled settings, in cluttered scenes, and under many different poses. Pascal3D+ contains 12 categories of rigid objects selected from the PASCAL VOC 2012 dataset. These objects are annotated with pose information (azimuth, elevation, and distance to the camera). Pascal3D+ also adds pose-annotated images of these 12 categories from the ImageNet dataset.
- LVIS (pronounced `el-vis') is a dataset for Large Vocabulary Instance Segmentation that aims to collect around 2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images.
- MOT (The Multiple Object Tracking) is a dataset for multiple object tracking with indoor and outdoor scenes of public places with pedestrians as the objects of interest. A video for each scene is divided into two clips, one for training and the other for testing. The dataset provides detections of objects in the video frames with three detectors, namely SDP, Faster-RCNN, and DPM.
- Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 million QA pairs, 17 questions per image on average. Compared to the Visual Question Answering dataset, Visual Genome represents a more balanced distribution over 6 question types: What, Where, When, Who, Why, and How. The Visual Genome dataset also presents 108K images with densely annotated objects, attributes, and relationships.
- The MPII Human Pose Dataset for single person pose estimation is composed of about 25K images of which 15K are training samples, 3K are validation samples and 7K are testing samples (which labels are withheld by the authors). The images are taken from YouTube videos covering 410 different human activities, and the poses are manually annotated with up to 16 body joints.
Collecting and loading an entire dataset into local storage is both time-consuming and impractical, especially if you’re dealing with video. Using open datasets can trim down cumbersome and labor-intensive processes in your data pipeline. We’ve listed the top open-source video datasets.
- BDD100K is one of the most diverse open video datasets collected by a driving platform. As the name suggests, the dataset consists of 100K videos, 40 seconds each. The videos also contain GPS/ IMU information to feature approximate route trajectories.
- The Cityscapes is one of the most popular large-scale datasets of stereo videos featuring urban scenes. It contains recordings from 50 different cities of Germany with pixel-accurate annotations providing GPS coordinates, outside temperature, ego-motion data, and right stereo views as well.
- VOT2016 is a video dataset for visual object tracking. It contains 60 video clips and 21,646 corresponding ground truth maps with pixel-wise annotation of salient objects.
- The Kinetics dataset is among the most popular video datasets representing a large-scale, high-quality dataset for human action recognition. It consists of around 500,000 video clips covering 600 human action classes, with at least 600 video clips for each action class. The videos are collected from YouTube, each clip lasts around 10 seconds and is labeled with a single action class.
- UCF101 dataset consists of 13,320 video clips, which are classified into 101 categories. These 101 categories can be classified into five types: body motion, human-human interactions, human-object interactions, playing musical instruments, and sports. The videos are collected from YouTube with a total length of 27 hours.
- The HMDB51 dataset is a large collection of realistic videos from various sources, including movies and web videos. The dataset is composed of 6,849 video clips from 51 action categories, such as jump, kiss, and laugh, with each category containing at least 101 clips.
- The Densely Annotation Video Segmentation dataset (DAVIS) contains 50 video sequences with 3455 pixel-level densely annotated frames, where 30 videos with 2079 frames are for training, and 20 videos with 1376 frames are for validation.
- KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets to use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Various researchers have manually annotated parts of the dataset to fit their necessities and perfected the dataset over time.
Machine learning models perceive and contextualize the world around not only through computer vision but also sounds and audio. Today, anyone can train speech-enabled applications in the craziest sound environments you could think of by using open-source audio databases.
- The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from Project Gutenberg. The training data is split into three partitions of 100hr, 360hr, and 500hr set,s while the dev and test data are around 5hr in audio length.
- The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. Version 2.7, released in 2020, consists of 183 treebanks over 104 languages. The annotation consists of universal part-of-speech tags, dependency heads, and universal dependency labels.
- VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities extracted from videos uploaded to YouTube. The dataset can be applied for speaker identification, speech separation, emotion recognition, and more.
- VoxCeleb2 is a large-scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa, and training face recognition from video to complement existing face recognition datasets.
- Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog.
- This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage, and an elicitation paragraph used for the speech accent archive.
- The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of various complexity such as to ask about weather or to book a restaurant. The training set contains 13,084 utterances, the validation set, and the test set contains 700 utterances each, with 100 queries per intent.
- Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
Public domain datasets with textual data find applications both in natural language processing (NLP) and optical character recognition (OCR). The latter cover documents as diverse as reviews, emails, books, and so forth. Let’s go through several examples below:
- The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall Street Journal (WSJ), is one of the most known and used corpus for the evaluation of models for sequence labeling. The task consists of annotating each word with its part-of-speech tag. The corpus is also commonly used for character-level and word-level language modeling.
- The Stanford Question Answering Dataset (SQuAD) is a diverse collection of question-answer pairs derived from Wikipedia articles. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowd workers in forms that are similar to the answerable ones.
- Visual Question Answering (VQA) is a dataset containing open-ended questions about images. These questions require an understanding of vision, language, and commonsense knowledge to answer. The first version of the dataset was released in October 2015. VQA v2.0 was released in April 2017.
- The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.
- ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include expert-created resources, crowd-sourcing, and games with a purpose.
- The SNLI dataset (Stanford Natural Language Inference) consists of 570k sentence pairs manually labeled as entailment, contradiction, and neutral. Premises are image captions from Flickr30k, while hypotheses were generated by crowd-sourced annotators who were shown a premise and asked to generate entailing, contradicting, and neutral sentences.
- CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects, where each image comes with a number of highly compositional questions that fall into different categories. The dataset consists of a training set of 70,000 images and 700,000 questions, a validation set of 15,000 images and 150,000 questions, a test set of 15,000 images, and 150,000 questions about objects, answers, scene graphs, and functional programs for all train and validation images and questions.
There is a steady increase in using ML and Deep Learning techniques in healthcare. If you would like to practice and see how it looks like working with such data, this dataset is a good choice.
- Breast Cancer Wisconsin (Diagnostic) Data Set helps to predict whether the cancer is benign or malignant based on digitized images of a fine needle aspirate (FNA) of a breast mass. Dataset is composed of 569 examples which include 357 benign and 212 malignant instances.
- The objective of Pima Indians Diabetes Database is to predict whether or not a patient has diabetes based on certain diagnostic measurements. This dataset contains 768 observations, with eight8 input features and one output feature. It is not a balanced dataset and it is assumed that missing values are replaced with 0.
- SOCR Data contains the height and weights of 25,000 different humans of 18 years of age. This dataset is a good fit for building a model that can predict the height or weight of a human.
- International Collaboration on Cancer Reporting (ICCR) presents a collection of 12 datasets arranged according to 12 anatomical sites where cancer occurs. The aim is to ensure that the datasets produced for different tumor types have a consistent style and content and contain all the parameters needed to manage and prognosis for individual cancers.
- This Heart disease dataset helps recognize the presence of heart disease in a patient based on 76 attributes such as age, sex, chest pain type, resting blood pressure, etc. With 303 instances, the database aims at simply attempting to distinguish the presence of a disease (values 1,2,3,4) from absence (value 0).
- Published by the Centers for Disease Control and Prevention U.S. Chronic Disease Indicators (CDI) allows states, territories, and large metropolitan areas to uniformly collect and report chronic disease data that are important to public health practice. It’s a relatively new dataset, last updated in April 2021, with information dating from 2008-2019 with data tables about gender, disease, mortality outcome, and more.
- Heart failure is a common event caused by cardiovascular diseases, and this heart failure dataset contains 12 features to predict mortality by heart failure. It’s free to download as a CSV file with ten columns with information on age, sex, diabetes presence, blood pressure, etc.
- MIMIC-III is an openly available dataset developed by the MIT Lab for computational physiology, comprising de-identified health data associated with around 40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.
- Ocular Disease Intelligent Recognition (ODIR) is a structured ophthalmic database of 5,000 patients with age, color fundus photographs from left and right eyes, and diagnostic keywords from doctors. This dataset represents a real-life set of patient information collected by Shanggong Medical Technology Co., Ltd. from different hospitals/medical centers in China. Annotations were labeled by trained human readers with quality control management.
- The Fetal Health Classification aims at classifying the health of a fetus as normal, suspect or pathological using CTG data in order to prevent child and maternal mortality. Here, 2126 fetal CTGs were processed and labeled by three expert obstetricians. The set is suitable either for 10-class or 3-class experiments.
More publicly-available healthcare datasets can be found here.
The pandemic continues to have a devastating effect on the health and well-being of the global population and ML engineers have been trying to support the ongoing research as well as to suggest solutions that may facilitate the treatment processes.
- The COVID-19 Medical Face Mask Detection Dataset is a refined combination of two sets – Medical Masks Dataset published by Mikolaj Witkowski with 682 pictures and over 3k medical masked faces and Face Mask Dataset with 853. After eliminating low-quality images and redundancy, this dataset now contains 1415 images and is a great fit for mask detection models.
- Created by Linda and Alexander Wong and published in 2020, COVID-Net is a tailored deep convolutional neural network design that helps detect COVID-19 cases based on chest X-Ray images. The chest radiography dataset leveraged to train COVID-Net is referred to as COVIDx and comprises 16,756 chest radiography images across 13,645 patient cases.
- The face masks appear as a solution for limiting the spread of COVID-19. In this context, efficient recognition systems are expected to check if people are masked properly in regulated areas. MaskedFace-Net is a dataset of human faces with a correctly or incorrectly worn mask (133,783 images) based on the dataset Flickr-Faces-HQ (FFHQ).
- The CORD-19 is an AI research challenge and a dataset representing an extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community to apply text and data mining approaches and analyze the fresh content in support of the ongoing COVID-19 response efforts worldwide. Kaggle is sponsoring a $1,000 per task so it’s safe to say the dataset is fresh and constantly updated.
- CT scans play a supportive role in the diagnosis of COVID-19. Models that can find evidence of COVID-19 and characterize its findings can contribute to optimizing diagnosis and treatment, especially with a shortage of expert radiologists. This COVID-19 CT scans dataset contains 20 CT scans of patients diagnosed with COVID-19 as well as segmentation of lungs and infections made by experts.
- The United States COVID-19 County Level of Community Transmission as Originally Posted public use dataset has 7 data elements reflecting community transmission levels (low, moderate, substantial, or high) of the infection. Currently, there are two versions of COVID-19 county-level community transmission level data: this dataset with the levels as originally posted, updated daily with the most recent data, and a historical dataset with the county-level transmission data from January 1, 2021 (Historical Changes dataset).
- COVID-19 dataset by Our World Data offers country-level vaccination information along with locations data file that includes vaccination sources information. The dataset is special because it’s updated every day and contains data on the total number of vaccinations on a specific day, the total number of people vaccinated, the number of people fully vaccinated, and more.
The role of machine learning in agriculture has certainly been increasing, thus the datasets in the domain are more than ever useful in terms of practicing ML model building.
- The wine quality dataset holds a number of chemical criteria about the wine, including fixed acidity, volatile acidity, residual sugar, chlorides, and more. The goal is to design a model that can predict whether the wine is of poor, normal, or excellent quality. With 4898 instances, this dataset is suitable for classification and regression tasks.
- The Food and Agriculture Organization (FAO) of the United Nations provides free access to food and agriculture data for over 245 countries and territories from 1961-2013. One of their projects, the Food Balance Sheets dataset shares insights on our worldwide food production - focusing on a comparison between food produced for human consumption and feed produced for animals.
- The dataset of wildfires in the United States includes wildfire data for the period of 1992-2015 compiled from US federal, state, and local reporting systems. With a total of 3 updates, it was last updated a year ago. This dataset is an SQLite database that contains information on the fire name, code, year, its longitude and latitude, and more.
- FAOSTAT is a dataset by the Food and Agriculture Organization (FAO) of the United Nations where one can filter, view, and download data on demographics, hunger and food insecurity, food access and utilization, and more. The charts contain data from 1990 - 2019 and fit well for building prediction models.
- The Crop Recommendation Dataset is a relatively fresh dataset that was published in 2020, aiming to maximize agricultural yield by recommending appropriate crops. The dataset was built by augmenting Indian rainfall, climate, and fertilizer datasets, allowing users to build a predictive model to recommend the most suitable crops in a particular farm based on various parameters, such as rainfall, humidity level, soil PH value, and more.
Security and fraud
Today it would be really hard to imagine the industry of security and surveillance without machine learning and computer vision, so we’ve reasoned that introducing a list of security datasets would be beyond useful.
- Fake and real news dataset contains two CSV lists for fake and real news. The list with fake news consists of information about 17903 news articles, whereas the one for real news comprises 20826 unique values. The news is narrowed down to United States politics.
- Being heavily used in literature, the Spam SMS dataset is a good choice to practice spam detection and text classification. With a total of 5574 instances, the set represents a text file with the tag (ham or spam) followed by the raw messages collected from multiple sources.
- An important aspect of security is bank security, namely the recognition of fraudulent credit card transactions. The Credit Card Fraud Detection dataset consists of 284,807 transactions made by European cardholders in September 2013. Out of these all, only 492 fraudulent transactions have been detected, making the dataset highly unbalanced. However, just recently a simulator for transaction data has been released as part of the practical handbook on Machine learning for credit card fraud detection, so check it out if interested.
- Another fraud detection dataset is the Synthetic financial dataset targeted at mobile money transactions. The dataset is generated using the simulator called PaySim that aggregates data from the private dataset and generates a synthetic dataset that resembles the real transactions. Besides, the simulator injects malicious behavior to later evaluate the performance of fraud detection methods.
- The credit card score is yet another actual issue and the Credit card approval prediction dataset builds the ground for models that can differentiate between reliable and unreliable clients according to historical data. The dataset consists of two lists – application records and credit records.
- Global Terrorism Database is an open-source CSV chart with more than 180,000 terrorist attacks worldwide from 1970 through 2017. Such information as the country, region, occurrence date, attack, and target types, and even resolution is presented in the set.
- The SIXray dataset contains 1,059,231 X-ray images collected from subway stations and annotated by human security inspectors with a purpose to detect six common categories of prohibited items – gun, knife, wrench, pliers, scissors, and hammer. In addition, there are manually added bounding boxes on the testing sets for each prohibited item to evaluate the performance of object localization.
- The Handgun Detection dataset is aimed at contributing to the improvement of public safety by detecting handguns within images. This dataset was first published by the University of Grenada. It contains 2986 images and 3448 labels across a single annotation class, which includes pistols, pistols in hand, and various types of gun images.
- FireNet is a real-time fire detection project, which aims to ensure that ML systems can be trained to detect fires instantly and eliminate false alerts. It contains annotated dataset, pre-trained models, and inference codes. The project contains 502 images split into 412 images for training and 90 images for validation.
- The name of the US Accidents dataset already uncovers the content of the project. It’s a countrywide car accident dataset covering 49 states of the USA with data from February 2016 to Dec 2020. Currently, there are about 1.5 million accident records in this dataset. It has been collected in real-time, using multiple traffic APIs. These APIs broadcast traffic data collected by various entities, for example, US and state departments of transportation, law enforcement agencies, and traffic cameras.
Flora and fauna
ML along with computer vision are integral parts of the world we live in. Below, we share some of the most popular and useful datasets that fit well the image classification, recognition, or segmentation tasks.
- Iris Data Set is perhaps the best-known database to be found in the pattern recognition literature due to R.A. Fisher’s classic paper that’s referenced frequently to this day. The data set contains three3 classes of 50 instances each, where each class refers to a type of iris plant.
- Published in 2020, the large-scale fish dataset contains nine different seafood types– gilt head bream, red sea bream, sea bass, red mullet, horse mackerel, black sea sprat, striped red mullet, trout, and shrimp image samples. This dataset was collected in order to carry out segmentation, feature extraction, and classification tasks and proved to perform well.
- The INRIA-Horse dataset consists of 170 horse images annotated with bounding boxes and 170 images without horses. The main challenges it offers are clutter, intra-class shape variability, and scale changes. As for computer vision, the dataset is suitable for object detection, edge detection, and classification tasks.
- The Stanford Dogs dataset contains 20,580 images of 120 dog breeds from around the world. This dataset has been built using images and bounding box labeled annotations from ImageNet for fine-grained image categorization.
- Originally donated to the UCI Machine Learning repository, the Mushroom dataset was designed to predict whether a specific mushroom is safe or poisonous to eat. The set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended.
- Animals-10 is a basic dataset of 10 animal categories: dog, cat, horse, spyder, butterfly, chicken, sheep, cow, squirrel, elephant. With 28 thousand medium-quality animal images, this set is great for testing image recognition or classification tasks.
- Here’s another animal dataset with just three categories (dog, cat, wildlife) but high-quality images at 512×512 resolution. The set consists of 16,130 images with more than 5000 for each category. Again, a very useful dataset for image classification or recognition.
Because ML is becoming more and more popular in nearly every industry and our daily life, the number of resources and information about it is growing accordingly. Ready-made public datasets provide a perfect ground for beginners to start building AI models and help seasoned ML engineers to save some time and focus on other aspects of their project. You can save this post and return to it when looking for your next dataset, and if there are more sets you’d like us to add, feel free to let us know.
Ready to start annotating your data? Check out SuperAnnotate and get it done fast.