Enterprises and research centres generate a lot of information that they can use for machine learning research. However, if you are an individual or a small business, it is too resource-consuming to collect quality data.
Today we will share with you some of the best open-source resources where you can find quality datasets for almost any type of research. They are completely free and easy to access. Check this guideline about data pre-processing in machine learning for your future analysis.
Table of Contents
Google’s Datasets Search Engine
Google Dataset Engine allows you to access a huge variety of open-source datasets on any topic you’d like. It is very easy to use, just insert your keyword and start searching. It is possible to filter the results by the latest update, format, license, and broad topic. The majority of datasets are from companies and enterprises like Statista, WHO, New York Times, and not individuals.
Kaggle
This is one of the most important online resources dedicated to machine learning and data science. In the tab “Data”, you will see more than 60 thousand databases. They are created both by large companies such as Google, John Hopkins University, Airbnb, and others. Many datasets are made by individual contributors who list their sources. The information is already structured, so you can just download it and start working. Moreover, Kaggle allows you to start exploring the data directly in the browser, just click the “New Notebook” button.
GitHub
GitHub is one of the most popular platforms for code storing and collective project management. So, this is the place where any data scientist can post anything they feel like sharing. Some of these datasets are good but not always. You can try to explore the resources on this platform, and maybe you will find a needle in a haystack. But nobody can guarantee you the quality of data.
Amazon Public Datasets
Amazon is one of the leading players in the machine learning market. They create off-the-shelf software and private databases for companies who want to do ML in a few clicks, however, some of their solutions are made public.
On their website, you can access 203 datasets belonging to different topics. While this number might not seem impressive, the quality of the data is high. The datasets are published by Facebook, NASA, Space Telescope Science Institute. So, if your research belongs to the domain of science, you certainly need to explore it.
UCI Machine Learning Repository
UC Irvine Machine Learning Repository has a large open-source library that you might want to explore. If you are looking for a dataset on diseases, forest fires, wine quality, car evaluation, and other topics, you will find them here.
The library contains not only spreadsheets of data but also datasets for computer vision, image processing, speech recognition, and speech generation.
Conclusion
Preparing quality data is hard for just one person or a small group of people. If you want to get true-to-life results, you would have to invest a lot of money and time in a good dataset. However, you can simply use an already-made dataset that you find online. These might not be suitable for commercial use but if you use them for your own benefit, everything is okay.