Google is known for having a vast collection of online information. When you want to know the answer to a question, or if you want to find out more information about virtually anything, you Google it. As a company, Google has amassed a vast collection of indexed, online content. In order to do so, they crawled the web.
Crawling the web involves using a type of software to visit every single page of the internet, in order to copy and index the content. Now, Common Crawl, is doing the same thing using their own web crawler. However, unlike Google, Common Crawl is a nonprofit organization. Their vision for Common Crawl is to make the data that they collect available for free, to everyone. They state on their website that the web is the “largest and most diverse collection of information in human history” and that their vision is to provide “universal open access to information.”
So far, Common Crawl has indexed a total of more than 5 billion pages. If someone wants access to the available data, they can access it through Amazon’s cloud computing services. This kind of accessibility to information is important for both researchers and entrepreneurs. It is also something that was often previously unavailable, as most people don’t have the resources to be able to do that type of a large scale crawl. Now researchers, educators, and entrepreneurs can use the available data to test out ideas their ideas, rather than having to turn to Google. This new accessibility to data has already inspired several startups.
Common Crawl also sponsors contest in order to generate new ideas. As search engines are one of the things that can be built with an index of the web, it may even be possible that one of the startups inspired by Common Crawl could be the next Google.
[Image via searchengineland]