GitHub today announced that it’s releasing activity data for 2.8 million open source code repositories and making it available for people to analyze with the Google BigQuery cloud-based data warehousing tool.
The data set is free to explore. (With BigQuery you get to process up to one terabyte each month free of charge.)
This new 3TB data set includes information on “more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions,” Arfon Smith, program manager for open source data at GitHub, wrote in a blog post.
To get people started, Smith has put together some starter queries. Felipe Hoffa, a Google developer advocate who focuses on BigQuery, has put together some tips for working with the data sets in a Medium post.
AI Weekly
The must-read newsletter for AI and Big Data industry written by Khari Johnson, Kyle Wiggers, and Seth Colaner.
Included with VentureBeat Insider and VentureBeat VIP memberships.
The data set could be useful to anyone who want to get a sense of trends in open source software use on GitHub, and it’s simpler than tinkering with the GitHub application programming interface (API). For sure, GitHub, with more than 15 million users, isn’t the only place where open source software lives on the Internet — see also GitLab — but it is a very popular one, perhaps the most popular.
Today’s move effectively amounts to an expansion of the GitHub Archive, which was first introduced by Google web performance engineer Ilya Grigorik in 2012.
GitHub will update the data set every week, a spokesperson told VentureBeat in an email.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More