ISBN: 978-981-11-3671-9 DOI: 10.18178/wcse.2017.06.025
Kraken: A Continuous Incremental Data Acquisition System for GitHub and Git Repositories
Abstract— With the quick development of open source software, quantity of software is produced in the
open source community (OSC) [1]. Lots of researches are launched to study the internal regular patterns of
OSC [2], [3]. GitHub is one of the most famous open source community which owns thousands software
projects. As a result, there are massive and abundant data of software development activities in GitHub. With
the purpose to offer an accuracy and efficient dataset of GitHub, this paper proposes Kraken which is a
continuous incremental data acquisition system for GitHub. Kraken contains three main modules which are
independent with each other. Kraken gets the data of GitHub from two ways: git repositories and rest API.
The final result shows that Kraken could extract the commits information of git repositories and get pull
requests(PRs) and issues through rest API. The commits information contains the detail development history
of software and the feedbacks and wisdom of software engineers are showed through PRs and issues.
Index Terms— GitHub, open source software, data extraction, rest API
Lingbin Zeng , Gang Yin, Tao Wang, Yue Yu, Qiang Fan, Zhi-Xing Li, Jie Yu, H. M. Wang
National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, CHINA
Cite: Lingbin Zeng , Gang Yin, Tao Wang, Yue Yu, Qiang Fan, Zhi-Xing Li, Jie Yu, H. M. Wang, "Kraken: A Continuous Incremental Data Acquisition System for GitHub and Git Repositories," Proceedings of 2017 the 7th International Workshop on Computer Science and Engineering, pp. 144-149, Beijing, 25-27 June, 2017.