GitHub Releases ML-Based “Good First Issues” Recommendations
GitHub shipped an updated version of good first issues feature that uses a combination of both a machine learning (ML) model that identifies “easy” issues and a hand curated list of issues that have been labeled “easy” by project maintainers. New and seasoned open source contributors can use this feature to find and tackle easy issues in a project.
In order to eliminate the challenging and tedious task of labelling and building a training set for a supervised ML model, GitHub has opted to use a weakly supervised model. The process starts by automatically inferring labels for hundreds of thousands of candidate samples from existing issues across GitHub repositories. Multiple criteria are used to filter out potentially negative training samples. These criteria include matching against a 300 curated list of labels, issues that were closed by a pull request submitted by a new contributor, and issues that were closed by pull requests that had tiny diffs in a single file. Further processing is done to remove duplicate issues and to split the samples into training, validation, and test sets across repositories.
At the moment GitHub is using preprocessed issue titles and bodies that are one-hot encoded as features to train the model. TensorFlow framework was chosen to implement the model. The common practise of regularization along with additional text data augmentation and early stopping techniques are used to train the model.
To rank and present good first issues to the user, the model is run to classify all open issues. If the probability score of the classifier for an issue is above a designated threshold, the issue, along with its probability score, is added to the bucket of issues slated for recommendation. Next, all open issues with labels that match any of the curated list of labels are inserted to the same bucket. These label based issues are assigned higher scores. Finally, all issues in the bucket are ranked based on their scores, with a penalty based on issue age.
Currently, GitHub trains the model and runs inference on open issues offline. The machine learning pipeline is scheduled and managed using Argo workflows.
Compared to the first release of the good first issues feature back in May 2019, GitHub saw the percent of recommended repositories that have easy issues increase from 40 percent to 70 percent. Moreover, the burden on project maintainers to triage and label open issues has been reduced.
In the future, GitHub plans to improve the issue recommendations by iterating on the training data and the ML model. In addition, project maintainers will be provided with an interface to enable or remove machine learning based recommendations.