How GitHub’s CodeSearchNet Challenge Can Improve Semantic Code Search
Recently, researchers from GitHub announced the CodeSearchNet Challenge evaluation environment and leaderboard. CodeSearchNet is a collection of datasets and benchmarks that explores the problem of code retrieval using natural language. The leading software development giant joined hands with California-based ML development tools startup, Weights & Biases to improve code search by using modern machine learning techniques.
About CodeSearchNet Corpus Dataset
The fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, which includes:
- Six million methods overall
- Two million of which have associated documentation (docstrings, JavaDoc, and more)
- Metadata that indicates the original location (repository or line number, for example) where the data was found.
Similar Model & Dataset
Researchers from Facebook AI Research did something similar. The researchers developed code search tools that apply natural language processing (NLP) and information retrieval (IR) techniques directly to source code text. The tools are called Neural Code Search (NCS) and UNIF which accept natural language queries and return relevant code fragments retrieved directly from the code corpus.
The researchers also open-sourced Neural Code Search Evaluation Dataset is composed of 287 Stack Overflow question and answer pairs which contain Stack Overflow post ID, title and URL of the Stack Overflow post along with code snippet answer to the question.
NCS is an unsupervised model for neural code search developed at Facebook which uses only word embeddings derived from a code corpus. While UNIF is a supervised extension of the base NCS technique which uses a supervised neural network model to improve performance when good supervision data is available for training.
Recent years have shown several substantial successes in working with natural language datasets. However, there a0re certain times that researchers find difficulty with deep learning models on highly-structured data. According to the researchers, searching an image or a document on natural language is easier than searching code. In order to mitigate these problems, researchers from GitHub introduced the CodeSearchNet challenge along with the large corpus. In a blog post, the researchers further stated that the researchers will be expanding the evaluation dataset to include more languages, queries, and annotations in the future.