How GitHub’s CodeSearchNet Challenge Can Improve Semantic Code Search

Source:- analyticsindiamag.com
Recently, researchers from GitHub announced the CodeSearchNet Challenge evaluation environment and leaderboard. CodeSearchNet is a collection of datasets and benchmarks that explores the problem of code retrieval using natural language. The leading software development giant joined hands with California-based ML development tools startup, Weights & Biases to improve code search by using modern machine learning techniques.

About CodeSearchNet Corpus Dataset

The CodeSearchNet Corpus is a collection of a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The dataset is collected from publicly available open-source non-fork GitHub repositories, using libraries.io in order to identify all projects which are used by at least one other project and sort them by “popularity” as indicated by the number of stars and forks. The CodeSearchNet consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus.

The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). It also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. The dataset is programmatically obtained by scraping open-source repositories and pairing individual functions with their (processed) documentation as natural language annotation. It is also large enough (2 million data points) to enable the training of high-capacity deep neural models on the task.

The fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, which includes:

Six million methods overall
Two million of which have associated documentation (docstrings, JavaDoc, and more)
Metadata that indicates the original location (repository or line number, for example) where the data was found.

Furthermore, researchers also released data preprocessing pipeline for other researchers and developers to use it as a starting point in applying machine learning to code. The pipeline currently supports 6 programming languages which are Python, Java, Go, Php, Ruby, and Javascript. It also parses function calls and links them with their definitions for Python.

Similar Model & Dataset

Researchers from Facebook AI Research did something similar. The researchers developed code search tools that apply natural language processing (NLP) and information retrieval (IR) techniques directly to source code text. The tools are called Neural Code Search (NCS) and UNIF which accept natural language queries and return relevant code fragments retrieved directly from the code corpus.

The researchers also open-sourced Neural Code Search Evaluation Dataset is composed of 287 Stack Overflow question and answer pairs which contain Stack Overflow post ID, title and URL of the Stack Overflow post along with code snippet answer to the question.

NCS is an unsupervised model for neural code search developed at Facebook which uses only word embeddings derived from a code corpus. While UNIF is a supervised extension of the base NCS technique which uses a supervised neural network model to improve performance when good supervision data is available for training.

OutLook

Recent years have shown several substantial successes in working with natural language datasets. However, there a0re certain times that researchers find difficulty with deep learning models on highly-structured data. According to the researchers, searching an image or a document on natural language is easier than searching code. In order to mitigate these problems, researchers from GitHub introduced the CodeSearchNet challenge along with the large corpus. In a blog post, the researchers further stated that the researchers will be expanding the evaluation dataset to include more languages, queries, and annotations in the future.