Automatic intent classification

As manual classification of commit intents is time intensive an automated approach is needed.

We evaluate an automatic approach in a multi-class setting for our labels (other, perfective, corrective). With a suitable automated approach, we can classify the rest of our data which enables us to include more data in our answers what changes when developers increase quality.

We used the manual labeled data we created in this study to evaluate an automatic approach against a simple baseline.

For the baseline we use a Random Forest classifier with a simple pipeline consisting of CountVectorizer and a TfidfTransformer basically following scikit-learn text-classification examples. In our approach we evaluate whether we can fine-tune a state-of-the-art NLP deep neural network architecture (BERT) pre-trained on software engineering data (se).

Evaluation

We use 10-fold cross-validation to evaluate the models. To mitigate random effects when selecting the evaluation data for the seBERT model we run this loop 10 times. Each model therefore makes 100 predictions which are evaluated.

MCC performance for 100 runs Accuracy performance for 100 runs F-Score macro averaged performance for 100 runs Precision macro averaged performance for 100 runs Recall macro averaged performance for 100 runs

As we can directly see in the plots, seBERT has a good performance and outperforms the baseline RandomForest.

Moreover, with a median MCC of around 0.70 and F1 of 0.80 the perfomance is reasonably good to let the model predict the commit intent on the rest of the data.