Preprint: https://arxiv.org/abs/2309.03044
Postprint: https://ieeexplore.ieee.org/document/10301266
This artifact contains all data (including the data gathering step), code, and scripts required to run the paper's experiment to reproduce the results. The structure of folders and files is as follows:
This folder contains all scripts and code required (specific to this paper) to re-run the training and testing of our models (including classic models, CodeBERT, ConcatInline, and ConcatCLS). The structure of this folder is:
+-- data (contains paper full dataset and preprocessing step script)
| +-- preprocess.sh (splitting dataset and scaling values)
+-- dataset (contains a small subset of the dataset after preprocessing for the getting started section)
+-- models
| +-- code_metrics (contains code for training and testing our classic models)
| +-- train_test.sh (training and testing the models)
| +-- code_representation
| +-- codebert
| +-- CodeBertModel.py (code for CodeBERT model)
| +-- ConcatInline.py (code ConcatInline model)
| +-- ConcatCLS.py (code ConcatCLS model)
| +-- train.sh (script for training the models)
| +-- inference.sh (script for testing the models)
| +-- evaluation
| +-- evaluation.py (evaluation metrics)
+-- utils (constant file)
The data folder contains bugs from Defects4tJ and Bugs.jar datasets. This folder contains a preprocessing script that unify bug severity values, scale the source code metrics and create train, val, and test splits.
Running this script using bash preprocessing.sh command generates 6 files containing train, val, tests splits in jsonl (compatible with CodeBERT experiments) and csv (compatible with source code metrics experiments) formats.
Files available in the dataset folder represent data for the getting started section (small subset of data). For reproducing paper results the generated files in the data folder should be copied to the dataset folder that is used by the model training scripts.
This folder contains all code and scripts for all of the experiments including classic models, CodeBERT models, ConcatInline, and ConcatCLS.
This folder contains all required code to gather the data including issue scraping, method extraction, and metric extraction. While this step is out of this paper's scope, the required step to reproduce the data is available in this instruction. While there are many directories/files in this folder, the following tree shows the structure of 3 files that need to be run.
+-- issue_scraper
| +-- main.py
+-- MetricsExtractor
| +-- method_extractor
| +-- MethodExtractorMain.java
| +-- metric_extractor
| +-- MetricCalculatorMain.java
For Getting Started:
- Operating System: The provided artifact is tested on Linux (20.04.6 LTS) and macOS (Ventura 13.5).
- GPU: It is better to have a GPU for running experiments on GPU otherwise it may take a long time.
- CPU/RAM: There is no strict minimum on these.
- Python: Python 3 is required.
This section only sets up the artifact and validates its general functionality based on a small example data (complete dataset for the classic models, but the first 50 rows for CodeBERT models).
-
Clone the repository
git@github.com:EhsanMashhadi/ISSRE2023-BugSeverityPrediction.git
-
Install dependencies (using
requirements.txtfile) or manually :
pip install pandas==1.4.2pip install jirapip install beautifulsoup4pip install lxmlpip install transformers==4.18.0pip install torch==1.11.0This should be enough for running on CPU, but install the next for running on GPUpip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.htmlpip install scikit-learn==1.1.1pip install xgboost==1.6.1pip install seaborn==0.11.2
- Adding the project root folder to the
PYTHONPATH
export PYTHONPATH=$PYTHONPATH:*/rootpath/you/clone/the/project*/experiments- e.g.,
export PYTHONPATH=$PYTHONPATH:/Users/ehsan/workspace/ISSRE2023-BugSeverityPrediction/experiments
- RQ1:
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_metricsbash train_test.sh- Results are generated in the
logfolder
- RQ2:
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_representation/codebert- Set
CodeBERTas themodel_archparameter's value intrain.shandinference.shfiles. bash train.shfor training the modelbash inference.shfor evaluating the model with thetestsplit- Results are generated in the
logfolder
- RQ3:
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_representation/codebert- Set
ConcatInlineorConcatCLSas themodel_archparameter's value intrain.shandinference.shfiles. bash train.shfor training the modelbash inference.shfor evaluating the model with thetestsplit- Results are generated in the
logfolder
- Clone the repository
git@github.com:EhsanMashhadi/ISSRE2023-BugSeverityPrediction.git
- Install dependencies (You may need to change the torch version for running on your GPU/CPU)
- Experiments:
- It is better to install these dependencies on a virtual env (you can also use requirements.txt)
pip install pandas==1.4.2pip install jirapip install beautifulsoup4pip install lxmlpip install transformers==4.18.0pip install torch==1.11.0This should be enough for running on CPU, but install the next for running on GPUpip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.htmlpip install scikit-learn==1.1.1pip install xgboost==1.6.1pip install seaborn==0.11.2
- Adding the project root folder to the
PYTHONPATH
export PYTHONPATH=$PYTHONPATH:*/rootpath/you/clone/the/project*/experiments- e.g.,
export PYTHONPATH=$PYTHONPATH:/Users/ehsan/workspace/ISSRE2023-BugSeverityPrediction/experiments
- Running data preprocessing
cd ISSRE2023-BugSeverityPrediction/experiments/databash preprocessing.sh- Copy generated
jsonlandcsvfiles into the dataset folder
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_metricsbash train_test.sh- Results are generated in the
logfolder
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_representation/codebert- Set
CodeBERTas themodel_archparameter's value intrain.shfile bash train.shfor training the modelbash inference.shfor evaluating the model with thetestsplit- Results are generated in the
logfolder
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_representation/codebert- Set
ConcatInlineorConcatCLSas themodel_archparameter's value intrain.shfile bash train.shfor training the modelbash inference.shfor evaluating the model with thetestsplit- Results are generated in the
logfolder
- You can change/add different hyperparameters/configs in
train.shandinference.shfiles.
- Check the
CUDAandPyTorchcompatibility - Assign the correct values for
CUDA_VISIBLE_DEVICES,gpu_rank, andworld_sizebased on your GPU numbers in all scripts. - Run on CPU by removing the
gpu_rank, andworld_sizeoptions in all scripts. - Refer to the CodeBERT Repo for finding common issue.
The tools below should be installed and configured correctly, otherwise, this step won't work. It may take a long time to do this step and can be skipped (recommended).
- Java: Java 18 is required (only for running data gathering step).
- Git: (brew, apt, ... based on your OS)
- SVN: (brew, apt, ... based on your OS)
- Defects4J (Follow all the steps in the provided installation guide).
- Bugs.jar (You must install this in the
data_gatheringdirectory).
cd ISSRE2023-BugSeverityPrediction/data_gathering/issue_scraperpython main.py
For the below steps, it can be easier to use gradlewor simply open by IntelliJ IDEA to run Java files
-
cd ISSRE2023-BugSeverityPrediction/data_gathering/MetricsExtractor/src/main/java/software/ehsan/severityprediction/method_extractor -
run MethodExtractorMain.java -
cd ISSRE2023-BugSeverityPrediction/data_gathering/MetricsExtractor/src/main/java/software/ehsan/severityprediction/metric_extractor -
run MetricCalculatorMain.java