Helper functions

Preparation

Collect buggy and fixed methods from the testing dataset. These AbstractMethods do not appear at all in the training or validation data.

testBuggyMethods = readAbstractMethodsFromFile(DATA_SMALL_METHODS_TEST_BUGGY)
testFixedMethods = readAbstractMethodsFromFile(DATA_SMALL_METHODS_TEST_FIXED)

Default Parameters

Get the HephaestusModelEvaluation for each model which was trained with the deafault parameters.

defaultControlEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_DEFAULT_CONTROL),
    testBuggyMethods,
    testFixedMethods,
    isControl = True
)

[2021-05-16 23:56:12,602 INFO] Translating shard 0.
[2021-05-16 23:56:51,650 INFO] PRED AVG SCORE: -0.0585, PRED PPL: 1.0603

defaultBasicEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_DEFAULT_BASIC),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-16 23:56:59,451 INFO] Translating shard 0.
[2021-05-16 23:58:13,628 INFO] PRED AVG SCORE: -0.0683, PRED PPL: 1.0707

defaultStrictEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_DEFAULT_STRICT),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-16 23:58:22,377 INFO] Translating shard 0.
[2021-05-16 23:58:43,836 INFO] PRED AVG SCORE: -0.4507, PRED PPL: 1.5694

defaultLooseEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_DEFAULT_LOOSE),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-16 23:58:51,889 INFO] Translating shard 0.
[2021-05-16 23:59:10,698 INFO] PRED AVG SCORE: -0.4528, PRED PPL: 1.5728

Training Accuracies

plotTrainingAccuracies(
    evaluations = [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
    lineLabels =  ["control",          "basic",          "strict",          "loose"],
    title = "Training Accuracies of Models Trained with Default Parameters"
)

Perfect prediction accuracies

plotPerfectPredictionAccuracies(
    evaluations = [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
    xLabels =     ["control",          "basic",          "strict",          "loose"],
    title = "Perfect Prediction Accuracies of Models Trained with Default Parameters"
)

Average Edit Distance Decreases

The edit distance decrease is a value representing how much the the HephaestusModel "helped" in reducing the Levenshtein edit distance to the fixed methods. E.g. a value of 3 means that on average, the edit distance from the model's outputted methods to the actual fixed methods was 3 less than the edit distance from the original buggy methods to the actual fixed methods. Negative values mean that the model made the output methods further away from the fixed methods than they originally were as given by the buggy methods. Therefore, a higher value is better.

plotAvgEditDistDecreases(
    evaluations = [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
    xLabels =     ["control",          "basic",          "strict",          "loose"],
    title = "Average Edit Distance Decreases of Models Trained with Default Parameters"
)

Failed prediction rates

How often did the models fail to output an AbstractMethod?

plotFailedPredictionRates(
    evaluations = [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
    xLabels =     ["control",          "basic",          "strict",          "loose"],
    title = "Failed Prediction Rates of Models Trained with Default Parameters"
)

Parameter group 1

group1ControlEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_GROUP1_CONTROL),
    testBuggyMethods,
    testFixedMethods,
    isControl = True
)

[2021-05-16 23:59:19,660 INFO] Translating shard 0.
[2021-05-16 23:59:58,010 INFO] PRED AVG SCORE: -0.0522, PRED PPL: 1.0535

group1BasicEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_GROUP1_BASIC),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-17 00:00:06,502 INFO] Translating shard 0.
[2021-05-17 00:01:20,760 INFO] PRED AVG SCORE: -0.0668, PRED PPL: 1.0691

group1StrictEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_GROUP1_STRICT),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-17 00:01:29,530 INFO] Translating shard 0.
[2021-05-17 00:01:51,968 INFO] PRED AVG SCORE: -0.4827, PRED PPL: 1.6204

group1LooseEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_GROUP1_LOOSE),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-17 00:02:00,110 INFO] Translating shard 0.
[2021-05-17 00:02:17,748 INFO] PRED AVG SCORE: -0.4820, PRED PPL: 1.6193

Training Accuracies

plotTrainingAccuracies(
    evaluations = [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
    lineLabels =  ["control",         "basic",         "strict",         "loose"],
    title = "Training Accuracies of Models Trained with Group 1 Parameters"
)

Perfect prediction accuracies

plotPerfectPredictionAccuracies(
    evaluations = [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
    xLabels =     ["control",         "basic",         "strict",         "loose"],
    title = "Perfect Prediction Accuracies of Models Trained with Group 1 Parameters"
)

Average Edit Distance Decreases

plotAvgEditDistDecreases(
    evaluations = [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
    xLabels =     ["control",         "basic",         "strict",         "loose"],
    title = "Average Edit Distance Decreases of Models Trained with Group 1 Parameters"
)

Failed prediction rates

plotFailedPredictionRates(
    evaluations = [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
    xLabels =     ["control",         "basic",         "strict",         "loose"],
    title = "Failed Prediction Rates of Models Trained with Group 1 Parameters"
)

Parameter group 2

The control model for this parameter group is the same as the control model for the default parameter group.

group2ControlEval = defaultControlEval

group2BasicEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_GROUP2_BASIC),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-17 00:02:26,390 INFO] Translating shard 0.
[2021-05-17 00:03:39,084 INFO] PRED AVG SCORE: -0.0638, PRED PPL: 1.0659

group2StrictEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_GROUP2_STRICT),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-17 00:03:47,603 INFO] Translating shard 0.
[2021-05-17 00:04:08,439 INFO] PRED AVG SCORE: -0.4644, PRED PPL: 1.5910

group2LooseEval = HephaestusModelEvaluation(
    HephaestusModel(MODEL_GROUP2_LOOSE),
    testBuggyMethods,
    testFixedMethods
)

[2021-05-17 00:04:16,486 INFO] Translating shard 0.
[2021-05-17 00:04:35,999 INFO] PRED AVG SCORE: -0.4754, PRED PPL: 1.6086

Training Accuracies

plotTrainingAccuracies(
    evaluations = [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval],
    lineLabels =  ["control",         "basic",         "strict",         "loose"],
    title = "Training Accuracies of Models Trained with Group 1 Parameters"
)

Perfect prediction accuracies

plotPerfectPredictionAccuracies(
    evaluations = [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval],
    xLabels =     ["control",         "basic",         "strict",         "loose"],
    title = "Perfect Prediction Accuracies of Models Trained with Group 1 Parameters"
)

Average Edit Distance Decreases

plotAvgEditDistDecreases(
    evaluations = [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval],
    xLabels =     ["control",         "basic",         "strict",         "loose"],
    title = "Average Edit Distance Decreases of Models Trained with Group 2 Parameters"
)

Failed prediction rates

plotFailedPredictionRates(
    evaluations = [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval],
    xLabels =     ["control",         "basic",         "strict",         "loose"],
    title = "Failed Prediction Rates of Models Trained with Group 1 Parameters"
)

Failure cases

If a HephaestusModel outputs malformed EditOperations, then those operations cannot be applied to the input method and thus an output method cannot be produced. Therefore, for each model trained with EditOperations, there is a small chance that a method prediction will fail due to malformed EditOperations.

This section aims to look more in depth at the distribution and causes for these failures. We will use the model with the highest failed prediction rate as a benchmark: the model trained on basic condensed EditOperations from parameter group 1, with a failure rate of 1.6%.

Failure distribution

There are two ways in which a HephaestusModel can fail to predict a method:

The outputted machine string contains a syntax error such that it literally connot be parsed into a valid CompoundOperation.
The syntax is correct, but the indices of the resulting CompoundOperation operate on tokens whose indices do not exist in the inputted AbstractMethod.

We can determine the failure distribution of the outputted EditOperations by looking at how many failures were caused by syntax errors and how many were caused by index errors.

First, we extract the outputted machine strings from the raw_output.txt file and convert them to lists of CompoundOperations. If any machine strings were not able to be converted by the readCompoundOperationsFromFile function, then they had syntax errors and will appear as None in the returned list.

compoundOpsLists = readCompoundOperationsFromFile(os.path.join(MODEL_GROUP1_BASIC, "raw_output.txt"))

numSyntaxFailures = compoundOpsLists.count(None)
numSyntaxFailures

5

Next, we attempt to apply all the well-formed CompoundOperations to the buggy AbstractMethods and count the number of thrown IndexErrors.

indexFailureMethods = []

for method, compoundOpsList in zip(testBuggyMethods, compoundOpsLists):
    
    methodCopy = deepcopy(method)
    
    if compoundOpsList is not None:
        try:
            methodCopy.applyEditOperations(compoundOpsList)
        except IndexError:
            indexFailureMethods.append(method)

numIndexFailures = len(indexFailureMethods)
numIndexFailures

86

numTotalFailures = numSyntaxFailures + numIndexFailures
numTotalFailures

91

We see that there were 91 total prediction failures, 5 of which were due to syntax errors and 86 of which were due to index errors. We can verify that this number is the actual amount of methods that were not able to be predicted as given by the HephaestusModelEvaluation.

assert(numTotalFailures == group1BasicEval.outputMethods.count(None))

Syntax Failures

Now we delve more in depth at the cause of the syntax failures by looking at the malformed machine strings directly.

syntaxFailureLineIndices = []
for i in range(len(compoundOpsLists)):
    if compoundOpsLists[i] is None:
        syntaxFailureLineIndices.append(i)
        
malformedMachineStrings = []
with open(os.path.join(MODEL_GROUP1_BASIC, "raw_output.txt"), "r") as f:
    lines = f.readlines()
    malformedMachineStrings = [lines[i].strip() for i in syntaxFailureLineIndices]

for string in malformedMachineStrings:
    print(string + "\n")

<op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> </op> <op> 7 8 <sep> true

<op> 8 9 <sep> </op> <op> 8 9 <sep> </op> <op> 8 9 <sep> return </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep>

<op> 8 9 <sep> </op> <op> 8 9 <sep> </op> <op> 8 9 <sep> return </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep> </op> <op> 15 16 <sep>

<op> 6 7 <sep> </op> <op> 6 7 <sep> return </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep> </op> <op> 12 13 <sep>

<op> 6 7 <sep> </op> <op> 6 7 <sep> return </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep> </op> <op> 29 30 <sep>

For all of the malformed machine strings, the syntax failures happen because the last represented CompoundOperation is cut off. Moreover, all of these strings appear to have a similar number of tokens.

[len(string.split()) for string in malformedMachineStrings]

[100, 100, 100, 100, 100]

Indeed, all of the malformed machine strings are of length 100. 100 is the default maximum output sequence length for OpenNMT translation, so we can be almost certain that this is the reason why the machine strings appear to be cut off at the end. Since syntax failures only accounted for 5 out of the 91 total failures, it is acceptable to leave this cap in place. Also, it's good to have a maximum output sequence length, as having one prevents outputted sequences from being very long and hogging processing power.

Index Failures

Next, we can look at the buggy AbstractMethods which caused the model to output EditOperations which resulted in IndexErrors. Perhaps there is a difference between the lengths of these methods and the lengths of the buggy AbstractMethods overall.

lengths = [len(method) for method in indexFailureMethods]
avgIndexFailureMethodLength = sum(lengths) / len(lengths)

avgIndexFailureMethodLength

29.558139534883722

lengths = [len(method) for method in testBuggyMethods]
avgBuggyMethodLength = sum(lengths) / len(lengths)

avgBuggyMethodLength

31.759211653813196

avgIndexFailureMethodLength - avgBuggyMethodLength

-2.201072118929474

On average, the length of an AbstractMethod which caused an index failure was 2.20 less than the typical inputted AbstractMethod. Therefore, there is evidence that shorter AbstractMethods are more likely to case prediction failures. This is likely due to the fact that a shorter AbstractMethod has a smaller range of valid indices, and as such, the model can more often generate EditOperations which influence out of bounds indices.

It appears that the probability of prediction failure is influenced by the length of the inputted AbstractMethod. Thus, the failure rate can likely be decreased if the data is further subdivided and grouped by length ranges, and models are trained on only one length range. However, this will dramatically reduce the amount of training data each model will have, so ultimately doing this is probably not a good idea.

Overall Statistics

Perfect Prediction Accuracies

plotAllPerfectPredictionAccuracies(
    evaluations = {
        "default params": [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
        "group1 params":  [group1ControlEval,  group1BasicEval,  group1StrictEval,  group1LooseEval],
        "group2 params":  [group2ControlEval,  group2BasicEval,  group2StrictEval,  group2LooseEval]
    },
    xLabels =             ["control",          "basic",          "strict",          "loose"]
)

Average Edit Distance Decreases

plotAllAvgEditDistDecreases(
    evaluations = {
        "default params": [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
        "group1 params":  [group1ControlEval,  group1BasicEval,  group1StrictEval,  group1LooseEval],
        "group2 params":  [group2ControlEval,  group2BasicEval,  group2StrictEval,  group2LooseEval]
    },
    xLabels =             ["control",          "basic",          "strict",          "loose"]
)

Analysis

The most important performace metric is perfect prediction accuracy; it describes how well the model can actually function as a "bug fixer". The control model (trained only with AbstractMethods) performs significantly better than any of the models trained with EditOperations in this regard, no matter the models' training parameters. These results give evidence that training NMT models with EditOperations is not beneficial over the standard approach. There is a negligible difference between the accuracies of models trained with basic, strictly, and loosely condensed EditOperations, which suggests that condensing the EditOperations does not affect the perfect prediction accuracy.

Another model performance metric is edit distance decrease (EDD). The geater this value, the closer the models' predictions approach the true fixed methods (i.e. higher = better). Surprisingly, every single model had a negative average EDD, meaning that all the models predicted methods that were further away from the true fixed methods than the original buggy methods were. The control models had the highest average EDDs overall, although the average EDDs of the strict and loose models are very comparable to those of the control models. Furthermore, the strict and loose models performed much better than the basic models in this regard, suggesting that it is helpful to condense the EditOperations before training. However, since the average EDD for all models was negative, EDD may not be a helpful metric by which to measure model performance.

Conclusions

It is evidenced that...

using EditOperations to train NMT models does not offer advantages over the standard approach.
if the NMT models are trained with EditOperations, then it is advantageous to train using strictly or loosely condensed EditOperations over the basic condensed EditOperations.

	acc%
control	14.721508
basic	7.986290
strict	7.146530
loose	7.180805

	rate%
control	0.000000
basic	0.377035
strict	0.274207
loose	0.582691

	acc%
control	13.898886
basic	7.626392
strict	8.071979
loose	8.260497

	rate%
control	0.000000
basic	1.559554
strict	0.394173
loose	0.599829

	acc%
control	14.721508
basic	7.523565
strict	6.820908
loose	7.523565

Experiment

`class` `HephaestusModelEvaluation`[source]

Helper functions

`annotateBarPlot`[source]

`plotBar`[source]

`plotTrainingAccuracies`[source]

`plotPerfectPredictionAccuracies`[source]

`plotAllPerfectPredictionAccuracies`[source]

`plotAvgEditDistDecreases`[source]

`plotAllAvgEditDistDecreases`[source]

`plotFailedPredictionRates`[source]

Preparation

Default Parameters

Training Accuracies

Perfect prediction accuracies

Average Edit Distance Decreases

Failed prediction rates

Parameter group 1

Training Accuracies

Perfect prediction accuracies

Average Edit Distance Decreases

Failed prediction rates

Parameter group 2

Training Accuracies

Perfect prediction accuracies

Average Edit Distance Decreases

Failed prediction rates

Failure cases

Failure distribution

Syntax Failures

Index Failures

Overall Statistics

Perfect Prediction Accuracies

Average Edit Distance Decreases

Analysis

Conclusions

	default params	group1 params	group2 params
control	-1.528192	-0.891174	-1.528192
basic	-2.402546	-2.691330	-2.541187
strict	-1.654924	-1.456986	-1.580217
loose	-1.843648	-1.596552	-1.387632

Experiment

class HephaestusModelEvaluation[source]

Helper functions

annotateBarPlot[source]

plotBar[source]

plotTrainingAccuracies[source]

plotPerfectPredictionAccuracies[source]

plotAllPerfectPredictionAccuracies[source]

plotAvgEditDistDecreases[source]

plotAllAvgEditDistDecreases[source]

plotFailedPredictionRates[source]

Preparation

Default Parameters

Training Accuracies

Perfect prediction accuracies

Average Edit Distance Decreases

Failed prediction rates

Parameter group 1

Training Accuracies

Perfect prediction accuracies

Average Edit Distance Decreases

Failed prediction rates

Parameter group 2

Training Accuracies

Perfect prediction accuracies

Average Edit Distance Decreases

Failed prediction rates

Failure cases

Failure distribution

Syntax Failures

Index Failures

Overall Statistics

Perfect Prediction Accuracies

Average Edit Distance Decreases

Analysis

Conclusions

`class` `HephaestusModelEvaluation`[source]

`annotateBarPlot`[source]

`plotBar`[source]

`plotTrainingAccuracies`[source]

`plotPerfectPredictionAccuracies`[source]

`plotAllPerfectPredictionAccuracies`[source]

`plotAvgEditDistDecreases`[source]

`plotAllAvgEditDistDecreases`[source]

`plotFailedPredictionRates`[source]