testBuggyMethods = readAbstractMethodsFromFile(DATA_SMALL_METHODS_TEST_BUGGY)
testFixedMethods = readAbstractMethodsFromFile(DATA_SMALL_METHODS_TEST_FIXED)
Default Parameters
Get the HephaestusModelEvaluation
for each model which was trained with the deafault parameters.
defaultControlEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_DEFAULT_CONTROL),
testBuggyMethods,
testFixedMethods,
isControl = True
)
defaultBasicEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_DEFAULT_BASIC),
testBuggyMethods,
testFixedMethods
)
defaultStrictEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_DEFAULT_STRICT),
testBuggyMethods,
testFixedMethods
)
defaultLooseEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_DEFAULT_LOOSE),
testBuggyMethods,
testFixedMethods
)
plotTrainingAccuracies(
evaluations = [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
lineLabels = ["control", "basic", "strict", "loose"],
title = "Training Accuracies of Models Trained with Default Parameters"
)
plotPerfectPredictionAccuracies(
evaluations = [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Perfect Prediction Accuracies of Models Trained with Default Parameters"
)
Average Edit Distance Decreases
The edit distance decrease is a value representing how much the the HephaestusModel
"helped" in reducing the Levenshtein edit distance to the fixed methods. E.g. a value of 3 means that on average, the edit distance from the model's outputted methods to the actual fixed methods was 3 less than the edit distance from the original buggy methods to the actual fixed methods. Negative values mean that the model made the output methods further away from the fixed methods than they originally were as given by the buggy methods. Therefore, a higher value is better.
plotAvgEditDistDecreases(
evaluations = [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Average Edit Distance Decreases of Models Trained with Default Parameters"
)
Failed prediction rates
How often did the models fail to output an AbstractMethod
?
plotFailedPredictionRates(
evaluations = [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Failed Prediction Rates of Models Trained with Default Parameters"
)
group1ControlEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_GROUP1_CONTROL),
testBuggyMethods,
testFixedMethods,
isControl = True
)
group1BasicEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_GROUP1_BASIC),
testBuggyMethods,
testFixedMethods
)
group1StrictEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_GROUP1_STRICT),
testBuggyMethods,
testFixedMethods
)
group1LooseEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_GROUP1_LOOSE),
testBuggyMethods,
testFixedMethods
)
plotTrainingAccuracies(
evaluations = [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
lineLabels = ["control", "basic", "strict", "loose"],
title = "Training Accuracies of Models Trained with Group 1 Parameters"
)
plotPerfectPredictionAccuracies(
evaluations = [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Perfect Prediction Accuracies of Models Trained with Group 1 Parameters"
)
plotAvgEditDistDecreases(
evaluations = [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Average Edit Distance Decreases of Models Trained with Group 1 Parameters"
)
plotFailedPredictionRates(
evaluations = [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Failed Prediction Rates of Models Trained with Group 1 Parameters"
)
group2ControlEval = defaultControlEval
group2BasicEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_GROUP2_BASIC),
testBuggyMethods,
testFixedMethods
)
group2StrictEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_GROUP2_STRICT),
testBuggyMethods,
testFixedMethods
)
group2LooseEval = HephaestusModelEvaluation(
HephaestusModel(MODEL_GROUP2_LOOSE),
testBuggyMethods,
testFixedMethods
)
plotTrainingAccuracies(
evaluations = [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval],
lineLabels = ["control", "basic", "strict", "loose"],
title = "Training Accuracies of Models Trained with Group 1 Parameters"
)
plotPerfectPredictionAccuracies(
evaluations = [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Perfect Prediction Accuracies of Models Trained with Group 1 Parameters"
)
plotAvgEditDistDecreases(
evaluations = [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Average Edit Distance Decreases of Models Trained with Group 2 Parameters"
)
plotFailedPredictionRates(
evaluations = [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval],
xLabels = ["control", "basic", "strict", "loose"],
title = "Failed Prediction Rates of Models Trained with Group 1 Parameters"
)
Failure cases
If a HephaestusModel
outputs malformed EditOperations, then those operations cannot be applied to the input method and thus an output method cannot be produced. Therefore, for each model trained with EditOperations, there is a small chance that a method prediction will fail due to malformed EditOperations.
This section aims to look more in depth at the distribution and causes for these failures. We will use the model with the highest failed prediction rate as a benchmark: the model trained on basic condensed EditOperations from parameter group 1, with a failure rate of 1.6%.
Failure distribution
There are two ways in which a HephaestusModel
can fail to predict a method:
- The outputted machine string contains a syntax error such that it literally connot be parsed into a valid
CompoundOperation
. - The syntax is correct, but the indices of the resulting
CompoundOperation
operate on tokens whose indices do not exist in the inputtedAbstractMethod
.
We can determine the failure distribution of the outputted EditOperations by looking at how many failures were caused by syntax errors and how many were caused by index errors.
First, we extract the outputted machine strings from the raw_output.txt
file and convert them to lists of CompoundOperations. If any machine strings were not able to be converted by the readCompoundOperationsFromFile
function, then they had syntax errors and will appear as None
in the returned list.
compoundOpsLists = readCompoundOperationsFromFile(os.path.join(MODEL_GROUP1_BASIC, "raw_output.txt"))
numSyntaxFailures = compoundOpsLists.count(None)
numSyntaxFailures
Next, we attempt to apply all the well-formed CompoundOperations to the buggy AbstractMethods and count the number of thrown IndexErrors.
indexFailureMethods = []
for method, compoundOpsList in zip(testBuggyMethods, compoundOpsLists):
methodCopy = deepcopy(method)
if compoundOpsList is not None:
try:
methodCopy.applyEditOperations(compoundOpsList)
except IndexError:
indexFailureMethods.append(method)
numIndexFailures = len(indexFailureMethods)
numIndexFailures
numTotalFailures = numSyntaxFailures + numIndexFailures
numTotalFailures
We see that there were 91 total prediction failures, 5 of which were due to syntax errors and 86 of which were due to index errors. We can verify that this number is the actual amount of methods that were not able to be predicted as given by the HephaestusModelEvaluation
.
assert(numTotalFailures == group1BasicEval.outputMethods.count(None))
syntaxFailureLineIndices = []
for i in range(len(compoundOpsLists)):
if compoundOpsLists[i] is None:
syntaxFailureLineIndices.append(i)
malformedMachineStrings = []
with open(os.path.join(MODEL_GROUP1_BASIC, "raw_output.txt"), "r") as f:
lines = f.readlines()
malformedMachineStrings = [lines[i].strip() for i in syntaxFailureLineIndices]
for string in malformedMachineStrings:
print(string + "\n")
For all of the malformed machine strings, the syntax failures happen because the last represented CompoundOperation
is cut off. Moreover, all of these strings appear to have a similar number of tokens.
[len(string.split()) for string in malformedMachineStrings]
Indeed, all of the malformed machine strings are of length 100. 100 is the default maximum output sequence length for OpenNMT translation, so we can be almost certain that this is the reason why the machine strings appear to be cut off at the end. Since syntax failures only accounted for 5 out of the 91 total failures, it is acceptable to leave this cap in place. Also, it's good to have a maximum output sequence length, as having one prevents outputted sequences from being very long and hogging processing power.
Index Failures
Next, we can look at the buggy AbstractMethods which caused the model to output EditOperations which resulted in IndexErrors. Perhaps there is a difference between the lengths of these methods and the lengths of the buggy AbstractMethods overall.
lengths = [len(method) for method in indexFailureMethods]
avgIndexFailureMethodLength = sum(lengths) / len(lengths)
avgIndexFailureMethodLength
lengths = [len(method) for method in testBuggyMethods]
avgBuggyMethodLength = sum(lengths) / len(lengths)
avgBuggyMethodLength
avgIndexFailureMethodLength - avgBuggyMethodLength
On average, the length of an AbstractMethod which caused an index failure was 2.20 less than the typical inputted AbstractMethod. Therefore, there is evidence that shorter AbstractMethods are more likely to case prediction failures. This is likely due to the fact that a shorter AbstractMethod has a smaller range of valid indices, and as such, the model can more often generate EditOperations which influence out of bounds indices.
It appears that the probability of prediction failure is influenced by the length of the inputted AbstractMethod. Thus, the failure rate can likely be decreased if the data is further subdivided and grouped by length ranges, and models are trained on only one length range. However, this will dramatically reduce the amount of training data each model will have, so ultimately doing this is probably not a good idea.
plotAllPerfectPredictionAccuracies(
evaluations = {
"default params": [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
"group1 params": [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
"group2 params": [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval]
},
xLabels = ["control", "basic", "strict", "loose"]
)
plotAllAvgEditDistDecreases(
evaluations = {
"default params": [defaultControlEval, defaultBasicEval, defaultStrictEval, defaultLooseEval],
"group1 params": [group1ControlEval, group1BasicEval, group1StrictEval, group1LooseEval],
"group2 params": [group2ControlEval, group2BasicEval, group2StrictEval, group2LooseEval]
},
xLabels = ["control", "basic", "strict", "loose"]
)
Analysis
The most important performace metric is perfect prediction accuracy; it describes how well the model can actually function as a "bug fixer". The control model (trained only with AbstractMethods) performs significantly better than any of the models trained with EditOperations in this regard, no matter the models' training parameters. These results give evidence that training NMT models with EditOperations is not beneficial over the standard approach. There is a negligible difference between the accuracies of models trained with basic, strictly, and loosely condensed EditOperations, which suggests that condensing the EditOperations does not affect the perfect prediction accuracy.
Another model performance metric is edit distance decrease (EDD). The geater this value, the closer the models' predictions approach the true fixed methods (i.e. higher = better). Surprisingly, every single model had a negative average EDD, meaning that all the models predicted methods that were further away from the true fixed methods than the original buggy methods were. The control models had the highest average EDDs overall, although the average EDDs of the strict and loose models are very comparable to those of the control models. Furthermore, the strict and loose models performed much better than the basic models in this regard, suggesting that it is helpful to condense the EditOperations before training. However, since the average EDD for all models was negative, EDD may not be a helpful metric by which to measure model performance.
Conclusions
It is evidenced that...
- using EditOperations to train NMT models does not offer advantages over the standard approach.
- if the NMT models are trained with EditOperations, then it is advantageous to train using strictly or loosely condensed EditOperations over the basic condensed EditOperations.