Encapsulates NMT operations on AbstractMethods.

class HephaestusModel[source]

HephaestusModel(modelDir:str)

The HephaestusModel is the means through which buggy AbstractMethods are translated into fixed ones. Each HephaestusModel occupies a directory which contains stored models, vocabularies, and configuration files.

Required args:

  • modelDir: The directory which stores files pertaining to the model. You can use a directory which already contains the necessary files (previously generated from a different HephaestusModel), in which case the model will not have to be trained again. If you provide a directory that does not exist, the HephaestsuModel will attempt to create it.

HephaestusModel.train[source]

HephaestusModel.train(trainSource:str, trainTarget:str, validSource:str, validTarget:str, numCheckpoints:int=10, numGPUs:int=1, embeddingSize:int=512, rnnType:str='LSTM', rnnSize:int=256, numLayers:int=2, numTrainingSteps:int=50000, numValidations:int=10, dropout:int=0.2)

Trains the model with the given parameters. Files containing AbstractMethods should have one per line with tokens separated by spaces. 'source' files must contain AbstractMethods. 'target' files may contain AbstractMethods or CompoundOperations in machine string format.

As the training progesses, checkpoint model files are created which follow the format model_step_#.pt, where # corresponds to the training step number. Once training is complete, the finalized model is outputted to model_final.pt. In addition, training command output is written to train_output.txt.

Default parameter values are such that they resemble the most successful NMT model in this paper as closely as possible.

Parameters:

  • Data and vocabulary:
    • trainSource: Required. File name containing training source data. Must be buggy AbstractMethods.
    • trainTarget: Required. File name containing training target data. Can be either non-buggy AbstractMethods or CompoundOperations in machine string format.
    • validSource: Required. File name containing validation source data. Must be buggy AbstractMethods.
    • validTarget: Required. File name containing validation target data. Must be the same type of data which is contained in the file denoted by trainTarget.
  • General options:
    • numCheckpoints: Number of times a checkpoint model is saved; e.g. if numTrainingSteps is 50,000 and numCheckpoints is 10, then a checkpoint will be saved after every 5,000 training steps. Defaults to 10.
    • numGPUs: Number of GPUs to use concurrently during training. If set to 0, then the CPU is used. Defaults to 1.
  • Model options:
    • embeddingSize: Word embedding size for source and target. Defaults to 512.
  • Encoder/decoder options:
    • rnnType: Gate type to use in RNN encoder and decoder. Can be "LSTM" or "GRU". Defaults to "LSTM".
    • rnnSize: Size of encoder and decoder RNN hidden states. Defaults to 256.
    • numLayers: Number of layers each in the encoder and decoder. Defaults to 2.
  • Learning options:
    • numTrainingSteps: Number of training steps to perform. Defaults to 50,000.
    • numValidations: validSteps: Number of validations to perform during training; e.g. if numTrainingSteps is 50,000 and numValidations is 10, then validation will occur after every 5,000 training steps. Defaults to 10.
    • dropout: Dropout probability. Defaults to 0.2.

HephaestusModel.getTrainingStats[source]

HephaestusModel.getTrainingStats()

Returns a pandas dataframe describing training statistics; the dataframe has the following columns:

  • step: The training step in increments of 50
  • trainAccuracy: Model accuracy with respect to the training set
  • validAccuracy: Validation accuracy. These values will likely not be present for every row.
  • crossEntropy: Cross-entropy value

HephaestusModel.translate[source]

HephaestusModel.translate(buggy:Union[str, AbstractMethod, List[AbstractMethod]], modelFile:str=None, applyEditOperations:bool=True)

Translates the given buggy AbstractMethods into supposedly fixed AbstractMethods, writes them to <model_directory>/postprocessed_output.txt, and then returns them. The raw output of the model is written to <model_directory>/raw_output.txt in case you want to access that as well. Depending on what type of value is passed to buggy, the return value of this method changes according to the following:

buggy type Return type
str (a file) List[Optional[AbstractMethod]]
AbstractMethod Optional[AbstractMethod]
List[AbstractMethod] List[Optional[AbstractMethod]]

A None return value means that the model was unable to translate that abstract method correctly. This could be due to the model outputting non well-formed CompoundOperations, among other things. These will appear as blank lines in postprocessed_output.txt.

Optional args:

  • modelFile: A .pt file which is used for translation instead of the default model_final.pt
  • applyEditOperations: When set to True, the model output is interpreted as CompoundOperations and a postprocessing stage occurs where the outputted CompoundOperations are applied to the inputted AbstractMethods. When set to False, the raw output is interpreted as AbstractMethods and returned without a postprocessing stage; in this case, the contents of raw_output.txt and postprocessed_output.txt are identical. If the model was trained with EditOperations, applyEditOperations should be True; if the model was trained with just AbstractMethods as in for the control group, then this should be False. Defaults to True.

Example Usage

Let's create a small test model.

model = HephaestusModel("test_model_loose")

Now there is a directory called test_model_loose which will be populated with files once the model is trained. We will train the model with the loosely condensed edit operations dataset in general form. Variables such as DATA_SMALL_METHODS_TRAIN_BUGGY describe the path to data files, and are defined in the DatasetConstruction module. Since this is just an example, a very small number of training steps will be used.

model.train(
    DATA_SMALL_METHODS_TRAIN_BUGGY,
    DATA_SMALL_OPS_GENERAL_LOOSE_TRAIN,
    DATA_SMALL_METHODS_VALID_BUGGY,
    DATA_SMALL_OPS_GENERAL_LOOSE_VALID,
    numCheckpoints = 5,
    numTrainingSteps = 500,
    numValidations = 5
)
[2021-05-15 01:29:17,840 INFO] Counter vocab from -1 samples.
[2021-05-15 01:29:17,840 INFO] n_sample=-1: Build vocab on full datasets.
[2021-05-15 01:29:17,845 INFO] corpus_1's transforms: TransformPipe()
[2021-05-15 01:29:17,846 INFO] Loading ParallelCorpus(../data/small/abstract_methods/train_buggy.txt, ../data/small/edit_ops/general/loose/train.txt, align=None)...
[2021-05-15 01:29:18,347 INFO] Counters src:429
[2021-05-15 01:29:18,347 INFO] Counters tgt:444
[2021-05-15 01:29:18,347 WARNING] path test_model_loose/save_data.vocab.src exists, may overwrite...
[2021-05-15 01:29:18,349 WARNING] path test_model_loose/save_data.vocab.tgt exists, may overwrite...
[2021-05-15 01:29:19,179 INFO] Parsed 2 corpora from -data.
[2021-05-15 01:29:19,179 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2021-05-15 01:29:19,179 INFO] Loading vocab from text file...
[2021-05-15 01:29:19,179 INFO] Loading src vocabulary from test_model_loose/save_data.vocab.src
[2021-05-15 01:29:19,181 INFO] Loaded src vocab has 429 tokens.
[2021-05-15 01:29:19,182 INFO] Loading tgt vocabulary from test_model_loose/save_data.vocab.tgt
[2021-05-15 01:29:19,184 INFO] Loaded tgt vocab has 444 tokens.
[2021-05-15 01:29:19,184 INFO] Building fields with vocab in counters...
[2021-05-15 01:29:19,185 INFO]  * tgt vocab size: 448.
[2021-05-15 01:29:19,185 INFO]  * src vocab size: 431.
[2021-05-15 01:29:19,185 INFO]  * src vocab size = 431
[2021-05-15 01:29:19,185 INFO]  * tgt vocab size = 448
[2021-05-15 01:29:19,187 INFO] Building model...
[2021-05-15 01:29:31,740 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(431, 512, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(512, 256, num_layers=2, dropout=0.2)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(448, 512, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.2, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.2, inplace=False)
      (layers): ModuleList(
        (0): LSTMCell(768, 256)
        (1): LSTMCell(256, 256)
      )
    )
    (attn): GlobalAttention(
      (linear_context): Linear(in_features=256, out_features=256, bias=False)
      (linear_query): Linear(in_features=256, out_features=256, bias=True)
      (v): Linear(in_features=256, out_features=1, bias=False)
      (linear_out): Linear(in_features=512, out_features=256, bias=True)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=256, out_features=448, bias=True)
    (1): Cast()
    (2): LogSoftmax(dim=-1)
  )
)
[2021-05-15 01:29:31,741 INFO] encoder: 1535488
[2021-05-15 01:29:31,741 INFO] decoder: 2184384
[2021-05-15 01:29:31,741 INFO] * number of parameters: 3719872
[2021-05-15 01:29:31,743 INFO] Starting training on GPU: [0]
[2021-05-15 01:29:31,743 INFO] Start training loop and validate every 100 steps...
[2021-05-15 01:29:31,744 INFO] corpus_1's transforms: TransformPipe()
[2021-05-15 01:29:31,745 INFO] Loading ParallelCorpus(../data/small/abstract_methods/train_buggy.txt, ../data/small/edit_ops/general/loose/train.txt, align=None)...
[2021-05-15 01:29:41,975 INFO] Step 50/  500; acc:  18.38; ppl: 170.46; xent: 5.14; lr: 0.00010; 10041/3951 tok/s;     10 sec
[2021-05-15 01:29:51,557 INFO] Step 100/  500; acc:  25.32; ppl: 35.93; xent: 3.58; lr: 0.00010; 10430/4158 tok/s;     20 sec
[2021-05-15 01:29:51,557 INFO] valid's transforms: TransformPipe()
[2021-05-15 01:29:51,559 INFO] Loading ParallelCorpus(../data/small/abstract_methods/valid_buggy.txt, ../data/small/edit_ops/general/loose/valid.txt, align=None)...
[2021-05-15 01:29:59,493 INFO] Validation perplexity: 26.3849
[2021-05-15 01:29:59,493 INFO] Validation accuracy: 28.6959
[2021-05-15 01:29:59,495 INFO] Saving checkpoint test_model_loose/model_step_100.pt
[2021-05-15 01:30:10,007 INFO] Step 150/  500; acc:  30.41; ppl: 25.32; xent: 3.23; lr: 0.00010; 5468/2203 tok/s;     38 sec
[2021-05-15 01:30:19,192 INFO] Step 200/  500; acc:  41.50; ppl: 16.08; xent: 2.78; lr: 0.00010; 11150/4410 tok/s;     47 sec
[2021-05-15 01:30:19,194 INFO] Loading ParallelCorpus(../data/small/abstract_methods/valid_buggy.txt, ../data/small/edit_ops/general/loose/valid.txt, align=None)...
[2021-05-15 01:30:27,140 INFO] Validation perplexity: 11.4145
[2021-05-15 01:30:27,140 INFO] Validation accuracy: 44.7773
[2021-05-15 01:30:27,142 INFO] Saving checkpoint test_model_loose/model_step_200.pt
[2021-05-15 01:30:37,243 INFO] Step 250/  500; acc:  45.01; ppl: 11.18; xent: 2.41; lr: 0.00010; 5593/2228 tok/s;     65 sec
[2021-05-15 01:30:46,838 INFO] Step 300/  500; acc:  45.41; ppl:  9.96; xent: 2.30; lr: 0.00010; 10616/4130 tok/s;     75 sec
[2021-05-15 01:30:46,839 INFO] Loading ParallelCorpus(../data/small/abstract_methods/valid_buggy.txt, ../data/small/edit_ops/general/loose/valid.txt, align=None)...
[2021-05-15 01:30:54,776 INFO] Validation perplexity: 8.9979
[2021-05-15 01:30:54,777 INFO] Validation accuracy: 46.2528
[2021-05-15 01:30:54,779 INFO] Saving checkpoint test_model_loose/model_step_300.pt
[2021-05-15 01:31:05,099 INFO] Step 350/  500; acc:  46.06; ppl:  9.41; xent: 2.24; lr: 0.00010; 5678/2190 tok/s;     93 sec
[2021-05-15 01:31:15,092 INFO] Step 400/  500; acc:  46.47; ppl:  9.05; xent: 2.20; lr: 0.00010; 10291/4053 tok/s;    103 sec
[2021-05-15 01:31:15,094 INFO] Loading ParallelCorpus(../data/small/abstract_methods/valid_buggy.txt, ../data/small/edit_ops/general/loose/valid.txt, align=None)...
[2021-05-15 01:31:23,041 INFO] Validation perplexity: 8.38134
[2021-05-15 01:31:23,041 INFO] Validation accuracy: 47.2777
[2021-05-15 01:31:23,043 INFO] Saving checkpoint test_model_loose/model_step_400.pt
[2021-05-15 01:31:33,206 INFO] Step 450/  500; acc:  46.45; ppl:  8.88; xent: 2.18; lr: 0.00010; 5548/2213 tok/s;    121 sec
[2021-05-15 01:31:43,081 INFO] Step 500/  500; acc:  47.22; ppl:  8.83; xent: 2.18; lr: 0.00010; 10405/4080 tok/s;    131 sec
[2021-05-15 01:31:43,082 INFO] Loading ParallelCorpus(../data/small/abstract_methods/valid_buggy.txt, ../data/small/edit_ops/general/loose/valid.txt, align=None)...
[2021-05-15 01:31:51,020 INFO] Validation perplexity: 8.06989
[2021-05-15 01:31:51,020 INFO] Validation accuracy: 48.0582
[2021-05-15 01:31:51,022 INFO] Saving checkpoint test_model_loose/model_step_500.pt

Suppose we want to view information about the training process of the model without having to scroll through all the output above; we can use the HephaestusModel.getTrainingStats method, which returns a Pandas DataFrame containing such information:

model.getTrainingStats()
step trainAccuracy validAccuracy crossEntropy
0 50 18.38 NaN 5.14
1 100 25.32 28.6959 3.58
2 150 30.41 NaN 3.23
3 200 41.50 44.7773 2.78
4 250 45.01 NaN 2.41
5 300 45.41 46.2528 2.30
6 350 46.06 NaN 2.24
7 400 46.47 47.2777 2.20
8 450 46.45 NaN 2.18
9 500 47.22 48.0582 2.18

Now that the model is trained, we can test it out. This gets the first buggy AbstractMethod from the testing data.

buggyMethod = readAbstractMethodsFromFile(DATA_SMALL_METHODS_TEST_BUGGY)[0]
buggyMethod
private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2 , VAR_2 ) ; }

Then translate the method into a supposedly fixed version using HephaestusModel.translate.

outputMethod = model.translate(buggyMethod)
[2021-05-15 01:34:26,819 INFO] Translating shard 0.
[2021-05-15 01:34:26,831 INFO] PRED AVG SCORE: -1.1982, PRED PPL: 3.3140

There is a possibility that the model was unable to translate the buggy method correctly, e.g. if the model outputted ill-formed EditOperations that could not be parsed and applied to the buggy method. Therefore, we should check that the outputted method is not None.

assert(outputMethod is not None)

View the contents of the outputted AbstractMethod:

outputMethod
private TYPE_1 getType ( TYPE_2 VAR_1 ) { new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2 , VAR_2 ) ; }

We can determine what exactly changed from the buggy method to the outputted method by getting the EditOperations between the two, then condensing them for easier readability.

observedOperations = getCondensedLoose(buggyMethod.getEditOperationsTo(outputMethod))
observedOperations
[COMPOUND_DELETE 8:11]

So it seems that the changes were deletions on tokens in the index range 8:11. We can verify that these were the actual edit operations applied by the model by looking at raw_output.txt directly.

appliedOperations = readCompoundOperationsFromFile("test_model_loose/raw_output.txt")[0]
appliedOperations
[COMPOUND_DELETE 8:11]
appliedOperations == observedOperations
True

Nice! But what was the correct answer, and how far off were we?

actualFixedMethod = readAbstractMethodsFromFile(DATA_SMALL_METHODS_TEST_FIXED)[0]
actualFixedMethod
private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2 , VAR_2 , this , VAR_1 ) ; }
modelDistance = outputMethod.getEditDistanceTo(actualFixedMethod)
modelDistance
7
actualDistance = buggyMethod.getEditDistanceTo(actualFixedMethod)
actualDistance
4

Since modelDistance is higher than actualDistance, our outputted method is actually further away from the actual fixed method than the original buggy method is! Oof. But keep in mind that this is only demonstrating example usage and that the model was trained with a laughable number of steps.