Utils 🔧

`class` `Dynamic_Dataset`[source]

Dynamic_Dataset(ground_truth, path, isZip)

This class efficiently 'stores' a dataset. Only a list of filenames and mappings to their ground truth values are stored in memory. The file contents are only brought into memory when requested.

This class supports indexing, slicing, and iteration.

A user can treat an instance of this class exactly as they would a list. Indexing an instance of this class will return a tuple consisting of the ground truth value and the file content of the filename at that index.

A user can request the filename at an index with get_id(index)

Example:

dataset = Dynamic_Dataset(ground_truth)

print(dataset.get_id(0))
    -> gitlab_79.txt

print(dataset[0])
    -> ('(1,0)', 'The currently used Rails version, in the stable ...

for x in dataset[2:4]:
    print(x)
        -> ('(1,0)', "'In my attempt to add 2 factor authentication ...
        -> ('(1,0)', 'We just had an admin accidentally push to a ...

`Dynamic_Dataset.get_id`[source]

Dynamic_Dataset.get_id(index)

Get the name of the file at the specified index

`Dynamic_Dataset.len`[source]

Dynamic_Dataset.__len__()

The number of files contained in the dataset

Processing_Dataset

A class to wrap up processing functions

`class` `Processing_Dataset`[source]

Processing_Dataset(path)

This class wraps up processing and will match issue our data corpus with it's ground truth. This class also creates our test train split.

`Processing_Dataset.get_issue`[source]

Processing_Dataset.get_issue(filename)

Give the contents of the specified file

`Processing_Dataset.get_ground_truth`[source]

Processing_Dataset.get_ground_truth()

Returns a dictionary representing the ground truth mapping to filenames in the path this class was initialized with

`Processing_Dataset.get_test_and_training`[source]

Processing_Dataset.get_test_and_training(ground_truth, test_ratio=0.1, isZip=False)

Given the input ground truth dictionary, generate a train test split with the given ratio: default 0.1. If isZip is true, then we will attempt to read the data as a zip archive. If not, we will try to read them normally. Returns a tuple of form (test, train), where test and train are of class Dynamic_Dataset

`class` `Embeddings`[source]

Embeddings()

Embeddings class is responsible for cleaning, normalizing, and vectorizing a given corpus.

`Embeddings.preprocess`[source]

Embeddings.preprocess(sentence, vocab_set=None)

Preprocess a given sentence string by cleaning each token, and normalizing.
We tokenize, filter stopwords, filter stemmings, and filter remove-terms. Returns a list of tokens.

`Embeddings.get_embeddings_dict`[source]

Embeddings.get_embeddings_dict(embeddings_filename)

Return a dictionary representation of the embeddings from the given embeddings csv file located at the input path

`Embeddings.vectorize`[source]

Embeddings.vectorize(sentence, embeddings_dict)

Takes an input sentence as a string, preprocesses it, then vectorizes it based on the input embeddings dictionary. Returns a numpy matrix representing the vectorized sentence

class Dynamic_Dataset[source]

Dynamic_Dataset.get_id[source]

Dynamic_Dataset.__len__[source]

Processing_Dataset

class Processing_Dataset[source]

Processing_Dataset.get_issue[source]

Processing_Dataset.get_ground_truth[source]

Processing_Dataset.get_test_and_training[source]

class Embeddings[source]

Embeddings.preprocess[source]

Embeddings.get_embeddings_dict[source]

Embeddings.vectorize[source]