class Dynamic_Dataset[source]

Dynamic_Dataset(ground_truth, path, isZip)

This class efficiently 'stores' a dataset. Only a list of filenames and mappings to their ground truth values are stored in memory. The file contents are only brought into memory when requested.

This class supports indexing, slicing, and iteration.

A user can treat an instance of this class exactly as they would a list. Indexing an instance of this class will return a tuple consisting of the ground truth value and the file content of the filename at that index.

A user can request the filename at an index with get_id(index)

Example:

dataset = Dynamic_Dataset(ground_truth)

print(dataset.get_id(0))
    -> gitlab_79.txt

print(dataset[0])
    -> ('(1,0)', 'The currently used Rails version, in the stable ...

for x in dataset[2:4]:
    print(x)
        -> ('(1,0)', "'In my attempt to add 2 factor authentication ...
        -> ('(1,0)', 'We just had an admin accidentally push to a ...

Dynamic_Dataset.get_id[source]

Dynamic_Dataset.get_id(index)

Get the name of the file at the specified index

Dynamic_Dataset.__len__[source]

Dynamic_Dataset.__len__()

The number of files contained in the dataset

Processing_Dataset

A class to wrap up processing functions

class Processing_Dataset[source]

Processing_Dataset(path)

This class wraps up processing and will match issue our data corpus with it's ground truth. This class also creates our test train split.

Processing_Dataset.get_issue[source]

Processing_Dataset.get_issue(filename)

Give the contents of the specified file

Processing_Dataset.get_ground_truth[source]

Processing_Dataset.get_ground_truth()

Returns a dictionary representing the ground truth mapping to filenames in the path this class was initialized with

Processing_Dataset.get_test_and_training[source]

Processing_Dataset.get_test_and_training(ground_truth, test_ratio=0.1, isZip=False)

Given the input ground truth dictionary, generate a train test split with the given ratio: default 0.1. If isZip is true, then we will attempt to read the data as a zip archive. If not, we will try to read them normally. Returns a tuple of form (test, train), where test and train are of class Dynamic_Dataset

class Embeddings[source]

Embeddings()

Embeddings class is responsible for cleaning, normalizing, and vectorizing a given corpus.

Embeddings.preprocess[source]

Embeddings.preprocess(sentence, vocab_set=None)

Preprocess a given sentence string by cleaning each token, and normalizing.
We tokenize, filter stopwords, filter stemmings, and filter remove-terms. Returns a list of tokens.

Embeddings.get_embeddings_dict[source]

Embeddings.get_embeddings_dict(embeddings_filename)

Return a dictionary representation of the embeddings from the given embeddings csv file located at the input path

Embeddings.vectorize[source]

Embeddings.vectorize(sentence, embeddings_dict)

Takes an input sentence as a string, preprocesses it, then vectorizes it based on the input embeddings dictionary. Returns a numpy matrix representing the vectorized sentence