SCARGC

Here are some functions in SCARGC module.

SCARGC — Module

Main module for SCARGC.jl, the Julia implementation of SCARGC algorithm.

From this module, two functions are exported for public use:

scargc_1NN. The 1-Nearest Neighbor SCARGC implementation.
extractValuesFromFile. A functio to extract the data values from a dataset file.

Predicting functions

SCARGC.scargc_1NN — Function

scargc_1NN(
    dataset         -> "dataset used in the algorighm"
    percentTraining -> "amount of data that is goung to be used as training data"
    maxPoolSize     -> "maximum instances that the pool size can store"
    K               -> "number of clusters"
)

SCARGC implementation with the Nearest Neighbor classifier for label predicting. The function prints the final accuracy and returns the vector with all predicted labels and the final accuracy.

The function starts getting the labeled and unlabeled and, with them, it creates the initial centroids. Then, a loop starts over the unlabeled data, storing the instance and the predicted label (predicted with the classification model).

When the pool reaches the maximum, represented by maxPoolSize, a clustering step is made on the data stored in the pool to get the centroids from the current iteration (represented as tempCurrentCentroids before receiving the labels and currentCentroids after it). With the centroids from the past iteration (represented as centroids), we find the current iteration's centroids' labels and create the intermediary centroids, represented as intermed, that stores the median between the past and current centroids and the current iretation's centroid's labels (the reason to store that is to store the drift between the past and the current iteration).

After getting the labels, the past iteration's centroids receive the values stored in intermed and a new labels are found using both centroids from current and past iterations. These labels are going to be useful to get the concordance between them and the labels stored in the pool to know if the model is going to be updated. The last part, after calculating the concordance, is update the classification model if the concordance is different of 100%.

For example, let's predict some data in the 1CHT dataset.

We start loading the dataset and reading the values from it.

using SCARGC

# loading dataset
dataset = "datasets/synthetic/1CHT.txt"

# extracting values from the dataset file and storing into `data`
data = extractValuesFromFile(dataset, 16000, 3)

# printing the read dataset
prinln(data)

# predicting labels and storing them in `predictedLabels` and the accuracy in `accuracy`
predictedLabels, accuracy = scargc_1NN(data, 0.3125, 300, 2)

If we print the results, we're gonna have

# predicted labels
11000-element Array{Int64,1}:
 1
 2
 1
 1
 1
 2
 1
 1
 ⋮
 2
 1
 2
 1
 2
 2
 2

# accuracy ~
99.73040752351096

The accuracy happens if you're using a dataset with the actual labels of all instances. Then the function can compare the prediction with the actual label to get the accuracy.

source

Utilities

SCARGC.extractValuesFromFile — Function

extractValuesFromFile(
    fileName -> "name of the file that you're trying to read"
    rows     -> "number of rows in the file"
    columns  -> "number of columns in the file"
)

The function reads a file and creates a matrix with the file's values. Using the function readuntil and parse we can converte each value, separeted by comma.

The function returns a matrix with the values in the file.

For example, let's suppose we have a text file, called "example.txt", with 10 lines and 7 columns.

2.91120, 2.94941, 2.92468, 0.88324, 6.00512, 5.88021, 3
4.97443, 2.31345, 5.21826, 8.13938, 7.47390, 8.24444, 1
3.13772, 7.87192, 8.35722, 7.41002, 6.95894, 5.20002, 2
3.74301, 6.39082, 3.89166, 8.76716, 1.66374, 6.75246, 3
8.98766, 4.42780, 4.94278, 7.05217, 5.21106, 3.69790, 1
4.38337, 4.91497, 6.16593, 8.73148, 6.17557, 0.90185, 2
0.97263, 2.70122, 0.00343, 7.20105, 1.63296, 8.26112, 1
0.50329, 3.54209, 8.47787, 2.59826, 3.93825, 2.58200, 3
6.75223, 4.59601, 2.02472, 8.10523, 3.65602, 6.91874, 1
8.98924, 2.33177, 8.34892, 4.03544, 8.57646, 7.93690, 1

We're gonna run this function with the following values:

fileName is going to receive the path to the file. In this case, it's just "example.txt";
rows is going to receive 10 and
columns is going to receive 7

Then, the function starts and reads each value separated by "," and stores these values into a matrix. This matrix is returned in the end of the function, in the format:

10×7 Array{Float64,2}:
2.91120, 2.94941, 2.92468, 0.88324, 6.00512, 5.88021, 3
4.97443, 2.31345, 5.21826, 8.13938, 7.47390, 8.24444, 1
3.13772, 7.87192, 8.35722, 7.41002, 6.95894, 5.20002, 2
3.74301, 6.39082, 3.89166, 8.76716, 1.66374, 6.75246, 3
8.98766, 4.42780, 4.94278, 7.05217, 5.21106, 3.69790, 1
4.38337, 4.91497, 6.16593, 8.73148, 6.17557, 0.90185, 2
0.97263, 2.70122, 0.00343, 7.20105, 1.63296, 8.26112, 1
0.50329, 3.54209, 8.47787, 2.59826, 3.93825, 2.58200, 3
6.75223, 4.59601, 2.02472, 8.10523, 3.65602, 6.91874, 1
8.98924, 2.33177, 8.34892, 4.03544, 8.57646, 7.93690, 1

This matrix is then ready to be used along the code.

source