WEKA Software for Machine Learning

James Gray, MS, CCBA
6 min readSep 26, 2022

--

machine learning, AI, picture of machine learning overview

WEKA is a free, open-source software, packed with various data mining tools and machine learning (ML) algorithms. WEKA provides researchers and machine learning engineers a no-code option for testing models on various datasets. Tasks carried out on the platform include data preprocessing, classification, clustering, regression, and visualizations. WEKA is popular in academic settings and in machine learning research projects. Coursework in graduate programs oftentimes require use of the platform for its graduate students. Some of the commonly-used models and key features of the software are discussed further.

Machine learning is the use of statistical models and algorithms to find patterns in data. Data is a term used throughout the field of machine learning and can represent various things. This includes numbers, currency, words, letters, pictures, figures, symbols, etc. Pretty much anything that can be tracked and collected fits into this category. Machine learning is a subfield within artificial intelligence, and within machine learning is a subfield referred to as deep learning.

Machine Learning Algorithms

Some of the most popular machine learning algorithms are supervised, unsupervised, and reinforcement. Supervised learning describes an algorithm that requires a human to build a model and the outcome, y, depends on the input, x. It also consists of labeled data or tags that enable recognition of the data points for the model being used. Unsupervised learning requires human input with a machine-generated output, like supervised learning, but consist of untagged data. Reinforcement learning requires human input, with a machine-generated output. The machine relies not on an input/output relationship, but solely on a cumulative reward quantity. When conducting a reinforcement learning trial, whichever algorithm has the highest reward quantity dominates the other algorithm. A human can implement optimizations based on the results of the reinforced learnings. This cycle is continued for x amount of times.

Common ML Techniques

Regression is a common tool within machine learning. There are two main types of regression — linear and logistic. Linear regression deals with numerical data, while logistic regression deals with categorical data. The latter handles tasks that involve classification. An example of this could be determining if a set of 100 pictures are displaying a cat or a dog.

Within logistic regression are an array of classification techniques. Some to note are Neural Networks, K-Nearest Neighbor (K-NN), Decision Tree, Random Forest, Support Vector Machines (SVM), and Naives Bayes. K-NN, Decision Trees, and Naive Bayes only implement supervised learning algorithms. Random Forest and SVM implement both supervised and unsupervised learning algorithms. Neural Network implements all 3 types of learning algorithms — supervised, unsupervised, and reinforcement learning. Decision trees are a very common feature in the WEKA environment, that create subsets within the dataset, in a repeated manner. The result is a newly developed decision tree, with decision nodes and leaves.

Cluster techniques are similar in theory to classification, but have more moving parts than previously described. The purpose of clustering in machine learning is to recognize different objects, and produce a machine that is able to differentiate amongst these objects, based on their characteristics. Two of the primary clustering techniques include K-Means and Anomaly Detection. Both methods utilize supervised and unsupervised learning.

The last area within machine learning, worth covering, is a process referred to as feature reduction. Feature reduction analyses are tasked with simplifying ML models. Dimensionality reduction is a very similar simplification process and works by reducing the number of variables present in a dataset, just like in feature reduction. Simplified machine learning models mean improved work processes and more efficient use of computing power. This results in less computation time while reducing the resources used to complete a task. Some of the most popular feature reduction techniques are Linear Discriminate Analysis, Principle Component Analysis, and Correlation Analysis.

Installing and Loading WEKA

Requirements for running WEKA include installation of the Java programming language, and of course the WEKA software. WEKA is written in Java and therefore must run alongside the Java application programming interface (API).

Directories for download of Java and WEKA, depending on your operating system, can be found in the link below.

https://www.java.com/en/download/manual.jsp

https://waikato.github.io/weka-wiki/downloading_weka/

On the WEKA download page you have an option to download a zip file containing the software, under “Other platforms”. The link is provided below.

Once Java and the zip file are downloaded, one can easily access the WEKA graphical user interface (GUI) from their command line. Opening ones terminal, starting at the home directory, and following the below commands should load the software.

Note: these commands are for Mac operating systems.

cd downloads
cd weka-3-8-6
java -jar weka.jar

The user is then prompted to the homepage where they can enter the Explorer environment.

Loading Data

The first step in testing a model is to select and import a dataset. Note that all files to be used must have the format and extension .arff. This stands for Attribute-Relation File Format and contains a list of instances with a set of attributes. This ensures that the WEKA software is able to identify the correct number of instances and attributes, which is required for test accuracy. If your data comes in a comma separated values (CSV) file, you’ll have to convert its format. This can be accomplished in a text editor or a spreadsheet program, such as Excel or Google Sheets. Once the data has been prepared and has successfully loaded into the environment, you should be able to view all the classes as well as their attributes, in the preprocess tab. If your data involves a class of 2 or more, WEKA should also display this in the preprocess tab without having to run a model. This also confirms that the data you have loaded is in proper format and models are ready to be ran.

Adding a Model

To test a model on your dataset, refer over to the classify tab and select your file. There are a range of them to select from. For instance a J48 model, which utilizes decision tree classification, can be used to determine the misclassification rate for each of your instances. The result comes with the addition of a classification tree visualization. You also have control over how the model interacts with the data, by varying the confidence factor of the model and switching between pruned and unpruned options. Other options include changing the number of folds for cross validation . This determines what percentage of your data is assigned as train and test data.

Adding a Classifier

To add a classifier to the model, click the bolded name of the model. One is then prompted to a selection of learners to pair with the original model. An example could be selecting the cost sensitivity model and pairing this with a bagging or boosting classifier. Combining multiple learners onto one dataset is a process known as meta-learning. The theory behind this method is that one can increase the effectiveness of their model, which creates more accurate predictions. WEKA provides this feature.

Determining if you you’d like to run a model on the default settings, or change the n-fold cross-validation is up to the one performing the experiment. Once your preferred settings are in place, ensure that your curser selects the class attribute and run your model.

Analyzing Results

Once the paired model and learners have successfully ran on the data, the results are pinned under the classifier output section of the screen. This output can also be saved in its own separate file. The time it took for the operating system to run the model on the dataset is also made available. More complicated learner schemes take more time to load, but that information is readily available. Options to save these results as a separate file are shown. Each unique run is pinned to the left side of the explorer screen, for convenient access.

If a classification model was used, such as the J48 classifier, a resulting tree is shown, either pruned or unpruned. Right-clicking on the list of learner schemes, situated to the left side of the classifier tab, enables one to view or save the tree created. A more pronounced view of visualizations is available on the visualize tab.

Last Remarks

There are numerous ML algorithms and models to choose from within the WEKA environment. This can oftentimes be an obstacle for researchers and professionals who consider using the software. But through trial & error and familiarity gained from running models on different datasets, WEKA can be a practical alternative to writing standard python code for machine learning. This article gives you an explanation of some of the common features within the software, how to maneuver through it, and how to get started within the environment.

If you found this article helpful make sure to give it an applause and follow for more posts like this.

Resources

Witten, Ian, et al. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 4th Edition, 2017.

--

--