LO2 fully implement data mining/machine learning projects, focused on problem analysis, data pre-processing, data post-processing by choosing and implementing appropriate algorithm







Show evidence of understanding of the clustering and modelling concepts, through the implementation of requested algorithms using real datasets. Implementation is performed in R environment, while students need to perform some critical evaluation of their results.


Learning Outcomes Covered in this Assignment:

This assignment contributes towards the following Learning Outcomes (LOs):

  • LO2 fully implement data mining/machine learning  projects, focused  on problem analysis, data pre-processing, data post-processing by choosing and implementing appropriate algorithms;
  • LO4 fully implement encode and test data mining and machine learning algorithms using the programming language (such as Python) and standard packages and toolkits (such as R).
    • LO6 perform critical evaluation of performance metrics for data mining and machine learning algorithms for a given domain/application

Expected deliverables


Submit on Blackboard only one pdf file containing the required details. All implemented codes should be included in your documentation together with the results/analysis/discussion.


Electronic submission on BB via a provided link close to the submission time.


Feedback will be provided on BB, on 15* December 2020 (15 working days)







  • 7.1.6 Use appropriate processes
  • 7.1.7 Investigate and define a problem
    • 7.1.8 Apply principles of supporting disciplines
    • 8.1.1 Systematic understanding of knowledge of the domain with depth in particular areas
      • 8.1.2 Comprehensive understanding of essential principles and practices
      • 8.2.2 Tackling a significant technical problem
      • 10.1.2 Comprehensive understanding of the scientific techniques



















Instructions for this coursework


During marking period, all coursework assessments will be compared in order to detect possible cases of plagiarism/collusion. For each question, show all the steps of your work (codes/results). In addition, students need to be informed, that although clarifications for CW questions can be provided during tutorials, coursework work has to be performed outside tutorial sessions,



Coursework Description


Clustering Part


In this assignment, we consider a set of observations on a number of silhouettes related to different type of vehicles, using a set of features extracted from the silhouette. Each vehicle may be viewed from one of many different angles. The features were extracted from the silhouettes by the HIPS (Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent  features  utilising  both classical moments based measures such as scaled variance, skewness and kurtosis about the major/minor axes and heuristic measures such as hollows, circularity, rectangularity and compactness. Four model vehicles were used for the experiment: a double decker bus, Chevrolet van, Saab and an Opel Manta. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.


One dataset (vehicles.xls) is available and has 846 observations/samples. There are 19 variables/features, all numerical and one nominal defining the class of the objects.


Description of attributes:


  1. Comp: Compactness
  2. Circ: Circularity
  3. D.Circ: Distance Circularity
  4. Rad.Ra: Radius ratio
    1. Pr.Axis.Ra: pr.axis aspect ratio
    2. Max.L.Ra: max.length aspect ratio
      1. Scat.Ra: scatter ratio
        1. Elong: elongatedness
        2. Pr.Axis.Rect: pr.axis rectangularity
        3. Max.L.Rect: max.length rectangularity
          1. Sc.Var.Maxis: scaled variance along major axis
          2. Sc.Var.maxis: scaled variance along minor axis
            1. Ra.Gyr: scaled radius of gyration
            2. Skew.Maxis: skewness about major axis



  1. Skew.maxis: skewness about minor axis
    1. Kurt.maxis: kurtosis about minor axis
    2. Kurt.Maxis:  kurtosis about major axis
      1. Holl.Ra: hollows ratio
      2. Class: type of cars


In this clustering part you need to use the first 18 attributes to your calculations.



1" Objective fpartitioning clustering)


You need to conduct the k-means clustering analysis of the vehicle dataset problem. Find the ideal number of clusters (please justify your answer). Choose the best two possible numbers of clusters and perform the k-means algorithm for both candidates. Validate which clustering test is more accurate.  For the winning  test, get the  mean of the each attribute (i.e. centres) of each group. Before conducting the k-means, please investigate if you need to add in your code any pre-processing task (scaling and/or outliers detection and justify your answer).  Write a code in R Studio to address all the above issues (codes/results need to be included in your report).  In your report you need to check the consistency of your produced cluster outcome against  the  information obtained from 19" column and provide the related results/discussion (evidence of a “confusion” matrix and extracted information from  it). At the end of your report,  provide also as an Appendix, the full code developed by you. The usage of kmeans R function is compulsory.


(Marks 40)


Forecasting Part

Time series analysis can be used in a multitude of business applications for forecasting a quantity into the future and explaining its historical patterns. Exchange rate is the currency rate of one country expressed in terms of the currency of another country. In the modern world, exchange rates of the tnost successful countries are tending to be floating. This system is set by the foreign exchange market over supply and demand for that particular currency in relation to the other currencies. Exchange rate prediction is one of the challenging applications of modern time series forecasting and very important for the success of many businesses and financial institutions. The rates are inherently noisy, non-stationary and deterministically chaotic. One general assumption is made in such cases is that the historical data incorporate all those behavior. As a result, the historical data is the major input to the prediction process. Forecasting of exchange rate poses many challenges. Exchange rates are influenced by many economic factors. As like economic time series exchange rate has trend cycle and irregularity. Classical time series analysis does not perform well on finance-related time series. Hence, the idea  of applying Neural Networks (NN) to forecast exchange rate has been considered as an alternative solution. NO tries to emulate human learning capabilities, creating models that represent the neurons in the human brain. In addition, research has been also directed to Support Vector Machine (SVM) which has emerged as a new and powerful technique for learning from data and in particular for solving classification and regression problems with better performance. The main advantage of SVM is its ability to minimize structural risk as opposed to empirical risk minimization as employed by the NN system.


In this forecasting part you need to use an MLP-NN and a SVM-based regression (SVR) model to predict the  next step-ahead exchange rate of GBP/EUR. Daily data (exchangeGBP.xls) have been collected from January 2010 until December 2011 (500 data). The first 400 of them have to be used as training data,  while  the remaining ones as testing set. Use only the 2nd column from the .xls file, which corresponds to the exchange rates.


2 Ob ective (MLP)


You need to construct an MLP neural network for this problem. You need to consider the appropriate  input vector (time-series), as well as the internal network structure (such as hidden layers, nodes, learning rate). You may consider any de-trending scheme if you feel is necessary. Write a code in R Studio to address all these requirements. You need to show the performance of your network both graphically as well as in terms of the following statistical indices (RMSE, MAE and MAPE). Suseestion: Experiment with various network structures as  well  as  various  input  vectors  and  show  a  comparison  table  of  their  performances  (using  these specific

statistical  indices). This will be a good justification  for your final network  choice. Show all your working  steps

(code & results, including comparison results from models with different input  vectors and  internal structure). As everyone will have different forecasting result, emphasis in the marking scheme will be given to the adopted methodology and the explanation/justification of various decisions you have taken in order to provide an acceptable, in terms of performance, solution. The input selection problem is very important. Experiment with various  options (i.e. how  many  past  values  you  need to consider  as potential  network  inputs).  Full details of




your results/codes/discussion are needed in your report. At the end of your report, provide also as an Appendix, the full code developed by you. The usage of neuralnet R function for MLP modelling is compulsory.


(Marks 35)


Ob ect ve SVR


You need to construct a SVR model to address this forecasting problem. You need to consider the appropriate input vector. Write a code in R Studio to implement this SVR scheme. You need to show the performance of  your model both graphically as well as in terms of the following statistical indices (MSE, RMSE and MAPE).

eSuason   The input selection  problem  is very important.  Experiment  with various SVR  parameters.  Show all            your working steps (code & results, including comparison results from models with different input vectors). As everyone will have different forecasting result, emphasis in the marking scheme will be given to the adopted methodology and the explanation/justification of various decisions you have taken in order to provide an acceptable, in terms of performance, solution. Full details of your results/codes/discussion are needed in your report. At the end of your report, provide also as an Appendix, the full code developed by you.


Coursework Marking scheme

The Coursework will be marked based on the following marking criteria:

Ob e  ive      rt    on n    luste  n

  • Find the ideal number of clusters — justify it by showing all necessary steps/methods (via manual & automated tools)
    • K-means with the best two clusters,
    • Find the mean of each attribute for the winner cluster,
      • Check consistency of your results against 19`hcolumn and provide relevant discussion,
        • Check for any pre-processing tasks (scaling, outliers)


2    Ob  ev    LP

(Marks 25)

  • Discuss the input selection problem for time series prediction and propose various input configurations (Suresont: consult  literature for system  identification configurations)                             8
  • Perform any pre-processing steps (such as normalisation) before training                                      5
    • Implement a number of MLPs, using various structures (layers/nodes) / input parameters

/ network parameters and show in a table their performances comparison (based on testing data) through the provided stat. indices. (6 marks for structures with different input

parameters, 6 marks for different internal NN structures and 4 for the comparison table)          16

  • Provide your best results both graphically (your prediction output vs. desired output) and via performance indices (3 marks for the graphical display and 3 marks for showing the requested statistical indices)                                                                                                                                         6

e3ye Ob                S

  • Discuss the input selection problem and propose various input configurations
    • Design an SVR and use various structures/parameters (incl. linear/nonlinear kernels)/input parameters and show in a table their performances comparison (based on testing data) through the provided stat. indices. (6 marks for structures with different input parameters,

6 marks for different internal SVM structures/parameters and 4 for the comparison table)        16

  • Provide your best results both graphically (your prediction output vs. desired output) and via performance indices (3 marks for the graphical display and 3 marks for showing the requested statistical indices)                                                                                                                                      







