neural networks Dr. Gary D. Boetticher Software Metrics
software economics
Return to the home page of Dr. Boetticher
University of Houston Clear Lake - About the University
School of Science and Computer Engineering - Info about SCE
Research Areas - Info about Dr. Boetticher's research
Dr. Boetticher's publications
Courses taught by Dr. Boetticher
Dr. Boetticher's professional experiences

 

CSCI 5833 -- Data Mining Tools and Techniques

STAT 5931 -- Research Topics in Statistics
Updated May 5, 2024

 

Office and Addresses

Delta 171 Phone 281.283.3805
email: boetticher@uhcl.edu
Secretary: Ms. Caroline Johnson, Delta 151 281.283.3860

Class Hours (Face-to-Face or Online)

This class is a hybrid class. The only face-to-face sessions are the orientation, midterm and final.

Orientation: Monday, June 3 at noon in Delta 241
Midterm: Wednesday, July 3rd at Noon in Delta 241 
Final:   Wednesday, July 24th at noon in Delta 241

Office Hours

Email me to set up an appointment or to arrange a Zoom session. My Zoom details can be found in the Google folder.

Teaching Assistant

Ms. Srutha Dasjou
Email: DasojuS4283@UHCL.edu
TA Hours: Bold = Zoom

   June: Monday 5 to 9 PM; Tuesday 5 to 9 PM; Wednesday 9 to Noon; Thursday 9 to Noon

   July:  Monday 5 to 9 PM; Tuesday 5 to 9 PM; Wednesday 9 to Noon; Thursday 9 to Noon

   

Course Description

Data Mining has emerged as one of the most exciting and dynamic fields in computer science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion by the end of this year.

The theoretical underpinnings of the data mining have existed for awhile (e.g., pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. Data mining seeks to detect `interesting' and significant nuggets of relationships/knowledge buried within data. It seeks to discover association rules, episode rules, sequential rules, etc., and it is concerned with efficient data structures and algorithms for data examination which possess good scaling properties.

There have been several success stories in this relatively young area: the SKICAT system for automatic cataloguing of sky surveys (JPL), the Advanced Scout system for mining NBA data (IBM), the QuakeFinder system for geoscientific data mining (UCLA/JPL) and the PYTHIA system for mining information from performance evaluation of scientific software (Purdue). Case studies from various domains (financial, bioinformatics, etc.) will be presented.

In a 15-week semester (Fall, Spring) you are expected to commit 15 to 20 hours per week to this course! This 8-week class is one half the time of a 15-week class. Therefore, you are expected to commit 30 to 40 hours per week.

Course Goals

 

By the end of the course, you will

  • Understand the data mining process.

  • Have a working knowledge of different data mining tools and techniques.  

  • Have an understanding of various Machine Learners (ML).

  • Have a working knowledge of some of the more significant current research in the area of data mining and ML.

  • Be aware of various data mining data repositories for the study of data mining.

  • Be able to effectively apply a number of data mining algorithms (e.g., neural networks, genetic algorithms) to solve data mining problems from various problem domains including Financial and Bioinformatics.

  • Be familiar with several successful applications of data mining.

Prerequisites

A course in artificial intelligence, machine learning, pattern recognition, algorithms, or statistics would be helpful, but is not required. Programming experience (or at least one course) in either C, C++, C#, Delphi, Java, PASCAL, or VB (using Visual Studio). If you do not meet the prerequisites, then you need to drop this course!  

Methodology

Lecture, seminar, case studies, and interactive problem solving.

Appraisal:

 Assignments:  20% of the total
 Midterm:  40% of the total
 Final: 40% of the total

Grades will be based solely on criteria listed above. No other factors will be considered.

Grading Scale

    93+ = A; 90 = A-; 87+ = B+; 83+ = B; 80+ = B-;

      77+ = C+; 73+ = C; 70 = C-; 67+ = D+; 63+ = D; 60+ = D-; 0+= F

My motto:

Foster disciplined, altruistic passion.

Required Textbook

     None.

Reference Books

1. Aggarwal, Charu C. Data mining: the textbook. Springer, 2015.

2. Berry and Linoff, Data Mining Techniques, Wiley, 2000.

3. Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011.

4. Mitchell, Machine Learning, McGraw-Hill, Boston, 1997.

5. Witten, Ian H., and Eibe Frank. "Data mining: practical machine learning tools and techniques with Java implementations." ACM Sigmod Record 31.1 (2002): 76-77.

Other Reference Materials

Conferences, Journals, and Organizations

Data Resources

 

Bioinformatic and biological databases:  

Santa Fe dataset  

 

Data Mining Software

  • R version 4.3.0(Windows): A programming language and software environment for statistical computing and graphics. Third-party packages support machine learners. It is available on the Google Drive.
  • RStudio (Desktop Version version 2023.03.0+386): GUI interface for R. It is available on the Google Drive.
  • Python 3.9.1: Available for various platforms. The link is for a Windows environment.
  • WEKA 3.9: Waikato Environment for Knowledge Analysis, contains many different learners. Versions 3.6.8 through 3.9 are available in the Google Folder. It is also installed in the NT lab in the Delta building.
  • GDB Net: A Backpropagation Neural Network Program. It is available on the Google Drive.
  • GDB GP: A Genetic Program Software Package. It is available on the Google Drive.
  • RapidMiner Studio: With RapidMiner Studio, you can access, load and analyze any type of data – both traditional structured data and unstructured data like text, images, and media. It can also extract information from these types of data and transform unstructured data into structured.
  • Orange: You will fall in love with this tool’s visual programming and Python scripting. It also has components for machine learning, add-ons for bioinformatics and text mining. It’s packed with features for data analytics.
  • KNIME: Data preprocessing has three main components:  extraction, transformation and loading. KNIME does all three. It gives you a graphical user interface to allow for the assembly of nodes for data processing. It is an open source data analytics, reporting and integration platform. KNIME also integrates various components for machine learning and data mining through its modular data pipelining concept and has caught the eye of business intelligence and financial data analysis.

    Written in Java and based on Eclipse, KNIME is easy to extend and to add plugins. Additional functionalities can be added on the go. Plenty of data integration modules are already included in the core version.

  • NLTK: When it comes to language processing tasks, nothing can beat NLTK. NLTK provides a pool of language processing tools including data mining, machine learning, data scraping, sentiment analysis and other various language processing tasks. All you need to do is install NLTK, pull a package for your favorite task and you are ready to go. Because it’s written in Python, you can build applications on top if it, customizing it for small tasks.
  • Software for Data Mining: List of software and tools maintained by KDnuggets.
  • IBM Data Warehouse Edition: A leading-edge DM software.
  • Teradata Warehouse Miner:
  • Oracle Data Mining: FAQ on Oracle 12c data mining tools.
  • Microsoft Data Analytics
  • SPSS Data Mining
  • Data Mining Products: A list of companies and their products in the area of data mining.
  • 43 Top Data Mining Softare Programs
  • Six of the Best Open Source Data Mining Tools
  • Free Data Mining Tools
  • List of data mining and learning analytics tools

 

Schedule (Tentative)

Jun 03 – Course overview, What is Data Mining? The Data Mining Process

************************************************************************

***   All course materials are located in the Google Drive folder.   ***

***   Send me a gmail and I will add it to the Google Drive folder.  ***

************************************************************************

    

Assign Homework 1

Point value: 100 points

Due date:  Thursday, June 13th, 5 PM (Central Time) via email

 

FOR THIS WEEK (IF NOT SOONER)       

            Blue Color = Available in the Google Folder

                It is the student's responsibility to download the notes, print the notes.

·   Read:  Syllabus

·   Read documents in:  WK00 Notes - Orientation Data Mining - 20240507.pdf

·   Read documents in:  WK01 Notes - What is Data Mining and the DM Process.zip

FOR NEXT WEEK (IF NOT SOONER)  

·   Read documents in:  WK02 Notes - The Data in Data Mining.zip

 

 

Jun 05 – The Data in Data Mining, General Types of Data Mining Problems

  

Assign Homework 2

Point value: 100 points

Due date:  Thursday, June 27th, 5 PM (Central Time)

 

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in:  WK03 Notes - The Experimental Process.zip

 

Jun 10 – The Experimental Process

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in:  WK04 Notes - Data Preprocessing - Introduction.zip

 

 

Jun 12 –  Data Preprocessing - Introduction

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in:  WK05 Notes - Data Preprocessing - Attribute and Dimension Reduction.zip 

 

Jun 13 –  Assignment 1 is due 5 PM Central Time

 

Jun 17 – Data Preprocessing - Attribute and Dimension Reduction

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in: WK06 and WK07 Notes - Decision Trees.zip

 

Jun 19 –  Decision Trees

FOR NEXT WEEK (IF NOT SOONER)

 

Jun 24 – Decision trees continued, Review

 

FOR NEXT WEEK (IF NOT SOONER)  

·  Submit:  Midterm questions by Monday, July 1st, 2 PM or sooner. This is optional.

                Use the template found on the Google Drive

                Leave out any identifying information (Your name, Student ID number)

                Specify whether you want me to share your questions with your fellow students.

·   Study!

 

Jun 27 – Assignment 2 is due 5 PM Central Time

 

Jul 03 – Midterm  Exam - Noon Central Time

 

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in: WK08 and 9 Notes - Clustering, Instance Based Learning, SVM, Ensemble Learning.zip

·   Review The following K-Means applet:

      http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

 ·   Read:  No assigned readings for next week

 ·   Review The following SVM applet:

      http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

 

 Background reading (not required)

 Aha, D., Kibler, D, Marc Albert, Instance-Based Learning Algorithms, Machine Learning, Kluwer Publishers, 6, 1991, Pp. 37-66.

        

******** July 8 – Last day to withdraw ********

 

Jul 08 – Ensemble Learning, Random Forests, Clustering, Instance-Based Learning, SVM

 

Assign Homework 3

Point value: 100 points

Due date:  Thursday, July 11th, 5 PM (Central Time)

 

FOR NEXT WEEK (IF NOT SOONER)

·  Review The following tutorial on Genetic Algorithms:

            http://www.obitko.com/tutorials/genetic-algorithms/

·   Read documents in: WK11 Notes - Genetic Algorithms.zip

Background reading (not required)

Koza, John, Genetic Programming, Dept. of CS, Stanford  University, 1997, Pp. 1 – 26.

Koza, John, Future Work and Practical Applications of Genetic Programming, Handbook of Evolutionary Computation, June, 1996, Pp. 1 – 7.  

Koza, John, Riccardo Poli, A Genetic Programming Tutorial, Stanford University

Mitchell, Tom M., Machine Learning and Data Mining, Communications of the ACM, 42 11, November 1999.

Whitley, Darrell, A Genetic Algorithm Tutorial, Dept. of CS, TR CS-93-103, Dept. of CS, Colorado State University, Pp. 1 – 38.

 

Jul 10 – Genetic Algorithms, Genetic Programs

 

Assign Homework 4

Point value: 100 points

Due date:  Tuesday, July 23rd,  5 PM (Central Time)

 

FOR NEXT WEEK (IF NOT SOONER 

·   Review One (or more) of the following online neural network tutorials:

NN Tutorial 1

NN Tutorial 2

NN Tutorial 3

·   Read documents in: WK12 Notes - Neural Networks.zip

·   Run:    WK12 R Example - Self Organizing Maps

·   Review Download, install, and try GDB_Net (A neural network software package).

·   Review Try out this Kohonan Self-Organizing Map applet

                 Click here for the zipped code of this applet.

Background reading (not required)

Gerstner, Wulfram, Supervised Learning for Neural Networks: A Tutorial with Java exercises. The corresponding Java applets for this tutorial are available at:     http://diwww.epfl.ch/mantra/tutorial/english/

 

Jul 11 – Assignment 3 is due 5 PM Central Time

 

Jul 15 – Neural Networks including Perceptron, Backprop., Self-Organizing Map (SOM)

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read documents in: WK13 Notes and Papers - Evaluating Results.zip

Jul 17 – Evaluating Data Mining results

       

FOR NEXT WEEK (IF NOT SOONER)  

·   Read documents in: WK14 - Notes - Neural Networks - CNNs and GANs.zip

 

Jul 22 - Neural Networks - Convolutional Neural Networks,  Generative Adversarial Networks (GANs), Final Exam Review

FOR NEXT WEEK (IF NOT SOONER)  

Background reading (not required)

Anand, Sarbjot, et al., The Role of Domain Knowledge in Data Mining, CIKM, Baltimore, Maryland, 1995.

Clark, Glymour, et al., Statistical Inference and Data Mining, Communications of the ACM, 39 11, November 1996.

Elder, John, et al., A Statistical Perspective on Knowledge Discovery in Databases

Friedman, Jerome H., Data Mining and Statistics: What's the Connection?, Department of Statistics, Stanford

 

·   Submit:   Final questions by Monday, Jul 22nd, 2 PM or earlier. This is optional.

                Use the template found on the Google Drive

                Strip out any identifying information (Your name, Student ID number)

                Specify whether you want me to share your questions with your fellow students.

 

·   Study!

Jul 23 – Assignment 4 is due 5 PM Central Time

 

Jul 24 – Final Exam - Noon Central Time

 

Other Policies

This class has 6 simple rules:

1) Be respectful of others.

2) Be very passionate about your learning and do your best.

3) Be fearless - ask lots of questions in class.

4) Don't be late on anything.

5) Don't ever cheat.

6) Have fun!

 

Miscellaneous

  • Any person with a disability who requires a special accommodation should inform me and contact the Disability services office or call 281 283 2627 as soon as possible.

  • You are expected to come fully prepared to every class!

  • If there is any religious observance that may interfere with any scheduled exam, homework due date, or attending class, please notify me of the situation during the first 2 weeks of class so that adjustments can be made at that time.

  • Please turn off all cell phones, and pagers prior to the start of class.

  • I am willing to provide letters of recommendation/references only if you have attained an 'A' in one of my classes, or two 'A-' in two of my classes.

  • I highly recommend that you seek out your advisor and complete your Candidate Plan of Study (CPS) as soon as possible. I am normally not available for advising during the summer months.

Return to Top


HomeUHCLSCE



2700 Bay Area Boulevard
Delta Building. Office 171
Houston, Texas 77058
Voice: 281-283-3805
Fax: 281-283-3869
boetticher@uhcl.edu


© 2002-2024 Boetticher: Data Mining Course, All Rights Reserved.

Undergrad courses taught by Dr. Boetticher
Graduate courses taught by Dr. Boetticher