neural networks Dr. Gary D. Boetticher Software Metrics
software economics

Return to the home page of Dr. Boetticher
University of Houston Clear Lake - About the University
School of Science and Computer Engineering - Info about SCE
Research Areas - Info about Dr. Boetticher's research
Dr. Boetticher's publications
Courses taught by Dr. Boetticher
Dr. Boetticher's professional experiences

 

CINF/CSCI 5931 -- Big Data Analytics: Tools, Techniques and Applications
Updated  September 26, 2018

Face-to-Face Class Hours

Wednesday 7:00 - 9:50, Delta 242

Office Hours

Wed. 1 - 4, Thur. 1 - 4,  or by appointment. If the suite door is locked, then call my extension (last 4 digits) using the phone in the hallway.

Teaching Assistant

Ms. Rekha Sampangiramaiah
Email: sampangirama@uhcl.edu

Hours: Monday 5 to 10; Tuesday 3 - 6 and 7 to 10; Wednesday 1 - 4

 

 

 

 

Course Description

Data generated by people, organizations, and ubiquitous machines are rapidly increasing in both size and complexity. This phenomenon of Big Data has created tremendous opportunities for us to derive value using various advanced analytics frameworks and techniques. The McKinsey Global Institute named Big Data as one of the top five catalysts for US economy growth, especially for the retail, manufacturing, health care, and government services sectors. This course teaches students about the core technologies to manipulate, store, and especially to analyze big data. Students will acquire essential skills required for a typical Data Science project. In this class, we couple hands-on labs/projects with lectures/readings. The hands-on activities familiarize students with Hadoop for storage (HDFS) and Spark as computing engine. Students will learn to apply typical machine learning techniques (using Spark MLlib) and some other analytics techniques such as graph processing (using Spark GraphX) to big data. Python is the main programming language for this course.

The traditional graduate student load is 3 courses. Be prepared to commit 15 to 20 hours per week to this course!

Course Goals

 

By the end of the course, you will

  • Clearly define what big data entails and its major characteristics.

  • Accurately describe a typical data science project lifecycle.

  • Be able to provide in-depth explanation of the enabling technologies of Hadoop.

  • Precisely describe how Spark complements MapReduce as the big data computing engine.

  • Possess a broad knowledge of Spark components and their utilities in big data computing.

  • Acquire solid understanding of major Machine Learning algorithms and models.

  • Precisely define a graph and its major properties.

  • Using Cloudera Distribution Hadoop, set up a four-node cluster.

  • Develop Spark applications in Python.

  • Program a basic Spark application.

  • Use Spark SQL to develop basic data processing applications.

  • Use MLlib, apply various machine learning algorithms including Regression, Classification, and Clustering to data sets from different domains.

  • Use GraphX to perform large graph (network) analysis.

Prerequisites

An undergraduate course in Database Management Systems. Experience in one programming language, preferably Python. If you do not meet the prerequisites, then you need to drop this course!  

Methodology

Lecture, seminar, case studies, and interactive problem solving.

Appraisal:

 Homework 15%
 Quizzes and Participation   5%
 Best 2 out of 3:
           Term Project 40%
           Midterm:  40%
           Final:  40%

Grading:

    93+ = A; 90 = A-; 87+ = B+; 83+ = B; 80+ = B-;

      77+ = C+; 73+ = C; 70 = C-; 67+ = D+; 63+ = D; 60+ = D-; 0+= F

My motto:

Foster disciplined, altruistic passion.

Required Textbook

     None.

Reference Books 

  • Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale by Tom White. O’Reilly. 4th Edition. ISBN-10: 1491901632, ISBN-13: 978-1491901632.

  • Big Data Analytics with Spark: A Practioner’s Guide to Using Spark for Large Scale Data Analysis by Mohammed Guller. Apress Publishing. 1st Edition. ISBN-10: 1484209656, ISBN-13: 978-1484209653.

  • Learning Spark: Lightning-Fast Big Data Analysis by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. O’Reilly, 1st Ediction. ISBN-10: 1449358624, ISBN-13: 978-1449358624.

  • Machine Learning with Spark by Rajdeep Dua, Manpreet Singh Ghotra, Nick Pentreath. Packt Publishing, 2nd edition. ISBN-10: 1785889931, ISBN-13: 978-1785889936.

  • Big Data Analytics: from strategic planning to enterprise integration with tools, techniques, NoSQL, and graph by David Loshin. ISBN 9780124173194. Electronic copy of this book is available through UHCL Neumann Library.

Recommended Readings (Electronic versions are available on the Google Drive)

  • J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," presented at the Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, San Francisco, CA, 2004.

  • S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," in ACM SIGOPS operating systems review, 2003, pp. 29-43.

  • M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, and A. Ghodsi, "Spark sql: Relational data processing in spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383-1394.

  • X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, and S. Owen, "Mllib: Machine learning in apache spark," Journal of Machine Learning Research, vol. 17, pp. 1-7, 2016.

  • Rabkin, C. Reiss, R. Katz, and D. Patterson, "Using clouds for MapReduce measurement assignments," Trans. Comput. Educ., vol. 13, pp. 1-18, 2013.

  • R. S. Xin, D. Crankshaw, A. Dave, J. E. Gonzalez, M. J. Franklin, and I. Stoica, "Graphx: Unifying data-parallel and graph-parallel analytics," arXiv preprint arXiv:1402.2394, 2014.

  • R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, "Shark: SQL and rich analytics at scale," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of data, 2013, pp. 13-24.

  • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012, pp. 2-2.

  • M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster Computing with Working Sets," HotCloud, vol. 10, p. 95, 2010.

  • M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, "Discretized streams: Fault-tolerant streaming computation at scale," in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, 2013, pp. 423-438.

Online Tutorials/Documentation/Resources

Software:

  • Cloudera Distribution for Hadoop (CDH5)

  • Spark (Apache Spark is included with CDH5)

  • Python

  • R

Hardware

  • Students need access to laptops with a 64 bit operation system and at east 8GB of RAM (More is strongly advised).

Schedule (Tentative)

 

Aug 29 Introduction to Big Data Analytics

 

************************************************************************

***   All course materials are located on the Google Drive.          ***

***   You are expected to bring a copy of the notes to all lectures. ***

***   I highly recommend you place the notes in a 3-ring binder.     ***

************************************************************************

 

FOR THIS WEEK (IF NOT SOONER)       

            Blue Color = Available on the Google Drive

                It is the student's responsibility to download the notes, print the notes, and bring them to class.

·   Read:  Syllabus

·   Read documents in:  WK01 - Notes - Intro to Big Data Analytics.zip

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read documents in:  WK02 - Notes - Hadoop Ecosystem, HDFS, YARN, Installing Hadoop,zip

 

 

Sep 05  The Hadoop Ecosystem, HDFS, YARN

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read documents in:  WK03 - Notes and Reference Papers on MapReduce.zip

 

Sep 12  MapReduce

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

Sep 19 Installing Hadoop, Hands on MapReduce

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read documents in:  WK04 - Notes - Hive, Pig, HBase, Flume, Sqoop, Hadoop 3.0.zip

 

Sep 26 Hive, Pig, HBase, Flume, Sqoop, Hadoop 3.0

 

Identify your partner for the term project via email (boetticher@uhcl.edu)

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

Oct 03   Hands on Hive, Pig, HBase, Review for the Midterm

 

FOR NEXT WEEK (IF NOT SOONER)  

·  Submit:  Midterm questions by Tuesday, October 9th, 7 PM.

                Use the template found on the Google Drive

                Strip out any identifying information (Your name, Student ID number)

                Specify whether you want me to post your questions on the Google Drive.

·   Study!

 

Oct 10Midterm

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

Oct 17Data Analysis

 

Submit a description of the data for part 1 of the term project (boetticher@uhcl.edu)

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

Oct 24 Introduction to Spark, Jupyter Notebook Demo (Anaconda Navigator)

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

Oct 31 Classification and Regression

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

Nov 07 Clustering-Application, Random Forest, Bagging/Boosting Algorithms

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

******** November 12 – Last day to withdraw ********

 

Nov 14 – Installing Spark, Hands on: Spark, Juypter Notebook, MLib

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

Nov 18 – Part 2 of the term project is due at 7 PM via email  (boetticher@uhcl.edu)

 

Nov 21 – Thanksgiving - No class

 

Nov 28 Introduction to graph analysis, Hands on graphX

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read: TBD

 

Nov 25 – Term project report due at 7 PM via email  (boetticher@uhcl.edu)

 

Dec 05 – Term Project Presentations, Review for the final

 

FOR NEXT WEEK (IF NOT SOONER)

·   Submit:   Final questions by Tuesday, December 11th, 7 PM. This is optional.

                Use the template found on the Google Drive

                Strip out any identifying information (Your name, Student ID number)

                Specify whether you want me to post your questions on the Google Drive.

 

·   Study!

 

Dec 12  Final Exam

 

 

Other Policies

Homework, Projects, Research Paper

  • Homework and projects are due exactly at the prescribed time (usually the beginning of class). As soon as a homework or project is collected, then all others are considered 1 day late (even if it only 3 minutes). In the event you might be running late, you might want to email the assignment. Also, when preparing your assignment, be mindful of possible backlogs at the printer, jammed printer, printer out of toner, etc.

  • Late homework/projects are accepted with a penalty of 10% deduction per 24-hour period after the due date. No late project will be accepted one week after the due date. The last homework/project cannot be late.

  • There will be no extra-credit homework or projects in this course.

  • All homework and projects must be typed not hand-written.

  • A cover page is expected for all homework and projects.

  • VERY IMPORTANT! In certain classes students are encouraged to work in groups. For this class you are expected to work on all homework and projects individually for most assignments. Students may not discuss, use, email, show, give, buy, sell, borrow, trade, steal, download from the Internet, etc. in whole or part, any of the homework or projects in any manner not prescribed by the instructor. This condition applies even after you complete this course! Penalty for cheating will be extremely severe and will result at least a one letter grade reduction in your final grade. It could result in an F for this course. Cheating can cost result in losing a scholarship, a TA position, or an RA position.  There may be some group assignments for this class. If there is inappropriate sharing among two or more groups, then all students will be considered guilty. Choose your partner very carefully!

  • Handing in an assignment for another student is considered cheating. Penalty for cheating will be extremely severe and may result in an F for this course. 

  • VERY IMPORTANT! Failing to report to the instructor any incident in which a student witnesses an alleged violation of the Academic Honesty Code is considered a violation of the academic honesty code. Please see me to discuss any incidents.

  • VERY IMPORTANT! Purchasing, or otherwise acquiring and submitting as one's own work any research paper or any other writing assignment prepared by others constitutes cheating. Penalty for cheating will be extremely severe and may result in an F for this course.

  • VERY IMPORTANT! Plagiarism on either an abstract, draft of a paper, or final paper will result in a 0 for all three parts (abstract, draft version, final paper). Please review the following links regarding plagiarism very carefully: https://www.indiana.edu/~istd/definition.html

  • http://www.hamilton.edu/style/avoiding-plagiarism

  • http://www.writing.utoronto.ca/advice/using-sources/how-not-to-plagiarize

  • Standard academic honesty procedure will be followed. For the UHCL Academic Honesty Policy, please click on the following link.

 

Tests and Quizzes

  • There are no make-up tests except in verified medical emergencies and with immediate notification. Rescheduling a final exam in order to catch a plane flight in order to go back home without a significant reasons and corresponding documentation is unacceptable. Make up exams are harder and different from the original exams.

  • There are no make-up quizzes. Allow plenty of additional time in the event that Blackboard crashes.

  • You are responsible for all required readings assigned throughout the semester.

  • Students are to work on test and quizzes individually.  Students may not discuss, show, give, sell, borrow, trade, share, etc. their tests or quizzes. Penalty on cheating will be extremely severe. Standard academic honesty procedure will be followed.

  • VERY IMPORTANT! Providing answers for any assigned work or examination when not specifically authorized by the instructor to do so. Or, informing any person or persons of the contents of any examination prior to the time the examination is given is considered cheating. Penalty for cheating will be extremely severe and may result in an F for this course.

  • VERY IMPORTANT! Failing to report to the instructor any incident in which a student witnesses an alleged violation of the Academic Honesty Code is considered a violation of the academic honesty code. Please see me to discuss any incidents.

Miscellaneous

  • Any person with a disability who requires a special accommodation should inform me and contact the Disability services office or call 281 283 2627 as soon as possible.

  • You are expected to come fully prepared to every class!

  • Incomplete grades or administrative withdrawals occur only under extremely rare situations.

  • You need to bring a hard copy of the notes to class. Laptops will be permitted only during software demos such as WEKA.

  • For all lecture-based classes, please turn off your laptops.

  • The ringing, beeping, buzzing of cell phones, watches, and/or pagers during class time is extremely rude and disruptive to your fellow students and to the class flow.

    Also, sending and/or receiving text messages during class is extremely rude and disruptive. Please turn off all cell phones, watches, and pagers prior to the start of class.

    If I see (even if the cell phone is off) or hear a cell phone during class, or see a student texting during class,   then 3 points will be deducted for each infraction from your final course average.

  • Attendance Policy:

    Face-to-face: You are expected to attend every class. If you miss more than 1 class, then your course grade will be reduced by 2 points for each lecture missed. Coming late to class on a regular basis will impact your course participation grade.

     

    Pure Web-based: You do not need to attend any lectures on campus. Also, you do not need to show up in  person to take the exams. However, you may attend any/all of the face-to-face lectures and/or exams. However, it is my experience that those students who do attend class on a regular basis do better on tests than those that don't. If you will be off-campus during the exams, please make the necessary arrangements with me as soon as possible.

  • I am willing to provide letters of recommendation/references only if you have attained an 'A' in one of my classes, or two 'A-' in two of my classes.

  • I highly recommend that you seek out your advisor and complete you Candidate Plan of Study (CPS) as soon as possible. I am normally not available for advising during the summer months.

  • Pay very careful attention to your email correspondence. It reflects on your communication skills. Below is a compilation of email errors I have received during the past year.

    Dear boeticher,

    Is there any chance of regrading my final grade. As i'am very nervous in exam i couldn't be able to attempt properly. you know how attentive in class and can u please grade me considering my class participation also or do i have a chance of re exam because c grade draws my gpa low which results in loosing my scholorship, Please consider my request.

    Thanks and Regards

    Some Student

    Common problems:

       *   bcoz instead of because

       *   r instead of are

       *   u instead of you

       *   lowecase i instead of I

       *   starting a sentence with a lowercase letter

       *   doubt instead of question

     

  • I immediately discard anonymous emails.

Return to Top


HomeUHCLSCE



2700 Bay Area Boulevard
Delta Building. Office 171
Houston, Texas 77058
Voice: 281-283-3805
Fax: 281-283-3869
boetticher@uhcl.edu


© 2018 - 2018 Boetticher: Financial Data Mining Course, All Rights Reserved.

Undergrad courses taught by Dr. Boetticher
Graduate courses taught by Dr. Boetticher