Big Data and Spark

Track: Enterprise Data Scientist (EDS)
The objective of this course is twofold. First to explain the theoretical foundation behind distributed computing and distributed storage. Second, to teach how to effectively use Apache Spark to analyze large datasets.

Analyze Large Datasets

Share on facebook
Share on google
Share on twitter
Share on linkedin

Distributed storage and processing is needed to handle Big Data. Apache Spark is currently the most effective tool to distribute machine learning processing. To be able to use Apache Spark, one needs both to understand how its architecture and design as well as how to use it practically.

Learning outcome:

Upon completion, participants should be able to communicate results and demonstrate each of the following;
  1. Know what big data means
  2. Know what cloud computing means and how to use it
  3. Understand the fundamentals of distributed storage and computing
  4. Ability to execute process big data on a Spark cluster from Python (using pySpark)

Who should attend:

Professionals that work with data

2 days of in depth learning

Face to face with experienced Data Scientist.

Course Methodology

This course will utilize a combination of Presentations and Workshops.

CADS Certification​

Earn certification upon completion.

Pre-requisite:
Python Programming II, Database Management Systems
Minimum Qualification:
Undergraduate Degree

Training Track

Enterprise Data Scientist (EDS)

Big Data and Spark is one of the modules under our Enterprise Data Scientist (EDS) programme. EDS is a 42-day training program that provides participants with the tools to be key leaders and contributors of a data science team and be able to analyze data to drive informed business decisions.

Details of Subject

Day 1
  1. On-
    • Introduction to Big Data
    • Apache Hadoop overview
    • HDFS architecture
    • Distributed processing
    • Hadoop, map reduce
    • RDDs (Resilient Distributed Datasets)
    • Apache Spark
  2. Cloud Computing –  Introduction to cloud computing platform: AWS (Amazon Web Services), GCP (Google Cloud Storage), Microsoft Azure
Day 2
  1. DataFrames and Spark SQL –
    • Creating, transforming DataFrames. Groupby, aggregate functions
    • DataFrames and RDDs
  2. Spark Mllib –
    • Introduction to Machine Learning
    • Using Machine Learning with Spark

Lead Instructor

Mohammad Mehdi Lotfinejad
Mehdi is an experienced computer scientist who is technically proficient in software development and database management. His expertise includes data mining, web development, data engineering and he is very knowledgeable in project management as well. In addition, he has been very active in academia in the past where he worked as a lecturer at the Payame Noor University and published papers for many conferences and journals as well as authoring technical text books on computer science subjects.

CADS Certification

EDS CADS Certified Enterprise Data Scientist

Certification information for this module & track will be made available soon.

Hear from Our Alumni

Apply Now

Big Data and Spark