Introduction

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R.

Prerequisites

  • Python 3.8+
  • Java 11 or 17
  • pip package manager
  • Step 1: Install PySpark

    # Install PySpark
    

    pip install pyspark

    # Verify installation

    python -c "import pyspark; print(pyspark.__version__)"

    Step 2: Create Your First SparkSession

    from pyspark.sql import SparkSession
    
    

    # Initialize Spark

    spark = SparkSession.builder \

    .appName("MyFirstPipeline") \

    .master("local[*]") \

    .getOrCreate()

    print(f"Spark version: {spark.version}")

    Step 3: Load and Explore Data

    # Create sample data
    

    data = [

    ("Alice", "Engineering", 85000),

    ("Bob", "Marketing", 72000),

    ("Charlie", "Engineering", 92000),

    ("Diana", "Marketing", 68000),

    ("Eve", "Engineering", 95000),

    ]

    columns = ["name", "department", "salary"]

    df = spark.createDataFrame(data, columns)

    # Show the data

    df.show()

    df.printSchema()

    Step 4: Transform Data

    from pyspark.sql import functions as F
    
    

    # Average salary by department

    avg_salary = df.groupBy("department") \

    .agg(

    F.avg("salary").alias("avg_salary"),

    F.count("*").alias("count"),

    F.max("salary").alias("max_salary")

    ) \

    .orderBy(F.desc("avg_salary"))

    avg_salary.show()

    Output:

    +------------+----------+-----+----------+
    

    | department|avg_salary|count|max_salary|

    +------------+----------+-----+----------+

    | Engineering| 90666.7| 3| 95000|

    | Marketing| 70000.0| 2| 72000|

    +------------+----------+-----+----------+

    Step 5: Read and Write Files

    # Write to Parquet
    

    df.write.mode("overwrite").parquet("output/employees.parquet")

    # Read from Parquet

    df_loaded = spark.read.parquet("output/employees.parquet")

    df_loaded.show()

    # Write to CSV

    df.write.mode("overwrite").option("header", True).csv("output/employees.csv")

    Step 6: SQL Queries

    # Register as temporary view
    

    df.createOrReplaceTempView("employees")

    # Run SQL

    result = spark.sql("""

    SELECT department,

    AVG(salary) as avg_salary,

    COUNT(*) as headcount

    FROM employees

    GROUP BY department

    HAVING AVG(salary) > 75000

    """)

    result.show()

    Performance Tips

  • Use cache() for frequently accessed DataFrames
  • Prefer parquet over CSV for large datasets
  • Use repartition() to optimize parallelism
  • Avoid UDFs when built-in functions exist
  • Monitor with Spark UI at http://localhost:4040

Conclusion

You've built your first Spark data pipeline! From here, explore Spark Streaming for real-time data and Spark MLlib for machine learning.