Introduction
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R.
Prerequisites
- Python 3.8+
- Java 11 or 17
- pip package manager
Step 1: Install PySpark
# Install PySpark
pip install pyspark
# Verify installation
python -c "import pyspark; print(pyspark.__version__)"
Step 2: Create Your First SparkSession
from pyspark.sql import SparkSession
# Initialize Spark
spark = SparkSession.builder \
.appName("MyFirstPipeline") \
.master("local[*]") \
.getOrCreate()
print(f"Spark version: {spark.version}")
Step 3: Load and Explore Data
# Create sample data
data = [
("Alice", "Engineering", 85000),
("Bob", "Marketing", 72000),
("Charlie", "Engineering", 92000),
("Diana", "Marketing", 68000),
("Eve", "Engineering", 95000),
]
columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)
# Show the data
df.show()
df.printSchema()
Step 4: Transform Data
from pyspark.sql import functions as F
# Average salary by department
avg_salary = df.groupBy("department") \
.agg(
F.avg("salary").alias("avg_salary"),
F.count("*").alias("count"),
F.max("salary").alias("max_salary")
) \
.orderBy(F.desc("avg_salary"))
avg_salary.show()
Output:
+------------+----------+-----+----------+
| department|avg_salary|count|max_salary|
+------------+----------+-----+----------+
| Engineering| 90666.7| 3| 95000|
| Marketing| 70000.0| 2| 72000|
+------------+----------+-----+----------+
Step 5: Read and Write Files
# Write to Parquet
df.write.mode("overwrite").parquet("output/employees.parquet")
# Read from Parquet
df_loaded = spark.read.parquet("output/employees.parquet")
df_loaded.show()
# Write to CSV
df.write.mode("overwrite").option("header", True).csv("output/employees.csv")
Step 6: SQL Queries
# Register as temporary view
df.createOrReplaceTempView("employees")
# Run SQL
result = spark.sql("""
SELECT department,
AVG(salary) as avg_salary,
COUNT(*) as headcount
FROM employees
GROUP BY department
HAVING AVG(salary) > 75000
""")
result.show()
Performance Tips
cache() for frequently accessed DataFramesparquet over CSV for large datasetsrepartition() to optimize parallelismhttp://localhost:4040Conclusion
You've built your first Spark data pipeline! From here, explore Spark Streaming for real-time data and Spark MLlib for machine learning.