In this project, I designed and implemented a real-time streaming end-to-end data engineering pipeline that captures real estate listings from Zoopla using the BrightData API. The data flows through a Kafka cluster, a message broker, which effectively manages the movement of data from the source to the storage system (sink), in this case, Cassandra DB. Utilizing Apache Spark, the pipeline handles large-scale data processing efficiently. This setup is specifically engineered to optimize real estate market analysis, providing a robust tool for dynamic and precise market evaluation. For an in-depth look at the project, you are welcome to visit my GitHub repository.
This project exemplifies the power of integrating multiple technologies to transform raw data into a valuable strategic asset, driving forward the capabilities of real estate market analytics.
Step 1. Clone the repository to your local machine:
Step 2. Building Docker Image:
Step 3. Start Docker Container (make sure the Docker client is up and running on your machine first!)
Step 3. Start Data Ingestion process:
Step 4. Start Spark Consumer:
In this example, I showcase how Apache Spark serves as an efficient consumer by extracting data from Apache Kafka and subsequently storing it in CassandraDB: