End-to-End Image Classification App

April 22, 2024

The End-to-End Cancer Classification App utilizes TensorFlow to train a binary classification model on a dataset of chest CT images. Images are categorized in two classes: adenocarcinoma and normal cases. Using a pre-trained, base model, VGG16, I built a convolutional neural network (CNN) model that assigns images to either class. The model is accessible through a frontend using the Gradio framework, which features an intuitive, user-friendly interface allowing for straightforward interactions, such as image uploads or direct pasting from the web or local storage. Detailed information about this project is available on my GitHub repository.

Project Overview

This project was conceived to harness the power of deep learning algorithms for the precise classification of medical imaging, thereby enhancing diagnostic accuracy. One significant application of this model is to support healthcare professionals, such as radiologists, by offering faster and more precise assessments critical during the treatment planning phase.

Beyond its initial application in aiding radiologists, the potential of the project extends to several other fields. For instance, it could be adapted for use in pathology to enhance the detection and analysis of histopathological slides, significantly speeding up the process of diagnosing diseases at the cellular level. Additionally, the model could be utilized in emergency medicine to quickly analyze images in critical care settings, facilitating faster decision-making where time is of the essence. Furthermore, its adaptability means it could also support research in epidemiology by providing insights into disease patterns and prevalence through large-scale image analysis, thereby contributing to public health strategies and preventive medicine. This breadth of applications highlights the versatility and impact of advanced machine learning models in various aspects of healthcare and research.

You may the view the app’s screenshot by clicking the following link: Screenshot: Chest CT Scan Classifier

Technologies Employed

Python: I used Python for developing machine learning algorithms and for data manipulation.
TensorFlow and Keras: These tools were essential in building and training the deep learning models.
OpenCV: This library was utilized for processing images, preparing them for the training process.
Jupyter Notebook: I documented the development and testing phases using Jupyter Notebook, which facilitated a clear presentation of the methodologies and results.
Gradio: I used Gradio to create the frontend framework.
MLFlow I used MLFlow for experiment tracking.
DVC I used DVC for data version tracking and control.

Getting Started

Installation

Interested parties can replicate or review the project by following these steps:

Step 1. Clone the repository by typing the following in your terminal:

git clone https://github.com/drjodyannjones/End-to-End-Cancer-Classification-Project.git

Step 2. Create a virtual environment.

Step 3. Activate the newly created virtual environment.

Step 4. Navigate to the project directory and install necessary dependencies by typing the following in your terminal:

pip install -r requirements.txt

Step 5. Run the app by typing the following in your terminal

gradio app.py

Sample Code Snippet: prediction.py

Below is a sample code snippet from the project, illustrating the loading of the trained tensorflow model and the logic used to generate predictions.

import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing import image
import os

class PredictionPipeline:
def **init**(self):
self.model = load_model(os.path.join("artifacts", "training", "model.h5"))

    def predict(self, filename):
        try:
            test_image = image.load_img(filename, target_size=(224, 224))
            test_image = image.img_to_array(test_image)
            test_image = np.expand_dims(test_image, axis=0)
            result = np.argmax(self.model.predict(test_image), axis=1)

            if result[0] == 1:
                prediction = "Normal"
            else:
                prediction = "Adenocarcinoma Cancer"

            return [{"image": prediction}]
        except Exception as e:
            print(f"Error during prediction: {e}")
            return [{"image": "Prediction error"}]

Acknowledgements

Special shoutout to DSwithBappy who inspired this project.

Streaming Real Estate Data Engineering Application

April 21, 2024

In this project, I designed and implemented a real-time streaming end-to-end data engineering pipeline that captures real estate listings from Zoopla using the BrightData API. The data flows through a Kafka cluster, a message broker, which effectively manages the movement of data from the source to the storage system (sink), in this case, Cassandra DB. Utilizing Apache Spark, the pipeline handles large-scale data processing efficiently. This setup is specifically engineered to optimize real estate market analysis, providing a robust tool for dynamic and precise market evaluation. For an in-depth look at the project, you are welcome to visit my GitHub repository.

Project Overview

The Real Estate Data Engineering project leverages cutting-edge data processing technologies to generate actionable insights from comprehensive real estate datasets. This initiative enhances decision-making capabilities for stakeholders by providing sophisticated tools for analyzing market trends, property valuations, and investment opportunities in real-time. By integrating real-time data streaming with advanced analytical processes, this project supports a wide range of applications, including:

Market Trend Analysis: Analysts can detect market shifts and emerging trends as they happen, enabling faster strategic responses.
Investment Decision Support: Investors gain access to up-to-date information on property values and market conditions, aiding in better investment choices.
Portfolio Management: Real estate companies can manage their assets more effectively by having current data at their fingertips, facilitating better operational and financial decisions.
Risk Management: By analyzing current and historical data trends, risks associated with investments can be better assessed and mitigated.

This project exemplifies the power of integrating multiple technologies to transform raw data into a valuable strategic asset, driving forward the capabilities of real estate market analytics.

Technologies Employed

Python: Used for scripting data collection, transformation, and aggregation processes.
Apache Spark: Employed to efficiently manage large-scale data processing.
Docker: Applied to containerize the development environment to ensure consistency across different platforms.
Apache Kafka A cluster of Apache Kafka brokers used as the conduit to transmit data from the source to sink with low latency.
Brightdata API I setup a Brightdata web scraper to seamlessly web scrape real estate listings in the UK from Zoopla’s website.
CassandraDB A NoSQL database designed to manage large amounts of data. This is our sink
OpenAI This helps with the querying of property listings on Zoopla’s website.

Getting Started

Installation

Step 1. Clone the repository to your local machine:

git clone https://github.com/drjodyannjones/RealEstateDataEngineering.git

Step 2. Building Docker Image:

docker build -t my-custom-spark:3.5.0 .

Step 3. Start Docker Container (make sure the Docker client is up and running on your machine first!)

docker compose up -d

Step 3. Start Data Ingestion process:

python main.py

Step 4. Start Spark Consumer:

docker exec -it realestatedataengineering-spark-master-1 spark-submit \
    --packages com.datastax.spark:spark-cassandra-connector_2.12:3.4.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 \
    jobs/spark-consumer.py

Sample Code Snippet: spark-consumer.py

In this example, I showcase how Apache Spark serves as an efficient consumer by extracting data from Apache Kafka and subsequently storing it in CassandraDB:

import logging
from cassandra.cluster import Cluster
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, FloatType, ArrayType

# Configure logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(**name**)

def get_cassandra_session():
"""Retrieve or create a Cassandra session."""
if 'cassandra_session' not in globals():
cluster = Cluster(["cassandra"])
globals()['cassandra_session'] = cluster.connect()
return globals()['cassandra_session']

def setup_cassandra(session):
"""Setup the keyspace and table in Cassandra."""
session.execute("""
CREATE KEYSPACE IF NOT EXISTS property_streams
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
""")
logger.info("Keyspace created successfully!")

    session.execute("""
        CREATE TABLE IF NOT EXISTS property_streams.properties (
            price text, title text, link text, pictures list<text>, floor_plan text,
            address text, bedrooms text, bathrooms text, receptions text, epc_rating text,
            tenure text, time_remaining_on_lease text, service_charge text,
            council_tax_band text, ground_rent text, PRIMARY KEY (link)
        );
    """)
    logger.info("Table created successfully!")

def insert_data(\*\*kwargs):
"""Insert data into Cassandra table using a session created at the executor."""
session = get_cassandra_session()
session.execute("""
INSERT INTO property_streams.properties (
price, title, link, pictures, floor_plan, address, bedrooms, bathrooms,
receptions, epc_rating, tenure, time_remaining_on_lease, service_charge, council_tax_band, ground_rent
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
""", (
kwargs['price'], kwargs['title'], kwargs['link'], kwargs['pictures'],
kwargs['floor_plan'], kwargs['address'], kwargs['bedrooms'], kwargs['bathrooms'],
kwargs['receptions'], kwargs['epc_rating'], kwargs['tenure'], kwargs['time_remaining_on_lease'],
kwargs['service_charge'], kwargs['council_tax_band'], kwargs['ground_rent']
))
logger.info("Data inserted successfully!")

def define_kafka_to_cassandra_flow(spark):
"""Define data flow from Kafka to Cassandra using Spark."""
schema = StructType([
StructField("price", FloatType(), True),
StructField("title", StringType(), True),
StructField("link", StringType(), True),
StructField("pictures", ArrayType(StringType()), True),
StructField("floor_plan", StringType(), True),
StructField("address", StringType(), True),
StructField("bedrooms", StringType(), True),
StructField("bathrooms", StringType(), True),
StructField("receptions", StringType(), True),
StructField("epc_rating", StringType(), True),
StructField("tenure", StringType(), True),
StructField("time_remaining_on_lease", StringType(), True),
StructField("service_charge", StringType(), True),
StructField("council_tax_band", StringType(), True),
StructField("ground_rent", StringType(), True)
])

    kafka_df = (spark
                .readStream
                .format("kafka")
                .option("kafka.bootstrap.servers", "kafka-broker:9092")
                .option("subscribe", "properties")
                .option("startingOffsets", "earliest")
                .load()
                .selectExpr("CAST(value AS STRING) as value")
                .select(from_json(col("value"), schema).alias("data"))
                .select("data.*"))

    kafka_df.writeStream.foreachBatch(
        lambda batch_df, _: batch_df.foreach(
            lambda row: insert_data(**row.asDict())
        )
    ).start().awaitTermination()

def main():
spark = SparkSession.builder.appName("RealEstateConsumer").config(
"spark.cassandra.connection.host", "cassandra"
).config(
"spark.jars.packages",
"com.datastax.spark:spark-cassandra-connector_2.12:3.4.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"
).getOrCreate()

    session = get_cassandra_session()
    setup_cassandra(session)
    define_kafka_to_cassandra_flow(spark)

if **name** == "**main**":
main()

Azure Data Management Pipeline Application

April 20, 2024

In this project, I designed and implemented a robust data management pipeline using Microsoft Azure’s cloud services. Setup of the infrastructure was managed with Terraform. The complete project can be reviewed in my GitHub repository.

Project Overview

This Azure Data Management Pipeline project focuses on the integration of various Azure services to create a scalable and efficient pipeline for data ingestion, processing, and storage. The pipeline facilitates advanced data analysis and is tailored to support enterprises in making agile, informed business decisions. Terraform is utilized to programmatically create, modify, and remove resources on the Microsoft Azure cloud platform.

Technologies Employed

Terraform For programatically managing cloud infrastructure.
Azure Data Factory: For orchestrating and automating data flows between various Azure services.
Azure Databricks: Utilized for data processing and running big data analytics.
Azure SQL Database: Used for storing processed data in a structured format.
Azure Storage: Employed for durable, scalable storage of raw data.

Execution Instructions

To engage with this project, follow these steps:

Clone the repository: git clone https://github.com/drjodyannjones/azure-data-management-pipeline.git
Set up the Azure services as detailed in the project’s documentation.
Deploy the Azure Data Factory pipelines and monitor the workflow execution within Azure Portal.

Sample Code Snippet: main.tf

Here is the main Terraform script that manages the infrastructure:

terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0.2"
}
}

required_version = ">= 1.1.0"
}

provider "azurerm" {
features {}
}

resource "azurerm_resource_group" "rg" {
name = var.resource_group_name
location = var.location
tags = var.tags
}

module "storage_account" {
source = "./modules/storage_account/storage_account"

resource_group_name = var.resource_group_name
storage_account_name = var.storage_account_name
location = var.location
source_folder_name = var.source_folder_name
destination_folder_name = var.destination_folder_name

depends_on = [
azurerm_resource_group.rg
]

}

module "data_factory" {
source = "./modules/data_factory/data_factory"

df_name = var.df_name
location = var.location
resource_group_name = var.resource_group_name
storage_account_name = var.storage_account_name

depends_on = [
module.storage_account
]
}