Burak's Blog

Blog about various subjects


Dataframe factory

Introduction

About a year ago, I made a switch at work and hopped onto the data scientist team. It was a bit of a wild ride getting into the AI scene, figuring out how things are built and coded.

As time rolled on, I noticed something odd – not many folks were big on testing. It was like the norm was to cook up your code and give it a spin with production data. Now, coming from a software development background, that felt a bit off. How on earth do you make sure you cover all those possible use cases without a solid testing game? I’ve got a shortlist of things I can’t stand in this world, and right at the top are messy organization, a lack of basic object-oriented principles, and, you guessed it, no testing.

So, I went on a bit of a quest to figure out why testing wasn’t getting the love it deserves. Turns out, the culprit was those pesky DataFrame objects – a real headache to test. But hey, lightbulb moment – I had this idea to make life easier for everyone by helping them test their code with some nifty DataFrame generation.

Introducing the DataFrame Factory

In the world of software development, testing and ensuring the quality of code are paramount. One common challenge developers face is the need for realistic data to thoroughly test their applications. This is where the DataFrame Factory comes into play. Being an avid enthusiast of pytest and factory-boy libraries for testing and data generation for ORM models, inspired by factory implementation pacticularly, some of might be familiar to many.

The Foundation: FactoryMetaClass

At the core of this library is the FactoryMetaClass, a metaclass that dynamically constructs classes based on provided options. It takes care of handling the meta information and options necessary for creating instances of the factory.

from abc import abstractmethod
from dataclasses import fields


class FactoryMetaClass(type):
    def __new__(mcs, class_name, bases, attrs):
        meta = attrs.pop("Meta", None)
        attrs["_meta"] = meta
        options = FactoryOptions()
        attrs["_options"] = options
        new_class = super().__new__(mcs, class_name, bases, attrs)
        options.construct_class(meta, attrs)
        return new_class

Configuration with FactoryOptions

The FactoryOptions class is responsible for interpreting the meta information and constructing the necessary columns for the factory. It intelligently handles default values and allows customization of the columns through the factory parameters.

class BaseMeta:
    abstract = True
    model = None


class FactoryOptions:
    def __init__(self):
        self.meta = None
        self.columns = {}
        self.factory_params = None

    def construct_class(self, meta, attrs):
        self.meta = meta
        self.factory_params = attrs
        if meta.model:
            self.construct_columns()
            self.override_columns()

    def construct_columns(self):
        for attr in fields(self.meta.model):
            self.columns[attr] = attr.default

    def override_columns(self):
        columns = {}
        for param, default in self.factory_params.items():
            if not param.startswith("_"):
                columns[param] = self.create_default_column_generator(param, default)
                self.columns = columns

    def create_default_column_generator(self, attr, default):
        if callable(default):
            return default
        return lambda: default

Crafting Data with BaseFactory

The BaseFactory class sets the groundwork for creating instances of the factory. It enforces that the class cannot be instantiated directly, emphasizing the usage of the create method. The _create method, marked as abstract, is intended for customization in the derived factories.

class BaseFactory:
  def __new__(cls, *args, **kwargs):
    """Would be called if trying to instantiate the class."""
      raise Exception("You cannot instantiate BaseFactory")

  @classmethod
  def create(
      cls,
      size=10,
      perc_na=None,
      perc_row_na=None,
      perc_col_na=None,
      **kwargs,
  ):
    cls.columns_name = list(cls._options.columns)
    cls.data = {
      attr: [cls._options.columns[attr]() for _ in range(size)]
          for attr in cls._options.columns
    }
      return cls._create(
        size, perc_na, perc_row_na=None, perc_col_na=None, **kwargs
      )

  @classmethod
  @abstractmethod
  def _create(
      cls,
      size,
      perc_na,
      perc_row_na=None,
      perc_col_na=None,
      **kwargs,
  ):
      raise NotImplementedError

Concrete Implementation

PandasDataFrameFactory

The provided PandasDataFrameFactory demonstrates how to extend the base factory to create Pandas DataFrames. It leverages the power of Pandas and NumPy to generate synthetic data based on the configured columns and options.

import numpy as np
import pandas as pd

class PandasDataFrameFactory(Factory):
    class Meta:
        abstract = False
        model = None

    @classmethod
    def _create(
            cls,
            size=10,
            perc_na=None,
            **kwargs,
    ):
        df = pd.DataFrame(columns=cls.columns_name, data=cls.data)
        if perc_na:
            mask = np.random.choice(
                [True, False], size=df.shape, p=[perc_na, 1 - perc_na]
            )
            return df.mask(mask)

        return df

PySparkDataFrameFactory

Now, let’s dive into a Spark-powered version of our DataFrame Factory – the PySparkDataFrameFactory. I’m kinda on the fence about it. The catch is, it needs Spark to do its magic, and that gives me a bit of pause. I mean, it’s cool and all, but the Spark dependency makes me think twice. Still, I get that there are situations, especially in integration tests, where you gotta roll with Spark DataFrames.

from dataclasses import fields

from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.getOrCreate()


class PySparkDataFrameFactory(DataFrameFactoryBase):
    @classmethod
    def create_column_generator(cls, attr):
        if attr.name == "column1":
            return lambda: "Custom Value"
        else:
            return super().create_column_generator(attr)

    @classmethod
    def generate(cls, size=10, **kwargs):
        model_class = cls.Meta.model
        columns = {}

        for attr in fields(model_class):
            columns[attr.name] = cls.create_column_generator(attr)

        # Override model field values with kwargs
        for field, value in kwargs.items():
            if field not in columns:
                raise ValueError(f"Invalid field '{field}' in kwargs")
            columns[field] = lambda: value

        schema = StructType(
            [
                StructField(attr.name, StringType(), nullable=False)
                for attr in fields(model_class)
            ]
        )

Example

from dataclasses import dataclass
from datetime import date
from typing import Optional

import pytest


@dataclass
class SimpleModel:
    first_name: str = ""
    last_name: str = ""
    birthday: Optional[date] = None


@pytest.fixture
def simple_pandas_factory_model(faker):
    class SimplePandasFactoryModel(PandasDataFrameFactory):
        class Meta:
            model = SimpleModel

            first_name = faker.first_name
            last_name = faker.last_name
            date = faker.date

    return SimplePandasFactoryModel


def test_simple_dataframe_creationj(simple_pandas_factory_model):
    models = simple_pandas_factory_model.create()
    check.equal(len(models), 10)

Conclusion

Bear in mind, this is a prototype. I don’t know where I’m going with this librabry but I’m sure this can be useful. Unless I’m not aware of an library that does something similar. Also, my implementation works with Python dataclass right, I’m using the type hints to guess the attribute type.

Feel free to share your thoughts or any specific aspects you’d like to highlight in the article!