Dataframe factory
Introduction
About a year ago, I made a switch at work and hopped onto the data scientist team. It was a bit of a wild ride getting into the AI scene, figuring out how things are built and coded.
As time rolled on, I noticed something odd – not many folks were big on testing. It was like the norm was to cook up your code and give it a spin with production data. Now, coming from a software development background, that felt a bit off. How on earth do you make sure you cover all those possible use cases without a solid testing game? I’ve got a shortlist of things I can’t stand in this world, and right at the top are messy organization, a lack of basic object-oriented principles, and, you guessed it, no testing.
So, I went on a bit of a quest to figure out why testing wasn’t getting the love it deserves. Turns out, the culprit was those pesky DataFrame objects – a real headache to test. But hey, lightbulb moment – I had this idea to make life easier for everyone by helping them test their code with some nifty DataFrame generation.
Introducing the DataFrame Factory
In the world of software development, testing and ensuring the quality of code are paramount. One common challenge developers face is the need for realistic data to thoroughly test their applications. This is where the DataFrame Factory comes into play. Being an avid enthusiast of pytest and factory-boy libraries for testing and data generation for ORM models, inspired by factory implementation pacticularly, some of might be familiar to many.
The Foundation: FactoryMetaClass
At the core of this library is the FactoryMetaClass, a metaclass that dynamically constructs classes based on provided options. It takes care of handling the meta information and options necessary for creating instances of the factory.
from abc import abstractmethod
from dataclasses import fields
class FactoryMetaClass(type):
def __new__(mcs, class_name, bases, attrs):
meta = attrs.pop("Meta", None)
attrs["_meta"] = meta
options = FactoryOptions()
attrs["_options"] = options
new_class = super().__new__(mcs, class_name, bases, attrs)
options.construct_class(meta, attrs)
return new_class
Configuration with FactoryOptions
The FactoryOptions class is responsible for interpreting the meta information and constructing the necessary columns for the factory. It intelligently handles default values and allows customization of the columns through the factory parameters.
class BaseMeta:
abstract = True
model = None
class FactoryOptions:
def __init__(self):
self.meta = None
self.columns = {}
self.factory_params = None
def construct_class(self, meta, attrs):
self.meta = meta
self.factory_params = attrs
if meta.model:
self.construct_columns()
self.override_columns()
def construct_columns(self):
for attr in fields(self.meta.model):
self.columns[attr] = attr.default
def override_columns(self):
columns = {}
for param, default in self.factory_params.items():
if not param.startswith("_"):
columns[param] = self.create_default_column_generator(param, default)
self.columns = columns
def create_default_column_generator(self, attr, default):
if callable(default):
return default
return lambda: default
Crafting Data with BaseFactory
The BaseFactory class sets the groundwork for creating instances of the factory. It enforces that the class cannot be instantiated directly, emphasizing the usage of the create method. The _create method, marked as abstract, is intended for customization in the derived factories.
class BaseFactory:
def __new__(cls, *args, **kwargs):
"""Would be called if trying to instantiate the class."""
raise Exception("You cannot instantiate BaseFactory")
@classmethod
def create(
cls,
size=10,
perc_na=None,
perc_row_na=None,
perc_col_na=None,
**kwargs,
):
cls.columns_name = list(cls._options.columns)
cls.data = {
attr: [cls._options.columns[attr]() for _ in range(size)]
for attr in cls._options.columns
}
return cls._create(
size, perc_na, perc_row_na=None, perc_col_na=None, **kwargs
)
@classmethod
@abstractmethod
def _create(
cls,
size,
perc_na,
perc_row_na=None,
perc_col_na=None,
**kwargs,
):
raise NotImplementedError
Concrete Implementation
PandasDataFrameFactory
The provided PandasDataFrameFactory demonstrates how to extend the base factory to create Pandas DataFrames. It leverages the power of Pandas and NumPy to generate synthetic data based on the configured columns and options.
import numpy as np
import pandas as pd
class PandasDataFrameFactory(Factory):
class Meta:
abstract = False
model = None
@classmethod
def _create(
cls,
size=10,
perc_na=None,
**kwargs,
):
df = pd.DataFrame(columns=cls.columns_name, data=cls.data)
if perc_na:
mask = np.random.choice(
[True, False], size=df.shape, p=[perc_na, 1 - perc_na]
)
return df.mask(mask)
return df
PySparkDataFrameFactory
Now, let’s dive into a Spark-powered version of our DataFrame Factory – the PySparkDataFrameFactory. I’m kinda on the fence about it. The catch is, it needs Spark to do its magic, and that gives me a bit of pause. I mean, it’s cool and all, but the Spark dependency makes me think twice. Still, I get that there are situations, especially in integration tests, where you gotta roll with Spark DataFrames.
from dataclasses import fields
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder.getOrCreate()
class PySparkDataFrameFactory(DataFrameFactoryBase):
@classmethod
def create_column_generator(cls, attr):
if attr.name == "column1":
return lambda: "Custom Value"
else:
return super().create_column_generator(attr)
@classmethod
def generate(cls, size=10, **kwargs):
model_class = cls.Meta.model
columns = {}
for attr in fields(model_class):
columns[attr.name] = cls.create_column_generator(attr)
# Override model field values with kwargs
for field, value in kwargs.items():
if field not in columns:
raise ValueError(f"Invalid field '{field}' in kwargs")
columns[field] = lambda: value
schema = StructType(
[
StructField(attr.name, StringType(), nullable=False)
for attr in fields(model_class)
]
)
Example
from dataclasses import dataclass
from datetime import date
from typing import Optional
import pytest
@dataclass
class SimpleModel:
first_name: str = ""
last_name: str = ""
birthday: Optional[date] = None
@pytest.fixture
def simple_pandas_factory_model(faker):
class SimplePandasFactoryModel(PandasDataFrameFactory):
class Meta:
model = SimpleModel
first_name = faker.first_name
last_name = faker.last_name
date = faker.date
return SimplePandasFactoryModel
def test_simple_dataframe_creationj(simple_pandas_factory_model):
models = simple_pandas_factory_model.create()
check.equal(len(models), 10)
Conclusion
Bear in mind, this is a prototype. I don’t know where I’m going with this librabry but I’m sure this can be useful. Unless I’m not aware of an library that does something similar. Also, my implementation works with Python dataclass right, I’m using the type hints to guess the attribute type.
Feel free to share your thoughts or any specific aspects you’d like to highlight in the article!