Documentation for formatters Module of DataAnalysisToolkit

The formatters module in DataAnalysisToolkit offers tools for transforming and standardizing data in a DataFrame. It’s designed to prepare your data for analysis, ensuring consistency and quality.

Data Formatter (data_formatter.py)

Overview

The DataFormatter class is a versatile tool for performing various data formatting tasks on a pandas DataFrame. It can standardize date formats, normalize numeric data, categorize columns, and more.

Usage

formatter = DataFormatter(df)
formatter.standardize_dates('date_column')
formatter.categorize_columns(['category_column1', 'category_column2'])
formatter.normalize_numeric(['numeric_column1', 'numeric_column2'])

Methods

  • __init__(self, data): Initialize the formatter with a DataFrame.

  • standardize_dates(self, date_column, date_format='%Y-%m-%d'): Standardize the format of a date column.

  • categorize_columns(self, columns): Convert specified columns to categorical data types for efficiency.

  • normalize_numeric(self, numeric_columns): Normalize numeric columns by scaling to a mean of 0 and standard deviation of 1.

  • fill_missing_values(self, column, fill_value=None, method=None): Fill missing values in a column either with a specified value or using a method like forward-fill or backward-fill.

  • encode_categorical_variables(self, columns): Perform one-hot encoding on categorical variables to transform them into a format suitable for machine learning models.

  • custom_transform(self, column, transform_func): Apply a custom transformation function to a specified column, allowing for flexible data transformations.

Examples

Here are some examples demonstrating how to use the DataFormatter class:

Standardizing a Date Column:

formatter = DataFormatter(df)
formatter.standardize_dates('date_column')

Normalizing Numeric Data:

formatter.normalize_numeric(['age', 'income'])

Encoding Categorical Variables:

formatter.encode_categorical_variables(['gender', 'occupation'])

Custom Transformations:

formatter.custom_transform('price', lambda x: x * 1.2)

The formatters module is essential for ensuring data consistency and quality, making it easier to perform reliable analysis. By providing a range of methods for data transformation, this module helps streamline the preprocessing stage of your data analysis projects.