Am I Your Type?

Christopher Sanchez
Programming LanguagesSystems & Low LevelCore Development

Today, data is more valuable than Gold. Many companies have been building revolutionary machine learning and artificial intelligence products for decades now. Algorithms have been used for making our lives easier in countless ways, from identifying disease, to predicting the weather, to even recommending movies. AI is revolutionizing the world and data is the fuel that powers it. Recently, with the release of large language models like ChatGPT, Claude, and Gemini, the world has seen the power of AI and data science. Millions of people now have a seemingly endless amount of information at their fingertips. We wouldn't have any of it without data.

Everywhere we look, everywhere we go, everything we see is data. Data is becoming the lifeblood of the world. It is the becoming the foundation of everything we do, from the way we communicate to the way we work, heck even the way we drive. Data is everywhere, it's not going anywhere, and it is growing at an exponential rate. With the rise of big data, data science, and machine learning, the importance of data has never been more obvious. But what is data? To make it as simple as possible, data is information, and information can be anything from numbers to text to images. However, data is more than just information; it is also a type of information. Data types are crucial for understanding how to work with data, and they play a significant role in programming.

Introduction to Data Typing

In programming, data types are used to separate different types of data, such as integers, strings, and booleans. It is incredibly important to ensure you are properly typing your variables, as it can have a huge impact on the performance, memory management, and security of your program. Data typing is the process of assigning a type to a variable, which tells the computer how to interpret the data stored in that variable.

There are literally dozens of data types spread across dozens of programming languages, but the most common ones are integers, floats, strings, and booleans. An integer is a whole number, a float is a number with a decimal point, a string is a sequence of characters, and a boolean is a value that is either true or false. These data types are the building blocks of programming, and they are used to represent everything from numbers to text to logic.

For example:

# Integers (whole numbers)
age = 25
year = 2024
temperature = -5

# Floats (decimal numbers)
height = 5.11
pi = 3.14159
bank_balance = -123.45

# Strings (text)
name = "Chris"
message = 'I am the master of the universe!'
address = "123 Data Science St."

# Booleans (True/False)
is_student = True
has_license = False
is_raining = True

On the surface learning data types is pretty straight forward, if its a whole number use an integer, if its a decimal use a float, if its text use a string, and if its a true or false statement use a boolean. However, once you get a little bit deeper into programming, you'll realize that data types are much more complex than that. There are many different types of integers, floats, strings, and booleans, and each one has its own unique properties and behaviors.

For example, in Python, there are four different types of integers: int8, int16, int32, and int64. Each of these types has a different range of values that it can represent, and each one takes up a different amount of memory. Similarly, there are different types of floats, strings, and booleans.

Properly typing your data can have massive effects on performance, memory usage, and security. For example, using a smaller integer type like int8 instead of int64 can save memory and improve performance. Similarly, using a strongly typed array instead of a dynamic array can improve memory usage and reduce the risk of security vulnerabilities.

Good ol' Memory

If you've been around computers a long time it's likely you've seen the evolution of memory. From the days of kilobytes to megabytes to gigabytes to terabytes, memory has come a long way. Memory is a crucial aspect of programming, and it is important to understand how data types can impact memory usage. When you create a variable in a program, the computer allocates memory to store the data associated with that variable. The amount of memory allocated depends on the data type of the variable.

Modern computers have either 32 bit or 64 bit memory architecture, which means they can store 32 bits or 64 bits of data at a time. This is important because different data types require different amounts of memory to store. For example, an int8 variable requires 8 bits of memory, while an int64 variable requires 64 bits of memory. This means that an int8 variable uses 8 times less memory than an int64 variable to store the same value. Depending on your environment, these memory savings can be absolutely crucial. They can be the difference between being able to work with data and not even being able to load it into memory.

I've created a couple of tools to help ingrain how memory is stored. Please feel free around to play with this data type visualizer here to help get an understanding of how data types are stored in memory.


Data Type Memory Visualizer

Size in memory: 4 bytes

Stack Memory

Type Description

32-bit signed integer (-2,147,483,648 to 2,147,483,647)

Input Format

Enter a whole number within the valid range


Here is also a handy chart to help you understand how much memory different data types use:

Data Types Reference

TypeSize
int8_t (char)1 bytes(8 bits)
uint8_t1 bytes(8 bits)
int16_t2 bytes(16 bits)
uint16_t2 bytes(16 bits)
int32_t4 bytes(32 bits)
uint32_t4 bytes(32 bits)
int64_t8 bytes(64 bits)
uint64_t8 bytes(64 bits)
float4 bytes(32 bits)
double8 bytes(64 bits)
char1 bytes(8 bits)
bool1 bytes(8 bits)
pointer8 bytes(64 bits)
Arrayn × element_size(n × element_bits)
Dynamic Array12 + (n × element_size)(96 + (n × element_bits))
List24 + (n × (element_size + ptr_size))(192 + (n × (element_bits + 64)))
Tuplesum(element_sizes)(sum(element_bits))
Set16 + (n × element_size)(128 + (n × element_bits))
Map/Dictionary24 + (n × (key_size + value_size))(192 + (n × (key_bits + value_bits)))
String8 + n + 1(64 + (n × 8) + 8)
Structsum(field_sizes)(sum(field_bits))
Unionmax(member_sizes)(max(member_bits))
Class8 + sum(field_sizes)(64 + sum(field_bits))
Enum1-4(8-32)

Size Notation

n: Number of elements

element_size: Size of each element in bytes

element_bits: Size of each element in bits

ptr_size: Size of a pointer (typically 8 bytes on 64-bit systems)

field_sizes: Combined size of all fields in a structure

YeeYeeYeeYee Live Action

Let's dive into a real world example to see how data typing can directly impact the performance and memory usage of a program. We'll use Python and Pandas to load a data set and optimize the data types to reduce memory usage. We'll also take a look at how different data types are represented in memory and how they can impact the performance of a program. It's important to note that this example is specific to Python and Pandas, but the concepts can be applied to any programming language and data processing library. Also keep in mind that pandas automatically assigns data types to columns when loading data, so it's important to check and optimize the data types to reduce memory usage as Pandas uses default data types such as int64 for integers even though they possibly can be optimized with a smaller int like an int8. Let's get started!

The dataset that will be used for this example is a dataset of NFL player statistics sourced from Kaggle. If you would like to follow along, the dataset can be downloaded from here: NFL Player Stats

There is two datasets included, but we will specifically focus on the games.

Our dataset contains 46 columns and 1,024,164 rows, so it is fairly large and will be a good example to demonstrate how memory optimization is very important and can lead to extreme cost and time savings.

Here is a quick look at the available columns in the dataset:

player_iddategame_numberage
passing_attemptspassing_completionspassing_yardspassing_rating
passing_touchdownspassing_interceptionspassing_sackspassing_sacks_yards_lost
rushing_attemptsrushing_yardsrushing_touchdownsreceiving_targets
receiving_receptionsreceiving_yardsreceiving_touchdownskick_return_attempts
kick_return_yardskick_return_touchdownspunt_return_attemptspunt_return_yards
punt_return_touchdownsdefense_sacksdefense_tacklesdefense_tackle_assists
defense_interceptionsdefense_interception_yardsdefense_interception_touchdownsdefense_safeties
point_after_attempspoint_after_makesfield_goal_attemptsfield_goal_makes
punting_attemptspunting_yardspunting_blockedteam
game_locationopponentgame_wonyear
player_team_scoreopponent_score

To get an initial look at the memory usage of the dataset, we can use the following code:

nfl_game_stats_df.memory_usage(deep=True).sum() / 1024**2

The dataset is using 525.47 mb of memory initially. This is a fairly large amount of memory, and we can optimize it by converting the data types of the columns to more appropriate types. For example, we can convert the int64 columns to int8 or int16 if the values in the columns will never exceed the maximum value supported by those types. We can also convert the object columns to category if there are a finite number of unique values in the columns.

Let's take a deeper look at the data types of the columns in the dataset and their memory usage to see how we can ooptimize them:

# Get memory usage of each column and convert to MB from bytes. (deep=True to include object columns)
mem_usage = nfl_game_stats_df.memory_usage(deep=True) / 1024**2
# Get data types of each column
dtypes = nfl_game_stats_df.dtypes

# Split the data into two halves for easier viewing
left, right = np.array_split(mem_usage, 2)
left = left[1:]
left = left.reset_index()
left.columns = ["Columns", "Values"]

right = right.reset_index()
right.columns = ["Columns", "Values"]

df1 = pd.DataFrame({
    'Features': left['Columns'],
    'Memory (MB)': left['Values'],
    'Dtype': [dtypes[col] for col in left['Columns']]
})

df2 = pd.DataFrame({
    'Features': right['Columns'],
    'Memory (MB)': right['Values'],
    'Dtype': [dtypes[col] for col in right['Columns']]
})

pd.concat([df1, df2], axis=1)
FeaturesMemory (MB)DtypeFeaturesMemory (MB)Dtype
0player_id7.814int64receiving_receptions7.814int64
1year7.814int64receiving_yards7.814int64
2date7.814datetime64[ns]receiving_touchdowns7.814int64
3game_number7.814int64kick_return_attempts7.814int64
4age53.717objectkick_return_yards7.814int64
5team50.789objectkick_return_touchdowns7.814int64
6game_location48.836objectpunt_return_attempts7.814int64
7opponent50.789objectpunt_return_yards7.814int64
8game_won0.977boolpunt_return_touchdowns7.814int64
9player_team_score7.814int64defense_sacks7.814float64
10opponent_score7.814int64defense_tackles7.814int64
11passing_attempts7.814int64defense_tackle_assists7.814int64
12passing_completions7.814int64defense_interceptions7.814int64
13passing_yards7.814int64defense_interception_yards7.814int64
14passing_rating7.814float64defense_interception_touchdowns7.814int64
15passing_touchdowns7.814int64defense_safeties7.814int64
16passing_interceptions7.814int64point_after_attemps7.814int64
17passing_sacks7.814int64point_after_makes7.814int64
18passing_sacks_yards_lost7.814int64field_goal_attempts7.814int64
19rushing_attempts7.814int64field_goal_makes7.814int64
20rushing_yards7.814int64punting_attempts7.814int64
21rushing_touchdowns7.814int64punting_yards7.814int64
22receiving_targets7.814int64punting_blocked7.814int64

There are a few things that immediately stand out in the output:

  1. The object columns are using a significant amount of memory.
  2. The int64 columns are using more memory than necessary.
  3. All of the 64 bit columns are using the same amount of memory which is a great visual example as to how it's not necessarily the value that changes memory, but the data type.

Let's take a look at the object columns first since they are using by far the most amount of memory:

nfl_game_stats_df[['age', 'team', 'game_location', 'opponent']].head()
ageteamgame_locationopponent
023-120SEAACHI
123-127SEAHRAI
223-134SEAADEN
323-142SEAHCIN
423-148SEAANWE

You might have been thinking why is the age column an object? Shouldn't it be an integer? It's clear in the output that age isn't a whole number and likely represents age with years and days, so the first player is 23 years old, 120 days. This is a perfect example of why it's important to check the data types of your columns and make sure they are appropriate for the data they contain. The team, game_location, and opponent columns are all strings, so they are correctly typed as objects, however they can be better typed as a category.

What does a category do? A category is a data type that is used to represent a fixed number of unique values. In simpler terms, if it's a categorical variable, type it as such. It is more memory efficient than an object data type, especially when the number of unique values is small. A quick way to check is using Pandas .nunique() method:

nfl_game_stats_df[['age', 'team', 'game_location', 'opponent']].nunique()
count
age8125.000
team42.000
game_location3.000
opponent42.000

We know that the dataset contains 1,024,164 rows of data and since all four of these columns have a really low number of unique values, we can deduce that these are all categorical variables. We can convert them to a category data type to save memory:

memory_data = []
columns = ['age', 'team', 'game_location', 'opponent']

for col in columns:
   before = nfl_game_stats_df[col].memory_usage(deep=True) / 1024**2
   after = nfl_game_stats_df[col].astype("category").memory_usage(deep=True) / 1024**2
   memory_data.append({
       'column': col,
       'before_mb': before,
       'after_mb': after,
       'mb_saved': before - after
   })

memory_df = pd.DataFrame(memory_data).set_index('column').rename_axis(None)

memory_df.loc['total'] = [memory_df['before_mb'].sum(), memory_df['after_mb'].sum(), memory_df['mb_saved'].sum()]
before_mbafter_mbmb_saved
age53.7182.63251.086
team50.7900.98049.810
game_location48.8360.97747.859
opponent50.7900.98049.810
total204.1335.569198.564

By making this one simple change, we were able to save 198.56 MB of memory, which is a significant amount considering the dataset is only 525.47 MB. That's nearly 38% of the total memory usage of the dataset saved! Strings/objects can really be detrimental to memory usage, so it's important to handle them appropriately. We were able to bring the total memory usage down to 326.9 MB, which is a huge improvement.

Next, let's take a look at the integer columns and see if we can optimize them as well. We can use the same approach as before to check the memory usage of the integer columns and see if we can convert them to a smaller data type. I wrote a function that checks the min and max of each integer column and determines the optimal integer type based on the range of values:

def optimize_ints_report(df):
   memory_data = []
   int_cols = df.select_dtypes(include=['int64']).columns

   for col in int_cols:
       col_min = df[col].min()
       col_max = df[col].max()
       before = df[col].memory_usage(deep=True) / 1024**2

       # Determine optimal type
       if col_min >= 0:
           dtype = ('uint8' if col_max < 256 else
                   'uint16' if col_max < 65536 else
                   'uint32' if col_max < 4294967296 else
                   'uint64')
       else:
           dtype = ('int8' if -128 <= col_min and col_max < 128 else
                   'int16' if -32768 <= col_min and col_max < 32768 else
                   'int32' if -2147483648 <= col_min and col_max < 2147483648 else
                   'int64')

       after = df[col].astype(dtype).memory_usage(deep=True) / 1024**2

       memory_data.append({
           'column': col,
           'before_mb': before,
           'after_mb': after,
           'mb_saved': before - after,
           'optimal_type': dtype
       })

   memory_df = pd.DataFrame(memory_data).set_index('column').rename_axis(None)
   memory_df.loc['total'] = [memory_df['before_mb'].sum(),
                            memory_df['after_mb'].sum(),
                            memory_df['mb_saved'].sum(),
                            '']

   return memory_df
before_mbafter_mbmb_savedoptimal_type
player_id7.8141.9545.860uint16
year7.8141.9545.860uint16
game_number7.8140.9776.837uint8
player_team_score7.8140.9776.837uint8
opponent_score7.8140.9776.837uint8
passing_attempts7.8140.9776.837uint8
passing_completions7.8140.9776.837uint8
passing_yards7.8141.9545.860int16
passing_touchdowns7.8140.9776.837uint8
passing_interceptions7.8140.9776.837uint8
passing_sacks7.8140.9776.837uint8
passing_sacks_yards_lost7.8140.9776.837uint8
rushing_attempts7.8140.9776.837uint8
rushing_yards7.8141.9545.860int16
rushing_touchdowns7.8140.9776.837uint8
receiving_targets7.8140.9776.837uint8
receiving_receptions7.8140.9776.837uint8
receiving_yards7.8141.9545.860int16
receiving_touchdowns7.8140.9776.837uint8
kick_return_attempts7.8140.9776.837uint8
kick_return_yards7.8141.9545.860int16
kick_return_touchdowns7.8140.9776.837uint8
punt_return_attempts7.8140.9776.837uint8
punt_return_yards7.8141.9545.860int16
punt_return_touchdowns7.8140.9776.837uint8
defense_tackles7.8140.9776.837uint8
defense_tackle_assists7.8140.9776.837uint8
defense_interceptions7.8140.9776.837uint8
defense_interception_yards7.8141.9545.860int16
defense_interception_touchdowns7.8140.9776.837uint8
defense_safeties7.8140.9776.837uint8
point_after_attemps7.8140.9776.837uint8
point_after_makes7.8140.9776.837uint8
field_goal_attempts7.8140.9776.837uint8
field_goal_makes7.8140.9776.837uint8
punting_attempts7.8140.9776.837uint8
punting_yards7.8141.9545.860int16
punting_blocked7.8140.9776.837uint8
total296.92745.911251.017

It's like magic, really. By converting the integer columns to their optimal integer types, we were able to save an additional 251.02 MB of memory, bringing the total memory usage of the dataset down to 75.89 MB. That's a 85.6% reduction in memory usage, which is absolutely incredible.

To test out some of these memory savings we will put the dataset through a gauntlet of tests, to compare the unoptimized dataset with the optimized dataset. We will test the time it takes to perform value counts, complex groupby operations, string operations, and complex filtering on the dataset.

def benchmark_extended_ops():
    # Value counts
    start = time.time()
    nfl_game_stats_df['team'].value_counts()
    t1 = time.time() - start

    start = time.time()
    nfl_game_stats_df_optimized['team'].value_counts()
    t2 = time.time() - start
    print(f"Value counts - Original: {t1:.2f}s, Optimized: {t2:.2f}s")

    # Multiple groupby operations
    start = time.time()
    nfl_game_stats_df.groupby(['team', 'year'])['passing_yards'].mean()
    t1 = time.time() - start

    start = time.time()
    nfl_game_stats_df_optimized.groupby(['team', 'year'], observed=True)['passing_yards'].mean()
    t2 = time.time() - start
    print(f"Complex groupby - Original: {t1:.2f}s, Optimized: {t2:.2f}s")

    # String operations
    start = time.time()
    nfl_game_stats_df['team'].str.contains('A')
    t1 = time.time() - start

    start = time.time()
    nfl_game_stats_df_optimized['team'].str.contains('A')
    t2 = time.time() - start
    print(f"String ops - Original: {t1:.2f}s, Optimized: {t2:.2f}s")

    # Complex filtering
    start = time.time()
    nfl_game_stats_df[(nfl_game_stats_df['passing_yards'] > 200) &
                      (nfl_game_stats_df['rushing_yards'] > 50)]
    t1 = time.time() - start

    start = time.time()
    nfl_game_stats_df_optimized[(nfl_game_stats_df_optimized['passing_yards'] > 200) &
                               (nfl_game_stats_df_optimized['rushing_yards'] > 50)]
    t2 = time.time() - start
    print(f"Complex filter - Original: {t1:.2f}s, Optimized: {t2:.2f}s")

benchmark_extended_ops()
OperationOriginal (s)Optimized (s)
Value counts0.130.00
Complex groupby0.120.02
String ops0.240.00
Complex filter0.020.00

The proof is in the pudding, by optimizing the data types of the columns in the dataset, we were able to significantly reduce the time it takes to perform operations on the data. This is because smaller data types are faster to process and require less memory to store, which leads to faster execution times. When working with huge data and/or having to perform a lot of operations these time savings add up very quickly.

All Good Things Come To An End

In conclusion, data typing is a crucial aspect of programming that can have a huge impact on the performance, memory usage, and security of your program. By properly typing your data, you can optimize your program to run faster, use less memory, and be more secure. It's important to understand the different data types available to you and how they can be used to represent your data. By taking the time to optimize the data types of your columns and projects in general, you can save a significant amount of memory and improve the performance of your program. Remember, data is the lifeblood of the world, and data typing is the key to unlocking its full potential.