Am I Your Type?
Today, data is more valuable than Gold. Many companies have been building revolutionary machine learning and artificial intelligence products for decades now. Algorithms have been used for making our lives easier in countless ways, from identifying disease, to predicting the weather, to even recommending movies. AI is revolutionizing the world and data is the fuel that powers it. Recently, with the release of large language models like ChatGPT, Claude, and Gemini, the world has seen the power of AI and data science. Millions of people now have a seemingly endless amount of information at their fingertips. We wouldn't have any of it without data.
Everywhere we look, everywhere we go, everything we see is data. Data is becoming the lifeblood of the world. It is the becoming the foundation of everything we do, from the way we communicate to the way we work, heck even the way we drive. Data is everywhere, it's not going anywhere, and it is growing at an exponential rate. With the rise of big data, data science, and machine learning, the importance of data has never been more obvious. But what is data? To make it as simple as possible, data is information, and information can be anything from numbers to text to images. However, data is more than just information; it is also a type of information. Data types are crucial for understanding how to work with data, and they play a significant role in programming.
Introduction to Data Typing
In programming, data types are used to separate different types of data, such as integers, strings, and booleans. It is incredibly important to ensure you are properly typing your variables, as it can have a huge impact on the performance, memory management, and security of your program. Data typing is the process of assigning a type to a variable, which tells the computer how to interpret the data stored in that variable.
There are literally dozens of data types spread across dozens of programming languages, but the most common ones are integers, floats, strings, and booleans. An integer is a whole number, a float is a number with a decimal point, a string is a sequence of characters, and a boolean is a value that is either true or false. These data types are the building blocks of programming, and they are used to represent everything from numbers to text to logic.
For example:
# Integers (whole numbers)
age = 25
year = 2024
temperature = -5
# Floats (decimal numbers)
height = 5.11
pi = 3.14159
bank_balance = -123.45
# Strings (text)
name = "Chris"
message = 'I am the master of the universe!'
address = "123 Data Science St."
# Booleans (True/False)
is_student = True
has_license = False
is_raining = True
On the surface learning data types is pretty straight forward, if its a whole number use an integer, if its a decimal use a float, if its text use a string, and if its a true or false statement use a boolean. However, once you get a little bit deeper into programming, you'll realize that data types are much more complex than that. There are many different types of integers, floats, strings, and booleans, and each one has its own unique properties and behaviors.
For example, in Python, there are four different types of integers: int8, int16, int32, and int64. Each of these types has a different range of values that it can represent, and each one takes up a different amount of memory. Similarly, there are different types of floats, strings, and booleans.
Properly typing your data can have massive effects on performance, memory usage, and security. For example, using a smaller integer type like int8 instead of int64 can save memory and improve performance. Similarly, using a strongly typed array instead of a dynamic array can improve memory usage and reduce the risk of security vulnerabilities.
Good ol' Memory
If you've been around computers a long time it's likely you've seen the evolution of memory. From the days of kilobytes to megabytes to gigabytes to terabytes, memory has come a long way. Memory is a crucial aspect of programming, and it is important to understand how data types can impact memory usage. When you create a variable in a program, the computer allocates memory to store the data associated with that variable. The amount of memory allocated depends on the data type of the variable.
Modern computers have either 32 bit or 64 bit memory architecture, which means they can store 32 bits or 64 bits of data at a time. This is important because different data types require different amounts of memory to store. For example, an int8 variable requires 8 bits of memory, while an int64 variable requires 64 bits of memory. This means that an int8 variable uses 8 times less memory than an int64 variable to store the same value. Depending on your environment, these memory savings can be absolutely crucial. They can be the difference between being able to work with data and not even being able to load it into memory.
I've created a couple of tools to help ingrain how memory is stored. Please feel free around to play with this data type visualizer here to help get an understanding of how data types are stored in memory.
Data Type Memory Visualizer
Stack Memory
Type Description
32-bit signed integer (-2,147,483,648 to 2,147,483,647)
Input Format
Enter a whole number within the valid range
Here is also a handy chart to help you understand how much memory different data types use:
Data Types Reference
Type | Size | Range | Common Usage | |
---|---|---|---|---|
int8_t (char) | 1 bytes(8 bits) | Min: -128 Max: 127 | Small integers, ASCII characters | |
uint8_t | 1 bytes(8 bits) | Min: 0 Max: 255 | Byte values, small positive numbers | |
int16_t | 2 bytes(16 bits) | Min: -32,768 Max: 32,767 | Medium-range integers | |
uint16_t | 2 bytes(16 bits) | Min: 0 Max: 65,535 | Port numbers, medium positive numbers | |
int32_t | 4 bytes(32 bits) | Min: -2,147,483,648 Max: 2,147,483,647 | General purpose integers | |
uint32_t | 4 bytes(32 bits) | Min: 0 Max: 4,294,967,295 | Large positive numbers, RGB colors | |
int64_t | 8 bytes(64 bits) | Min: -9,223,372,036,854,775,808 Max: 9,223,372,036,854,775,807 | Very large integers, timestamps | |
uint64_t | 8 bytes(64 bits) | Min: 0 Max: 18,446,744,073,709,551,615 | File sizes, very large positive numbers | |
float | 4 bytes(32 bits) | Min: ±1.18e-38 Max: ±3.4e+38 | Basic decimal calculations | |
double | 8 bytes(64 bits) | Min: ±2.23e-308 Max: ±1.80e+308 | Precise decimal calculations | |
char | 1 bytes(8 bits) | Min: 0 Max: 255 | Single characters, ASCII values | |
bool | 1 bytes(8 bits) | Min: false Max: true | True/false values | |
pointer | 8 bytes(64 bits) | Min: 0x0 Max: 0xFFFFFFFFFFFFFFFF | Memory addresses | |
Array | n × element_size(n × element_bits) | Min: 0 elements Max: Memory limit | Fixed-size sequential collections | |
Dynamic Array | 12 + (n × element_size)(96 + (n × element_bits)) | Min: 0 elements Max: Memory limit | Resizable sequential collections | |
List | 24 + (n × (element_size + ptr_size))(192 + (n × (element_bits + 64))) | Min: 0 elements Max: Memory limit | Linked data structures | |
Tuple | sum(element_sizes)(sum(element_bits)) | Min: Fixed size Max: Fixed size | Mixed-type fixed collections | |
Set | 16 + (n × element_size)(128 + (n × element_bits)) | Min: 0 elements Max: Memory limit | Unique value collections | |
Map/Dictionary | 24 + (n × (key_size + value_size))(192 + (n × (key_bits + value_bits))) | Min: 0 pairs Max: Memory limit | Key-value associations | |
String | 8 + n + 1(64 + (n × 8) + 8) | Min: Empty string Max: Memory limit | Text storage | |
Struct | sum(field_sizes)(sum(field_bits)) | Min: Fixed size Max: Fixed size | Custom data grouping | |
Union | max(member_sizes)(max(member_bits)) | Min: Largest member Max: Largest member | Memory-efficient variants | |
Class | 8 + sum(field_sizes)(64 + sum(field_bits)) | Min: vtable + fields Max: vtable + fields | Object-oriented types | |
Enum | 1-4(8-32) | Min: 0 Max: 2^32 - 1 | Named constants |
Size Notation
n: Number of elements
element_size: Size of each element in bytes
element_bits: Size of each element in bits
ptr_size: Size of a pointer (typically 8 bytes on 64-bit systems)
field_sizes: Combined size of all fields in a structure
YeeYeeYeeYee Live Action
Let's dive into a real world example to see how data typing can directly impact the performance and memory usage of a program. We'll use Python and Pandas to load a data set and optimize the data types to reduce memory usage. We'll also take a look at how different data types are represented in memory and how they can impact the performance of a program. It's important to note that this example is specific to Python and Pandas, but the concepts can be applied to any programming language and data processing library. Also keep in mind that pandas automatically assigns data types to columns when loading data, so it's important to check and optimize the data types to reduce memory usage as Pandas uses default data types such as int64 for integers even though they possibly can be optimized with a smaller int like an int8. Let's get started!
The dataset that will be used for this example is a dataset of NFL player statistics sourced from Kaggle. If you would like to follow along, the dataset can be downloaded from here: NFL Player Stats
There is two datasets included, but we will specifically focus on the games.
Our dataset contains 46 columns and 1,024,164 rows, so it is fairly large and will be a good example to demonstrate how memory optimization is very important and can lead to extreme cost and time savings.
Here is a quick look at the available columns in the dataset:
player_id | date | game_number | age |
passing_attempts | passing_completions | passing_yards | passing_rating |
passing_touchdowns | passing_interceptions | passing_sacks | passing_sacks_yards_lost |
rushing_attempts | rushing_yards | rushing_touchdowns | receiving_targets |
receiving_receptions | receiving_yards | receiving_touchdowns | kick_return_attempts |
kick_return_yards | kick_return_touchdowns | punt_return_attempts | punt_return_yards |
punt_return_touchdowns | defense_sacks | defense_tackles | defense_tackle_assists |
defense_interceptions | defense_interception_yards | defense_interception_touchdowns | defense_safeties |
point_after_attemps | point_after_makes | field_goal_attempts | field_goal_makes |
punting_attempts | punting_yards | punting_blocked | team |
game_location | opponent | game_won | year |
player_team_score | opponent_score |
To get an initial look at the memory usage of the dataset, we can use the following code:
nfl_game_stats_df.memory_usage(deep=True).sum() / 1024**2
The dataset is using 525.47 mb of memory initially. This is a fairly large amount of memory, and we can optimize it by converting the data types of the columns to more appropriate types. For example, we can convert the int64 columns to int8 or int16 if the values in the columns will never exceed the maximum value supported by those types. We can also convert the object columns to category if there are a finite number of unique values in the columns.
Let's take a deeper look at the data types of the columns in the dataset and their memory usage to see how we can ooptimize them:
# Get memory usage of each column and convert to MB from bytes. (deep=True to include object columns)
mem_usage = nfl_game_stats_df.memory_usage(deep=True) / 1024**2
# Get data types of each column
dtypes = nfl_game_stats_df.dtypes
# Split the data into two halves for easier viewing
left, right = np.array_split(mem_usage, 2)
left = left[1:]
left = left.reset_index()
left.columns = ["Columns", "Values"]
right = right.reset_index()
right.columns = ["Columns", "Values"]
df1 = pd.DataFrame({
'Features': left['Columns'],
'Memory (MB)': left['Values'],
'Dtype': [dtypes[col] for col in left['Columns']]
})
df2 = pd.DataFrame({
'Features': right['Columns'],
'Memory (MB)': right['Values'],
'Dtype': [dtypes[col] for col in right['Columns']]
})
pd.concat([df1, df2], axis=1)
Features | Memory (MB) | Dtype | Features | Memory (MB) | Dtype | |
---|---|---|---|---|---|---|
0 | player_id | 7.814 | int64 | receiving_receptions | 7.814 | int64 |
1 | year | 7.814 | int64 | receiving_yards | 7.814 | int64 |
2 | date | 7.814 | datetime64[ns] | receiving_touchdowns | 7.814 | int64 |
3 | game_number | 7.814 | int64 | kick_return_attempts | 7.814 | int64 |
4 | age | 53.717 | object | kick_return_yards | 7.814 | int64 |
5 | team | 50.789 | object | kick_return_touchdowns | 7.814 | int64 |
6 | game_location | 48.836 | object | punt_return_attempts | 7.814 | int64 |
7 | opponent | 50.789 | object | punt_return_yards | 7.814 | int64 |
8 | game_won | 0.977 | bool | punt_return_touchdowns | 7.814 | int64 |
9 | player_team_score | 7.814 | int64 | defense_sacks | 7.814 | float64 |
10 | opponent_score | 7.814 | int64 | defense_tackles | 7.814 | int64 |
11 | passing_attempts | 7.814 | int64 | defense_tackle_assists | 7.814 | int64 |
12 | passing_completions | 7.814 | int64 | defense_interceptions | 7.814 | int64 |
13 | passing_yards | 7.814 | int64 | defense_interception_yards | 7.814 | int64 |
14 | passing_rating | 7.814 | float64 | defense_interception_touchdowns | 7.814 | int64 |
15 | passing_touchdowns | 7.814 | int64 | defense_safeties | 7.814 | int64 |
16 | passing_interceptions | 7.814 | int64 | point_after_attemps | 7.814 | int64 |
17 | passing_sacks | 7.814 | int64 | point_after_makes | 7.814 | int64 |
18 | passing_sacks_yards_lost | 7.814 | int64 | field_goal_attempts | 7.814 | int64 |
19 | rushing_attempts | 7.814 | int64 | field_goal_makes | 7.814 | int64 |
20 | rushing_yards | 7.814 | int64 | punting_attempts | 7.814 | int64 |
21 | rushing_touchdowns | 7.814 | int64 | punting_yards | 7.814 | int64 |
22 | receiving_targets | 7.814 | int64 | punting_blocked | 7.814 | int64 |
There are a few things that immediately stand out in the output:
- The object columns are using a significant amount of memory.
- The int64 columns are using more memory than necessary.
- All of the 64 bit columns are using the same amount of memory which is a great visual example as to how it's not necessarily the value that changes memory, but the data type.
Let's take a look at the object columns first since they are using by far the most amount of memory:
nfl_game_stats_df[['age', 'team', 'game_location', 'opponent']].head()
age | team | game_location | opponent | |
---|---|---|---|---|
0 | 23-120 | SEA | A | CHI |
1 | 23-127 | SEA | H | RAI |
2 | 23-134 | SEA | A | DEN |
3 | 23-142 | SEA | H | CIN |
4 | 23-148 | SEA | A | NWE |
You might have been thinking why is the age column an object? Shouldn't it be an integer? It's clear in the output that age isn't a whole number and likely represents age with years and days, so the first player is 23 years old, 120 days. This is a perfect example of why it's important to check the data types of your columns and make sure they are appropriate for the data they contain. The team, game_location, and opponent columns are all strings, so they are correctly typed as objects, however they can be better typed as a category.
What does a category do? A category is a data type that is used to represent a fixed number of unique values. In simpler terms, if it's a categorical variable, type it as such. It is more memory efficient than an object data type, especially when the number of unique values is small. A quick way to check is using Pandas .nunique() method:
nfl_game_stats_df[['age', 'team', 'game_location', 'opponent']].nunique()
count | |
---|---|
age | 8125.000 |
team | 42.000 |
game_location | 3.000 |
opponent | 42.000 |
We know that the dataset contains 1,024,164 rows of data and since all four of these columns have a really low number of unique values, we can deduce that these are all categorical variables. We can convert them to a category data type to save memory:
memory_data = []
columns = ['age', 'team', 'game_location', 'opponent']
for col in columns:
before = nfl_game_stats_df[col].memory_usage(deep=True) / 1024**2
after = nfl_game_stats_df[col].astype("category").memory_usage(deep=True) / 1024**2
memory_data.append({
'column': col,
'before_mb': before,
'after_mb': after,
'mb_saved': before - after
})
memory_df = pd.DataFrame(memory_data).set_index('column').rename_axis(None)
memory_df.loc['total'] = [memory_df['before_mb'].sum(), memory_df['after_mb'].sum(), memory_df['mb_saved'].sum()]
before_mb | after_mb | mb_saved | |
---|---|---|---|
age | 53.718 | 2.632 | 51.086 |
team | 50.790 | 0.980 | 49.810 |
game_location | 48.836 | 0.977 | 47.859 |
opponent | 50.790 | 0.980 | 49.810 |
total | 204.133 | 5.569 | 198.564 |
By making this one simple change, we were able to save 198.56 MB of memory, which is a significant amount considering the dataset is only 525.47 MB. That's nearly 38% of the total memory usage of the dataset saved! Strings/objects can really be detrimental to memory usage, so it's important to handle them appropriately. We were able to bring the total memory usage down to 326.9 MB, which is a huge improvement.
Next, let's take a look at the integer columns and see if we can optimize them as well. We can use the same approach as before to check the memory usage of the integer columns and see if we can convert them to a smaller data type. I wrote a function that checks the min and max of each integer column and determines the optimal integer type based on the range of values:
def optimize_ints_report(df):
memory_data = []
int_cols = df.select_dtypes(include=['int64']).columns
for col in int_cols:
col_min = df[col].min()
col_max = df[col].max()
before = df[col].memory_usage(deep=True) / 1024**2
# Determine optimal type
if col_min >= 0:
dtype = ('uint8' if col_max < 256 else
'uint16' if col_max < 65536 else
'uint32' if col_max < 4294967296 else
'uint64')
else:
dtype = ('int8' if -128 <= col_min and col_max < 128 else
'int16' if -32768 <= col_min and col_max < 32768 else
'int32' if -2147483648 <= col_min and col_max < 2147483648 else
'int64')
after = df[col].astype(dtype).memory_usage(deep=True) / 1024**2
memory_data.append({
'column': col,
'before_mb': before,
'after_mb': after,
'mb_saved': before - after,
'optimal_type': dtype
})
memory_df = pd.DataFrame(memory_data).set_index('column').rename_axis(None)
memory_df.loc['total'] = [memory_df['before_mb'].sum(),
memory_df['after_mb'].sum(),
memory_df['mb_saved'].sum(),
'']
return memory_df
before_mb | after_mb | mb_saved | optimal_type | |
---|---|---|---|---|
player_id | 7.814 | 1.954 | 5.860 | uint16 |
year | 7.814 | 1.954 | 5.860 | uint16 |
game_number | 7.814 | 0.977 | 6.837 | uint8 |
player_team_score | 7.814 | 0.977 | 6.837 | uint8 |
opponent_score | 7.814 | 0.977 | 6.837 | uint8 |
passing_attempts | 7.814 | 0.977 | 6.837 | uint8 |
passing_completions | 7.814 | 0.977 | 6.837 | uint8 |
passing_yards | 7.814 | 1.954 | 5.860 | int16 |
passing_touchdowns | 7.814 | 0.977 | 6.837 | uint8 |
passing_interceptions | 7.814 | 0.977 | 6.837 | uint8 |
passing_sacks | 7.814 | 0.977 | 6.837 | uint8 |
passing_sacks_yards_lost | 7.814 | 0.977 | 6.837 | uint8 |
rushing_attempts | 7.814 | 0.977 | 6.837 | uint8 |
rushing_yards | 7.814 | 1.954 | 5.860 | int16 |
rushing_touchdowns | 7.814 | 0.977 | 6.837 | uint8 |
receiving_targets | 7.814 | 0.977 | 6.837 | uint8 |
receiving_receptions | 7.814 | 0.977 | 6.837 | uint8 |
receiving_yards | 7.814 | 1.954 | 5.860 | int16 |
receiving_touchdowns | 7.814 | 0.977 | 6.837 | uint8 |
kick_return_attempts | 7.814 | 0.977 | 6.837 | uint8 |
kick_return_yards | 7.814 | 1.954 | 5.860 | int16 |
kick_return_touchdowns | 7.814 | 0.977 | 6.837 | uint8 |
punt_return_attempts | 7.814 | 0.977 | 6.837 | uint8 |
punt_return_yards | 7.814 | 1.954 | 5.860 | int16 |
punt_return_touchdowns | 7.814 | 0.977 | 6.837 | uint8 |
defense_tackles | 7.814 | 0.977 | 6.837 | uint8 |
defense_tackle_assists | 7.814 | 0.977 | 6.837 | uint8 |
defense_interceptions | 7.814 | 0.977 | 6.837 | uint8 |
defense_interception_yards | 7.814 | 1.954 | 5.860 | int16 |
defense_interception_touchdowns | 7.814 | 0.977 | 6.837 | uint8 |
defense_safeties | 7.814 | 0.977 | 6.837 | uint8 |
point_after_attemps | 7.814 | 0.977 | 6.837 | uint8 |
point_after_makes | 7.814 | 0.977 | 6.837 | uint8 |
field_goal_attempts | 7.814 | 0.977 | 6.837 | uint8 |
field_goal_makes | 7.814 | 0.977 | 6.837 | uint8 |
punting_attempts | 7.814 | 0.977 | 6.837 | uint8 |
punting_yards | 7.814 | 1.954 | 5.860 | int16 |
punting_blocked | 7.814 | 0.977 | 6.837 | uint8 |
total | 296.927 | 45.911 | 251.017 |
It's like magic, really. By converting the integer columns to their optimal integer types, we were able to save an additional 251.02 MB of memory, bringing the total memory usage of the dataset down to 75.89 MB. That's a 85.6% reduction in memory usage, which is absolutely incredible.
To test out some of these memory savings we will put the dataset through a gauntlet of tests, to compare the unoptimized dataset with the optimized dataset. We will test the time it takes to perform value counts, complex groupby operations, string operations, and complex filtering on the dataset.
def benchmark_extended_ops():
# Value counts
start = time.time()
nfl_game_stats_df['team'].value_counts()
t1 = time.time() - start
start = time.time()
nfl_game_stats_df_optimized['team'].value_counts()
t2 = time.time() - start
print(f"Value counts - Original: {t1:.2f}s, Optimized: {t2:.2f}s")
# Multiple groupby operations
start = time.time()
nfl_game_stats_df.groupby(['team', 'year'])['passing_yards'].mean()
t1 = time.time() - start
start = time.time()
nfl_game_stats_df_optimized.groupby(['team', 'year'], observed=True)['passing_yards'].mean()
t2 = time.time() - start
print(f"Complex groupby - Original: {t1:.2f}s, Optimized: {t2:.2f}s")
# String operations
start = time.time()
nfl_game_stats_df['team'].str.contains('A')
t1 = time.time() - start
start = time.time()
nfl_game_stats_df_optimized['team'].str.contains('A')
t2 = time.time() - start
print(f"String ops - Original: {t1:.2f}s, Optimized: {t2:.2f}s")
# Complex filtering
start = time.time()
nfl_game_stats_df[(nfl_game_stats_df['passing_yards'] > 200) &
(nfl_game_stats_df['rushing_yards'] > 50)]
t1 = time.time() - start
start = time.time()
nfl_game_stats_df_optimized[(nfl_game_stats_df_optimized['passing_yards'] > 200) &
(nfl_game_stats_df_optimized['rushing_yards'] > 50)]
t2 = time.time() - start
print(f"Complex filter - Original: {t1:.2f}s, Optimized: {t2:.2f}s")
benchmark_extended_ops()
Operation | Original (s) | Optimized (s) |
---|---|---|
Value counts | 0.13 | 0.00 |
Complex groupby | 0.12 | 0.02 |
String ops | 0.24 | 0.00 |
Complex filter | 0.02 | 0.00 |
The proof is in the pudding, by optimizing the data types of the columns in the dataset, we were able to significantly reduce the time it takes to perform operations on the data. This is because smaller data types are faster to process and require less memory to store, which leads to faster execution times. When working with huge data and/or having to perform a lot of operations these time savings add up very quickly.
All Good Things Come To An End
In conclusion, data typing is a crucial aspect of programming that can have a huge impact on the performance, memory usage, and security of your program. By properly typing your data, you can optimize your program to run faster, use less memory, and be more secure. It's important to understand the different data types available to you and how they can be used to represent your data. By taking the time to optimize the data types of your columns and projects in general, you can save a significant amount of memory and improve the performance of your program. Remember, data is the lifeblood of the world, and data typing is the key to unlocking its full potential.