Today, data is more valuable than Gold. Many companies have been building revolutionary machine learning and artificial intelligence products for decades now. Algorithms have been used for making our lives easier in countless ways, from identifying disease, to predicting the weather, to even recommending movies. AI is revolutionizing the world and data is the fuel that powers it. Recently, with the release of large language models like ChatGPT, Claude, and Gemini, the world has seen the power of AI and data science. Millions of people now have a seemingly endless amount of information at their fingertips. We wouldn't have any of it without data.

Everywhere we look, everywhere we go, everything we see is data. Data is becoming the lifeblood of the world. It is the becoming the foundation of everything we do, from the way we communicate to the way we work, heck even the way we drive. Data is everywhere, it's not going anywhere, and it is growing at an exponential rate. With the rise of big data, data science, and machine learning, the importance of data has never been more obvious. But what is data? To make it as simple as possible, data is information, and information can be anything from numbers to text to images. However, data is more than just information; it is also a type of information. Data types are crucial for understanding how to work with data, and they play a significant role in programming.

Introduction to Data Typing

In programming, data types are used to separate different types of data, such as integers, strings, and booleans. It is incredibly important to ensure you are properly typing your variables, as it can have a huge impact on the performance, memory management, and security of your program. Data typing is the process of assigning a type to a variable, which tells the computer how to interpret the data stored in that variable.

There are literally dozens of data types spread across dozens of programming languages, but the most common ones are integers, floats, strings, and booleans. An integer is a whole number, a float is a number with a decimal point, a string is a sequence of characters, and a boolean is a value that is either true or false. These data types are the building blocks of programming, and they are used to represent everything from numbers to text to logic.

For example:

# Integers (whole numbers)
age = 25
year = 2024
temperature = -5

# Floats (decimal numbers)
height = 5.11
pi = 3.14159
bank_balance = -123.45

# Strings (text)
name = "Chris"
message = 'I am the master of the universe!'
address = "123 Data Science St."

# Booleans (True/False)
is_student = True
has_license = False
is_raining = True

On the surface learning data types is pretty straight forward, if its a whole number use an integer, if its a decimal use a float, if its text use a string, and if its a true or false statement use a boolean. However, once you get a little bit deeper into programming, you'll realize that data types are much more complex than that. There are many different types of integers, floats, strings, and booleans, and each one has its own unique properties and behaviors.

For example, in Python, there are four different types of integers: int8, int16, int32, and int64. Each of these types has a different range of values that it can represent, and each one takes up a different amount of memory. Similarly, there are different types of floats, strings, and booleans.

Properly typing your data can have massive effects on performance, memory usage, and security. For example, using a smaller integer type like int8 instead of int64 can save memory and improve performance. Similarly, using a strongly typed array instead of a dynamic array can improve memory usage and reduce the risk of security vulnerabilities.

Good ol' Memory

If you've been around computers a long time it's likely you've seen the evolution of memory. From the days of kilobytes to megabytes to gigabytes to terabytes, memory has come a long way. Memory is a crucial aspect of programming, and it is important to understand how data types can impact memory usage. When you create a variable in a program, the computer allocates memory to store the data associated with that variable. The amount of memory allocated depends on the data type of the variable.

Modern computers have either 32 bit or 64 bit memory architecture, which means they can store 32 bits or 64 bits of data at a time. This is important because different data types require different amounts of memory to store. For example, an int8 variable requires 8 bits of memory, while an int64 variable requires 64 bits of memory. This means that an int8 variable uses 8 times less memory than an int64 variable to store the same value. Depending on your environment, these memory savings can be absolutely crucial. They can be the difference between being able to work with data and not even being able to load it into memory.

I've created a couple of tools to help ingrain how memory is stored. Please feel free around to play with this data type visualizer here to help get an understanding of how data types are stored in memory.

Data Type Memory Visualizer

Data Type

Value

Size in memory: 4 bytes

Stack Memory

Type Description

32-bit signed integer (-2,147,483,648 to 2,147,483,647)

Input Format

Enter a whole number within the valid range

Here is also a handy chart to help you understand how much memory different data types use:

Data Types Reference

Type	Size	Range	Common Usage
int8_t (char)	1 bytes(8 bits)	Min: -128 Max: 127	Small integers, ASCII characters
uint8_t	1 bytes(8 bits)	Min: 0 Max: 255	Byte values, small positive numbers
int16_t	2 bytes(16 bits)	Min: -32,768 Max: 32,767	Medium-range integers
uint16_t	2 bytes(16 bits)	Min: 0 Max: 65,535	Port numbers, medium positive numbers
int32_t	4 bytes(32 bits)	Min: -2,147,483,648 Max: 2,147,483,647	General purpose integers
uint32_t	4 bytes(32 bits)	Min: 0 Max: 4,294,967,295	Large positive numbers, RGB colors
int64_t	8 bytes(64 bits)	Min: -9,223,372,036,854,775,808 Max: 9,223,372,036,854,775,807	Very large integers, timestamps
uint64_t	8 bytes(64 bits)	Min: 0 Max: 18,446,744,073,709,551,615	File sizes, very large positive numbers
float	4 bytes(32 bits)	Min: ±1.18e-38 Max: ±3.4e+38	Basic decimal calculations
double	8 bytes(64 bits)	Min: ±2.23e-308 Max: ±1.80e+308	Precise decimal calculations
char	1 bytes(8 bits)	Min: 0 Max: 255	Single characters, ASCII values
bool	1 bytes(8 bits)	Min: false Max: true	True/false values
pointer	8 bytes(64 bits)	Min: 0x0 Max: 0xFFFFFFFFFFFFFFFF	Memory addresses
Array	n × element_size(n × element_bits)	Min: 0 elements Max: Memory limit	Fixed-size sequential collections
Dynamic Array	12 + (n × element_size)(96 + (n × element_bits))	Min: 0 elements Max: Memory limit	Resizable sequential collections
List	24 + (n × (element_size + ptr_size))(192 + (n × (element_bits + 64)))	Min: 0 elements Max: Memory limit	Linked data structures
Tuple	sum(element_sizes)(sum(element_bits))	Min: Fixed size Max: Fixed size	Mixed-type fixed collections
Set	16 + (n × element_size)(128 + (n × element_bits))	Min: 0 elements Max: Memory limit	Unique value collections
Map/Dictionary	24 + (n × (key_size + value_size))(192 + (n × (key_bits + value_bits)))	Min: 0 pairs Max: Memory limit	Key-value associations
String	8 + n + 1(64 + (n × 8) + 8)	Min: Empty string Max: Memory limit	Text storage
Struct	sum(field_sizes)(sum(field_bits))	Min: Fixed size Max: Fixed size	Custom data grouping
Union	max(member_sizes)(max(member_bits))	Min: Largest member Max: Largest member	Memory-efficient variants
Class	8 + sum(field_sizes)(64 + sum(field_bits))	Min: vtable + fields Max: vtable + fields	Object-oriented types
Enum	1-4(8-32)	Min: 0 Max: 2^32 - 1	Named constants

Size Notation

n: Number of elements

element_size: Size of each element in bytes

element_bits: Size of each element in bits

ptr_size: Size of a pointer (typically 8 bytes on 64-bit systems)

field_sizes: Combined size of all fields in a structure

YeeYeeYeeYee Live Action

Let's dive into a real world example to see how data typing can directly impact the performance and memory usage of a program. We'll use Python and Pandas to load a data set and optimize the data types to reduce memory usage. We'll also take a look at how different data types are represented in memory and how they can impact the performance of a program. It's important to note that this example is specific to Python and Pandas, but the concepts can be applied to any programming language and data processing library. Also keep in mind that pandas automatically assigns data types to columns when loading data, so it's important to check and optimize the data types to reduce memory usage as Pandas uses default data types such as int64 for integers even though they possibly can be optimized with a smaller int like an int8. Let's get started!

The dataset that will be used for this example is a dataset of NFL player statistics sourced from Kaggle. If you would like to follow along, the dataset can be downloaded from here: NFL Player Stats

There is two datasets included, but we will specifically focus on the games.

Our dataset contains 46 columns and 1,024,164 rows, so it is fairly large and will be a good example to demonstrate how memory optimization is very important and can lead to extreme cost and time savings.

Here is a quick look at the available columns in the dataset:


player_id	date	game_number	age
passing_attempts	passing_completions	passing_yards	passing_rating
passing_touchdowns	passing_interceptions	passing_sacks	passing_sacks_yards_lost
rushing_attempts	rushing_yards	rushing_touchdowns	receiving_targets
receiving_receptions	receiving_yards	receiving_touchdowns	kick_return_attempts
kick_return_yards	kick_return_touchdowns	punt_return_attempts	punt_return_yards
punt_return_touchdowns	defense_sacks	defense_tackles	defense_tackle_assists
defense_interceptions	defense_interception_yards	defense_interception_touchdowns	defense_safeties
point_after_attemps	point_after_makes	field_goal_attempts	field_goal_makes
punting_attempts	punting_yards	punting_blocked	team
game_location	opponent	game_won	year
player_team_score	opponent_score

To get an initial look at the memory usage of the dataset, we can use the following code:

nfl_game_stats_df.memory_usage(deep=True).sum() / 1024**2

The dataset is using 525.47 mb of memory initially. This is a fairly large amount of memory, and we can optimize it by converting the data types of the columns to more appropriate types. For example, we can convert the int64 columns to int8 or int16 if the values in the columns will never exceed the maximum value supported by those types. We can also convert the object columns to category if there are a finite number of unique values in the columns.

Let's take a deeper look at the data types of the columns in the dataset and their memory usage to see how we can ooptimize them:

# Get memory usage of each column and convert to MB from bytes. (deep=True to include object columns)
mem_usage = nfl_game_stats_df.memory_usage(deep=True) / 1024**2
# Get data types of each column
dtypes = nfl_game_stats_df.dtypes

# Split the data into two halves for easier viewing
left, right = np.array_split(mem_usage, 2)
left = left[1:]
left = left.reset_index()
left.columns = ["Columns", "Values"]

right = right.reset_index()
right.columns = ["Columns", "Values"]

df1 = pd.DataFrame({
    'Features': left['Columns'],
    'Memory (MB)': left['Values'],
    'Dtype': [dtypes[col] for col in left['Columns']]
})

df2 = pd.DataFrame({
    'Features': right['Columns'],
    'Memory (MB)': right['Values'],
    'Dtype': [dtypes[col] for col in right['Columns']]
})

pd.concat([df1, df2], axis=1)

	Features	Memory (MB)	Dtype	Features	Memory (MB)	Dtype
0	player_id	7.814	int64	receiving_receptions	7.814	int64
1	year	7.814	int64	receiving_yards	7.814	int64
2	date	7.814	datetime64[ns]	receiving_touchdowns	7.814	int64
3	game_number	7.814	int64	kick_return_attempts	7.814	int64
4	age	53.717	object	kick_return_yards	7.814	int64
5	team	50.789	object	kick_return_touchdowns	7.814	int64
6	game_location	48.836	object	punt_return_attempts	7.814	int64
7	opponent	50.789	object	punt_return_yards	7.814	int64
8	game_won	0.977	bool	punt_return_touchdowns	7.814	int64
9	player_team_score	7.814	int64	defense_sacks	7.814	float64
10	opponent_score	7.814	int64	defense_tackles	7.814	int64
11	passing_attempts	7.814	int64	defense_tackle_assists	7.814	int64
12	passing_completions	7.814	int64	defense_interceptions	7.814	int64
13	passing_yards	7.814	int64	defense_interception_yards	7.814	int64
14	passing_rating	7.814	float64	defense_interception_touchdowns	7.814	int64
15	passing_touchdowns	7.814	int64	defense_safeties	7.814	int64
16	passing_interceptions	7.814	int64	point_after_attemps	7.814	int64
17	passing_sacks	7.814	int64	point_after_makes	7.814	int64
18	passing_sacks_yards_lost	7.814	int64	field_goal_attempts	7.814	int64
19	rushing_attempts	7.814	int64	field_goal_makes	7.814	int64
20	rushing_yards	7.814	int64	punting_attempts	7.814	int64
21	rushing_touchdowns	7.814	int64	punting_yards	7.814	int64
22	receiving_targets	7.814	int64	punting_blocked	7.814	int64

There are a few things that immediately stand out in the output:

The object columns are using a significant amount of memory.
The int64 columns are using more memory than necessary.
All of the 64 bit columns are using the same amount of memory which is a great visual example as to how it's not necessarily the value that changes memory, but the data type.

Let's take a look at the object columns first since they are using by far the most amount of memory:

nfl_game_stats_df[['age', 'team', 'game_location', 'opponent']].head()

	age	team	game_location	opponent
0	23-120	SEA	A	CHI
1	23-127	SEA	H	RAI
2	23-134	SEA	A	DEN
3	23-142	SEA	H	CIN
4	23-148	SEA	A	NWE

You might have been thinking why is the age column an object? Shouldn't it be an integer? It's clear in the output that age isn't a whole number and likely represents age with years and days, so the first player is 23 years old, 120 days. This is a perfect example of why it's important to check the data types of your columns and make sure they are appropriate for the data they contain. The team, game_location, and opponent columns are all strings, so they are correctly typed as objects, however they can be better typed as a category.

What does a category do? A category is a data type that is used to represent a fixed number of unique values. In simpler terms, if it's a categorical variable, type it as such. It is more memory efficient than an object data type, especially when the number of unique values is small. A quick way to check is using Pandas .nunique() method:

nfl_game_stats_df[['age', 'team', 'game_location', 'opponent']].nunique()

	count
age	8125.000
team	42.000
game_location	3.000
opponent	42.000

We know that the dataset contains 1,024,164 rows of data and since all four of these columns have a really low number of unique values, we can deduce that these are all categorical variables. We can convert them to a category data type to save memory:

memory_data = []
columns = ['age', 'team', 'game_location', 'opponent']

for col in columns:
   before = nfl_game_stats_df[col].memory_usage(deep=True) / 1024**2
   after = nfl_game_stats_df[col].astype("category").memory_usage(deep=True) / 1024**2
   memory_data.append({
       'column': col,
       'before_mb': before,
       'after_mb': after,
       'mb_saved': before - after
   })

memory_df = pd.DataFrame(memory_data).set_index('column').rename_axis(None)

memory_df.loc['total'] = [memory_df['before_mb'].sum(), memory_df['after_mb'].sum(), memory_df['mb_saved'].sum()]

	before_mb	after_mb	mb_saved
age	53.718	2.632	51.086
team	50.790	0.980	49.810
game_location	48.836	0.977	47.859
opponent	50.790	0.980	49.810
total	204.133	5.569	198.564

By making this one simple change, we were able to save 198.56 MB of memory, which is a significant amount considering the dataset is only 525.47 MB. That's nearly 38% of the total memory usage of the dataset saved! Strings/objects can really be detrimental to memory usage, so it's important to handle them appropriately. We were able to bring the total memory usage down to 326.9 MB, which is a huge improvement.

Next, let's take a look at the integer columns and see if we can optimize them as well. We can use the same approach as before to check the memory usage of the integer columns and see if we can convert them to a smaller data type. I wrote a function that checks the min and max of each integer column and determines the optimal integer type based on the range of values:

def optimize_ints_report(df):
   memory_data = []
   int_cols = df.select_dtypes(include=['int64']).columns

   for col in int_cols:
       col_min = df[col].min()
       col_max = df[col].max()
       before = df[col].memory_usage(deep=True) / 1024**2

       # Determine optimal type
       if col_min >= 0:
           dtype = ('uint8' if col_max < 256 else
                   'uint16' if col_max < 65536 else
                   'uint32' if col_max < 4294967296 else
                   'uint64')
       else:
           dtype = ('int8' if -128 <= col_min and col_max < 128 else
                   'int16' if -32768 <= col_min and col_max < 32768 else
                   'int32' if -2147483648 <= col_min and col_max < 2147483648 else
                   'int64')

       after = df[col].astype(dtype).memory_usage(deep=True) / 1024**2

       memory_data.append({
           'column': col,
           'before_mb': before,
           'after_mb': after,
           'mb_saved': before - after,
           'optimal_type': dtype
       })

   memory_df = pd.DataFrame(memory_data).set_index('column').rename_axis(None)
   memory_df.loc['total'] = [memory_df['before_mb'].sum(),
                            memory_df['after_mb'].sum(),
                            memory_df['mb_saved'].sum(),
                            '']

   return memory_df

	before_mb	after_mb	mb_saved	optimal_type
player_id	7.814	1.954	5.860	uint16
year	7.814	1.954	5.860	uint16
game_number	7.814	0.977	6.837	uint8
player_team_score	7.814	0.977	6.837	uint8
opponent_score	7.814	0.977	6.837	uint8
passing_attempts	7.814	0.977	6.837	uint8
passing_completions	7.814	0.977	6.837	uint8
passing_yards	7.814	1.954	5.860	int16
passing_touchdowns	7.814	0.977	6.837	uint8
passing_interceptions	7.814	0.977	6.837	uint8
passing_sacks	7.814	0.977	6.837	uint8
passing_sacks_yards_lost	7.814	0.977	6.837	uint8
rushing_attempts	7.814	0.977	6.837	uint8
rushing_yards	7.814	1.954	5.860	int16
rushing_touchdowns	7.814	0.977	6.837	uint8
receiving_targets	7.814	0.977	6.837	uint8
receiving_receptions	7.814	0.977	6.837	uint8
receiving_yards	7.814	1.954	5.860	int16
receiving_touchdowns	7.814	0.977	6.837	uint8
kick_return_attempts	7.814	0.977	6.837	uint8
kick_return_yards	7.814	1.954	5.860	int16
kick_return_touchdowns	7.814	0.977	6.837	uint8
punt_return_attempts	7.814	0.977	6.837	uint8
punt_return_yards	7.814	1.954	5.860	int16
punt_return_touchdowns	7.814	0.977	6.837	uint8
defense_tackles	7.814	0.977	6.837	uint8
defense_tackle_assists	7.814	0.977	6.837	uint8
defense_interceptions	7.814	0.977	6.837	uint8
defense_interception_yards	7.814	1.954	5.860	int16
defense_interception_touchdowns	7.814	0.977	6.837	uint8
defense_safeties	7.814	0.977	6.837	uint8
point_after_attemps	7.814	0.977	6.837	uint8
point_after_makes	7.814	0.977	6.837	uint8
field_goal_attempts	7.814	0.977	6.837	uint8
field_goal_makes	7.814	0.977	6.837	uint8
punting_attempts	7.814	0.977	6.837	uint8
punting_yards	7.814	1.954	5.860	int16
punting_blocked	7.814	0.977	6.837	uint8
total	296.927	45.911	251.017

It's like magic, really. By converting the integer columns to their optimal integer types, we were able to save an additional 251.02 MB of memory, bringing the total memory usage of the dataset down to 75.89 MB. That's a 85.6% reduction in memory usage, which is absolutely incredible.

To test out some of these memory savings we will put the dataset through a gauntlet of tests, to compare the unoptimized dataset with the optimized dataset. We will test the time it takes to perform value counts, complex groupby operations, string operations, and complex filtering on the dataset.

def benchmark_extended_ops():
    # Value counts
    start = time.time()
    nfl_game_stats_df['team'].value_counts()
    t1 = time.time() - start

    start = time.time()
    nfl_game_stats_df_optimized['team'].value_counts()
    t2 = time.time() - start
    print(f"Value counts - Original: {t1:.2f}s, Optimized: {t2:.2f}s")

    # Multiple groupby operations
    start = time.time()
    nfl_game_stats_df.groupby(['team', 'year'])['passing_yards'].mean()
    t1 = time.time() - start

    start = time.time()
    nfl_game_stats_df_optimized.groupby(['team', 'year'], observed=True)['passing_yards'].mean()
    t2 = time.time() - start
    print(f"Complex groupby - Original: {t1:.2f}s, Optimized: {t2:.2f}s")

    # String operations
    start = time.time()
    nfl_game_stats_df['team'].str.contains('A')
    t1 = time.time() - start

    start = time.time()
    nfl_game_stats_df_optimized['team'].str.contains('A')
    t2 = time.time() - start
    print(f"String ops - Original: {t1:.2f}s, Optimized: {t2:.2f}s")

    # Complex filtering
    start = time.time()
    nfl_game_stats_df[(nfl_game_stats_df['passing_yards'] > 200) &
                      (nfl_game_stats_df['rushing_yards'] > 50)]
    t1 = time.time() - start

    start = time.time()
    nfl_game_stats_df_optimized[(nfl_game_stats_df_optimized['passing_yards'] > 200) &
                               (nfl_game_stats_df_optimized['rushing_yards'] > 50)]
    t2 = time.time() - start
    print(f"Complex filter - Original: {t1:.2f}s, Optimized: {t2:.2f}s")

benchmark_extended_ops()

Operation	Original (s)	Optimized (s)
Value counts	0.13	0.00
Complex groupby	0.12	0.02
String ops	0.24	0.00
Complex filter	0.02	0.00

The proof is in the pudding, by optimizing the data types of the columns in the dataset, we were able to significantly reduce the time it takes to perform operations on the data. This is because smaller data types are faster to process and require less memory to store, which leads to faster execution times. When working with huge data and/or having to perform a lot of operations these time savings add up very quickly.

All Good Things Come To An End

In conclusion, data typing is a crucial aspect of programming that can have a huge impact on the performance, memory usage, and security of your program. By properly typing your data, you can optimize your program to run faster, use less memory, and be more secure. It's important to understand the different data types available to you and how they can be used to represent your data. By taking the time to optimize the data types of your columns and projects in general, you can save a significant amount of memory and improve the performance of your program. Remember, data is the lifeblood of the world, and data typing is the key to unlocking its full potential.