Am I Your Type?

Christopher Sanchez
Programming LanguagesSystems & Low LevelCore Development

Today, data is more valuable than Gold. Many companies have been building revolutionary machine learning and artificial intelligence products for decades now. Algorithms have been used for making our lives easier in countless ways, from identifying disease, to predicting the weather, to even recommending movies. AI is revolutionizing the world and data is the fuel that powers it. Recently, with the release of large language models like ChatGPT, Claude, and Gemini, the world has seen the power of AI and data science. Millions of people now have a seemingly endless amount of information at their fingertips. We wouldn't have any of it without data.

Everywhere we look, everywhere we go, everything we see is data. Data is becoming the lifeblood of the world. It is the becoming the foundation of everything we do, from the way we communicate to the way we work, heck even the way we drive. Data is everywhere, it's not going anywhere, and it is growing at an exponential rate. With the rise of big data, data science, and machine learning, the importance of data has never been more obvious. But what is data? To make it as simple as possible, data is information, and information can be anything from numbers to text to images. However, data is more than just information; it is also a type of information. Data types are crucial for understanding how to work with data, and they play a significant role in programming.

Introduction to Data Typing

In programming, data types are used to separate different types of data, such as integers, strings, and booleans. It is incredibly important to ensure you are properly typing your variables, as it can have a huge impact on the performance, memory management, and security of your program. Data typing is the process of assigning a type to a variable, which tells the computer how to interpret the data stored in that variable.

There are literally dozens of data types spread across dozens of programming languages, but the most common ones are integers, floats, strings, and booleans. An integer is a whole number, a float is a number with a decimal point, a string is a sequence of characters, and a boolean is a value that is either true or false. These data types are the building blocks of programming, and they are used to represent everything from numbers to text to logic.

For example:

# Integers (whole numbers)
age = 25
year = 2024
temperature = -5

# Floats (decimal numbers)
height = 5.11
pi = 3.14159
bank_balance = -123.45

# Strings (text)
name = "Chris"
message = 'I am the master of the universe!'
address = "123 Data Science St."

# Booleans (True/False)
is_student = True
has_license = False
is_raining = True

On the surface learning data types is pretty straight forward, if its a whole number use an integer, if its a decimal use a float, if its text use a string, and if its a true or false statement use a boolean. However, once you get a little bit deeper into programming, you'll realize that data types are much more complex than that. There are many different types of integers, floats, strings, and booleans, and each one has its own unique properties and behaviors.

For example, in Python, there are four different types of integers: int8, int16, int32, and int64. Each of these types has a different range of values that it can represent, and each one takes up a different amount of memory. Similarly, there are different types of floats, strings, and booleans.

Properly typing your data can have massive effects on performance, memory usage, and security. For example, using a smaller integer type like int8 instead of int64 can save memory and improve performance. Similarly, using a strongly typed array instead of a dynamic array can improve memory usage and reduce the risk of security vulnerabilities.

Let's take a look at an example to see how data typing can directly impact the performance and memory usage of a program. Pretend you have a data set that contains the demographic information of 1 million people. Each person has an ID, a name, an age. ID is an integer, name is a string, and age is an integer. If the data is loaded in with a library like Pandas for example, the default data types will be int64 for the ID, object for the name, and int64 for the age. This is because Pandas uses the most general data types by default, which can lead to wasted memory and slower performance. For example, since we know there are only 1 million people, we can use an int32 for the ID column since we know the value in ID will never surpass the maximum value supported by a 32 bit integer. For the age column we can use an int8 since the age of a person will never surpass 127. For the name column we can use a category data type since there are only a finite number of names that can be used.

By using the correct data types, we can reduce the memory usage of the data set quite dramatically. The memory used for the ID column will be reduced by 50%, the memory used for the age column will be reduced by 87.5%, and the memory used for the name column will be reduced by around 90%. This can have a huge impact on the performance of the program, as it will reduce the amount of time it takes to load the data into memory, process it, and analyze it.

Please feel free to play around with this tool below to see how different data types are represented in memory:


Data Type Memory Visualizer

Size in memory: 4 bytes

Stack Memory

Type Description

32-bit signed integer (-2,147,483,648 to 2,147,483,647)

Input Format

Enter a whole number within the valid range


Number Base Converter

Base-10
Valid characters: 0123456789
Binary (Base-2)
101010
Octal (Base-8)
52
Decimal (Base-10)
42
Hexadecimal (Base-16)
2A
Each digit is multiplied by its position value (10position) and then summed:
2
×
100
=
(2 × 1)
=
2
4
×
101
=
(4 × 10)
=
40
Total:
42
(decimal)

int x = 42;
int y = x;
Step 1 of 4

Load immediate value into register

The value 42 is first loaded into the RAX register. On modern architectures, all operations typically go through registers first. The value 42 is represented as 0x2A in hexadecimal. The register shows the full 64-bit representation (padded with zeros).

Registers

rax0x00000000
general
rbx0x00000000
general
rcx0x00000000
general
rdx0x00000000
general
rsi0x00000000
pointer
rdi0x00000000
pointer
xmm00x00000000
simd
xmm10x00000000
simd

Data Types Reference

TypeSize
int8_t (char)1 bytes(8 bits)
uint8_t1 bytes(8 bits)
int16_t2 bytes(16 bits)
uint16_t2 bytes(16 bits)
int32_t4 bytes(32 bits)
uint32_t4 bytes(32 bits)
int64_t8 bytes(64 bits)
uint64_t8 bytes(64 bits)
float4 bytes(32 bits)
double8 bytes(64 bits)
char1 bytes(8 bits)
bool1 bytes(8 bits)
pointer8 bytes(64 bits)
Arrayn × element_size(n × element_bits)
Dynamic Array12 + (n × element_size)(96 + (n × element_bits))
List24 + (n × (element_size + ptr_size))(192 + (n × (element_bits + 64)))
Tuplesum(element_sizes)(sum(element_bits))
Set16 + (n × element_size)(128 + (n × element_bits))
Map/Dictionary24 + (n × (key_size + value_size))(192 + (n × (key_bits + value_bits)))
String8 + n + 1(64 + (n × 8) + 8)
Structsum(field_sizes)(sum(field_bits))
Unionmax(member_sizes)(max(member_bits))
Class8 + sum(field_sizes)(64 + sum(field_bits))
Enum1-4(8-32)

Size Notation

n: Number of elements

element_size: Size of each element in bytes

element_bits: Size of each element in bits

ptr_size: Size of a pointer (typically 8 bytes on 64-bit systems)

field_sizes: Combined size of all fields in a structure