Data types are the building blocks of machine learning, and choosing the right one is crucial for success. This blog post delves into the world of data types, exploring their impact on supervised, unsupervised, and reinforcement learning. Whether dealing with dependent and independent variables or navigating the realms of qualitative and quantitative data, a nuanced grasp of these concepts is essential for making informed decisions in the world of artificial intelligence and data science.
Data Types: A Bird’s Eye View
Broadly, data falls into two categories: qualitative and quantitative.
- Qualitative data: Think text, images, videos, and sounds. It’s descriptive and non-numerical, representing qualities or categories. It can be further classified into:
- Structured Data: Structured qualitative data lives in databases like MySQL or MSSQL and can be further categorized:
- Nominal:
- Categorical data like gender (male/female) or loan approval status (approved/rejected).
- To classify or identify the element.
- It can be considered as labels or names.
- Labels can be a number or text which represent a particular category.
- Example:
- 0 and 1 can represent male and female or vice-versa
- 0 and 1 can represent loan given or not or vice-versa
- M and F letters can represent male and female respectively
- “male” and “female” words represent male and female gender respectively
- “0”, “1–3”, “>3” can represent no. of children i.e. 0 means “No children” ,1 to 3 means “Children between 1 and 3” and >3 means “Greater than 3 children”
- Ordinal:
- Ranked data with order but no fixed intervals, like exam grades (1st, 2nd, 3rd).
- Apart from having characteristics of Nominal scale it contains order or rank characteristics, thus providing comparison between different elements.
- Elements can be arranged as per their order or rank.
- Example:
- Exam rank, for example. “1st Rank” will have the highest score compared to “2nd Rank”, the same way “2nd Rank” will have a low score compared to
“1st Rank” and high score compared to “3rd Rank” but we can not say the difference of the score between 1st rank and 2nd rank. - The income of people can be categorized as “High”, “Medium” and “Low” i.e. People having “High” income will have higher earnings compared to
“Medium” and “Low” category, but what is the earning difference between high income person and medium income person will not be known.
- Exam rank, for example. “1st Rank” will have the highest score compared to “2nd Rank”, the same way “2nd Rank” will have a low score compared to
- Nominal:
- Unstructured Data: Unstructured qualitative data lives in databases like MongoDB or Cassandra. As mentioned above, the data will be anything like free texts, videos, images and sound etc.
- Structured Data: Structured qualitative data lives in databases like MySQL or MSSQL and can be further categorized:
- Quantitative data: Numbers rule here. This data type is numerical and can be further classified into:
- Interval Scale:
- Data with order and equal differences between values, like temperature in Celsius or Fahrenheit.
- It contains characteristics of both Nominal & Ordinal scales.
- It lacks absolute zero which means origin or base is not fixed.
- Due to absence of base negative values are allowed.
- It provides comparison characteristics with magnitude i.e. we can say that 1st rank person is 20 marks ahead of 2nd rank person.
- It allows only 2 mathematical operations i.e. addition & subtraction.
- In interval scale the difference between 2 elements is identical. For example:
- Difference between 10°C and 5°C is equal to the difference between 23°C and 18°C that is the difference of 5°C but we can not say that 10°C is twice as warm as 5°C, due to the absence of absolute 0 in Celsius and Fahrenheit.
- We can not say that at 0°C or 0°F, temperature does not exist due to the absence of the base
- Ratio Scale:
- Similar to interval data, but with an absolute zero point, allowing calculations like ratios and proportions.
- Due to the presence of the origin, negative values are not allowed.
- It allows all 4 mathematical operations i.e. addition, subtraction, multiplication & division.
- It contains characteristics of all the 3 scales i.e. Nominal, Ordinal & Interval.
- Examples include temperature in Kelvin, height, weight, and blood pressure.
- We can say that the difference between 323.15°K and 298.15°K is equal to the difference between 353.15°K and 338.15°K. We can also say that 200°K is twice as high as 100°K.
- At Origin, the variable or element does not exist. For Example:
- At Blood Pressure = 0, the person will die as the heart beat stops.
- Nothing will exist at Age, Weight and Height = 0
- At 0 kelvin molecular motion will stop.
- Interval Scale:
Quantitative Data: Unveiling the Numbers Game
Quantitative data, the world of numbers, gets even more interesting with:
- Continuous data: Can take any value within a range, like height or weight.
- Discrete data: Whole numbers only, like the number of children in a family.
Data Types and Learning Paradigms: Finding the Perfect Match
The type of data you have dictates the learning approach you can use:
- Supervised learning: Deals with labeled data, where you have both independent (input) and dependent (output) variables. Think predicting house prices based on size and location. Quantitative data works best here.
- Unsupervised learning: Makes sense of unlabeled data, grouping it into hidden patterns or structures. Think clustering customer segments based on their purchase history. Both qualitative and quantitative data can be used.
- Reinforcement learning: Learns through trial and error, interacting with an environment to maximize rewards. Think training a robot to walk by rewarding successful steps. Both qualitative and quantitative data can be used, depending on the environment and sensors involved.
Remember:
- Data types can be subjective, depending on how data is collected and formatted.
- If the temperature is recorded in Celsius or Fahrenheit, it will be considered as Interval Scale but the same temperature will be considered as Ratio Scale if recorded in Kelvin.
- If temperature range is clubbed together, it will become Categorical i.e.
“[0–25)”, “[25–50)”, “[50–75)”, “[75–100)”, “[100-more than 100]”- Here the square bracket means the value is included and the parenthesis means the value is excluded
- Quantitative data can be converted to qualitative, but not vice versa.
Mastering Data Types: The Key to Machine Learning Success
By understanding data types and their interplay with different learning paradigms, you unlock the true potential of machine learning. Choose wisely, and watch your algorithms soar!
For further exploration:
1 Response
[…] Previous Understanding Data Types in Machine Learning: A Comprehensive Guide […]