Datasets

A Dataset is a set or collection of data. This set is normally presented in a tabular pattern. Every column describes a particular variable. And each row corresponds to a given member of the data set, as per the given question. This is a part of data management. Data sets describe values for each variable for unknown quantities such as height, weight, temperature, volume, etc., of an object or values of random numbers. The values in this set are known as a datum. The data set consists of data of one or more members corresponding to each row. In this article, let us learn the definition of the dataset, different types of datasets, properties, and so on with many solved examples.

Table of Contents:

Dataset Meaning

A data set is an ordered collection of data. As we know, a collection of information obtained through observations, measurements, study, or analysis is referred to as data. It could include information such as facts, numbers, figures,  names, or even basic descriptions of objects. For our study, data can be organized in the form of graphs, charts, or tables. Through data mining, data scientists assist in the analysis of gathered data.

A dataset is a set of numbers or values that pertain to a specific topic. A dataset is, for example, each student’s test scores in a certain class. Datasets can be written as a list of integers in a random order, a table, or with curly brackets around them. The data sets are normally labelled so you understand what the data represents, however, while dealing with data sets, you don’t always know what the data stands for, and you don’t necessarily need to realize what the data represents to accomplish the problem.

Also, read:

Types of Datasets

In Statistics, we have different types of data sets available for different types of information. They are:

  • Numerical data sets
  • Bivariate data sets
  • Multivariate data sets
  • Categorical data sets
  • Correlation data sets

Also, check out: Types of Data in Statistics.

Let us discuss all these data sets with examples.

Numerical Datasets

The numerical data set is a data set, where the data are expressed in numbers rather than natural language. The numerical data is sometimes called quantitative data. The set of all the quantitative data/numerical data is called the numerical data set. The numerical data is always in the numbers form, such that we can perform arithmetic operations on it.

  • Weight and height of a person
  • The count of RBC in a medical report
  • Number of pages present in a book

Bivariate Datasets

A data set that has two variables is called a Bivariate data set. It deals with the relationship between the two variables. Bivariate dataset usually contains two types of related data.

Example: To find the percentage score and age of the students in a class. Score and age can be considered as two variables

  1. The sales of ice cream versus the temperature on that day. Here the two variables used are ice cream and temperature. 

(Note: In case, if you have one set of data alone say, temperature, then it is called the univariate dataset)

Multivariate Datasets

A data set with multiple variables.  When the dataset contains three or more than three data types (variables), then the data set is called a multivariate dataset. In other words, the multivariate dataset consists of individual measurements that are acquired as a function of three or more than three variables.

Example: If we have to measure the length, width, height, volume of a rectangular box, we have to use multiple variables to distinguish between those entities.

Categorical Datasets

Categorical data sets represent features or characteristics of a person or an object. The categorical dataset consists of a categorical variable also called the qualitative variable, that can take exactly two values. Hence, it is termed as a dichotomous variable. Categorical data/variables with more than two possible values are called polytomous variables. The qualitative/categorical variables are often assumed to be polytomous variable unless otherwise specified.

Example:

  • A person’s gender (male or female)
  • Marital status (married/unmarried)

Correlation Datasets

The set of values that demonstrate some relationship with each other indicates correlation data sets. Here the values are found to be dependent on each other.

Generally, correlation is defined as a statistical relationship between two entities/variables. In some scenarios, you might have to predict the correlation between the things. It is essential to understand how correlation works. The correlation is classified into three types. They are:

  • Positive correlation – Two variables move in the same direction (Either both are up or both or down)
  • Negative correlation – Two variables move in opposite directions. (One variable is up and another variable is down and vice versa)
  • No or zero correlation – No relationship between two variables.

Example: A tall person is considered to be heavier than a short person. So here the weight and height variables are dependent on each other.

Mean, Median, Mode and Range of Datasets

The mean, median and mode along with range are the major topics in Statistics. In other words, calculating the mean, median, and mode of data sets are the three methods for working with them. However, before we can compute these three measures of the dataset, we must first prepare our data set by rewriting it in ascending order from least to greatest.

Mean of a dataset is the average of all the observations present in the table. It is the ratio of the sum of observations to the total number of elements present in the data set. The formula of mean is given by;

Mean = Sum of Observations / Total Number of Elements in Data Set

Median of a dataset is the middle value of the collection of data when arranged in ascending order and descending order.

Mode of a dataset is the variable or number or value which is repeated maximum number of times in the set.

Range of a dataset is the difference between the maximum value and minimum value.

Range = Maximum Value – Minimum Value

Properties of Dataset

Before performing any statistical analysis, it is essential to understand the nature of the data. We can use different Exploratory Data Analysis (EDA techniques), which helps to identify the properties of data, so that the appropriate statistical methods can be applied on the data. With the help of EDA techniques, we can check the following properties of the dataset.

  • Centre of data
  • Skewness of data
  • Spread among the data members
  • Presence of outliers
  • Correlation among the data
  • Type of probability distribution that the data follows

Video Lesson on What are Sets

Datasets Example

Example 1:

Find the mean, mode, median and range of the given data set.

{2, 4, 6, 8, 2, 10, 12}

Solution:

Given, {2, 4, 6, 8, 2, 10, 12} is a set of data.

Mean = 2+4+6+8+2+10+12/7 = 44/7

To find median we have to first arrange the given data in ascending or descending order

So, {2,2,4,6,8,10,12}. Thus,

Median = 6

Mode = 2

Range = 12-2 = 10

Example 2: 

Find the mode for the given data set: 2, 3, 3, 4, 6, 7

Solution:

Given data set: 2, 3, 3, 4, 6, 7

We know that the mode is the frequently repeated value in the data set.

From the given data set, it is observed that the data “3” is repeated twice. 

Hence, the mode for the given data set is 3.

Practice Problems

Solve the following problems:

  1. Find the mean for the dataset: 5, 3, 1, 6, 8, 9.
  2. Find the median for the dataset: 6, 2, 4, 5, 7.
  3. Find the mode and range for the following dataset: 3, 9, 12, 23, 7, 16, 5.

Also, read:


Frequently Asked Questions on Dataset

Q1

What is meant by dataset?

The set or the collection of data is called a dataset. In other words, the dataset is the ordered collection of data.

Q2

What are the different characteristics used to measure the dataset?

In statistics, the different characteristics used to measure the dataset are mean, median, mode, range, and so on.

Q3

How to calculate the range of the given dataset?

The range of the given data set is the difference between the maximum and minimum value of the data set.

Q4

What are the different types of datasets?

The different types of datasets are:
Numerical dataset
Bivariate dataset
Multivariate dataset
Categorical dataset
Correlation dataset

Q5

What is the median of the dataset?

The median is the middle value of the dataset, in which the data are arranged in ascending order.