Intro to T-SQL for Data Science
Data Science uses multitudes of scientific methods, algorithms, and processes to extract knowledge and insight from data and uses it across wide range of domains. Data Science helps to extract patterns from raw data which would have been invisible beforehand. In order to access and experiment with these data, a Database is essential. Thus, today we’ll learn about the Database as a whole and use examples and features of T-SQL to learn to find meaning out of raw data.
Check out other articles on SQL.
T-SQL is one of the mostly widely used SQL Derivatives. It is known as a transactional language used to define how things are to be done. T SQL is one of the easiest and most effective way to get things done. It is importantly used to create applications and also to add business login into the application the backend systems.
- Cloud VS On-Premise:
What is Cloud?
Cloud or Cloud Computing can be defined as the on-demand availability of the ecosystem of computer resources ranging from data storage (cloud storage) to computer power without the user having to actively manage the system. These are the data centers which are available to users across the internet. We can say, Cloud is the architecture of the system where someone else owns the hardware such as AWS, Azure and Alibaba Cloud and the user can use it for their need at the time of convenience. This can also be called as Off-Premise.
On-premises is the scenario where the user owns the Hardware and the software systems are deployed on the very premises of the user or of the organization instead of remote facility for instance, Cloud Computing or Server Farm.
If you want to install in your own machine,
Download and Install SQL Server Express
-> SQL Server 2019 Express Edition
Editions of SQL Server
- Express — which is free
Tools of the Trade (IDE)
SQL Server Management Studio (SSMS) is an application software which is widely used to configure, manage and administer different components within the Microsoft SQL Server.
Visual Studio (VS) Code is a free software developed by Microsoft for Windows, Linux and macOS which developers and engineering write and edit code, debug, refactor and provides multiple other functionalities.
What is an IDE?
In layman’s terms, this is an environment where you can write your code and/or perform your analysis. For E.g., Microsoft SQL Server Management Studio, VS Code, Azure Data Studio.
We would highly suggest you to try out Azure Data Studio to start out T-SQL journey as it is free and to learn from the experience.
Database can be defined as the collections of information which is organized such that the information could be accessed and worked upon. It is often controlled by a database management system which all together with database is known as a database system, in short form just called database.
Database Management System (DBMS)
A DBMS acts as the interface between the database and its end-users such that the users can access, manage and update the information.
Database objects can be understood as the objects in a database which are used in order to store or refer to the data. Tables, Views, Sequences, Indexes, Clusters and Synonyms are all examples of the database objects.
A set of related data stored in our database
A sheet in Excel. Just like Excel contains spreadsheets, relational databases are composed of tables.
Let us take sample data (car_data.csv from Kaggle).
What is a Column?
A set of data of a particular type, generally there is a value for each row in our table.
What is a Row?
A collection of fields that make up a record.
What is a Field?
The smallest source of data in a database
Logical Processing Order
The logical processing order of a SELECT statement explains the methods of how the query will be process and the final result will be achieved.
- GROUP BY
- WITH CUBE or WITH ROLLUP
- ORDER BY
Working with Data:
This is the process of retrieving one or many rows or columns from one or many tables in a database. E.g., Returning 20 records with our 11 selected columns from the Cars Table
This is the process of excluding data based on predefined rules. E.g., Returning 12 records with our 11 selected columns from the Cars Table where the cars have a body style of type “convertible”
This is the process of sorting a dataset based on a specified sort order. E.g., Returning 12 records with our 11 selected columns from the Cars Table where the cars have a body style of type “convertible” and ordering them from the widest based on the Wheelbase.
Grouping data within a dataset typically over 1 or more columns. E.g., Returning All the records from the Cars dataset, grouping the data over the Make to see how many Models each Make has that is of Body_Style Convertible and what is the Average Price for each Model.
Permanent VS Temporary data modifications. For Permanent we use the UPDATE statement, for temporary we can use what best suite our needs. E.g., We are correcting the spelling error of ‘Audi’ in the Make the top statement will only display the corrected data, where the bottom will permanently change the underlying data.
Working with Multiple datasets
Joining 2 or more datasets together on a common “join” key. The SQL join allows us to collect two or more tables using the common identifiers. There are different types of join such as: Inner Join, Outer Join, Cross Join and Semi-join.
If you want to dive deeper into T — SQL, watch this video embedded below,
The SET Operations like Union, Intersect, Minus help us get meaningful outputs from the data stored in table under various special conditions.
Check out my full article at: https://www.c-sharpcorner.com/article/intro-to-t-sql-for-data-science/