Introduction to SQL Injection

Introduction

SQL Injection attack, or flaws that affect the communication between an application and its database, it is perhaps the most wanted and exploited attack in both penetration testes and malicious actions (such as an intrusion test tries to reflect). This category of attacks allows the malicious agent to impact its victim in various ways, from manipulating security mechanisms, such as authentication functions, to extracting data stored in that database, or even executing commands at the level of the host operating system where the database technology is operating. Therefore, it is important to know how a database works, what the idea behind its flow of operation is, so that we can find such flaws that often affect critical systems, and mitigate their exploitation in the best way for such a usage scenario.

In this article, I intend to present some basic notions about databases, discuss a little about the functioning mode of the SQL language, the reason why it works in such a different way from other languages, and the reason why attacks of this type are possible and difficult to mitigate.

I write because I believe there is a need for content of this type, which not only talks about a security flaw or exploitation technique in a superficial way without commitment to the reader's understanding. Here I will use more realistic arguments and not based solely on "axiomatic" notions, which we simply assume that it works without much questioning.

All of this boils down to "we will not discuss which payload to use or how, but rather develop the reasoning to create a more efficient one.”

Security Flaw

I like to define a security flaw as computer science applied in its purest form.

A flaw is something unexpected in the initial design of the flow of operation of a computable code, which allows for unexpected actions such as causing a runtime error from manipulating some point of interaction of this code with the external. However, this corruption in the execution flow of a program may enable control of that execution flow, manipulating the next instructions to be executed.

Structured Query Language

SQL or Structured Query Language is a standard programming language for relational databases, however DBMS (Data Base Management System) implement it by applying their variations, which can facilitate in cases of use for which that DBMS was created to supply. Therefore, there are many variations of the language, but they all follow the same principle.

SQL is a declarative language, which expresses itself in the flow of a calculation, a different paradigm from the imperative languages we are used to dealing with. A simple example to expose this difference would be to say that, in imperative languages we structure the data and then define what we will do with it, going through functions to process it, a more explicit flow based on functions to refine the information. SQL does not, we first define what we will do, and then we go after this data that will undergo the action.

Operations

Operators are defined for manipulating data sets. With operators, it is possible to create subsets of data by applying filtering methods and joining parts of other data sets (tables) to create information.

Undoubtedly, the most commonly used operator for manipulating data contained in databases (relational or not) is $\sigma$ (select), which defines the select operation. Applying $\sigma$ to a set of values $X$ generates a subset of $X$ or $X$ itself. The $\sigma$ operator generates a subset of $X$ based on some defined property, a condition. $\sigma$ can be applied to only one set.

We define a set $X={1,2,3,4,5,6,7}$ . If we create a subset $A$ of $X$ , we could define the operation $\sigma x$ is even in $X$ , then $A=\{x \in X|x\%2=0\}$ , where $\%$ represents a modular operation. This generates a subset $A$ of $X$ such that $A={2,4,6}$ with $A$ containing only even values. Note that $A$ is a subset of $X$ , therefore, always, any value in $A$ is also in $X$ .

A \subseteq X \Longleftrightarrow \forall x,\ \text{if}\ x \in A\ \text{then}\ x \in X

In SQL, we define the operation using the SELECT command:

SELECT x FROM X WHERE mod(x,2) = 0;

Relationship Between Data

The most important part of SQL databases it is the relational propriety. We must think on it as a tool to trace relationship among data in many tables and so compile the information from this.

The language is based on relational algebra, which is a set of properties that support the theoretical foundation of relational databases. It enables the operator to write instructions for the system in a procedural way, and the system performs operations on the specified groups of data and calculates the result. Relational algebra is used to define operations between existing data (without altering them), such as acquiring data stored in two tables and creating a third table by relating the data based on some common property.

This makes a lot of sense because we are always using fundamental operations from relational algebra to treat data.

Imagine two datasets, the set $X={1,3,4,6,9}$ and the set $Y={2,3,5,6,8}$ . The union of those two groups of data, will be the joining of all symbols conteined in each set or group, such that $X\cup Y = {1,2,3,4,5,6,8,9}$ .

Por outro lado, a intersecção entre conjuntos resultará em, apenas aqueles valores presentes em ambos os conjuntos, dessa forma $X \cap Y = {3,6}$ . Parece familiar, não?

SELECT x FROM X
INNER JOIN Y
ON x=y;

When utilizing intersections, we create the third table which grows horizontally because usually it is added to the resulting table a column from the first table $X$ and another column from the second table $Y$ , so does not matter the length of the table $X$ or $Y$ , or even what is being selected. However, when the union between those two tables/sets is queried, we can only do this operation if and only if the two tables have the compatibility property to union, the tables must be equivalent (this idea is extremely important when using "UNION" in SQLI payload), because the resulting table grows vertically.

If both tables are not compatible in the number of columns, we can select the same number of columns to create this necessary compatibility.

Imagine two sets of value, $X$ and $Y$ , such that:

X=\begin{bmatrix} a & 1 & true\\ b & 2 & false \end{bmatrix} Y=\begin{bmatrix} m & 8 \\ n & 9 \end{bmatrix}

If I select through the following way, giving to the first column of $X$ the symbol $x1$ , $x2$ for the second column and so on. Analogously with $Y$ that has the first column expressed as $y1$ .

SELECT x1, x2, x3 FROM X UNION SELECT y1, y2 FROM X;

Isso não seria possível, pois a matriz $Z$ resultante seria algo como:

Z=\begin{bmatrix} a & 1 & true \\ m & 8 & ?? \\ b & 2 & false \\ n & 9 & ?? \end{bmatrix}

Union will combine the selected result sets from two or more tables and create a third set, so the number of columns must be the same, as well as the data types. Set $Y$ does not have the third column, so it would not be possible to create one out from nowhere. To work around this problem:

SELECT x1, x2 FROM X UNION SELECT y1, y2 FROM X;

Z=\begin{bmatrix} a & 1 \\ m & 8 \\ b & 2 \\ n & 9 \end{bmatrix}

Select the same amount of column.

SQL Command Injection

The security flaw that allows SQL injection occurs when an SQL command is constructed by concatenating it with data entered by the user through some point of interaction in the application, usually forms, or some location easily manipulated by the user, such as query strings in the URL. The attacker is able to escape from the flow of execution of the initially idealized SQL query, and execute an arbitrary command.

So, if there is an area in the application where the user can submit data that will be saved or used to index new information returned in the response body, if that point is not well sanitized, it may be subject to SQL command injection.

Imagine a URL that receives as argument an integer for the parameter called id, so the URL would look like https://domain.com/page?id=5.

Here is written a pseudocode that has the goal to process the inserted data end and return something from the database.

id = URL.Query["id"]
query = "SELECT x,y,z FROM anyTable WHERE id=" + id + " LIMIT 10;"
rows = DBConnection.Query(query)
...

Note that, at one point in the code, a SQL command is constructed, concatenated with the id, so the resultng code would be domething like SELECT x,y,z FROM anyTable WHERE id=5 LIMIT 10;. This data should be an integer, but at no point it is realy validated, so a user could input anything, such as a letter: SELECT x,y,z FROM anyTable WHERE id=a LIMIT 10;. Of course, this would return an SQL error, as there are no delimiters for a char type. This query would return an SQL error that an attacker could take as parameter to exploit it.

Abusing Concatenated Functions

Now let's try to build an authentication function. A form should manipulate the browser to perform a POST with the parameters uname for the username, and pass for the user's password. The following pseudocode represents a query for user validation, it should return the ID of this user so that the session configuration will be successfully created if authentication goes as expected.

...
query = 'SELECT id FROM User
    WHERE uname="'+ POST.Param["uname"] +'"
    AND passwd=MD5("'+ POST.Param["pass"] +'");'
...

When a POST request is performed by the user, a query is built as the following.

SELECT id FROM User WHERE uname="3lf" AND passwd=MD5("3lfp455");

Again, concatenation is happening incorrectly, and any information entered in the form will be added to the SQL command, which will then be interpreted.

The SQL command has a conditions structure that will validate if the requested data in such standards (username and password) really exists, it will do this by comparing with each row of the selected table, so the username and the password must exist and be written exactly the same way, on the same row for the validation to occur successfully. We could do it in the following way in C, to make it easier to understand. Note that there will be a loop performing the comparison of all data contained in the matrix, just like a DBMS does, but in a common DBMS the programmer does not need to run a loop of verifications:

int main(void) {
  char *uname = "3lf";
  char *passwd = "3lfp455";
  char *User[3][2] = {
    {"admin","p4ssWd"},
    {"user","Pa$$w6"},
    {"3lf","3lfp455"},
  };

  for (int i=0; i<3; i++) {
    if (strcmp(User[i][0],uname)==0 && strcmp(User[i][1],passwd)==0)
      printf("Correct!\n");
  }
  return 0;
}

Due to the fact that the form is not properly sanitized, the attacker can guess that a string is being concatenated with the command, close this string with delimiters that commonly are single or double quotes, and then execute something another code right after.

Abusing Comments

In a programming language, a comment is a sequence of characters that will not be executed in the program's flow, so this feature can be used to add organization to the code. With this, it is possible to disable parts of the code that were only used for testing or debugging. In the SQL command shown above, it would be possible to "comment out" the entire part of the code that deals with user password validation, thus discarding that part. The malicious agent can terminate the string with its delimiter and then add the command that specifies which part of the code is a comment. Each DBMS will have a different way of dealing with comments, but generally it is a sequence of two hyphens "--".

SELECT id FROM User WHERE uname="3lf";-- -AND passwd=md5("3lfp455");

Note that all that code appearing after the "name" has been commented, now it is just useless data and can do nothing. Therefore, just the code SELECT id FROM User WHERE uname="3lf"; would be executed leading with unauthorized access being given to the adversary.

Logical Manipulation

When it comes to injecting SQL commands, probably the most famous payload is:

1\ OR\ 1\ =\ 1--

It is used to manipulate the query based on Boolean algebra through logical congruences or incoherences, thus applying Boolean functions to two input values to obtain an output value, which drives the code execution flow. For example, $X \in R$ or $1=1$ . The result of the first expression does not matter, as $1$ always equals $1$ . Therefore, it is possible to play with this when the goal is to manipulate a query, but it is not known how the query is organized, what values should be compared, or what tables these data are extracted from. Therefore, the logical operator OR is usually used because it is extremely flexible.

Here is a simple OR truth table, where $A\ OR\ B=C$ :

\begin{matrix} A & B & C\\ 0 & 0 & 0\\ 0 & 1 & 1\\ 1 & 0 & 1\\ 1 & 1 & 1 \end{matrix}

The following payload would return the first user registered in the database, usually the administrator:

SELECT id FROM User WHERE uname="" OR 1=1 LIMIT 1-- -AND passwd=md5("a");

This happens because the value of uname is not known, so a second condition is inserted with the OR operator, which will always be true. "Select id from the User table, where uname is "" or $1=1$ *. For each row in the table, uname will never be equal to "" (nothing), however for each row in the table $1=1$ , so the row is returned as fitting the condition, consequently all rows will be returned. Finally, a LIMIT 1 is executed, so only the first comparison that fits the condition reaches the limit, which is usually the administrator's account or a test account.

The second most used logical operator is AND, which results in a positive outcome only when both input values are positive. Here's a simple truth table for AND, where $A\ AND\ B=C$ :

\begin{matrix} A & B & C\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0\\ 1 & 1 & 1 \end{matrix}

The AND operator can be useful in brute force processes, especially when employing SQL command injection techniques for time-based attacks (which will not be discussed in this article in depth). The following query will use the MySQL function SLEEP(int) (the name will be different in other DBMS) to suspend the execution of the command for a certain period of time. We can assign a binary meaning to this delay in the response from the manipulated server.

SELECT id FROM User WHERE uname="any" AND SLEEP(5)-- -...;

The execution of the command will be suspended for $5$ seconds if uname="any" and SLEEP(5). The SLEEP(5) function itself does not return a positive boolean value, it is called so that something is returned and evaluated. It would be possible to test a list of usernames, one at a time, assigning a wait time, for example, for each uname tested if there is a limit of attempts:

SELECT id FROM User WHERE uname="admin" AND SLEEP(2);
SELECT id FROM User WHERE uname="user" AND SLEEP(5);
SELECT id FROM User WHERE uname="anyuser" AND SLEEP(8);
SELECT id FROM User WHERE uname="3lf" AND SLEEP(11);

If the application takes 11 seconds to return a response, we know that the user 3lf exists. Of course, this method would not be used in this way, but it is a good illustration of the idea.

A Real Case I came across

To better illustrate, I will share a situation that happened to me during a pentest, where the use of the mentioned logical manipulation was essential to exploit a Time-Based SQL Injection flaw. I didn't have any indication in the response body that the application had an implementation flaw in the database interface, but some requests that I passed random characters to, in an attempt to expose some corruption of the mechanism, took a little longer, as if the application had to think a little more to interpret it. So, I started injecting payloads for relational and non-relational databases, and eventually something worked.

After some testing period, I managed to find a pattern that allowed me to send commands within this sequence of characters, and my commands were interpreted. So, I sent the following query:

SELECT CASE WHEN 1=1 THEN PG_SLEEP(5) ELSE PG_SLEEP(0) END;

The DBMS in use was PostgreSQL. I used a conditional structure that varied the response time to symbolize a response for something I wanted to know - whether my query was being executed or not. If $1=1$ , wait for $5$ seconds, if $1\neq 1$ , respond immediately. This request took a little over $5$ seconds to return any data.

For a second check, I sent another query, but with something I knew was absurd, so the response should be instantaneous:

SELECT CASE WHEN 1=5 THEN PG_SLEEP(5) ELSE PG_SLEEP(0) END;

With that, the structure jumped to the else block and the server responded to me in less than $1$ second.

Using this method I was able to dump some information from the database.

All tests to expose this flaw, both the proof of the possibility of escaping the developed execution flow and the actual exploitation, the database dump (yes, to make it more interesting I had to extract some data, but that's for a more SQLI-focused text), had to be done manually. The pattern to break the string and make my command execute without errors was not something trivial, so it was not possible at any time to use tools like SQLMap for detection and exploitation of the flaw.

Conclusion

The main objective of this article was to try to explain with algebra a little bit about why SQL works the way we know it today, what are the foundations of this important language, and how these principles are employed when the goal is to find a security flaw. Almost every application uses SQL in some way, and mastering techniques for detecting and exploiting vulnerabilities that exploit this category of vulnerability is crucial for any pentester. The goal is not to run an automated tool that creates proof of concepts and exploits automatically, these tools are not prepared for all cases and will eventually fail, so knowing about the technology in order to then build exploitation tactics is the most important solution for security problems.