A URL (Uniform Resource Locator) is a universal naming format used to indicate a resource on the Internet. It is a printable ASCII character string which breaks down into five parts:
A URL therefore has the following structure:
Protocol | Password (optional) | Server name | Port (optional if 80) |
Path |
http:// | user:password@ | www.commentcamarche.net. | :80 | /glossair/glossair.php3 |
For example, the following protocols may be used through URL:
The name of the file in the URL can be followed by a question mark then data in the ASCII format, this is additional data sent as parameters for an application on the server (a CGI script for example). The URL will then resemble a character string like this:
http://en.kioskea.net/forum/?cat=1&page=2
Considering that the URL is a medium for sending information over the Internet (to send data with a CGI script for example), it must be able to send special characters, yet URLs cannot contain special characters. In addition, certain characters are reserved because they have a meaning (the slash enables sub-directories to be specified, the characters & and ? are used to send data via forms..). Finally, URLs can be included in an HTML document, which makes it difficult to insert characters such as < or > in the URL.
That is why encoding is necessary! Encoding consists of replacing special characters with the character % (itself also becoming a special character) followed by the ASCII code of the character to be encoded in hexadecimal notation.
Here is the list of characters which require particular encoding:
Character | URL encoding |
---|---|
Tabulation | %09 |
Space | %20 |
" | %22 |
# | %23 |
% | %25 |
& | %26 |
( | %28 |
) | %29 |
+ | %2B |
, | %2C |
. | %2E |
/ | %2F |
: | %3A |
; | %3B |
< | %3C |
= | %3D |
> | %3E |
? | %3F |
@ | %40 |
[ | %5B |
\ | %5C |
] | %5D |
^ | %5E |
' | %60 |
{ | %7B |
| | %7C |
} | %7D |
~ | %7E |
The format of URLs is defined by RFC 1738: