Characters in lisp a tutorial
Character repertoires in lisp
A character repertoire such as ascii
, or Latin alphabet 1
, or unicode
, is a set of characters .
Each character , such as a
, in a repertoire , is assigned a code point . A code point is just a number . For example , in ascii , the code point for a
, is 97
in decimal , or 61
in hexadecimal .
Each code point has an encoding . An encoding is the representation of the code point , in the computer . For example , in ascii , the character a
is encoded , the same as its code point , which is 61
in hexadecimal or 0110 0001
in binary .
This being said , the characters that can be used in a lisp implementation , are defined by the character repertoire , or character repertoires if multiple , that a lisp implementation supports .
At the minimum , all lisp implementations , must support some basic characters , called the standard characters. The set formed by these characters , is called the standard character repertoire standard-char
.
Thecharacters that all lisp implementations must support are : a-z
, A-Z
, 0-9
, ! ? $ " ’ ‘ . : , ; * + - / | \ ~ ^ < = > # % @ & ( ) [ ] { }
, the non graphic character new line , the graphic character space .
Character types in lisp
Characters in lisp can be used in writing a program , or they can be used to represent themselves , as data . The lisp reader , reads the typed in characters , one by one . Each read character , has a type . The type of a character , affects how the lisp reader , interprets it.
Characters can be of the following types:
A macro character. The standard macro characters in lisp are : ( ) ' ` , ; " #
.
A macro character affects the parsing of the characters , that follows it . When a macro character is encountered , its associated reader macro function , is called . The reader macro function , will read following characters , and return an entity .
For example , the double quote "
macro character , returns a string literal , as in "Hello world"
, the sharp sign #
macro character , followed by a backslash \
, and a single character , or a character name , returns a literal character , as in #\a
, characters following the semicolon macro character ;
, are ignored , and characters following the single quote character '
, if they form a symbol , or a cons , then the read eval loop will not evaluate the symbol or cons . For example :
> '(+ 1 2 )
;;;The cons (+ 1 2 )
;;; is not evaluated .
(+ 1 2)
An escape character. An escape character , is not a macro character , as such there is no associated macro function , which is called , and no entities are returned .
The backslash \
single escape character, escapes a single character , the character just following it . Escaping a character is , the character as alpha , and its case is preserved . For example :
> \'
;;;The macro character ' is escaped ,
;;;It is now treated as an alphabetic
;;;character , instead of being
;;;a macro character .
;;;One or more alpha characters ,
;;;can form a symbol .
;;; The symbol is evaluated , by the
;;;read eval loop , this will cause an
;;;error , as no value is associated
;;;with the symbol '.
*** - SYSTEM::READ-EVAL-PRINT: variable |'| has no value
The following restarts are available:
Vertical bars multiple escape characters , escape the enclosed characters . Enclosed characters are treated as alpha , and their cases are preserved . Escape characters , appearing in vertical bars escape characters , must be escaped , using a single escape character .
>|a(`;b|
;;;Characters in vertical bars ,
;;;are escaped , and treated as
;;;alpha , and their case is
;;;preserved . The symbol
;;;a(`;b| , has no value , and
;;;this will cause an error .
*** - SYSTEM::READ-EVAL-PRINT: variable |>a(`;b| has no value
The following restarts are available:
A constituent character. The constituent characters are the Rubout which is the delete character in ASCII , the Backspace , and A-Z a–z 0-9 ! $ % & * + - . / : < = > ? @ [ ] ^ _ { } ~
.
The lisp reader forms from the read constituent characters , and from the escaped characters , tokens . A token that is formed by the lisp reader , can either be a number , such as 1
, or 1.2
, or it can be a symbol , such as a
or \'
. An escaped character , is considered a constituent character .
A number does not have a name , and as a value , it evaluates to itself . A symbol has a name , which is what is read by the lisp reader , and it has zero or more values , associated with it .
The lisp reader , as stated by the value read , by using (readtable-case *readtable* )
, and which by default is :upcase
, converts constituent characters touppercase , unless a constituent character is escaped , in such a case , its case is preserved .
> (readtable-case *readtable*)
;;;Print the case for constituent characters
:UPCASE> 'Ab
;;;b is converted to
;;;uppercase .
AB> 'A\b
;;;b case is preserved .
|Ab|
A whitespace character. Whitespace characters as defined by the Common Lisp standard , are the space , and the new line characters . They are used to separate tokens .
> |a b|
;;;One token is formed , since
;;;vertical bar escape characters
;;;are used .
|a b|> (+ 2 3 )
;;;a list of three tokens
;;;+ 2 and 3 .
5
Literal characters
Characters can represent themselves as data , these are the literal characters .
A literal character is input by using #\
, followed by a single character , or by a character name .
If followed by a single character, the character case is preserved . For example , to enter the literal character capital A
, it is represented using #\A
, and to enter the lower case character a
, it is represented by using #\a
.
If followed by a character name, the character name is converted to uppercase , and the character represented by this character name , is the target character literal . For example #\LATIN_CAPITAL_LETTER_A
represents A
, and #\LATIN_SMALL_LETTER_A
represents a
.
Depending on if printer escaping is enabled , or disabled , a literal character when output , is either output with a #\
character preceeding it , or it is just output as is . Printer escaping can be check if it is enabled , or disabled , by using *print-escape*
.
> *print-escape*
T> #\a
;;;input is #\a , and output
;;;is #\a
#\a> #\A
;;;input is #\A , and output
;;;is #\A
#\A
Character categories
A character in lisp , can belong to one or more category.
A graphic
character, is a character which has a glyph , used to display the character . All standard characters beside the new line character , are graphic characters .
The predicate function , graphic-char-p
can be used to check if a character is graphical or not .
> (graphic-char-p #\ )
;;;The space character is a graphic
;;;character.
T> (graphic-char-p #\
)
;;;The new line character is not graphical
;;;It is informally called a control character
NIL
Analphabetic
character, is a character which is also graphic . The standard characters which are alphabetic , in lisp , are : A-Z
, a-z
.
For an implementation defined character, if it has a case , it must be alphabetic , if not , then it is implementation defined , if it is alphabetic or not .
To check if a character is alphabetic , the predicate function alpha-char-p
can be used .
> (alpha-char-p #\1 )
;;;The character 1 is not alphabetic
NIL> (alpha-char-p #\a )
;;;The character a is alphabetic
T
A numeric
character , is also a graphic character . The standard numeric characters , are 0-9
. An implementation may define , other numeric characters .
An alphanumeric
character , is a graphical character , which is either numeric , or alphabetic . The standard characters which are alphanumeric , are : a-z
, A-Z
, and 0-9
.
The predicate function alphanumericp
, can be used to check if a character is alphanumeric .
> (alphanumericp #\a )
;;;Check if a is alphanumeric ,
;;;returns true .
T> (alphanumericp #\1 )
;;;Check if 1 , is alphanumeric ,
;;;returns true.
T> (and (alphanumericp #\1 ) (not (alpha-char-p #\1 ) ))
;;;Check if a character is numeric , by checking
;;;if it is alphanumeric , and is not alphabetic .
;;;Check if 1 is numericT
A cased
character, is an alphabetic character, it can be either uppercase , or lower cased , and it must have a character , which has its counterpart case .
The standard uppercase characters are A-Z
, and the standard lowercase characters are a-z
.
The predicate functions upper-case-p
and lower-case-p
, can be used to check if a character is , upper or lower case .
The both-case-p
predicate function , can be used to check , if a character , has both an uppercase , and a lowercase version . Some non standard characters , might not be cased , as an example arabic characters , do not have a case .
The functions char-downcase
, and char-upcase
, can be used to get the lowercase , and uppercase version , of a character .
> (upper-case-p #\A )
;;;Check if the character A ,
;;;is uppercase .
T> (lower-case-p #\a )
;;;Check if the character a ,
;;; is lowercase .
T> (both-case-p #\a )
;;;Check if the character a ,
;;;can be uppercase and
;;;lowercase .
T> (both-case-p #\1 )
;;;Check if the character 1 ,
;;;can be uppercase and lowercase .
NIL> (both-case-p #\ARABIC_LETTER_ALEF )
;;;Check if the arabic character alef
;;;has uppercase , and lowercase .
NIL> (char-upcase #\a )
;;;Get the uppercase version
;;;of the character a .
#\A> (char-downcase #\A )
;;;Get the lowercase version ,
;;;of the character A .
#\a
A digit
character, is a digit in a given radix . For example , A
in base 16
, is considered to be a digit . The standard radix , are between 2
and 36
inclusive , and the radices digits , can be between : 0
and Z
, where Z
is 35
. Radices digits , are case insensitive .
The predicate function digit-char-p
, can be used to check if a character is a digit , in a given radix .
> (digit-char-p #\9 )
;;;If no radix is specified , then the
;;;default radix , which is used is 10 .
;;;digit-char-p , returns either the
;;;weight of the digit in the radix ,
;;;or false if the character is
;;;not a digit in the provided
;;;radix .
9> (digit-char-p #\F 16 )
15
The function digit-char
, can be used to get the character, that represents a digit in a given radix .
> (digit-char 9 )
;;;If no radix is specified , the default
;;;one is base 10 . The number 9 , is a
;;;digit in base 10 , hence digit-char
;;;returns its representing character .
#\9> (digit-char 10 )
;;;The number 10 , is not a digit
;;;in base 10 , hence the digit-char
;;;function returns false .
NIL> (digit-char 35 36 )
;;;The number 35 , is a digit ,
;;;in base 36 , hence the function
;;;digit-char returns its representing
;;;character capital Z .
#\Z
Character attributes
In lisp the only attribute that a character must have is , its code point .
As talked about earlier , an implementation can define characters beside the standard characters . Each character , in a character repertoire , has a code point . An implementation can use , for example , the unicode character repertoire , and all the characters made available by this implementation , through this character repertoire , will have as code points , the unicode code point .
To get the code point of a character in lisp , the function char-code
, can be used :
> (char-code #\a )
;;;Get the code point of the character
;;;a in decimal .
97
By default the print base in lisp is decimal , so to set to it hexadecimal , and to see the code point of a number , as it is written in unicode , the print base can be set , to hexadecimal as follows .
> (setq *print-base* 16 )
;;;Set the print base to base 16 .
;;;outputs 10 in base 16 , to indicate
;;;the set base .
10> (char-code #\a )
;;;Get the code point of the character
;;;a in hexadecimal .
61
To get the character , represented by a given code point , the function code-char
can be used .
> (setq *read-base* 16 )
;;;Set the read base , to base 16 , instead
;;;of the default base 10 .
16> (code-char 61 )
;;;Enter , the character code in hexadecimal .
;;;The implementation uses , the unicode ,
;;;character repertoire , as such 61 in
;;;hexadecimal , is the code point for
;;;the character a .
#\a> (setq *read-base* A )
;;;Set the read base to the default
;;; base 10 .
10> (code-char 97 )
#\a
An implementation might define additional attributes beside the code point . Historically the font attribute , and the bits attribute existed , as part of common lisp , but they were removed in the process of standardizing common lisp , and instead , an implementation can define additional attributes .
The font attribute , was used to specify the style of a character , for example , bold , or italics .
Character name , encoding , and other meta information
A character can have a name. The only standard characters , for which the standard defined a name , are : the Newline
, and the Space
characters .
The standard also defined a name for the following characters : Rubout
, Page
, Tab
, Backspace
, Return
, Linefeed
.
All non graphic character, must have a name , unless they have some not null implementation defined attribute . The name of a character , can be gotten as a string , using the function char-name
, and a character , can be gotten from a string , which has its name , using the function name-char
.
> (char-name #\a )
;;;Get the name of the character
;;;a .
"LATIN_SMALL_LETTER_A"> (name-char "LATIN_SMALL_LETTER_A" )
;;;Get the character which has
;;;a name of LATIN_SMALL_LETTER_A .
#\a
A character , has an encoding , which is how it is numerically represented in a computing machine . The encoding of a character in lisp , can be gotten by using the char-int
function .
> (char-int #\a )
;;;Get the integer value of the encoding
;;;of a character , the print base
;;;is decimal , hence the output for
;;;the character a is 97 .
97
The lisp standard defines , that if the implementation does not define any implementation specific attributes for a character , then the char-int
function , must return the same value as char-code
.
The constant variable char-code-limit
, can be used to get the maximum code number , that a character might have , not inclusive , in a given lisp implementation .
> char-code-limit
1114112
Character comparison
To compare two or more characters for equality , or inequality , using their attributes , the predicate functions char=
, and char/=
can be used . These predicate functions , return true or false , depending if all their characters have , or do not have , the same attributes .
> (char= #\a #\LATIN_SMALL_LETTER_A #\A )
;;;compare a , a and A for equality
;;;using their attributes .
NIL> (char/= #\a #\LATIN_SMALL_LETTER_A #\A )
;;;Compare a , a , and A , for difference
;;;using their attributes .
NIL> (char= #\a #\LATIN_SMALL_LETTER_A #\a )
;;;Compare a , a , a for equality , using
;;;their attributes .
T>(char/= #\a #\b #\c )
;;;Compare for difference , a , b , and
;;;c using their attributes .
T
eql
, and equal
, can also be used when comparing two characters only , they will yield the same result as eq=
.
eq
can be used to compare two characters for equality , it does not use the character concept for comparison, but a lower implementation defined level . eq
, might yield false , when eq=
is yielding true , but if eq
yields true, when comparing two characters , then eq=
must yield true .
To compare one or more character for order , using the characters code points , when all other attributes are the same , the predicate functions : char<
, char<=
, char>
, char>=
can be used .
The standard characters , have the following ordering .
A<B<C<D<E<F<G<H<I<J<K<L<M<N<O<P<Q<R<S<T<U<V<W<X<Y<Z
;;;Ordering of capital case charactersa<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<s<t<u<v<w<x<y<z
;;;Ordering of lowercase characters 0<1<2<3<4<5<6<7<8<9
;;;Ordering of numbers
An example of using these functions :
> (char< #\A #\B #\C )
;;;Compare A , B , C for
;;;ascending order .
T> (char> #\z #\f #\d )
;;;Compare z , f , d for
;;;descending order .
T> (char<= #\1 #\DIGIT_ONE #\4 )
;;;Compare for less or equal
;;;order 1 , 1 , 4 .
T> (char>= #\4 #\3 #\5 )
;;;Compare for larger or equal
;;;order 4 , 3 and 5 .
NIL
If characters are to be compared , for equality or for order , ignoring case , and ignoring other attributes , which the implementation dictates , that they are to ignored , then the functions : char-equal
, char-not-equal
, char-greaterp
, char-not-greaterp
, char-lessp
, and char-not-lessp
can be used .
> (char-equal #\a #\A #\Latin_capital_letter_A )
;;;Compare for equality , ignoring case ,
;;;the characters a , A , and A .
T
Converting a character designator to a character
The function character
, can be used to convert a character designator , to its corresponding character . For example :
> (character 65 )
#\A> (character "a" )
#\a> (character 'a )
#\A> (character #\A )
#\A
standard-char , base-char , extended-char , character
Astandard-char
, is any character of the 96
standard characters , talked about earlier .
A character can be checked if it is a standard character , using standard-char-p
.
> (standard-char-p #\a )
;;;Check if the character a , is
;;;a standard character , returns
;;;true .
T> (standard-char-p #\POUND_SIGN )
;;;Check if the pound sign character
;;;is a standard character , returns
;;;false .
NIL
Abase-char
is , is a super type of the standard-char
. The set of base-char
, can contain additional characters .
This is more related to how an implementation encode characters. If an implementation for example , has two encoding of characters , one which is 8
bits , and the second one which is 16
bits , then base-char
corresponds to the characters , which are encoded using 8
bits , and the 16
bits encoded characters , are called extended characters .
Aextended-char
is simply a character , which is not a base-char
.
Thecharacter
type , is a super type , for both the base-char
, and the extended-char
types . In an implementation where all characters are base-char
, there can be no extended-char
.
The predicate functions characterp
, can be used to check if a token is a character .
> (characterp #\POUND_SIGN )
;;;Check if the character pound sign
;;;is a character .
T
Originally published at https://twiserandom.com on February 21, 2021.