Characters in lisp a tutorial

mohamad wael
13 min readFeb 21, 2021

--

Character repertoires in lisp

A character repertoire such as ascii, or Latin alphabet 1, or unicode, is a set of characters .

Each character , such as a, in a repertoire , is assigned a code point . A code point is just a number . For example , in ascii , the code point for a, is 97in decimal , or 61in hexadecimal .

Each code point has an encoding . An encoding is the representation of the code point , in the computer . For example , in ascii , the character ais encoded , the same as its code point , which is 61in hexadecimal or 0110 0001in binary .

This being said , the characters that can be used in a lisp implementation , are defined by the character repertoire , or character repertoires if multiple , that a lisp implementation supports .

At the minimum , all lisp implementations , must support some basic characters , called the standard characters. The set formed by these characters , is called the standard character repertoire standard-char.

Thecharacters that all lisp implementations must support are : a-z, A-Z, 0-9, ! ? $ " ’ ‘ . : , ; * + - / | \ ~ ^ < = > # % @ & ( ) [ ] { }, the non graphic character new line , the graphic character space .

Character types in lisp

Characters in lisp can be used in writing a program , or they can be used to represent themselves , as data . The lisp reader , reads the typed in characters , one by one . Each read character , has a type . The type of a character , affects how the lisp reader , interprets it.

Characters can be of the following types:

A macro character. The standard macro characters in lisp are : ( ) ' ` , ; " # .

A macro character affects the parsing of the characters , that follows it . When a macro character is encountered , its associated reader macro function , is called . The reader macro function , will read following characters , and return an entity .

For example , the double quote "macro character , returns a string literal , as in "Hello world", the sharp sign #macro character , followed by a backslash \, and a single character , or a character name , returns a literal character , as in #\a, characters following the semicolon macro character ;, are ignored , and characters following the single quote character ', if they form a symbol , or a cons , then the read eval loop will not evaluate the symbol or cons . For example :

> '(+ 1 2 )
;;;The cons (+ 1 2 )
;;; is not evaluated .
(+ 1 2)

An escape character. An escape character , is not a macro character , as such there is no associated macro function , which is called , and no entities are returned .

The backslash \single escape character, escapes a single character , the character just following it . Escaping a character is , the character as alpha , and its case is preserved . For example :

> \'
;;;The macro character ' is escaped ,
;;;It is now treated as an alphabetic
;;;character , instead of being
;;;a macro character .
;;;One or more alpha characters ,
;;;can form a symbol .
;;; The symbol is evaluated , by the
;;;read eval loop , this will cause an
;;;error , as no value is associated
;;;with the symbol '.
*** - SYSTEM::READ-EVAL-PRINT: variable |'| has no value
The following restarts are available:

Vertical bars multiple escape characters , escape the enclosed characters . Enclosed characters are treated as alpha , and their cases are preserved . Escape characters , appearing in vertical bars escape characters , must be escaped , using a single escape character .

>|a(`;b| 
;;;Characters in vertical bars ,
;;;are escaped , and treated as
;;;alpha , and their case is
;;;preserved . The symbol
;;;a(`;b| , has no value , and
;;;this will cause an error .
*** - SYSTEM::READ-EVAL-PRINT: variable |>a(`;b| has no value
The following restarts are available:

A constituent character. The constituent characters are the Rubout which is the delete character in ASCII , the Backspace , and A-Z a–z 0-9 ! $ % & * + - . / : < = > ? @ [ ] ^ _ { } ~ .

The lisp reader forms from the read constituent characters , and from the escaped characters , tokens . A token that is formed by the lisp reader , can either be a number , such as 1, or 1.2, or it can be a symbol , such as aor \'. An escaped character , is considered a constituent character .

A number does not have a name , and as a value , it evaluates to itself . A symbol has a name , which is what is read by the lisp reader , and it has zero or more values , associated with it .

The lisp reader , as stated by the value read , by using (readtable-case *readtable* ), and which by default is :upcase, converts constituent characters touppercase , unless a constituent character is escaped , in such a case , its case is preserved .

> (readtable-case *readtable*)
;;;Print the case for constituent characters
:UPCASE
> 'Ab
;;;b is converted to
;;;uppercase .
AB
> 'A\b
;;;b case is preserved .
|Ab|

A whitespace character. Whitespace characters as defined by the Common Lisp standard , are the space , and the new line characters . They are used to separate tokens .

> |a   b|
;;;One token is formed , since
;;;vertical bar escape characters
;;;are used .
|a b|
> (+ 2 3 )
;;;a list of three tokens
;;;+ 2 and 3 .
5

Literal characters

Characters can represent themselves as data , these are the literal characters .

A literal character is input by using #\, followed by a single character , or by a character name .

If followed by a single character, the character case is preserved . For example , to enter the literal character capital A, it is represented using #\A, and to enter the lower case character a, it is represented by using #\a.

If followed by a character name, the character name is converted to uppercase , and the character represented by this character name , is the target character literal . For example #\LATIN_CAPITAL_LETTER_Arepresents A, and #\LATIN_SMALL_LETTER_Arepresents a.

Depending on if printer escaping is enabled , or disabled , a literal character when output , is either output with a #\character preceeding it , or it is just output as is . Printer escaping can be check if it is enabled , or disabled , by using *print-escape*.

> *print-escape*
T
> #\a
;;;input is #\a , and output
;;;is #\a
#\a
> #\A
;;;input is #\A , and output
;;;is #\A
#\A

Character categories

A character in lisp , can belong to one or more category.

A graphiccharacter, is a character which has a glyph , used to display the character . All standard characters beside the new line character , are graphic characters .

The predicate function , graphic-char-p can be used to check if a character is graphical or not .

> (graphic-char-p #\ )
;;;The space character is a graphic
;;;character.
T
> (graphic-char-p #\
)
;;;The new line character is not graphical
;;;It is informally called a control character
NIL

Analphabeticcharacter, is a character which is also graphic . The standard characters which are alphabetic , in lisp , are : A-Z, a-z.

For an implementation defined character, if it has a case , it must be alphabetic , if not , then it is implementation defined , if it is alphabetic or not .

To check if a character is alphabetic , the predicate function alpha-char-pcan be used .

> (alpha-char-p #\1 )
;;;The character 1 is not alphabetic
NIL
> (alpha-char-p #\a )
;;;The character a is alphabetic
T

A numericcharacter , is also a graphic character . The standard numeric characters , are 0-9. An implementation may define , other numeric characters .

An alphanumericcharacter , is a graphical character , which is either numeric , or alphabetic . The standard characters which are alphanumeric , are : a-z, A-Z, and 0-9.

The predicate function alphanumericp, can be used to check if a character is alphanumeric .

> (alphanumericp #\a )
;;;Check if a is alphanumeric ,
;;;returns true .
T
> (alphanumericp #\1 )
;;;Check if 1 , is alphanumeric ,
;;;returns true.
T
> (and (alphanumericp #\1 ) (not (alpha-char-p #\1 ) ))
;;;Check if a character is numeric , by checking
;;;if it is alphanumeric , and is not alphabetic .
;;;Check if 1 is numeric
T

A casedcharacter, is an alphabetic character, it can be either uppercase , or lower cased , and it must have a character , which has its counterpart case .

The standard uppercase characters are A-Z, and the standard lowercase characters are a-z.

The predicate functions upper-case-p and lower-case-p, can be used to check if a character is , upper or lower case .

The both-case-ppredicate function , can be used to check , if a character , has both an uppercase , and a lowercase version . Some non standard characters , might not be cased , as an example arabic characters , do not have a case .

The functions char-downcase, and char-upcase, can be used to get the lowercase , and uppercase version , of a character .

> (upper-case-p #\A )
;;;Check if the character A ,
;;;is uppercase .
T
> (lower-case-p #\a )
;;;Check if the character a ,
;;; is lowercase .
T
> (both-case-p #\a )
;;;Check if the character a ,
;;;can be uppercase and
;;;lowercase .
T
> (both-case-p #\1 )
;;;Check if the character 1 ,
;;;can be uppercase and lowercase .
NIL
> (both-case-p #\ARABIC_LETTER_ALEF )
;;;Check if the arabic character alef
;;;has uppercase , and lowercase .
NIL
> (char-upcase #\a )
;;;Get the uppercase version
;;;of the character a .
#\A
> (char-downcase #\A )
;;;Get the lowercase version ,
;;;of the character A .
#\a

A digitcharacter, is a digit in a given radix . For example , Ain base 16, is considered to be a digit . The standard radix , are between 2and 36inclusive , and the radices digits , can be between : 0and Z, where Zis 35. Radices digits , are case insensitive .

The predicate function digit-char-p, can be used to check if a character is a digit , in a given radix .

> (digit-char-p #\9 )
;;;If no radix is specified , then the
;;;default radix , which is used is 10 .
;;;digit-char-p , returns either the
;;;weight of the digit in the radix ,
;;;or false if the character is
;;;not a digit in the provided
;;;radix .
9
> (digit-char-p #\F 16 )
15

The function digit-char, can be used to get the character, that represents a digit in a given radix .

> (digit-char 9 )
;;;If no radix is specified , the default
;;;one is base 10 . The number 9 , is a
;;;digit in base 10 , hence digit-char
;;;returns its representing character .
#\9
> (digit-char 10 )
;;;The number 10 , is not a digit
;;;in base 10 , hence the digit-char
;;;function returns false .
NIL
> (digit-char 35 36 )
;;;The number 35 , is a digit ,
;;;in base 36 , hence the function
;;;digit-char returns its representing
;;;character capital Z .
#\Z

Character attributes

In lisp the only attribute that a character must have is , its code point .

As talked about earlier , an implementation can define characters beside the standard characters . Each character , in a character repertoire , has a code point . An implementation can use , for example , the unicode character repertoire , and all the characters made available by this implementation , through this character repertoire , will have as code points , the unicode code point .

To get the code point of a character in lisp , the function char-code, can be used :

> (char-code #\a )
;;;Get the code point of the character
;;;a in decimal .
97

By default the print base in lisp is decimal , so to set to it hexadecimal , and to see the code point of a number , as it is written in unicode , the print base can be set , to hexadecimal as follows .

> (setq *print-base* 16 )
;;;Set the print base to base 16 .
;;;outputs 10 in base 16 , to indicate
;;;the set base .
10
> (char-code #\a )
;;;Get the code point of the character
;;;a in hexadecimal .
61

To get the character , represented by a given code point , the function code-charcan be used .

> (setq *read-base* 16 )
;;;Set the read base , to base 16 , instead
;;;of the default base 10 .
16
> (code-char 61 )
;;;Enter , the character code in hexadecimal .
;;;The implementation uses , the unicode ,
;;;character repertoire , as such 61 in
;;;hexadecimal , is the code point for
;;;the character a .
#\a
> (setq *read-base* A )
;;;Set the read base to the default
;;; base 10 .
10
> (code-char 97 )
#\a

An implementation might define additional attributes beside the code point . Historically the font attribute , and the bits attribute existed , as part of common lisp , but they were removed in the process of standardizing common lisp , and instead , an implementation can define additional attributes .

The font attribute , was used to specify the style of a character , for example , bold , or italics .

Character name , encoding , and other meta information

A character can have a name. The only standard characters , for which the standard defined a name , are : the Newline, and the Spacecharacters .

The standard also defined a name for the following characters : Rubout, Page, Tab, Backspace, Return, Linefeed.

All non graphic character, must have a name , unless they have some not null implementation defined attribute . The name of a character , can be gotten as a string , using the function char-name, and a character , can be gotten from a string , which has its name , using the function name-char.

> (char-name #\a )
;;;Get the name of the character
;;;a .
"LATIN_SMALL_LETTER_A"
> (name-char "LATIN_SMALL_LETTER_A" )
;;;Get the character which has
;;;a name of LATIN_SMALL_LETTER_A .
#\a

A character , has an encoding , which is how it is numerically represented in a computing machine . The encoding of a character in lisp , can be gotten by using the char-intfunction .

> (char-int #\a )
;;;Get the integer value of the encoding
;;;of a character , the print base
;;;is decimal , hence the output for
;;;the character a is 97 .
97

The lisp standard defines , that if the implementation does not define any implementation specific attributes for a character , then the char-intfunction , must return the same value as char-code.

The constant variable char-code-limit, can be used to get the maximum code number , that a character might have , not inclusive , in a given lisp implementation .

> char-code-limit 
1114112

Character comparison

To compare two or more characters for equality , or inequality , using their attributes , the predicate functions char=, and char/=can be used . These predicate functions , return true or false , depending if all their characters have , or do not have , the same attributes .

> (char=  #\a #\LATIN_SMALL_LETTER_A #\A )
;;;compare a , a and A for equality
;;;using their attributes .
NIL
> (char/= #\a #\LATIN_SMALL_LETTER_A #\A )
;;;Compare a , a , and A , for difference
;;;using their attributes .
NIL
> (char= #\a #\LATIN_SMALL_LETTER_A #\a )
;;;Compare a , a , a for equality , using
;;;their attributes .
T
>(char/= #\a #\b #\c )
;;;Compare for difference , a , b , and
;;;c using their attributes .
T

eql, and equal, can also be used when comparing two characters only , they will yield the same result as eq=.

eqcan be used to compare two characters for equality , it does not use the character concept for comparison, but a lower implementation defined level . eq, might yield false , when eq=is yielding true , but if eqyields true, when comparing two characters , then eq=must yield true .

To compare one or more character for order , using the characters code points , when all other attributes are the same , the predicate functions : char< , char<=, char>, char>= can be used .

The standard characters , have the following ordering .

A<B<C<D<E<F<G<H<I<J<K<L<M<N<O<P<Q<R<S<T<U<V<W<X<Y<Z
;;;Ordering of capital case characters
a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<s<t<u<v<w<x<y<z
;;;Ordering of lowercase characters
0<1<2<3<4<5<6<7<8<9
;;;Ordering of numbers

An example of using these functions :

> (char< #\A #\B #\C )
;;;Compare A , B , C for
;;;ascending order .
T
> (char> #\z #\f #\d )
;;;Compare z , f , d for
;;;descending order .
T
> (char<= #\1 #\DIGIT_ONE #\4 )
;;;Compare for less or equal
;;;order 1 , 1 , 4 .
T
> (char>= #\4 #\3 #\5 )
;;;Compare for larger or equal
;;;order 4 , 3 and 5 .
NIL

If characters are to be compared , for equality or for order , ignoring case , and ignoring other attributes , which the implementation dictates , that they are to ignored , then the functions : char-equal, char-not-equal, char-greaterp, char-not-greaterp, char-lessp, and char-not-lessp can be used .

> (char-equal #\a #\A #\Latin_capital_letter_A )
;;;Compare for equality , ignoring case ,
;;;the characters a , A , and A .
T

Converting a character designator to a character

The function character, can be used to convert a character designator , to its corresponding character . For example :

> (character 65 )
#\A
> (character "a" )
#\a
> (character 'a )
#\A
> (character #\A )
#\A

standard-char , base-char , extended-char , character

Astandard-char, is any character of the 96standard characters , talked about earlier .

A character can be checked if it is a standard character , using standard-char-p.

> (standard-char-p  #\a )
;;;Check if the character a , is
;;;a standard character , returns
;;;true .
T
> (standard-char-p #\POUND_SIGN )
;;;Check if the pound sign character
;;;is a standard character , returns
;;;false .
NIL

Abase-charis , is a super type of the standard-char. The set of base-char, can contain additional characters .

This is more related to how an implementation encode characters. If an implementation for example , has two encoding of characters , one which is 8bits , and the second one which is 16bits , then base-charcorresponds to the characters , which are encoded using 8bits , and the 16bits encoded characters , are called extended characters .

Aextended-charis simply a character , which is not a base-char.

Thecharactertype , is a super type , for both the base-char, and the extended-chartypes . In an implementation where all characters are base-char, there can be no extended-char.

The predicate functions characterp, can be used to check if a token is a character .

> (characterp #\POUND_SIGN )
;;;Check if the character pound sign
;;;is a character .
T

Originally published at https://twiserandom.com on February 21, 2021.

--

--

No responses yet