[ Index ]

PHP Cross Reference of DokuWiki

title

Body

[close]

/inc/ -> utf8.php (summary)

UTF8 helper functions

Author: Andreas Gohr
License: LGPL 2.1 (http://www.gnu.org/copyleft/lesser.html)
File Size: 1772 lines (90 kb)
Included or required: 1 time
Referenced: 0 times
Includes or requires: 0 files

Defines 1 class

utf8_entity_decoder:: (9 methods):
  __construct()
  makeutf8()
  decode()
  utf8_to_unicode()
  unicode_to_utf8()
  utf8_to_utf16be()
  utf16be_to_utf8()
  utf8_bad_replace()
  utf8_correctIdx()

Defines 9 functions

  utf8_isASCII()
  utf8_strip()
  utf8_check()
  utf8_basename()
  utf8_strlen()
  utf8_substr()
  utf8_substr_replace()
  utf8_ltrim()
  utf8_rtrim()
  utf8_trim()
  utf8_strtolower()
  utf8_strtoupper()
  utf8_ucfirst()
  utf8_ucwords()
  utf8_ucwords_callback()
  utf8_deaccent()
  utf8_romanize()
  utf8_stripspecials()
  utf8_strpos()
  utf8_tohtml()
  utf8_unhtml()
  utf8_decode_numeric()

Class: utf8_entity_decoder  - X-Ref

Encapsulate HTML entity decoding tables

__construct()   X-Ref
Initializes the decoding tables


makeutf8($c)   X-Ref
Wrapper around unicode_to_utf8()

param: string $c
return: string|false

decode($ent)   X-Ref
Decodes any HTML entity to it's correct UTF-8 char equivalent

param: string $ent An entity
return: string|false

utf8_to_unicode($str,$strict=false)   X-Ref
Takes an UTF-8 string and returns an array of ints representing the
Unicode characters. Astral planes are supported ie. the ints in the
output can be > 0xFFFF. Occurrances of the BOM are ignored. Surrogates
are not allowed.

If $strict is set to true the function returns false if the input
string isn't a valid UTF-8 octet sequence and raises a PHP error at
level E_USER_WARNING

Note: this function has been modified slightly in this library to
trigger errors on encountering bad bytes

param: string  $str UTF-8 encoded string
param: boolean $strict Check for invalid sequences?
link: http://hsivonen.iki.fi/php-utf8/
link: http://sourceforge.net/projects/phputf8/
see: unicode_to_utf8
author: <hsivonen@iki.fi>
author: Harry Fuecks <hfuecks@gmail.com>
return: mixed array of unicode code points or false if UTF-8 invalid

unicode_to_utf8($arr,$strict=false)   X-Ref
Takes an array of ints representing the Unicode characters and returns
a UTF-8 string. Astral planes are supported ie. the ints in the
input can be > 0xFFFF. Occurrances of the BOM are ignored. Surrogates
are not allowed.

If $strict is set to true the function returns false if the input
array contains ints that represent surrogates or are outside the
Unicode range and raises a PHP error at level E_USER_WARNING

Note: this function has been modified slightly in this library to use
output buffering to concatenate the UTF-8 string (faster) as well as
reference the array by it's keys

param: array $arr of unicode code points representing a string
param: boolean $strict Check for invalid sequences?
link: http://hsivonen.iki.fi/php-utf8/
link: http://sourceforge.net/projects/phputf8/
see: utf8_to_unicode
author: <hsivonen@iki.fi>
author: Harry Fuecks <hfuecks@gmail.com>
return: string|false UTF-8 string or false if array contains invalid code points

utf8_to_utf16be(&$str, $bom = false)   X-Ref
UTF-8 to UTF-16BE conversion.

Maybe really UCS-2 without mb_string due to utf8_to_unicode limits

param: string $str
param: bool $bom
return: string

utf16be_to_utf8(&$str)   X-Ref
UTF-8 to UTF-16BE conversion.

Maybe really UCS-2 without mb_string due to utf8_to_unicode limits

param: string $str
return: false|string

utf8_bad_replace($str, $replace = '')   X-Ref
Replace bad bytes with an alternative character

ASCII character is recommended for replacement char

PCRE Pattern to locate bad bytes in a UTF-8 string
Comes from W3 FAQ: Multilingual Forms
Note: modified to include full ASCII range including control chars

param: string $str to search
param: string $replace to replace bad bytes with (defaults to '?') - use ASCII
see: http://www.w3.org/International/questions/qa-forms-utf-8
author: Harry Fuecks <hfuecks@gmail.com>
return: string

utf8_correctIdx(&$str,$i,$next=false)   X-Ref
adjust a byte index into a utf8 string to a utf8 character boundary

param: string $str   utf8 character string
param: int    $i     byte index into $str
param: $next  bool     direction to search for boundary,
author: chris smith <chris@jalakai.co.uk>
return: int            byte index into $str now pointing to a utf8 character boundary

Functions
Functions that are not part of a class:

utf8_isASCII($str)   X-Ref
Checks if a string contains 7bit ASCII only

param: string $str
author: Andreas Haerter <andreas.haerter@dev.mail-node.com>
return: bool

utf8_strip($str)   X-Ref
Strips all highbyte chars

Returns a pure ASCII7 string

param: string $str
author: Andreas Gohr <andi@splitbrain.org>
return: string

utf8_check($Str)   X-Ref
Tries to detect if a string is in Unicode encoding

param: string $Str
link: http://php.net/manual/en/function.utf8-encode.php
author: <bmorel@ssi.fr>
return: bool

utf8_basename($path, $suffix='')   X-Ref
A locale independent basename() implementation

works around a bug in PHP's basename() implementation

param: string $path     A path
param: string $suffix   If the name component ends in suffix this will also be cut off
link: https://bugs.php.net/bug.php?id=37738
see: basename()
return: string

utf8_strlen($string)   X-Ref
Unicode aware replacement for strlen()

utf8_decode() converts characters that are not in ISO-8859-1
to '?', which, for the purpose of counting, is alright - It's
even faster than mb_strlen.

param: string $string
see: strlen()
see: utf8_decode()
author: <chernyshevsky at hotmail dot com>
return: int

utf8_substr($str, $offset, $length = null)   X-Ref
UTF-8 aware alternative to substr

Return part of a string given character offset (and optionally length)

param: string $str
param: int $offset number of UTF-8 characters offset (from left)
param: int $length (optional) length in UTF-8 characters from offset
author: Harry Fuecks <hfuecks@gmail.com>
author: Chris Smith <chris@jalakai.co.uk>
return: string

utf8_substr_replace($string, $replacement, $start , $length=0 )   X-Ref
Unicode aware replacement for substr_replace()

param: string $string      input string
param: string $replacement the replacement
param: int    $start       the replacing will begin at the start'th offset into string.
param: int    $length      If given and is positive, it represents the length of the portion of string which is
see: substr_replace()
author: Andreas Gohr <andi@splitbrain.org>
return: string

utf8_ltrim($str,$charlist='')   X-Ref
Unicode aware replacement for ltrim()

param: string $str
param: string $charlist
see: ltrim()
author: Andreas Gohr <andi@splitbrain.org>
return: string

utf8_rtrim($str,$charlist='')   X-Ref
Unicode aware replacement for rtrim()

param: string $str
param: string $charlist
see: rtrim()
author: Andreas Gohr <andi@splitbrain.org>
return: string

utf8_trim($str,$charlist='')   X-Ref
Unicode aware replacement for trim()

param: string $str
param: string $charlist
see: trim()
author: Andreas Gohr <andi@splitbrain.org>
return: string

utf8_strtolower($string)   X-Ref
This is a unicode aware replacement for strtolower()

Uses mb_string extension if available

param: string $string
see: strtolower()
see: utf8_strtoupper()
author: Leo Feyer <leo@typolight.org>
return: string

utf8_strtoupper($string)   X-Ref
This is a unicode aware replacement for strtoupper()

Uses mb_string extension if available

param: string $string
see: strtoupper()
see: utf8_strtoupper()
author: Leo Feyer <leo@typolight.org>
return: string

utf8_ucfirst($str)   X-Ref
UTF-8 aware alternative to ucfirst
Make a string's first character uppercase

param: string $str
author: Harry Fuecks
return: string with first character as upper case (if applicable)

utf8_ucwords($str)   X-Ref
UTF-8 aware alternative to ucwords
Uppercase the first character of each word in a string

param: string $str
see: http://php.net/ucwords
author: Harry Fuecks
return: string with first char of each word uppercase

utf8_ucwords_callback($matches)   X-Ref
Callback function for preg_replace_callback call in utf8_ucwords
You don't need to call this yourself

param: array $matches matches corresponding to a single word
see: utf8_ucwords
see: utf8_strtoupper
author: Harry Fuecks
return: string with first char of the word in uppercase

utf8_deaccent($string,$case=0)   X-Ref
Replace accented UTF-8 characters by unaccented ASCII-7 equivalents

Use the optional parameter to just deaccent lower ($case = -1) or upper ($case = 1)
letters. Default is to deaccent both cases ($case = 0)

param: string $string
param: int $case
author: Andreas Gohr <andi@splitbrain.org>
return: string

utf8_romanize($string)   X-Ref
Romanize a non-latin string

param: string $string
author: Andreas Gohr <andi@splitbrain.org>
return: string

utf8_stripspecials($string,$repl='',$additional='')   X-Ref
Removes special characters (nonalphanumeric) from a UTF-8 string

This function adds the controlchars 0x00 to 0x19 to the array of
stripped chars (they are not included in $UTF8_SPECIAL_CHARS)

param: string $string     The UTF8 string to strip of special chars
param: string $repl       Replace special with this string
param: string $additional Additional chars to strip (used in regexp char class)
author: Andreas Gohr <andi@splitbrain.org>
return: string

utf8_strpos($haystack, $needle, $offset=0)   X-Ref
This is an Unicode aware replacement for strpos

param: string  $haystack
param: string  $needle
param: integer $offset
see: strpos()
author: Leo Feyer <leo@typolight.org>
return: integer

utf8_tohtml($str, $all = false)   X-Ref
Encodes UTF-8 characters to HTML entities

param: string $str
param: bool $all Encode non-utf8 char to HTML as well
link: http://php.net/manual/en/function.utf8-decode.php
author: Tom N Harris <tnharris@whoopdedo.org>
author: <vpribish at shopping dot com>
return: string

utf8_unhtml($str, $entities=null)   X-Ref
Decodes HTML entities to UTF-8 characters

Convert any &#..; entity to a codepoint,
The entities flag defaults to only decoding numeric entities.
Pass HTML_ENTITIES and named entities, including &amp; &lt; etc.
are handled as well. Avoids the problem that would occur if you
had to decode "&amp;#38;&#38;amp;#38;"

unhtmlspecialchars(utf8_unhtml($s)) -> "&#38;&#38;"
utf8_unhtml(unhtmlspecialchars($s)) -> "&&amp#38;"
what it should be                   -> "&#38;&amp#38;"

param: string  $str      UTF-8 encoded string
param: boolean $entities Flag controlling decoding of named entities.
author: Tom N Harris <tnharris@whoopdedo.org>
return: string  UTF-8 encoded string with numeric (and named) entities replaced.

utf8_decode_numeric($ent)   X-Ref
Decodes numeric HTML entities to their correct UTF-8 characters

param: $ent string A numeric entity
return: string|false