Fighting Microsoft Word autoformat

We've all been there. The total agony when users paste in text from Word, or when we have to import third party data which was created on some Windows platform. In these situations, we tend to be stuck with characters that just cannot seem be rendered properly. These are display characters created by Windows.

Example. You type three dots in Word, and it is replaced by a single character displaying, well, three dots. In html entities this would be a . Or you type a quotion mark and it is replaced by a curly quotation mark. ( and ).

When dealing with data, these characters would best be replaced by their semantic counterparts. So: three dots would just be replaced by... three dots. A posh curly quotation mark would be replaced by a simple straight quotation mark. And so on, and so forth.

Here's a little function that does just that. I borrowed it from a friend, who borrowed it from a friend. You can find it on Github, too. https://gist.github.com/4419014

Happy 2013!

<?php
function transcribe_cp1252_to_latin1 ($cp1252)
{
    return strtr($cp1252, array(
        "x80" => "e",
        "x81" => " ",
        "x82" => "'",
        "x83" => 'f',
        "x84" => '"',
        "x85" => "…",
        "x86" => "+",
        "x87" => "#",
        "x88" => "^",
        "x89" => "0/00",
        "x8A" => "S",
        "x8B" => "<",
        "x8C" => "OE",
        "x8D" => " ",
        "x8E" => "Z",
        "x8F" => " ",
        "x90" => " ",
        "x91" => "`",
        "x92" => "'",
        "x93" => '"',
        "x94" => '"',
        "x95" => "*",
        "x96" => "-",
        "x97" => "--",
        "x98" => "~",
        "x99" => "(™)",
        "x9A" => "s",
        "x9B" => ">",
        "x9C" => "oe",
        "x9D" => " ",
        "x9E" => "z",
        "x9F" => "Y",
        "…" => "..."
    ));
}