Skip to content

Latest commit

 

History

History
176 lines (145 loc) · 4.78 KB

File metadata and controls

176 lines (145 loc) · 4.78 KB

Quick Reference: Most Common Encoding Issues

Top 20 Patterns by Language

German 🇩🇪

Corrupted Fixed Example Names
ü ü Müller, Günther
ö ö Schröder, Böhm
ä ä Bäcker, Schäfer
ß ß Straße, Groß
Ä Ä Ärzte, Äpfel
Ö Ö Österreich
Ü Ü Über, Tür

French 🇫🇷

Corrupted Fixed Example Names
é é René, Café
è è Père, Système
ê ê Tête, Forêt
ç ç François, Garçon
à à À, Voilà
ô ô Côte, Hôtel
ë ë Noël, Citroën

Spanish 🇪🇸

Corrupted Fixed Example Names
ñ ñ Señor, España
á á García, Martínez
é é José, Pérez
í í María, Díaz
ó ó López, Gómez
ú ú Raúl, Perú
¿ ¿ ¿Cómo estás?
¡ ¡ ¡Hola!

Polish 🇵🇱

Corrupted Fixed Example Names
Å‚ ł Kowalski, Wałęsa
Ä… ą Dąbrowski
Ä™ ę Będziński
ó ó Wróbel, Kraków
ć ć Jaśko
Å„ ń Gdańsk
Å› ś Śląsk
ź ź Źrebak
ż ż Żabka

Swedish/Norwegian 🇸🇪 🇳🇴

Corrupted Fixed Example Names
Ã¥ å Håkan, Malmö
ä ä Mäkinen, Täby
ö ö Lindström, Örebro
ø ø København, Strømstad
Ø Ø Østergaard
Ã… Å Åsa, Ångström
Æ Æ Ærø

Czech/Slovak 🇨🇿 🇸🇰

Corrupted Fixed Example Names
Ä č Dvořák, Čech
Å¡ š Jakubíšek
Å™ ř Jiří, Příbram
ž ž Žižkov
ý ý Nový
á á Bratislava
é é René

Common Punctuation Issues

Quotes

Corrupted Fixed Usage
“ " Left double quote
†" Right double quote
‘ ' Left single quote
’ ' Right single quote/apostrophe

Dashes

Corrupted Fixed Usage
â€" En dash (ranges)
â€" Em dash (breaks)
― Horizontal bar

Other Symbols

Corrupted Fixed Usage
… Ellipsis
• Bullet point
€ Euro symbol
© © Copyright
® ® Registered
â„¢ Trademark
° ° Degree

HTML Entities

Corrupted Fixed Usage
' ' Apostrophe
" " Quote
& & Ampersand
  (space) Non-breaking space

Recognition Patterns

How to Spot Encoding Issues:

  1. Ã followed by special characters → Usually accented letters
  2. †followed by anything → Usually punctuation or quotes
  3. Å or Ä followed by special chars → Usually Eastern European
  4. Multiple special chars where one should be → Encoding problem

Examples:

  • é = 2 characters that should be 1 → é
  • ’ = 3 characters that should be 1 → '
  • é = 4 characters that should be 1 → é (double-encoded)

Quick Test

Is This an Encoding Issue?

✅ YES if you see:

  • Multiple weird characters where an accent should be
  • � (replacement character)
  • Patterns like Ã+special char
  • â€+anything
  • Names that look "garbled" but you can guess what they should be

❌ NO if you see:

  • Random unrelated characters
  • Numbers in place of letters
  • Complete gibberish with no pattern

Language Coverage

The script handles these language families:

Western European:

  • German, French, Spanish, Portuguese, Italian, Dutch

Nordic:

  • Swedish, Norwegian, Danish, Icelandic, Finnish

Eastern European:

  • Polish, Czech, Slovak, Romanian, Hungarian, Croatian

Baltic:

  • Latvian, Lithuanian, Estonian

Other:

  • Turkish, Albanian

When to Run

Run the script after:

  • ✅ Importing CSV files
  • ✅ Exporting from Salesforce/HubSpot/Dynamics
  • ✅ Receiving data from international offices
  • ✅ Migrating between systems
  • ✅ Copy/paste from emails or web

Pro Tips

  1. Always keep the log file - it's your audit trail
  2. Run on a copy first if you're nervous
  3. Check the summary - if it says 0 fixes, your data was already clean
  4. Look for patterns - if many names from one country are broken, they'll all fix the same way
  5. Share the script - your international colleagues will love you

Need help? Check the full documentation or send me new patterns you encounter!