How to fix MYSQL Character Encoding Issues After Migration?

Published by Supun Kavinda on

We recently performed a major migration that was stressful but helped us learn a lot of things in one day. One of the main problems we faced was MYSQL character encoding issues after the migration.

Here’s how it happened. First, we exported our database. In this export file, non-ASCII characters were changed into some weird characters like “Венчур Д. Simply, all non-English comments were shown in this way. So, do we need to ask all non-English sites that use Hyvor Talk to “forget about their previous comments”?

No. We found a way to fix this issue.

Finding the Issue

First, we figured out what has happened.

  • There were no problems with integer columns.
  • Columns with defined strings (Eg: enums) aren’t changed as they are mostly ASCII – You know, developers rarely use non-English characters for data.
  • Columns with user-defined strings were messed up. In our case, these included comments, usernames, website names, and more.

Then, we figured out, our previous database used MYSQL 5 and the default charset was latin1, which isn’t good but worked fine for us. Because it worked fine, we never thought about it again.

Our new database uses MYSQL 8, where the default charset is utf8mb4 (It is what you should use in 2020).

latin1 and utf8mb4 behaves differently in encoding characters. However, all ASCII character seems to be fine in both databases. So, we first queried the database to find out how many comments were there with non-ASCII characters.

You can use this SQL query to do that. (Source) (body is the column we are checking for).

SELECT id, body
FROM comments 
WHERE body <> CONVERT(body USING ASCII)

If we had only 10 or so columns we could change them manually. Unfortunately, there were thousands of comments with non-ASCII characters. Hence, we needed a way to automate the process.

Before that, we need a way to get the correct comment out of that weird encoding. It was simple than we thought. Setting the session charset to LATIN1 before the query gave us the correct results.

SET NAMES 'LATIN1';
SELECT body FROM comments WHERE id = 204; -- returns a nice comments without weird chars

Even you can do this in your application before fetching data, it is always better to completely move to UTF8MB4, which can store characters like 💩 in your database. And, it’s the recommended way in MYSQL8.

Fixing the Issue

First of all, backup your database in case anything goes wrong.

We used a PHP script to fix this issue. You can use whatever language you use to connect to MYSQL.

The logic is simple.

  • First, we fetch the data from MYSQL in LATIN1 encoding and save them in a PHP variable.
  • Then, we update the connection charset to UTF8MB4 and update the record.
<?php
$mysqli = new mysqli(HOST, USERNAME, PWORD, DB);
// set connection charset to latin
$mysqli -> query("SET NAMES 'LATIN1'");
// select non-ascii rows
$rows = $mysqli -> query('SELECT id, body FROM comments WHERE body <> CONVERT(body USING ASCII)');
// set connection charset to UTF8MB4
$mysqli -> query("SET NAMES 'UTF8MB4'");
foreach ($rows as $row) {
     // before updatin the rows, verify your data is correct
     echo $row['id'] . ' ' . $row['body'] . '<br>';
}

In the for loop, we echo the data to verify everything is perfect. Then, update the rows.

foreach ($rows as $row) {
    $stmt = $mysqli -> prepare('UPDATE comments SET body = ? WHERE id = ? LIMIT 1');
    $stmt -> bind_param('si', $row['body'], $row['id']);
    $stmt -> execute();
}

Make sure to use prepared queries! This will update the records. You’ll see the characters correctly when you fetch data from the database next time with UTF8MB4 connection charset.

Caution! Do not run this script more than once for the same column. It will break everything.

Next, you can do the same to all the columns with encoding issues.

Finally: Set up all databases, tables, and columns to UTFMB4

The issue we fixed occurred because we used latin1 character encoding in our database. You have to change it to UTF8MB4 to avoid any errors in the future and store almost any Unicode character in your database.

This is a great resource on how to do that.

You have to change three things. Set character set to utf8mb4 and collation to utf8mb4_unicode_ci of

  • all databases
  • all tables
  • all string columns

You don’t have to be afraid to make these changes. However, always make sure to keep a backup.

Conclusion

In this article, I explained how we fixed the issues we faced when migrating the MYSQL5 database with LATIN1 encoding to the MYSQL8 database with UTF8MB4 encoding. We always know that issues are not always the same in all machines. So, if you encounter a different type of issue in migrating, feel free to leave a comment below.

Need a privacy-focused commenting platform for your website?

Subscribe to our Newsletter

We write a lot of blog posts to help you grow your audience on your website. Subscribe to get the best blog posts right into your inbox.

Categories: Uncategorized