UTF-8 Round Trip for Perl, MySQL and node.js

Doing things the un*x way means keeping a supply of “go-to” tools for the various tasks that spring up during development. For me, that’s a lot of bash, python and Perl on the dev machine and recently node.js on the server.

While scraping acquiring multi-lingual (Unicode) data for a project, I had to make sure I kept the correct utf-8 encoding all along the processing pipeline. One screw-up and it’s garbage out. Here are some tips:

Acquire Content in Perl

Use the Perl modules pQuery and Redis to fetch and store the content. Everything will work in utf-8, no worries.

Normalize the Data as CSV

Use the Perl modules Text::CSV::Unicode, your CSV file will be encoded in utf-8.

Read CSV into MySQL Database

Using Text::CSV::Unicode again, insert the data into a MySQL database. But here’s the kicker, use:

$dbh->{'mysql_enable_utf8'} = 1;
$dbh->do(qq{SET NAMES 'utf8';});

… if you don’t, your utf-8 strings will be re-encoded into the local character set. Not good.

Also create a file at ~/.my.cnf with this content:

[mysql]
default-character-set=utf8

Serving Up The Data

I’m using Express with a node.js server. I found that the documented default behaviour of responding with utf-8 wasn’t working, so I needed to set it manually:

res.charset = 'utf-8';
res.send(output);

There you go… I hope this helps somebody! It gave me quite a headache for a day… =D

 

Leave a Reply