Doing things the un*x way means keeping a supply of “go-to” tools for the various tasks that spring up during development. For me, that’s a lot of bash, python and Perl on the dev machine and recently node.js on the server.
While scraping acquiring multi-lingual (Unicode) data for a project, I had to make sure I kept the correct utf-8 encoding all along the processing pipeline. One screw-up and it’s garbage out. Here are some tips:
Acquire Content in Perl
Use the Perl modules pQuery and Redis to fetch and store the content. Everything will work in utf-8, no worries.
Normalize the Data as CSV
Use the Perl modules Text::CSV::Unicode, your CSV file will be encoded in utf-8.
Read CSV into MySQL Database
Using Text::CSV::Unicode again, insert the data into a MySQL database. But here’s the kicker, use:
$dbh->{'mysql_enable_utf8'} = 1;
$dbh->do(qq{SET NAMES 'utf8';});
… if you don’t, your utf-8 strings will be re-encoded into the local character set. Not good.
Also create a file at ~/.my.cnf with this content:
[mysql] default-character-set=utf8
Serving Up The Data
I’m using Express with a node.js server. I found that the documented default behaviour of responding with utf-8 wasn’t working, so I needed to set it manually:
res.charset = 'utf-8'; res.send(output);
There you go… I hope this helps somebody! It gave me quite a headache for a day… =D