Saturday, June 25, 2011

Publishing API and a new service could make translating Connexions modules easy

Specialized tools for translating and publishing OER is one of the possible uses of an API for publishing to open education repositories. Repositories may have general purpose editors for creating content, but they aren’t likely to have great facilities for translating content.

Carl Scheffler and I spent some time in the Geneva airport investigating whether Google Translator Toolkit could be the translation editor of choice for Connexions modules. Translator Toolkit has to be convinced and helped along, however, because it was designed for HTML (web pages), rather than for the structured XML format of Connexions' modules. It just might be possible, however, and advice and comments would be most welcome.

The workflow would be just a bit more complicated than the normal route for translation and would look something like this:
  1. Find a module that you want to tranlsate on Connexions and record its ID. Lets say the module is Electric Circuits - Grade 10, http://cnx.org/content/m32830/latest/. Then the id is “m32830”.
  2. Open Google Translator Toolkit and select a URL something like this: http://www.coolhelperservice.org/cnxtranslate/m32830. This would fetch the module in a format that Google Translate can use well.
  3. Translate it using the Translator Toolkit.
  4. Save the file to your laptop.
  5. Go to something like http://www.coolhelperservice.org/cnxpublish and upload the saved file. Fill out a bit of information and then push a button to sign the license and publish it to Connexions.
Although it would be more straightforward to enter a cnx.org web address into the Translator Toolkit and then publish straight from the toolkit, we don't have the technical hooks into the Translator Toolkit to be able to do that. So instead, we would create this new “coolhelperservice” that would know how to format Connexions content for Translator Toolkit and how to take translations and reformat them and publish them to Connexions.

Does that work flow seem reasonable? Is there a better work flow that you can think of and suggest?

Some technical details for those that are interested. Those that aren't can safely stop here and still be able to give feedback on the process from a translator's perspective.

Google Translator Toolkit doesn't work with XML formats. But Connexions does produce an HTML format for modules that can be be converted back into Connexions XML without any loss. So the “coolhelperservice” needs to retrieve the module, format it in HTML for the translator toolkit, and then do the opposite transform (HTML → CNXML) on the way back into Connexions.

To get the HTML for the body of a module from Connexions, you append “/body” to the module URL. And the module metadata (title and such) is available by appending /metadata to the module URL. So with the module ID, the “coolhelperservice” can put together a nice package of HTML for the translator to use, and still be able to reconstruct the XML to publish the translated version.

One tricky bit is that Google Translator Toolkit makes a mess of the mathematics that comes in from Connexions, so the math has to be protected somehow. Carl and I experimented with a few ideas for how to do that, and toolkit didn't cooperate with most of those, but Carl came up with the idea of putting all the math into an HTML id. Amazingly, that worked. It comes out all escaped, but that is good enough. (Toolkit won't keep around a random attribute, so “id” was the way to go). Carl is pretty sure that there is a webservice that will take a snippet of mathml and give back an image. He is going to investigate that further. So in principle, you can stuff the math into an image ID (so it doesn't get lost) and replace the math with a URL to this service that will render the math. The translator won't be able to translate words that were inside the math, but Carl had previously looked around and that isn't very common, so this might just be good enough.

At the end, the “coolhelperservice” will use a publishing API (SWORD V2) to publish the translation back to Connexions. Implementing that API is part of my fellowship work so it is coming later this year. There will have to be a bit of license signing back at Connexions, but the “coolhelperservice” can make that smoother also.

I think something like this could work. What do you think? And did we miss some clever idea or service that could be of help? Actually, I am sure we did since this was a 2 hour experiment. So send help, advice, etc. Carl will keep investigating, and maybe we will have some screenshots to clarify all this for a future post.

13 comments:

  1. The tool I was referring to is at http://latex.codecogs.com/ For example, try http://latex.codecogs.com/png.latex?\sqrt{2}x

    Unfortunately (for our purposes) it takes LaTeX equations rather than MathML. LaTeX to MathML converters seem sparsely distributed, but I'll have a look into those.

    Cheers,
    Carl

    ReplyDelete
  2. @Carl -- how about LaTeXML? In any case, we tried to do a comparison about 2 years ago on the arXiv.org corpus -- http://kwarc.info/kohlhase/submit/dml09.pdf

    One of my colleagues is actually working right now on a RESTful interface for LaTeXML.

    Contact me if you want to know more :)

    Catalin David

    ReplyDelete
  3. Oops... sorry for the confusion, but I seem to have mixed up the requirements in my earlier comment. We need a MathML to LaTeX converter (and not the other way around).

    @CDavid: Thanks for the pointers. I had a quick read through your article, but it looks like all the converters covered there go the wrong way around for our purposes. It's strange that none of them are bi-directional.

    Carl

    ReplyDelete
  4. Hey, Carl!

    Actually I haven't heard of anybody that would require a MathML to LaTeX transform...

    I think the problem is that LaTeX -> HTML + MathML -> LaTeX will almost never be the same at the ends due to the complexity of the (La)TeX system -- I really like the example of David Carlisle http://www.ctan.org/pkg/xii -- look for file xii.tex ... That file evaluates to a full poem ("Twelve days of Christmas") when ran through a TeX->PDF compiler (or, for that matter, TeX -> HTML compiler like LaTeXML). From that, there is absolutely no way to get back to the exact same source unless you actually keep the source embedded in the HTML.

    It might be relatively easy to get a variant similar to the MathML by using a PHP / Python / Perl script for parsing the Presentation MathML into a DOM and, based on some rules, create a LaTeX source. But I think nobody has done that until now.

    Catalin

    ReplyDelete
  5. Connexions already does a MathML to LaTeX transform as part of the PDF generation system, so it ought to be possible to isolate and reuse that part of the xsl.

    You don't need it to be bidirectional, because the purpose of the MathML -> LaTeX transform here is just to get a picture for displaying to translators so that they have the proper mathematical context for translating the words.

    Or perhaps there is a service that makes and serves an image out of MathML. Perhaps MathJax knows of some place?

    ReplyDelete
  6. Ok, how about this: http://xsltml.sourceforge.net/ ? I haven't tried it, but it looks pretty good.

    Otherwise, I found MathParser: http://www.tilman.de/programme/mathparser/anleitung_en.html which does also MathML to LaTeX, but it's written in Java and seems to lack a CLI.

    Catalin

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Hi All

    Regarding the MathML -> png - just use the pipeline Roche built for the Connexions mobile front-end (mobile.cnx.org). It's been tested with a ton of modules on Connexions and is being used by the android application as well so we'd know if there were big problems by now.

    I'll ask him to post a comment here but its all done in python using, I think, libsvg but check.

    On the mobile front-end we're even caching the images using a md5sum of the MathML so repeated equations need only be stored once.

    About storing the MathML in the id - is there no limit to the id length?

    Mark

    ReplyDelete
  9. As Mark pointed out we already do a lot of MathML to image conversion for the Connexions mobiles site. I always point people to the Artificial satellites module since this module has around 30 formulas that needs to be converted and shows off the conversion nicely. I have developed a package called upfront.mathmlimage currently living inside the Rhaptos SVN repository at https://software.cnx.rice.edu/svn/devsets/mobile-frontend/buildout/deliverance/src/upfront.mathmlimage that makes the conversion as simple as calling a method named "convert" with a mathml string eg.: convert(mathmlstring). We can pretty easily turn this into a web service with an hour of two's hacking :-)

    ReplyDelete
  10. Sorry, I forgot to give more details. Basically we use SVGMath (http://grigoriev.ru/svgmath/) to convert Mathml to SVG and then we use ImageMagick's convert command to convert it to a PNG. If we have content MathML we transform it to presentation MathML using an XSL stylesheet before we pass it to SVGMath. Hope that helps.

    ReplyDelete
  11. I made some notes here:

    http://code.google.com/p/oer-roadmap/wiki/GoogleTranslatorWorkflow

    The Google Translator Toolkit apparently also accepts Wikipedia Mark-up and seems to handle in losslessly (which is not the case for HTML).

    Cheers,
    Carl

    ReplyDelete
  12. Really interesting post - thank you!

    It's worth noting that in the long human trajectory towards openness, open translation data is an important component of OER. Translator toolkit could be more open.

    Thanks - George.

    ReplyDelete
  13. Does anyone still need a "MathML to LaTeX" or "MathML to image (PNG, BMP, EMF, PDF, or, even, SVG) converter?

    I can create one in C/C++, Windows preferably but Linux is also possible.

    If you're interested, please let me know. Thanks.

    ReplyDelete