Archive

Posts Tagged ‘translate’

Translate files using Google Translate directly from java

Google has provided API for its translation service. The API is mainly designed for www pages, but the wrapper for Java language is also available. The problem is, however, that the API enables to translate String items only (either single item or the array of items) and the size of input is limited. The Google Translate page provides the functionality to convert very large texts by means of file upload. It converts also quite good HTML pages.

I’ve used Apache HttpClient to use Google Translate service directly through its web page, not by the means of API. It converts document to HTML page, which must be extracted from inner frame. There are two things, that additionally needs clean-up. The first is encoding of HTML document. It is usually not unicode, while we usually want to save documents in unicode. My code changes the encoding to UTF-8. The second is that google places both original and translated version in document. Original is hidden and displayed via JavaScript on mouse over, however exists in document source and search engine can mark translated version as duplicate context. I clean up original too.

This is the source code for method translating file:

  public String translate(File file, String sourceLang, String destLangthrows Exception {
    Element form = currentPage.getElementById("text_form");
    List<NameValuePair> nvps = new ArrayList<NameValuePair>();
    nvps.add(new BasicNameValuePair("old_sl", sourceLang));
    nvps.add(new BasicNameValuePair("old_tl", destLang));
    nvps.add(new BasicNameValuePair("old_submit""Tłumacz"));
    nvps.add(new BasicNameValuePair("sl", sourceLang));
    nvps.add(new BasicNameValuePair("tl", destLang));
    addMissingFields(form, nvps);
    
    String actionUri = "http://translate.googleusercontent.com/translate_f?sl="
      +sourceLang+"&tl="+destLang;
    HttpPost post = new HttpPost(actionUri);
    MultipartEntity multipart = new MultipartEntity();
    for (NameValuePair nvp : nvps) {
      multipart.addPart(nvp.getName()new StringBody(nvp.getValue(), Charset.forName("UTF-8")));
    }
    multipart.addPart("file"new FileBody(file));
    
    post.setEntity(multipart);
    HttpResponse response = httpClient.execute(post);
    HttpEntity entity = response.getEntity();
    Source page = new Source(entity.getContent());
    // convert charsets to UTF-8
    OutputDocument output = new OutputDocument(page);
    for (Element meta : page.getAllElements("meta")) {
      String httpEquiv = meta.getAttributeValue("http-equiv");
      String content = meta.getAttributeValue("content");
      if (StringUtils.equalsIgnoreCase("content-type", httpEquiv&& content!=null) {
        String newAttr = meta.toString().replaceAll("charset=[A-Z0-9\\-]*\"""charset=UTF-8");
        output.replace(meta, newAttr);
      }
    }
    // now remove all spans with class "google-src-text"
    for (Element googleSrc : page.getAllElementsByClass("google-src-text")) {
      output.replace(googleSrc, "");
    }
    // clean up scripts
    for (Element script : page.getAllElements("script")) {
      output.replace(script, "");
    }
    // remove google's iframe
    for (Element iframe : page.getAllElements("iframe")) {
      if (StringUtils.containsIgnoreCase(iframe.getAttributeValue("src")".google."))
        output.replace(iframe, "");
    }
    page = new Source(output.toString());
    for (StartTag main : page.getAllStartTagsByClass("main")) {
      if ("div".equals(main.getName())) {
        int end = main.getEnd();
        return page.subSequence(end, page.length()).toString();
      }
    }
    return page.toString();
  }

— added 12th May 2011

  private void addMissingFields(Element elem, List<NameValuePair> nvps) {
    f1: for (Element input : elem.getAllElements(HTMLElementName.INPUT)) {
      String name = input.getAttributeValue(„name”);
      String value = input.getAttributeValue(„value”);
      if (name == null || value == null)
        continue;
      // check if exist
      for (int i=0;i<nvps.size();i++) {
        if (nvps.get(i).getName().equals(name))
          continue f1;
      }
      nvps.add(new BasicNameValuePair(name, value));
    }
  }