diff options
| author | Dan Allen <dan.j.allen@gmail.com> | 2022-07-06 23:54:40 -0600 |
|---|---|---|
| committer | Dan Allen <dan.j.allen@gmail.com> | 2022-07-07 02:46:19 -0600 |
| commit | cd9b07c75323be97bb8364554f9a4ee095baec33 (patch) | |
| tree | 87d09a709d16f7a8a73baca0eb0fa4080e078438 /docs/modules | |
| parent | 0b4da859def75d71eb46b0f8b2f197383ccd2cc7 (diff) | |
document how to extract text from HTML using w3m or elinks
Diffstat (limited to 'docs/modules')
| -rw-r--r-- | docs/modules/migrate/pages/asciidoc-py.adoc | 59 |
1 files changed, 58 insertions, 1 deletions
diff --git a/docs/modules/migrate/pages/asciidoc-py.adoc b/docs/modules/migrate/pages/asciidoc-py.adoc index e8e69505..87898cd8 100644 --- a/docs/modules/migrate/pages/asciidoc-py.adoc +++ b/docs/modules/migrate/pages/asciidoc-py.adoc @@ -344,7 +344,7 @@ Instead, Asciidoctor provides an xref:extensions:register.adoc[extension API] th === Localization -AsciiDoc.py had built-in [.path]_.conf_ files that translated built-in labels. +AsciiDoc.py has built-in [.path]_.conf_ files that translated built-in labels. In Asciidoctor, you must define the translations for these labels explicitly. See xref:ROOT:localization-support.adoc[] for details. @@ -378,6 +378,63 @@ a| AsciiDoc.py custom extensions are Python commands, so they don't work with Asciidoctor. Depending on the Asciidoctor processor you choose, you can re-write your xref:extensions:index.adoc[extensions in Ruby, Java, or JavaScript]. +== Extract text + +AsciiDoc.py provides a frontend to the DocBook toolchain named a2x.py. +This script can produce various output formats from an AsciiDoc document. +One of those formats is text (aka "`plain text`"). +In order to extract the text, the DocBook toolchain first converts the AsciiDoc to HTML, then extracts the text from that document using lynx. + +There are numerous approaches to extracting text from AsciiDoc in Asciidoctor. +One way is to write an Asciidoctor converter mapped to the `text` backend that xref:convert:custom.adoc#convert-to-text-only[converts AsciiDoc to text only]. +Another approach is to convert the AsciiDoc to HTML, then extract text from the HTML output document using a text-based browser, just like the DocBook toolchain does. + +Before continuing, it's worth noting that there's no universal definition of "`plain text`". +It all depends on what information you are trying to extract. +That's why you won't find a text backend provided by Asciidoctor core. +Let's consider what tools are available. + +As an alternative to lynx, the text-based browser w3m does a nice job of extracting text from an HTML document. +For example: + + $ w3m -dump -cols 120 doc.html > doc.txt + +You can set the number of columns so lines aren't hard wrapped at a fixed line width. +The upper bounds for this value is MAX_INT (2147483647). +You can retrieve that value dynamically using Perl. + + $ w3m -dump -cols $(perl -MPOSIX -e 'print INT_MAX') doc.html > doc.txt + +It doesn't seem possible to configure w3m to preserve markup that indicates headings. +However, the text-based browser `elinks` offers this behavior by default through indentation. + + $ elinks -dump 1 -no-references -no-numbering -dump-width 50000 doc.html > doc.txt + +Yet another option is the https://www.npmjs.com/package/html-to-text[html-to-text^] module for Node.js, which parses HTML and returns beautiful text. + +If you want to extract the text during AsciiDoc conversion, you can do so using an Asciidoctor postprocessor extension. + +[,ruby] +---- +require 'open3' + +Asciidoctor::Extensions.register do + postprocessor do + process do |doc, output| + outfile = (doc.attr 'outfile').sub %r/\.\S+$/, '.txt' + Open3.popen2 'elinks -dump 1 -no-references -no-numbering' do |is, os| + is.print output + is.close + File.write outfile, os.read + end + output + end + end +end +---- + +This extension will write a file with a .txt extension adjacent to the document written by the converter. + == Doctest AsciiDoc.py `--doctest` ran its unit tests. |
