| Age | Commit message (Collapse) | Author |
|
Normally these occur outside the table element itself, but they
should still be parsed as captions in this case.
Closes #9518.
|
|
The styleId can change depending on the localization.
Partially resolves #9518.
|
|
Header and footer references may be absolute in the reference.docx.
E.g. editing it with dotnet's Open-XML-SDK causes this error:
```
+ pandoc test.md -t docx --reference-doc referenceh.docx -o test.docx
word//word/header1.xml missing in reference docx
```
There was already code in pandoc to handle relative vs absolute paths in
references, so use it.
Signed-off-by: Edwin Török <edwin@etorok.net>
|
|
The argument can apparently be omitted, and then we just have
a fragment URL. Closes #9246.
|
|
|
|
* #9214 text in shape format test document
* #9214 support Text in Shape Format
* #9214 remove irrelevant code
|
|
Add T.P.Readers.Docx.Symbols. This gives us a table to use to
resolve characters included in docx via w:sym element.
Use this table to resolve characters when symbol fonts are specified.
Closes #9220.
|
|
|
|
Closes #9002.
|
|
Previously the backup PNG was exported even if an SVG was
present, but the SVG should be preferred.
Closes #7244.
|
|
LibreOffice tags images slightly differently than Word; this change lets
the parses take that difference into account when looking for an image
description (alt text).
|
|
|
|
Closes #8483.
The problem is that oMathPara can either occur at the block-level
(child of w:body) or at the inline level (child of w:p, potentially
with other content). We need to handle both cases.
Previously the code just assumed that if we had a w:p with an oMathPara,
the math would be the sole content.
This patch removes OMathPara as a constructor of BodyPart
and adds it as a constructor of ParPart.
|
|
|
|
This will no doubt produce a bunch of warnings and hence CI
failures, which we'll need to work around with explicit imports.
|
|
We were exporting Parser, ParserT as synonyms of Parsec, ParsecT.
There is no good reason for this and it can cause confusion.
Also, when possible, we replace imports of Text.Parsec with
T.P.Parsing. The idea is to make it easier, at some point,
to switch to megaparsec or another parsing engine if we want to.
T.P.Parsing new exports: Stream(..), updatePosString, SourceName,
Parsec, ParsecT [API change].
Removed exports: Parser, ParserT [API change].
|
|
|
|
|
|
If a document uses numbered headings, then headings without numbers are
marked with class `unnumbered`, the default class used by pandoc to
convey this kind of information. The classes are not added if none of
the headings in a document are. This change ensures good conversion
results when converting with `--number-sections`.
Closes: #8148
|
|
|
|
These are supported in the same way as Zotero citations,
using the same code. As with Zotero, enable the `citations`
extension on `docx` to parse these as native citations.
Closes #7840.
|
|
as FieldInfo.
|
|
So far this just adds a constructor for FieldInfo;
we'll need to adjust the rest of the reader code to
parse the JSON and do something with it.
See #7840.
|
|
|
|
|
|
Closes #7786.
|
|
...instead of ParPart.
Also remove NullParPart constructor, as it is no longer
needed.
This will allow us to handle elements that contain multiple
ParParts, e.g. w:drawing elements with multiple pic:pic.
See #7786.
|
|
|
|
We were using `splitPath` in two places in the code
where `splitDirectories` should have been used.
This led to a test for `..` in paths in `extractMedia`
failing, so that images with `..` in the path name
could be extracted outside the directory specified
by `extractMedia`.
It also led a test for `media` in resource paths to fail
in the docx reader.
|
|
This fixes a regression introduced in pandoc 2.15 by PR #7606.
Closes #7655.
|
|
Some fields only have an instrText and no content, Pandoc didn't
understand these, causing other fields to be misunderstood because it
seemed like a field was still open when it wasn't.
|
|
These fields, often used in tables of contents, can be a hyperlink.
|
|
Fields delimited by fldChar elements can contain other fields. Before,
the nested fields would be ignored, except for the end, which would be
considered the end of the parent field.
To fix this issue, fields needed to be considered containing ParParts
instead of Runs, since a Run can't represent complex enough structures.
This also impacted Hyperlinks since they can originate from a field.
|
|
When a paragraph has an indentation different from the parent (named)
style, it used to be considered a blockquote. But this only makes sense
when the paragraph has more indentation. So this commit adds a check
for the indentation of the parent style.
|
|
|
|
The docx reader made a couple assumptions about how docx
containers were laid out that were not always true, with
the result that some images in documents did not get
found/extracted.
Closes #7511.
|
|
Closes #7374.
|
|
* Column spans
* Row spans
- The spec says that if the `val` attribute is ommitted, its value
should be assumed to be `continue`, and that its values are
restricted to {`restart`, `continue`}. If the value has any other
value, I think it seems reasonable to default it to `continue`. It
might cause problems if the spec is extended in the future by adding
a third possible value, in which case this would probably give
incorrect behaviour, and wouldn't error.
* Allow multiple header rows
* Include table description in simple caption
- The table description element is like alt text for a table (along
with the table caption element). It seems like we should include
this somewhere, but I’m not 100% sure how – I’m pairing it with the
simple caption for the moment. (Should it maybe go in the block
caption instead?)
* Detect table captions
- Check for caption paragraph style /and/ either the simple or
complex table field. This means the caption detection fails for
captions which don’t contain a field, as in an example doc I added
as a test. However, I think it’s better to be too conservative: a
missed table caption will still show up as a paragraph next to the
table, whereas if I incorrectly classify something else as a table
caption it could cause havoc by pairing it up with a table it’s
not at all related to, or dropping it entirely.
* Update tests and add new ones
Partially fixes: #6316
|
|
|
|
|
|
They represent images, the same way as other images in vml format.
|
|
|
|
This gives a speedup of about 5-10%.
The reader is now approximately twice as fast as in the last
release.
|
|
..and add new definitions isomorphic to xml-light's, but with
Text instead of String. This allows us to keep most of the code in
existing readers that use xml-light, but avoid lots of unnecessary
allocation.
We also add versions of the functions from xml-light's
Text.XML.Light.Output and Text.XML.Light.Proc that operate
on our modified XML types, and functions that convert
xml-light types to our types (since some of our dependencies,
like texmath, use xml-light).
Update golden tests for docx and pptx.
OOXML test: Use `showContent` instead of `ppContent` in `displayDiff`.
Docx: Do a manual traversal to unwrap sdt and smartTag.
This is faster, and needed to pass the tests.
Benchmarks:
A = prior to 8ca191604dcd13af27c11d2da225da646ebce6fc (Feb 8)
B = as of 8ca191604dcd13af27c11d2da225da646ebce6fc (Feb 8)
C = this commit
| Reader | A | B | C |
| ------- | ----- | ------ | ----- |
| docbook | 18 ms | 12 ms | 10 ms |
| opml | 65 ms | 62 ms | 35 ms |
| jats | 15 ms | 11 ms | 9 ms |
| docx | 72 ms | 69 ms | 44 ms |
| odt | 78 ms | 41 ms | 28 ms |
| epub | 64 ms | 61 ms | 56 ms |
| fb2 | 14 ms | 5 ms | 4 ms |
|
|
This exports functions that uses xml-conduit's parser to
produce an xml-light Element or [Content]. This allows
existing pandoc code to use a better parser without
much modification.
The new parser is used in all places where xml-light's
parser was previously used. Benchmarks show a significant
performance improvement in parsing XML-based formats
(especially ODT and FB2).
Note that the xml-light types use String, so the
conversion from xml-conduit types involves a lot
of extra allocation. It would be desirable to
avoid that in the future by gradually switching
to using xml-conduit directly. This can be done
module by module.
The new parser also reports errors, which we report
when possible.
A new constructor PandocXMLError has been added to
PandocError in T.P.Error [API change].
Closes #7091, which was the main stimulus.
These changes revealed the need for some changes
in the tests. The docbook-reader.docbook test
lacked definitions for the entities it used; these
have been added. And the docx golden tests have been
updated, because the new parser does not preserve
the order of attributes.
Add entity defs to docbook-reader.docbook.
Update golden tests for docx.
|
|
|
|
* Remove unused LANGUAGE pragmata
* Apply HLint suggestions
* Configure HLint to ignore some warnings
* Lint code when committing to master
|
|
For security reasons, some legal firms delete the date from comments and
tracked changes.
* Make date optional (Maybe) in tracked changes and comments datatypes
* Add tests
|
|
* Fix hlint suggestions, update hlint.yaml
Most suggestions were redundant brackets. Some required
LambdaCase.
The .hlint.yaml file had a small typo, and didn't ignore camelCase
suggestions in certain modules.
|
|
[Docx Reader] Only use bCs/iCs on runs with rtl or cs property
|