summaryrefslogtreecommitdiff
path: root/src/Text/Pandoc/Readers/Docx
AgeCommit message (Collapse)Author
2024-02-28Docx reader: ensure that table captions are counted.John MacFarlane
Normally these occur outside the table element itself, but they should still be parsed as captions in this case. Closes #9518.
2024-02-28Docx reader: detect caption by style name not id.John MacFarlane
The styleId can change depending on the localization. Partially resolves #9518.
2023-12-26fix(docx): support absolute header/footer pathsEdwin Török
Header and footer references may be absolute in the reference.docx. E.g. editing it with dotnet's Open-XML-SDK causes this error: ``` + pandoc test.md -t docx --reference-doc referenceh.docx -o test.docx word//word/header1.xml missing in reference docx ``` There was already code in pandoc to handle relative vs absolute paths in references, so use it. Signed-off-by: Edwin Török <edwin@etorok.net>
2023-12-18Docx reader: fix HYPERLINK with only switch and no argument.John MacFarlane
The argument can apparently be omitted, and then we just have a fragment URL. Closes #9246.
2023-12-11Whitespace fix.John MacFarlane
2023-11-29Docx reader: unwrap content of shaped textboxes...Stephan Meijer
* #9214 text in shape format test document * #9214 support Text in Shape Format * #9214 remove irrelevant code
2023-11-28Docx reader: Improve handling of w:sym.John MacFarlane
Add T.P.Readers.Docx.Symbols. This gives us a table to use to resolve characters included in docx via w:sym element. Use this table to resolve characters when symbol fonts are specified. Closes #9220.
2023-11-28Correct comment.John MacFarlane
2023-08-18Docx reader: omit "Table NN" from caption.John MacFarlane
Closes #9002.
2023-07-14Docx reader: use SVG version of image if present.John MacFarlane
Previously the backup PNG was exported even if an SVG was present, but the SVG should be preferred. Closes #7244.
2023-02-18Docx reader: parse image alt texts in LibreOffice generated filesAlbert Krewinkel
LibreOffice tags images slightly differently than Word; this change lets the parses take that difference into account when looking for an image description (alt text).
2023-01-10Update copyright years, it's 2023!Albert Krewinkel
2022-12-11Docx reader: fix handling of oMathPara in w:p with other content.John MacFarlane
Closes #8483. The problem is that oMathPara can either occur at the block-level (child of w:body) or at the inline level (child of w:p, potentially with other content). We need to handle both cases. Previously the code just assumed that if we had a w:p with an oMathPara, the math would be the sole content. This patch removes OMathPara as a constructor of BodyPart and adds it as a constructor of ParPart.
2022-11-19Docx reader: Support parsing of highlighted text.John MacFarlane
2022-10-31First stab at mtl 2.3 compliance.John MacFarlane
This will no doubt produce a bunch of warnings and hence CI failures, which we'll need to work around with explicit imports.
2022-10-16T.P.Parsing: Remove gratuitious renaming of Parsec types.John MacFarlane
We were exporting Parser, ParserT as synonyms of Parsec, ParsecT. There is no good reason for this and it can cause confusion. Also, when possible, we replace imports of Text.Parsec with T.P.Parsing. The idea is to make it easier, at some point, to switch to megaparsec or another parsing engine if we want to. T.P.Parsing new exports: Stream(..), updatePosString, SourceName, Parsec, ParsecT [API change]. Removed exports: Parser, ParserT [API change].
2022-10-15Minor code cleanups.John MacFarlane
2022-09-27Fix small whitespace things.John MacFarlane
2022-08-30Docx reader: mark unnumbered headings with class 'unnumbered'Albert Krewinkel
If a document uses numbered headings, then headings without numbers are marked with class `unnumbered`, the default class used by pandoc to convey this kind of information. The classes are not added if none of the headings in a document are. This change ensures good conversion results when converting with `--number-sections`. Closes: #8148
2022-02-04Docx reader: parse EN.CITE and EN.REFLIST fields.John MacFarlane
2022-02-04Support embedded Mendeley citations in docx.John MacFarlane
These are supported in the same way as Zotero citations, using the same code. As with Zotero, enable the `citations` extension on `docx` to parse these as native citations. Closes #7840.
2022-01-19Docx reader: parse both zotero citation and bibliography...John MacFarlane
as FieldInfo.
2022-01-19Docx reader: add skeleton for parsing zotero ADDINs.John MacFarlane
So far this just adds a constructor for FieldInfo; we'll need to adjust the rest of the reader code to parse the JSON and do something with it. See #7840.
2022-01-17Fix some haddock errors.John MacFarlane
2022-01-02Copyright notices: update for 2022Albert Krewinkel
2021-12-30Docx reader: handle multiple pic elements inside a drawing.John MacFarlane
Closes #7786.
2021-12-30Docx reader: change elemToParPart to return [ParPart]John MacFarlane
...instead of ParPart. Also remove NullParPart constructor, as it is no longer needed. This will allow us to handle elements that contain multiple ParParts, e.g. w:drawing elements with multiple pic:pic. See #7786.
2021-12-30Fix ghc 9.2.1 warnings.John MacFarlane
2021-12-28Use `splitDirectories` istead of `splitPath`.John MacFarlane
We were using `splitPath` in two places in the code where `splitDirectories` should have been used. This led to a test for `..` in paths in `extractMedia` failing, so that images with `..` in the path name could be extracted outside the directory specified by `extractMedia`. It also led a test for `media` in resource paths to fail in the docx reader.
2021-11-02Docx reader: don't let first line indents trigger block quotes.John MacFarlane
This fixes a regression introduced in pandoc 2.15 by PR #7606. Closes #7655.
2021-10-18Docx reader: fix handling of empty fieldsMilan Bracke
Some fields only have an instrText and no content, Pandoc didn't understand these, causing other fields to be misunderstood because it seemed like a field was still open when it wasn't.
2021-10-18Docx parser: implement PAGEREF fieldsMilan Bracke
These fields, often used in tables of contents, can be a hyperlink.
2021-10-18Docx reader: fix handling of nested fieldsMilan Bracke
Fields delimited by fldChar elements can contain other fields. Before, the nested fields would be ignored, except for the end, which would be considered the end of the parent field. To fix this issue, fields needed to be considered containing ParParts instead of Runs, since a Run can't represent complex enough structures. This also impacted Hyperlinks since they can originate from a field.
2021-10-10Avoid blockquote when parent style has more indentMilan Bracke
When a paragraph has an indentation different from the parent (named) style, it used to be considered a blockquote. But this only makes sense when the paragraph has more indentation. So this commit adds a check for the indentation of the parent style.
2021-09-30Docx reader: Add placeholder for word diagramEzwal
2021-08-19Improve docx reader's robustness in extracting images.John MacFarlane
The docx reader made a couple assumptions about how docx containers were laid out that were not always true, with the result that some images in documents did not get found/extracted. Closes #7511.
2021-06-12Docx reader: handle absolute URIs in Relationship Target.John MacFarlane
Closes #7374.
2021-05-28Docx reader: Support new table features.Emily Bourke
* Column spans * Row spans - The spec says that if the `val` attribute is ommitted, its value should be assumed to be `continue`, and that its values are restricted to {`restart`, `continue`}. If the value has any other value, I think it seems reasonable to default it to `continue`. It might cause problems if the spec is extended in the future by adding a third possible value, in which case this would probably give incorrect behaviour, and wouldn't error. * Allow multiple header rows * Include table description in simple caption - The table description element is like alt text for a table (along with the table caption element). It seems like we should include this somewhere, but I’m not 100% sure how – I’m pairing it with the simple caption for the moment. (Should it maybe go in the block caption instead?) * Detect table captions - Check for caption paragraph style /and/ either the simple or complex table field. This means the caption detection fails for captions which don’t contain a field, as in an example doc I added as a test. However, I think it’s better to be too conservative: a missed table caption will still show up as a paragraph next to the table, whereas if I incorrectly classify something else as a table caption it could cause havoc by pairing it up with a table it’s not at all related to, or dropping it entirely. * Update tests and add new ones Partially fixes: #6316
2021-05-28Docx reader: Read table column widths.Emily Bourke
2021-05-25Allow compilation with base 4.15Albert Krewinkel
2021-04-29Docx reader: add handling of vml image objects (jgm#4735) (#7257)mbrackeantidot
They represent images, the same way as other images in vml format.
2021-03-15Use foldl' instead of foldl everywhere.John MacFarlane
2021-02-17Docx reader: use Map instead of list for Namespaces.John MacFarlane
This gives a speedup of about 5-10%. The reader is now approximately twice as fast as in the last release.
2021-02-16Rename Text.Pandoc.XMLParser -> Text.Pandoc.XML.Light...John MacFarlane
..and add new definitions isomorphic to xml-light's, but with Text instead of String. This allows us to keep most of the code in existing readers that use xml-light, but avoid lots of unnecessary allocation. We also add versions of the functions from xml-light's Text.XML.Light.Output and Text.XML.Light.Proc that operate on our modified XML types, and functions that convert xml-light types to our types (since some of our dependencies, like texmath, use xml-light). Update golden tests for docx and pptx. OOXML test: Use `showContent` instead of `ppContent` in `displayDiff`. Docx: Do a manual traversal to unwrap sdt and smartTag. This is faster, and needed to pass the tests. Benchmarks: A = prior to 8ca191604dcd13af27c11d2da225da646ebce6fc (Feb 8) B = as of 8ca191604dcd13af27c11d2da225da646ebce6fc (Feb 8) C = this commit | Reader | A | B | C | | ------- | ----- | ------ | ----- | | docbook | 18 ms | 12 ms | 10 ms | | opml | 65 ms | 62 ms | 35 ms | | jats | 15 ms | 11 ms | 9 ms | | docx | 72 ms | 69 ms | 44 ms | | odt | 78 ms | 41 ms | 28 ms | | epub | 64 ms | 61 ms | 56 ms | | fb2 | 14 ms | 5 ms | 4 ms |
2021-02-10Add new unexported module T.P.XMLParser.John MacFarlane
This exports functions that uses xml-conduit's parser to produce an xml-light Element or [Content]. This allows existing pandoc code to use a better parser without much modification. The new parser is used in all places where xml-light's parser was previously used. Benchmarks show a significant performance improvement in parsing XML-based formats (especially ODT and FB2). Note that the xml-light types use String, so the conversion from xml-conduit types involves a lot of extra allocation. It would be desirable to avoid that in the future by gradually switching to using xml-conduit directly. This can be done module by module. The new parser also reports errors, which we report when possible. A new constructor PandocXMLError has been added to PandocError in T.P.Error [API change]. Closes #7091, which was the main stimulus. These changes revealed the need for some changes in the tests. The docbook-reader.docbook test lacked definitions for the entities it used; these have been added. And the docx golden tests have been updated, because the new parser does not preserve the order of attributes. Add entity defs to docbook-reader.docbook. Update golden tests for docx.
2021-01-08Update copyright notices for 2021 (#7012)Albert Krewinkel
2020-11-07Lint code in PRs and when committing to master (#6790)Albert Krewinkel
* Remove unused LANGUAGE pragmata * Apply HLint suggestions * Configure HLint to ignore some warnings * Lint code when committing to master
2020-10-06DOCX reader: Allow empty dates in comments and tracked changes (#6726)Diego Balseiro
For security reasons, some legal firms delete the date from comments and tracked changes. * Make date optional (Maybe) in tracked changes and comments datatypes * Add tests
2020-09-13Fix hlint suggestions, update hlint.yaml (#6680)Christian Despres
* Fix hlint suggestions, update hlint.yaml Most suggestions were redundant brackets. Some required LambdaCase. The .hlint.yaml file had a small typo, and didn't ignore camelCase suggestions in certain modules.
2020-07-13Merge pull request #6527 from lierdakil/fix-6514John MacFarlane
[Docx Reader] Only use bCs/iCs on runs with rtl or cs property