pandoc - MSM's mirror of Pandoc

Age	Commit message (Collapse)	Author
2024-02-28	Docx reader: ensure that table captions are counted.	John MacFarlane
	Normally these occur outside the table element itself, but they should still be parsed as captions in this case. Closes #9518.
2024-02-28	Docx reader: detect caption by style name not id.	John MacFarlane
	The styleId can change depending on the localization. Partially resolves #9518.
2023-12-26	fix(docx): support absolute header/footer paths	Edwin Török
	Header and footer references may be absolute in the reference.docx. E.g. editing it with dotnet's Open-XML-SDK causes this error: ``` + pandoc test.md -t docx --reference-doc referenceh.docx -o test.docx word//word/header1.xml missing in reference docx ``` There was already code in pandoc to handle relative vs absolute paths in references, so use it. Signed-off-by: Edwin Török <edwin@etorok.net>
2023-12-18	Docx reader: fix HYPERLINK with only switch and no argument.	John MacFarlane
	The argument can apparently be omitted, and then we just have a fragment URL. Closes #9246.
2023-12-11	Whitespace fix.	John MacFarlane

2023-11-29	Docx reader: unwrap content of shaped textboxes...	Stephan Meijer
	* #9214 text in shape format test document * #9214 support Text in Shape Format * #9214 remove irrelevant code
2023-11-28	Docx reader: Improve handling of w:sym.	John MacFarlane
	Add T.P.Readers.Docx.Symbols. This gives us a table to use to resolve characters included in docx via w:sym element. Use this table to resolve characters when symbol fonts are specified. Closes #9220.
2023-11-28	Correct comment.	John MacFarlane

2023-08-18	Docx reader: omit "Table NN" from caption.	John MacFarlane
	Closes #9002.
2023-07-14	Docx reader: use SVG version of image if present.	John MacFarlane
	Previously the backup PNG was exported even if an SVG was present, but the SVG should be preferred. Closes #7244.
2023-02-18	Docx reader: parse image alt texts in LibreOffice generated files	Albert Krewinkel
	LibreOffice tags images slightly differently than Word; this change lets the parses take that difference into account when looking for an image description (alt text).
2023-01-10	Update copyright years, it's 2023!	Albert Krewinkel

2022-12-11	Docx reader: fix handling of oMathPara in w:p with other content.	John MacFarlane
	Closes #8483. The problem is that oMathPara can either occur at the block-level (child of w:body) or at the inline level (child of w:p, potentially with other content). We need to handle both cases. Previously the code just assumed that if we had a w:p with an oMathPara, the math would be the sole content. This patch removes OMathPara as a constructor of BodyPart and adds it as a constructor of ParPart.
2022-11-19	Docx reader: Support parsing of highlighted text.	John MacFarlane

2022-10-31	First stab at mtl 2.3 compliance.	John MacFarlane
	This will no doubt produce a bunch of warnings and hence CI failures, which we'll need to work around with explicit imports.
2022-10-16	T.P.Parsing: Remove gratuitious renaming of Parsec types.	John MacFarlane
	We were exporting Parser, ParserT as synonyms of Parsec, ParsecT. There is no good reason for this and it can cause confusion. Also, when possible, we replace imports of Text.Parsec with T.P.Parsing. The idea is to make it easier, at some point, to switch to megaparsec or another parsing engine if we want to. T.P.Parsing new exports: Stream(..), updatePosString, SourceName, Parsec, ParsecT [API change]. Removed exports: Parser, ParserT [API change].
2022-10-15	Minor code cleanups.	John MacFarlane

2022-09-27	Fix small whitespace things.	John MacFarlane

2022-08-30	Docx reader: mark unnumbered headings with class 'unnumbered'	Albert Krewinkel
	If a document uses numbered headings, then headings without numbers are marked with class `unnumbered`, the default class used by pandoc to convey this kind of information. The classes are not added if none of the headings in a document are. This change ensures good conversion results when converting with `--number-sections`. Closes: #8148
2022-02-04	Docx reader: parse EN.CITE and EN.REFLIST fields.	John MacFarlane

2022-02-04	Support embedded Mendeley citations in docx.	John MacFarlane
	These are supported in the same way as Zotero citations, using the same code. As with Zotero, enable the `citations` extension on `docx` to parse these as native citations. Closes #7840.
2022-01-19	Docx reader: parse both zotero citation and bibliography...	John MacFarlane
	as FieldInfo.
2022-01-19	Docx reader: add skeleton for parsing zotero ADDINs.	John MacFarlane
	So far this just adds a constructor for FieldInfo; we'll need to adjust the rest of the reader code to parse the JSON and do something with it. See #7840.
2022-01-17	Fix some haddock errors.	John MacFarlane

2022-01-02	Copyright notices: update for 2022	Albert Krewinkel

2021-12-30	Docx reader: handle multiple pic elements inside a drawing.	John MacFarlane
	Closes #7786.
2021-12-30	Docx reader: change elemToParPart to return [ParPart]	John MacFarlane
	...instead of ParPart. Also remove NullParPart constructor, as it is no longer needed. This will allow us to handle elements that contain multiple ParParts, e.g. w:drawing elements with multiple pic:pic. See #7786.
2021-12-30	Fix ghc 9.2.1 warnings.	John MacFarlane

2021-12-28	Use `splitDirectories` istead of `splitPath`.	John MacFarlane
	We were using `splitPath` in two places in the code where `splitDirectories` should have been used. This led to a test for `..` in paths in `extractMedia` failing, so that images with `..` in the path name could be extracted outside the directory specified by `extractMedia`. It also led a test for `media` in resource paths to fail in the docx reader.
2021-11-02	Docx reader: don't let first line indents trigger block quotes.	John MacFarlane
	This fixes a regression introduced in pandoc 2.15 by PR #7606. Closes #7655.
2021-10-18	Docx reader: fix handling of empty fields	Milan Bracke
	Some fields only have an instrText and no content, Pandoc didn't understand these, causing other fields to be misunderstood because it seemed like a field was still open when it wasn't.
2021-10-18	Docx parser: implement PAGEREF fields	Milan Bracke
	These fields, often used in tables of contents, can be a hyperlink.
2021-10-18	Docx reader: fix handling of nested fields	Milan Bracke
	Fields delimited by fldChar elements can contain other fields. Before, the nested fields would be ignored, except for the end, which would be considered the end of the parent field. To fix this issue, fields needed to be considered containing ParParts instead of Runs, since a Run can't represent complex enough structures. This also impacted Hyperlinks since they can originate from a field.
2021-10-10	Avoid blockquote when parent style has more indent	Milan Bracke
	When a paragraph has an indentation different from the parent (named) style, it used to be considered a blockquote. But this only makes sense when the paragraph has more indentation. So this commit adds a check for the indentation of the parent style.
2021-09-30	Docx reader: Add placeholder for word diagram	Ezwal

2021-08-19	Improve docx reader's robustness in extracting images.	John MacFarlane
	The docx reader made a couple assumptions about how docx containers were laid out that were not always true, with the result that some images in documents did not get found/extracted. Closes #7511.
2021-06-12	Docx reader: handle absolute URIs in Relationship Target.	John MacFarlane
	Closes #7374.
2021-05-28	Docx reader: Support new table features.	Emily Bourke
	* Column spans * Row spans - The spec says that if the `val` attribute is ommitted, its value should be assumed to be `continue`, and that its values are restricted to {`restart`, `continue`}. If the value has any other value, I think it seems reasonable to default it to `continue`. It might cause problems if the spec is extended in the future by adding a third possible value, in which case this would probably give incorrect behaviour, and wouldn't error. * Allow multiple header rows * Include table description in simple caption - The table description element is like alt text for a table (along with the table caption element). It seems like we should include this somewhere, but I’m not 100% sure how – I’m pairing it with the simple caption for the moment. (Should it maybe go in the block caption instead?) * Detect table captions - Check for caption paragraph style /and/ either the simple or complex table field. This means the caption detection fails for captions which don’t contain a field, as in an example doc I added as a test. However, I think it’s better to be too conservative: a missed table caption will still show up as a paragraph next to the table, whereas if I incorrectly classify something else as a table caption it could cause havoc by pairing it up with a table it’s not at all related to, or dropping it entirely. * Update tests and add new ones Partially fixes: #6316
2021-05-28	Docx reader: Read table column widths.	Emily Bourke

2021-05-25	Allow compilation with base 4.15	Albert Krewinkel

2021-04-29	Docx reader: add handling of vml image objects (jgm#4735) (#7257)	mbrackeantidot
	They represent images, the same way as other images in vml format.
2021-03-15	Use foldl' instead of foldl everywhere.	John MacFarlane

2021-02-17	Docx reader: use Map instead of list for Namespaces.	John MacFarlane
	This gives a speedup of about 5-10%. The reader is now approximately twice as fast as in the last release.
2021-02-16	Rename Text.Pandoc.XMLParser -> Text.Pandoc.XML.Light...	John MacFarlane
	..and add new definitions isomorphic to xml-light's, but with Text instead of String. This allows us to keep most of the code in existing readers that use xml-light, but avoid lots of unnecessary allocation. We also add versions of the functions from xml-light's Text.XML.Light.Output and Text.XML.Light.Proc that operate on our modified XML types, and functions that convert xml-light types to our types (since some of our dependencies, like texmath, use xml-light). Update golden tests for docx and pptx. OOXML test: Use `showContent` instead of `ppContent` in `displayDiff`. Docx: Do a manual traversal to unwrap sdt and smartTag. This is faster, and needed to pass the tests. Benchmarks: A = prior to 8ca191604dcd13af27c11d2da225da646ebce6fc (Feb 8) B = as of 8ca191604dcd13af27c11d2da225da646ebce6fc (Feb 8) C = this commit \| Reader \| A \| B \| C \| \| ------- \| ----- \| ------ \| ----- \| \| docbook \| 18 ms \| 12 ms \| 10 ms \| \| opml \| 65 ms \| 62 ms \| 35 ms \| \| jats \| 15 ms \| 11 ms \| 9 ms \| \| docx \| 72 ms \| 69 ms \| 44 ms \| \| odt \| 78 ms \| 41 ms \| 28 ms \| \| epub \| 64 ms \| 61 ms \| 56 ms \| \| fb2 \| 14 ms \| 5 ms \| 4 ms \|
2021-02-10	Add new unexported module T.P.XMLParser.	John MacFarlane
	This exports functions that uses xml-conduit's parser to produce an xml-light Element or [Content]. This allows existing pandoc code to use a better parser without much modification. The new parser is used in all places where xml-light's parser was previously used. Benchmarks show a significant performance improvement in parsing XML-based formats (especially ODT and FB2). Note that the xml-light types use String, so the conversion from xml-conduit types involves a lot of extra allocation. It would be desirable to avoid that in the future by gradually switching to using xml-conduit directly. This can be done module by module. The new parser also reports errors, which we report when possible. A new constructor PandocXMLError has been added to PandocError in T.P.Error [API change]. Closes #7091, which was the main stimulus. These changes revealed the need for some changes in the tests. The docbook-reader.docbook test lacked definitions for the entities it used; these have been added. And the docx golden tests have been updated, because the new parser does not preserve the order of attributes. Add entity defs to docbook-reader.docbook. Update golden tests for docx.
2021-01-08	Update copyright notices for 2021 (#7012)	Albert Krewinkel

2020-11-07	Lint code in PRs and when committing to master (#6790)	Albert Krewinkel
	* Remove unused LANGUAGE pragmata * Apply HLint suggestions * Configure HLint to ignore some warnings * Lint code when committing to master
2020-10-06	DOCX reader: Allow empty dates in comments and tracked changes (#6726)	Diego Balseiro
	For security reasons, some legal firms delete the date from comments and tracked changes. * Make date optional (Maybe) in tracked changes and comments datatypes * Add tests
2020-09-13	Fix hlint suggestions, update hlint.yaml (#6680)	Christian Despres
	* Fix hlint suggestions, update hlint.yaml Most suggestions were redundant brackets. Some required LambdaCase. The .hlint.yaml file had a small typo, and didn't ignore camelCase suggestions in certain modules.
2020-07-13	Merge pull request #6527 from lierdakil/fix-6514	John MacFarlane
	[Docx Reader] Only use bCs/iCs on runs with rtl or cs property