Embedding MuPDF

This is a document that describes MuPDF from the point of view of a developer wanting to integrate it into their own embedded system. While information in this article may be of interest to people wanting to use MuPDF on more standard desktop OSs (such as Windows, Linux or MacOS) or on mobile OSs (such as iOS or Android), the exact details of the example viewers for these OS are not covered here.

What is MuPDF?

MuPDF is at heart, a portable C library for opening, manipulating and rendering PDF, XPS and other format files. It is, as far as possible, completely platform agnostic. It assumes just a simple C runtime, with no threading, allocation, or graphics libra---ry dependencies.

MuPDF does rely on some thirdparty components (such as jpeglib, freetype, openjpeg, zlib etc), but these are linked in statically by default and do not need to be provided by the underlying system. If, however, the system is already using these libraries, they can be shared thus saving program memory (referred to henceforth as ROM).

In addition to the core C library, we provide various tools that use the library to perform common tasks, such as:

These utilities are typically thin veneers over functionality provided within the core library - i.e. all these features can be easily accessed by an integrator.

By using well defined interfaces throughout, MuPDF is designed to be easily extensible, allowing new input and output formats to be supported, new rendering and extraction devices, custom operations on PDF files, and support for custom sources (enabling DRM and HTTP based operations).

Allocation

MuPDF performs all its allocations through a set of standard allocators. These can be overridden by the caller, so that memory use can be finely controlled by the caller. In addition, MuPDF uses a tunable caching system whereby decoded objects (such as bitmaps or fonts) can be held ready for reuse. If the system runs short of memory they will automatically be freed and decoded again if required.

This allows the integrator to be in total control of the memory usage of the system, and to get the best possible use out of even the smallest hardware.

Threading

While MuPDF itself does not require a threading library, it can make use of one if it is there. This can enable multi-core CPUs to achieve impressive performance, either when rendering a single page in bands, or when rendering several pages together.

Graphics libraries

MuPDF is not tied to any underlying graphic library, but can be integrated at either a high or low level with any system required.

At the simplest possible level, the default drawing device within MuPDF produces simple in-memory pixmaps (either in grayscale, RGB, or CMYK). These trivially map down onto images in almost every graphics library. Monochrome operation can be obtained by an inbuilt post-process filter that will half-tone the pixmaps to true bitmaps.

MuPDF also offers integrators a chance to produce higher level output. A key component of MuPDFs operation is that of it's 'device' interface. All graphical operations are broken down to calls to this interface (including "fill image", "plot path", "place text" etc). The standard MuPDF rendering engine is just one of these devices; others are supplied that include the ability to output PDF and XPS format files, or to extract rich structured textual data from input files. Third parties have exploited this freedom, and have implemented their own devices, including one that does rendering using Windows GDI calls.

Integrators who are so inclined could implement their own devices to hand off graphically intensive tasks to their own GPUs or ASICs.

Interaction

MuPDF is suitable for a range of tasks of varying interactivity.

At the simplest level, it can be used within a printer to just render pages from a PDF with no interaction at all.

At a more complex level, it could be used to first render pages on demand for an LCD panel on a printer to allow a user to select pages for print, and then to do the actual render (at a much higher resolution and/or different color model) for the printer.

At the ultimate level, if could be used to do a full viewer app, including panning and zooming of pages, and the ability to fill in forms with full Javascript validation.

Javascript is only required for this ultimate level of interactivity. MuPDF is not tied to any particular Javascript implementation, and we have simple coupling logic to integrate us with different Javascript engines. We do, however, recommend the use of MuJS, our own custom Javascript engine. MuJS was developed out of a dissatisfaction with the alternatives; while many offerings focus on speed, none focus as effectively on portability, completeness and small size as MuJS does.

How large is MuPDF?

The exact size of MuPDF will depend on how it is configured.

Various factors affect the size, notably:

Fonts
PDF support requires a set of 14 'Base' fonts; these weigh in at around 360K. Adding more fonts increases this size; a typical fallback font with decent Unicode and CJKV coverage can add 4M-5M to this figure.
CMaps
In order to cope with different language encodings, the PDF standard uses CMaps files. An integrator is free to choose what subset of these they would like to support.
Static or system libraries
If the system already has (for example) zlib, jpeglib or freetype, then sharing the same instance can save on the memory footprint.
Removal of unwanted devices
If the system does not need to output structured text, or PWG rasters, or do PCL extraction, then simply by not calling the relevant functions the devices will be omitted, saving memory.

As a benchmark, a release build of the 'mudraw' binary on windows, including a 5.3M Fallback font, the standard 14 base fonts, a full set of CMaps (around 4M uncompressed) and supporting every output device we have weighs in at just 9.5M. The android apk for MuPDF is 5M in size, which includes a 10M mupdf shared library (with a similar configuration of CMaps/Fonts etc).

How much memory does MuPDF use?

The exact memory use of MuPDF will depend on a wide range of factors, including the size and contents of files that are to be opened, the resolution at which they are to be displayed, and exactly what level of interaction is desired with the file.

The simplest way to drive MuPDF is to open a PDF file, load a page from that file, and render it to a pixmap.

A certain amount of memory is required by MuPDF to hold the details of the 'structure' of the file. This memory will be required as soon as a file is opened, and is resolution and color depth independent.

As pages are accessed from the file more details of the files structure are loaded into memory. The bulk of this memory is returned as each page is finished with, but some small amount will remain used.

During the actual rendering process, we obviously need pixmaps to be rendered onto. These will generally account for by far the largest amount of memory usage. In cases of files which use the transparency features of PDF, multiple layers of pixmaps may be required.

Finally, as we meet resources throughout the file (including Images and Fonts), we will use more memory to unpack these into. For optimal performance we would like to keep these in their decoded form as they may be used multiple times throughout the document.

All of this can add up to far more memory than an embedded system might easily have to hand, but there are various schemes in place to reduce the actual memory impact.

Rendering in bands

Rather than rendering an entire page bitmap at once, it can frequently make sense to render 'bands' across the page. For inkjet printers in particular this can make sense as the printer only actually needs a set number of raster lines at once. This not only reduces the main pixmap memory overhead, it reduces the memory overhead for transparency buffers in parallel.

Rendering using a displaylist

If we are to render in bands, this means that the page contents must be traversed multiple times. We can do this by re-interpreting the page for each band, but this can be slow. A faster way to work is to interpret the page once to a displaylist that can be held in memory. We can then quickly render this out multiple times without having to decompress from the file again (and with the ability to quickly skip over parts of the file that do not touch the band we are using).

Obviously, the displaylist itself takes some memory, but this is generally small compared to the pixmap being rendered into - and we can always fall back to simple reinterpretation if memory is too tight.

Image subsampling

Images within PDF are frequently sent at very high quality - often far higher quality than can be required for rendering on inkjets (especially in draft mode). MuPDF includes the ability to decode images at a subsampled level. This level of subsampling is chosen so that there is no loss in final quality, but this can save large amounts of memory.

Resource caching

MuPDF will hold resources that it encounters in a "store" (cache) in the hopes they will be useful again. This can drastically speed rendering in many cases, but does increase the memory usage. The library takes several steps to avoid this memory use becoming a problem.

Firstly, the Integrator can specify a limit to the size of the store. Secondly, the store is integrated with the allocators that MuPDF calls. If an allocation ever fails, MuPDF will automatically free elements from the store and then retry. It will only ever fail an allocation if there are no more objects that can be freed, and the allocator still fails.

This means that in many systems, the store can be set to be of unlimited size, and as much as possible will be cached, with the system recovering memory 'just in time' for its rendering needs.

Example memory use figures

As a torture test, we use a 1300+ page document, the "PDF Language Reference Manual v1.7".

The peak usage for opening the document, and loading and rendering a variable number of pages, at different resolutions, with and without the displaylist is as follows:

Display List

No

Yes

Resolution (dpi)

1

72

1

72

Pages

1

1-100

1-1310

1

1-100

1-1310

1

1-100

1-1310

1

1-100

1-1310

Peak Memory Use (Mb)

13.2

30

72.5

13.9

29.5

68.1

13.1

31.2

80.3

13.8

30.9

75.7

Color Management

Currently, MuPDF does not make any use of Color Management, but the hooks are broadly in place for us to add this later. This is not currently a priority, but this could change with suitable customer interest.

CPU specific optimisations

While MuPDF is written in portable C, there are certain speed critical sections (such as bitmap scaling and color conversion) that have been optimised into ARM assembler. Similar assembly coding of these small hotspots is a potential optimisation that may be driven by customer interest.

Halftoning

In order to support 1bpc output, we provide routines to post process the contone (8bpc) output. These take a threshold array per component and produce a bitmap output. There is potential for hardware acceleration or CPU specific optimisation here.

Custom sources

MuPDF reads files from a defined public API (fz_stream). The most typical implementations of this simply read from a file on backing store, or from a block of memory.

Because this is a public API, other implementations can also be provided. We have one customer that uses this to fetch and decrypt data so as to provide their own custom DRM mechanism. Alternatives might include secure communication with cloud filesystems, or custom streaming protocols.

This functionality is key to allowing efficient operation with files served by http.

Progressive display

MuPDF has the ability to display pages 'progressively' as data arrives over a slow link (such as to an HTTP web server). Feedback from the engine can also be used to tune which parts of the file are sent in what order (using HTTP byte requests), thus minimising the time to display.

For printers with a preview ability, this functionality opens up the possibility of displaying files before they are entirely sent to the printer.

For low end printers, an extension to this idea would allow pages from a PDF to be printed even when the PDF itself is too large to fit in the memory of the printer.

Artifex: MuPDF/Embedding (last edited 2016-05-25 09:53:47 by TorAndersson)