First8 staat voor vakmanschap. Al onze collega’s zijn een groot aanhanger van Open Source en in het bijzonder het Java-platform. Wij zijn gespecialiseerd in het pragmatisch ontwikkelen van bedrijfskritische Java toepassingen waarbij integratie van systemen, hoge eisen aan beveiliging en veel transacties een belangrijke rol spelen. Op deze pagina vind je onze blogs.

Pagination: a bad solution to a problem you should not have

Every developer has (or will) encountered the phenomenon called pagination. In this blog post I want to challenge you as a developer or product owner to reconsider using that pattern. Simply put: it is a bad solution to a problem that you shouldn’t have. (Of course, since this is a blog post, there may be some nuances left out. Take this rant with a grain of salt :))

The problem that pagination tries to solve is that if you have a large result set that you want to show a user or deliver to a client via an API, that will take a long time. Not only do you (or rather, your software) need to gather a lot of data, that same data also needs to be transported via the internet to your client. The larger the set, the slower your webpage or API becomes. And that leads to frustration, time outs and other things you’d like to avoid.

Pagination is the idea that you simply chunk the result set into “pages”, and only deliver the records on the selected page. Typically you start on page 0 (or 1 if you are talking to a human), select the top n records (where n is your page size) and return those. If the client wants more, you select the next n records. In a user interface you will typically see buttons with previous/next labels and some page numbers. In a (REST Json) API, you will see something similar in the form of links in the body. See e.g. Json-LD  or Json Hyperschema conventions on those. 

Problems

Pagination has a few problems. Before we dive into this, take note that if you want to chunk a result set, it is generally assumed that there is some order in those records. Commonly, the most recent records or the most relevant are shown first.

First of all, there is the problem of stability: typically, you query your database with a start record and some page size. For the first page, you simply take the top N records and return those. After a while the client decides that he needs the second page. But if in the mean time new records appear, these records could pop up on the first page. So what do you do now if the client asks for the second page? Do you pop in those new records? Do you generate the second page results based on the new data? Either way, the client either gets a weird ordering, or runs the risk of missing or duplicate records. One way of solving it is introducing a cursor that keeps track of the original result set. But how long do you keep that cursor? And are you willing to spend the resources to build that cursor solution and run it? Remember, we are doing pagination to optimise performance, not to reduce it.

Secondly, pagination is fine for reducing traffic, but actually the implementation of pagination is not that light-weight in a distributed system. You’ll encountered something called the top-n problem. To summarize it quickly: Assume there are multiple nodes or databases that each maintain a part of your data set, and you want to query across all of them, order the result and return the top N records. Thus, building a pagination pattern. You have to ask all of the M nodes for the local top N records, merge those N x M records and return the top N global records, throwing away the rest. You need to do that because you don’t know if there happens to be a node that contains most of the global top-N, so you have to assume any one of them could contain the full top-N. This is sort of fine for the first page, but it gets worse for later pages. Then, aside from some minor optimizations, you would have to query each node for N x PageNumber results. The total traffic quickly adds up to N x M x Pagenumber records, just to deliver N records. There are more efficient solutions, but those are harder to generalize. 

Thirdly, a really nice and friendly pagination solution shows you how many pages (or records) there are. So, depending on your data structures, you might actually already have queried a large part of the data that you would need for the entire data set. This is why some solutions do not show you the total number of records and skip the ‘last’ button. 

Alternative

So, what is the alternative? That is where I would like to challenge you: why on earth do you have a use case that requires a transport of so many records? If you are building an application that requires a user to browser through dozens of pages, then your UX (user experience) design is wrong. Add some proper defaults, filtering and sorting, or rethink the entire use of your page. Maybe your user is exploring your dataset: offer him facets that give him insight in how to zoom in on the dataset, instead of dumbing the interaction down to a next page-next page-next page experience.

If you are building an API which requires these huge datasets to be transported, why not prepare the result as a batch up front or on demand, and deliver those as a single transport? Or simply make sure that you can handle reasonably large queries. It is 2020 after all.

Pagination is lazy interaction design.