Craig Jurney, Chief Solutions Architect
In this series of posts, we share insight into the advanced and comprehensive functionality that underpins our end-to-end scholarly publishing platform.
Not only do we have innovative solutions; we use very good technology – the best available to the industry.
In this month’s post, we explore Apache Kafka®, a distributed streaming platform which we’ve built into the HighWire platform over the past four years.
Kafka has supercharged our hosting capability, helping our publishers speed up publication and discovery. Making new articles available faster – and being able to instantaneously update existing content – means readers can access the latest research sooner. This not only provides a better service to the academic community, but helps boost publishers’ commercial performance.
Read on to learn how we adopted a technology pioneered by social networks, to make this a reality.
What was the problem we decided to solve?
Take this example: a publisher knows in human terms that their next issue is due to be published next Thursday at 8:00am. But in machine terms, it’s an asynchronous event – digital content is constantly coming in.
Historically, when content came into the HighWire platform we would run script A, then script B, then script C and so on. There was a linear chain with a long list of actions taking place in a prescribed order.
But publication workflows are no longer linear. We needed a solution that could handle the complexity of multiple use-cases, building in flexibility for different publisher needs, in an automated and streamlined manner. The system needed to react as quickly as possible to inbound content from a variety of sources and do multiple things at once. To solve this problem, we decided to integrate Kafka as an event-driven message bus.
What’s a message bus?
Put simply, a message bus allows you to peel off and run other activities in parallel. This means actions can happen faster.
When an action takes place – for example, option A takes place following action B – we publish into a message bus. It is a conduit or a flow of events. The actions of many different agents can be triggered by a particular event or piece of content. These agents take their action on a piece of content and state what they’ve done: “I ingested” or “I published” etc., so other parts of the platform can immediately read and react to that action in real-time.
Take, for example, an ever-growing chain of actions. With Kafka, action 27 can happen directly after action 5, even though there are 22 things “before” it. This decoupled approach would not have been possible in the past.
Publishers want content on site as soon as possible – we can do it sooner because of this robust and real-time technology.
What’s the background to Kafka?
The Apache Software Foundation describes Kafka as a distributed streaming platform. It is an asynchronous messaging technology that allows developers to build distributed applications.
It was pioneered by LinkedIn, who donated it to Apache for open-source use. Very high-volume use of Kafka will be familiar to users of social networks like LinkedIn and Facebook. These platforms are processing billions of events a day. For example, somebody just ‘liked’ or ‘shared’ this post – this information needs to be available and actioned upon by multiple systems, both internal and external. A ‘like’ is an event; this page/post/image was liked. That event flows into the message bus at Facebook and tools such as aggregators look at this simultaneously. The ‘like’ is then shared or displayed in multiple places – in your home-feed, on the user’s profile page, in the back-end analytics of the piece of content that was ‘liked’. This is what fuels ‘trending topics’, ‘events you might be interested in’ and other methods of ensuring we see the content that matters to us in real-time.
Key business benefits for publishers
- Real-time dissemination of and reaction to your content.
- A message log that provides ‘in-flight’ status of the platform, providing snapshots in time of your publication’s status, i.e. “how many articles do we have?” This can enable faster decision-making and greater insight to help you determine when you need to publish an issue.
- Larger publishers with large volumes of content being published can know what’s going up and when by plugging into Kafka. Using our API, you can build a dashboard that tracks all activity and gives you a consolidated view of all inbound content. This live status is your “canary in the coal-mine”, giving you the ability to immediately investigate when you need to.
- As ‘Platform-as-a-Service’ (PaaS). When publishers use our platform and we host their content, they’re also using our PaaS distribution and aggregation service via our API. Commercial advantages include:
- Improved content arrangements with third parties. For example, medical journals and pharmaceutical companies will often bundle up collections of articles and send these across to third-party business partners. With Kafka, this process is both more automated and more proactive. You can create an action (“As soon as the article is up, send it”) so that it’s distributed to third-parties at the same frequency as it goes live on your own site. This brings competitive advantage as you can get new content to your business partner rapidly and reliably.
- Better content reselling relationships. The same applies – you can enable resellers’ members to browse and buy at the same time the content is published to your own site.
- Quickly respond to future events – for example, a typo correction in an author name. Within Kafka, an event is created showing that there’s been an update. In turn this update will be immediately percolated across all uses of that content.
- A detailed audit trail, enabling you to track where workflows go awry or manage compliance.
- New capabilities being plugged in as they’re developed.
How are we evolving our use of Kafka?
We continue to develop new functionality and plug it into the message bus.
For example, we slotted an entirely new service – our taxonomy solution, as used by McGraw-Hill – into the existing event-stream powered by Kafka. This means when new content is added we can immediately pull it out and apply a taxonomy. We didn’t need to go back and retool any of the existing content-ingest tool-chain to do this, as with Kafka we can plug in a whole new function without having to revisit and disturb the existing function. This ability helps prevent faults and interruptions to services.
Another example scenario is updating images, supporting conversion from .tif to .jpeg files to be used elsewhere. You can write a process looking for that precise action to happen upstream, then a function downstream reporting that it was done. This allows the conversion process to be decoupled from the content ingestion process itself.
It’s all about being faster, automated and trackable.
- Find out more about Kafka: https://kafka.apache.org/intro
- Learn about HighWire Hosting and get in touch with one of our experts: https://www.highwirepress.com/solutions/highwire-hosting/
- Check out our other recent product spotlights on Microsoft Academic Graph and email alerting
Article image credit: Kafka.apache.org
Latest news and blog articles
Full-text HTML of preprints now available on medRxiv