Mobile payment has achieved great success in China, and a transaction can be done within seconds by scanning QR code. While it brings convenience to our daily lives, mobile payment also brings huge challenges to the risk control infrastructure. I gave a talk at O’Reilly Strata Data Conference New York 2019 and shared how we leveraged Apache Pulsar to boost the efficiency of the risk indicator development within Orange Financial.
Orange Financial (namely China Telecom Bestpay Co., Ltd), is an affiliate company of China Telecom. Established in March 2011, Orange Financial receives “The payment business license” issued by the People’s Bank of China. The subsidiaries of Orange Financial covers Bestpay, Orange Wealth, Orange Insurance, Orange Credit, Orange Financial Cloud and so on. Bestpay has become the third-largest payment provider in China, closely following Alipay and WeChat Payment.
With 500 million registered users and 41.9 million active users, the transaction volume of Orange Financial reached 1.13 trillion CNY ($18.37 billion) in 2018.
China is currently the largest mobile payment market and it is still growing each year. According to data from a research institute, the number of mobile payment users was 462 million in 2016, it reaches 733 million in 2019, and the number is expected to get even larger by 2020. The total volume of mobile payment transactions was 22 trillion USD in 2016, and it reaches 45 trillion USD this year.
In China, the number of economic activities carried out through mobile payment is surging, and people are less likely to use cash or credit cards than before. The high industry penetration rate of mobile payment in China indicates that mobile payment is closely related to our daily life. You can do almost everything with a QR code on your smartphone, such as order food, take taxi and metro, rent a bike, buy coffee and so on.
Mobile payment has achieved great success in China because it is convenient and fast. A transaction can be done within seconds by scanning the QR code. It accelerates the adoption of mobile payments in e-commerce, financial services, transport, retail, and other businesses.
Greater ease of use comes greater threats. While mobile payment brings convenience to our daily lives, it also brings huge challenges to the risk control infrastructure. An instant transaction involves thousands of rules running against the transaction to prevent potential financial frauds. RSA reports that the top fraud types include phishing, rogue applications, Trojan attack, and brand abuse. Financial threats in the mobile payment era are more than that, such as account or identity theft, merchant frauds, money laundry and so on. According to a survey from China UnionPay, 60% of the 105,000 interviewees reported that they encountered mobile-payment security threats.
We have a robust risk management system that helps us detect and prevent these attacks, and we have been doing quite well in protecting the assets of our customers in recent years, yet we still face many challenges:
The core of a risk management system is the decision. A decision is a combination of one or many indicators: such as geographic coordinates of a user’s login, the transaction volume of a retailer. For example, if the geographic coordinates of a user’s recent logins are always the same, it is suspicious, it’s probably a bot or simulator. If the transaction volume of a fruit stall is around $300 a day, when the volume suddenly rises up to $3000, our alert is triggered.
Developing risk control indicators requires both historical data and real-time data. For example, when we sum up the total transaction volume of a merchant for the past month (30 days), we have to calculate both the volume of the last 29 days in batch mode, and then sum up with the value returned by a streaming task on data collected on the current day since 12 am.
Most internet companies deploy a Lambda Architecture to solve similar challenges, since it is effective and keeps a good balance of speed and reliability. Previously, we also adopt the Lambda Architecture, which has three layers: batch layer, streaming layer, and the serving layer. The batch layer is for historical data computation, with data stored in Hive, and Spark is dominating as the batch computation engine. The streaming layer is for real-time computation, with Flink as computing engine consuming data persisted in Kafka. The serving layer retrieves the final result for serving.
However, Lambda Architecture has several problems in our practice, it is complex and hard to maintain. First of all, we have to split our business logic into many segments, which is difficult to maintain and it increases communication overhead. Secondly, the data are duplicated in two different systems, and we have to move data among different systems for processing. As the business grows, our data processing stack becomes very complex, we have to maintain three different software stacks: batch layer, streaming layer and serving layer. It means we have to maintain multiple clusters: Kafka, Hive, Spark, Flink, and HBase, as well as a diverse engineering team with different skill sets. The cost of maintaining our current data-processing stack becomes really expensive.
We’ve been seeking alternatives, and then we get to know Apache Pulsar. With Apache Pulsar, we made a bold attempt to refactor our data processing stack. Our goal is to simplify the stack, improve production efficiency, reduce cost, and accelerate decision-making in our risk management system.
To deal with those challenges, simplify business processes, and keep good control of financial risks, we begin to investigate Apache Pulsar.
Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo! and now it is a part of the Apache Software Foundation. After investigation, we’ve summarized the points that why Pulsar fits best for our businesses.
Cloud-native architecture and segment-centric storage
Apache Pulsar adopts layered architecture and segment based storage (with Apache BookKeeper). An Apache Pulsar cluster is composed of two layers: a stateless serving layer, comprised of a set of brokers that receive and deliver messages; a stateful persistence layer, comprised of a set of Apache BookKeeper storage nodes called bookies that store messages durably. Apache Pulsar is enabled with high availability, strong consistency, and low latency features.
Pulsar stores messages based on topic partitions, each topic partition is assigned to one of the living brokers in Pulsar, which is called the owner broker of that topic partition. The owner broker serves message-reads from the partition and message-writes to the partition. If a broker fails, Pulsar automatically moves the topic partitions that were owned by it to the remaining available brokers in the cluster. Since brokers are “stateless”, Pulsar only transfers ownership from one broker to another during nodes failure or broker cluster expansion, no data copy occurred during this time.
Messages on a Pulsar topic partition are stored in a distributed log, and the log is further divided into segments. Each segment is stored as an Apache BookKeeper ledger that is distributed and stored in multiple bookies in the cluster. A new segment is created either after a previous segment has been written for longer than a configured interval (time-based rolling), or if the size of the previous segment has reached a configured threshold (size-based rolling), or whenever the ownership of topic partition is changed. With segmentation, the messages in a topic partition can be evenly distributed and balanced across all the bookies in the cluster, which means the capacity of a topic partition is not limited only by the capacity of one node. Instead, it can scale up to the total capacity of the whole BookKeeper cluster.
Layered architecture and segment-centric storage (with Apache BookKeeper) are two key design philosophies, which enable Apache Pulsar with several significant benefits such as unlimited topic partition storage, instant scaling without data rebalancing, and independent scalability of serving and storage clusters.
Apache Pulsar provides two types of read API: pub-sub for streaming, segment for batch processing
Apache Pulsar follows the general pub-sub pattern, where a producer publishes a message to a topic; a consumer subscribes to the topic, processes a received message, and sends a confirmation after the message is processed (Ack). A subscription is a named configuration rule that determines how messages are delivered to consumers. Pulsar enables four types of subscriptions that can coexist on the same topic, distinguished by subscription name:
Key-shared subscriptions — multiple consumers can attach to the same subscription, and messages with the same key or same ordering key are delivered to only one consumer.
In a batch process, Pulsar adopts segment-centric storage, reads data from the storage layer (BookKeeper or tiered storage).
After further investigation, we decided to adopt Apache Pulsar to build a new unified data processing stack: Pulsar as the unified data store, and Spark as the unified computing engine.
Since Spark 2.2.0, Structured Streaming laid a solid foundation for batch and streaming process. You can read data in Pulsar through Spark Structured Streaming, and query historical data in Pulsar through Spark SQL.
Apache Pulsar addresses the messy operational problems by storing data in segmented streams. The data is appended to topics (streams) as they arrive, and segmented and stored in scalable log storage, Apache BookKeeper. Since data is stored as only one copy (source of truth), it solves the inconsistency problem in Lambda Architecture. Meanwhile, we can access data in Streams via unified pub/sub messaging and segments for elastic parallel batch processing. Together with a unified computing engine like Spark, Apache Pulsar is a perfect unified messaging and storage solution for building the unified data processing stack. We decide to adopt Apache Pulsar to re-architect our stack for our business.
To enable Apache Pulsar for our business, we need to upgrade our data processing stack. The upgrading is done with two steps.
First, import data from the old Lambda based data processing stack into Pulsar. Our data is comprised of historic data and real-time streaming data. For real-time streaming data, we leverage pulsar-io-kafka to read data from Kafka and then write to Pulsar while keeping schema information unchanged. For historic data, we use pulsar-spark to query data stored in Hive by Spark, and store the result with Schema (AVRO) format into Pulsar. Both
pulsar-spark are already open-sourced by StreamNative.
Second, move our computation jobs to process the records stored in Pulsar. We use Spark Structured Streaming for real-time processing, and Spark SQL for batch processing and interactive queries.
The new Apache Pulsar based solution unifies computing engine, data storage, and programming language. Compared with Lambda Architecture, the new solution reduces complexity dramatically:
Apache Pulsar is a cloud-native messaging system, with layered architecture and segment-centric storage, which is a perfect choice to build a unified data processing stack. Together with a unified computing engine like Spark, Apache Pulsar is able to boost the efficiency of our risk-control decision deployment. Thus we are able to provide merchants and consumers with safe, convenient and efficient services.
Pulsar is a young and promising project, and Apache Pulsar community is growing fast. We have heavily invested in the new Pulsar based unified data stack. We’d like to contribute our practices back to the Pulsar community and help companies with similar challenges to solve their problems.