With the release of Apache Spark 2.4 in November 2018, the Data Source V2 APIs became more stable. The design and the work for this major improvement is documented in Jira ticket SPARK-15689. If you trace through the design document and the sub-tasks of the aforementioned Jira ticket, you will discover that the work was started before the release of Apache Spark 2.3 and some of the APIs enhancements were released in Apache Spark 2.3. To my surprise, there were breaking changes in the Data Source V2 APIs when Apache Spark 2.4 came out.
Why am I writing this blog? Well, I maintain a few Apache Spark Streaming data sources (Wiki edits and Twitter tweets) that are regularly used in a few Structured Streaming exercises in the Introduction Apache Spark with Scala course I teach at UCSC Extension school, and recently I had to upgrade those data sources in order for them to work with Apache Spark 2.4.
In this blog I would like to what I learned as well as a few very useful resources I found extremely valuable during this exercise.
To get a feel for API changes between Apache Spark 2.3 and 2.4, I would highly recommend you spend a few minutes to read the awesome blog called “Migrating to Spark 2.4 Data Source API”. The author of this blog graciously shared the git commit that contains the necessary code changes to upgrade his data sources. I found it extremely helpful to see what was changed.
By reading through that blog and the git commit, I was able to successfully upgrade my data sources without much trouble. One thing that I encountered a bit of trouble is in dealing with the InternalRow class, which is used as a container to hold the values of each row. It requires each string to be in UTF8 format and there is a handy utility call UTF8String we can leverage to convert a regular string to UTF8 string by calling fromString function. The other thing that InternalRow class requires is the Timestamp type columns need to be encoded as SQLTimestamp type instead of long type. Luckily there is a utility class called DateTimeUtils that we can use to convert from milliseconds by calling the fromMillis function.
That is pretty it. Hope you find this blog useful in your journey of migrating your data sources to Spark 2.4 Data Source V2 APIs.
Here is the github that contains the streaming data sources I mentioned and here the commit that I submitted.