Skip to content

pyspark

Getting Started

from pyspark.sql import SparkSession

def init_spark():
  spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
  # spark = SparkSession.builder.master("spark://192.168.2.99:7077").getOrCreate()
  sc = spark.sparkContext
  return spark,sc

def main():
  spark,sc = init_spark()
  nums = sc.parallelize([1,2,3,4])
  print(nums.map(lambda x: x*x).collect())


if __name__ == '__main__':
  main()

Submit

python - Pyspark - Error related to SparkContext - no attribute _jsc - Stack Overflow

TODO

Cool Sources

wiki.software.list.json

Cast as new column

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

Lists

JOIN

Notes

* [How to copy columns to a new Pandas DataFrame in Python](https://www.adamsmith.haus/python/answers/how-to-copy-columns-to-a-new-pandas-dataframe-in-python) * Set column name in select query `alias` python import pyspark.sql.functions as F unioned_df.select(F.to_timestamp(F.from_unixtime( F.col("ts") / 1000 )).alias('time_stamp')).show() * Cast dataframe pyspark python

* [python - How to change a dataframe column from String type to Double type in PySpark? - Stack Overflow](https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark) * List data already in spark python spark.sql('show databases').show() spark.sql("show tables in default").show() spark.catalog.listTables('default') * [List Tables & Databases in Apache Spark | by Swaroop | Medium](https://medium.com/@durgaswaroop/list-tables-and-databases-in-spark-2d03594d2883) * Save dataframe to spark sql table python unioned_df.write.saveAsTable('default.log_data') * Access data already in spark python spark.sql("SELECT * FROM log_data").toPandas() ``` * Timestamp stuff * scala - How to convert unix timestamp to date in Spark - Stack Overflow * Case * pyspark - Apache spark dealing with case statements - Stack Overflow

Dataframe Examples

S3

Documentation

Work

Does not work

TODO

kavgan/nlp-in-practice: Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.