pyspark

Getting Started

from pyspark.sql import SparkSession

def init_spark():
  spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
  # spark = SparkSession.builder.master("spark://192.168.2.99:7077").getOrCreate()
  sc = spark.sparkContext
  return spark,sc

def main():
  spark,sc = init_spark()
  nums = sc.parallelize([1,2,3,4])
  print(nums.map(lambda x: x*x).collect())


if __name__ == '__main__':
  main()

pyspark-hello-world.py

Submit

python - Pyspark - Error related to SparkContext - no attribute _jsc - Stack Overflow

TODO

Cool Sources

wiki.software.list.json

Working with JSON in Apache Spark | by Neeraj Bhadani | Expedia Group Technology | Medium
"column_name" function - Stack Overflow](.md) function - Stack Overflow](dentropydaemon-wiki/Software/List/pyspark.md) function - Stack Overflow]]%20function%20-%20Stack%20Overflow)
apache spark - Pyspark dataframe write to single json file with specific name - Stack Overflow

Cast as new column

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

python - How to change a dataframe column from String type to Double type in PySpark? - Stack Overflow
[[dentropydaemon-wiki/Software/List/Spark|[Spark|[Spark|[Spark|Spark) - Stack Overflow
python - Mapping columns from one dataframe to another to create a new column - Stack Overflow

Lists

JOIN

udacity-data-lake/etl.py at main · dentropy/udacity-data-lake

Notes

[[dentropydaemon-wiki/Software/List/pyspark|[pyspark|[pyspark|[pyspark|pyspark) - Stack Overflow
Select Column python df[["column_name"]]
Do basic math in middle of query python df.schema df.printSchema()
scala - How to check the schema of DataFrame? - Stack Overflow
Append Dataframe / Add a Row
scala - How to add a Spark Dataframe to the bottom of another dataframe? - Stack Overflow
Add Row to Dataframe in Pandas – thisPointer
Copy Columns possibly in another table ``` python

* [How to copy columns to a new Pandas DataFrame in Python](https://www.adamsmith.haus/python/answers/how-to-copy-columns-to-a-new-pandas-dataframe-in-python) * Set column name in select query `alias` python import pyspark.sql.functions as F unioned_df.select(F.to_timestamp(F.from_unixtime( F.col("ts") / 1000 )).alias('time_stamp')).show() * Cast dataframe pyspark python

* [python - How to change a dataframe column from String type to Double type in PySpark? - Stack Overflow](https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark) * List data already in spark python spark.sql('show databases').show() spark.sql("show tables in default").show() spark.catalog.listTables('default') * [List Tables & Databases in Apache Spark | by Swaroop | Medium](https://medium.com/@durgaswaroop/list-tables-and-databases-in-spark-2d03594d2883) * Save dataframe to spark sql table python unioned_df.write.saveAsTable('default.log_data') * Access data already in spark python spark.sql("SELECT * FROM log_data").toPandas() ``` * Timestamp stuff * scala - How to convert unix timestamp to date in Spark - Stack Overflow * Case * pyspark - Apache spark dealing with case statements - Stack Overflow

Dataframe Examples

Example Data Frames
PySpark - Create an Empty DataFrame & RDD — SparkByExamples
PySpark - Create DataFrame from List - GeeksforGeeks
Put a list of JSON files or CSV's into a single data frame
python - Merging multiple data frames row-wise in PySpark - Data Science Stack Exchange python import glob import os log_files = glob.glob(os.getcwd() + '/log-data/*') list_of_dfs = [] for log_file in log_files: list_of_dfs.append( spark.read.json(log_file) ) import functools def unionAll(dfs): return functools.reduce(lambda df1,df2: df1.union(df2.select(df1.columns)), dfs) unioned_df = unionAll(list_of_dfs)

S3

python - Pyspark Save dataframe to S3 - Stack Overflow

Documentation

README · GitBook

Work

spark-examples/pyspark-examples: Pyspark RDD, DataFrame and Dataset Examples in Python language
From sparkbyexamples.com
mahmoudparsian/pyspark-tutorial: PySpark-Tutorial provides basic algorithms using PySpark
Shows examples for CLI, no Jobs or Jupyter ntoebooks

Does not work

[[dentropydaemon-wiki/Software/List/pyspark|[pySpark|[pySpark|[pySpark|pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Seems to use Python2
6 Years Old
Includes command to start spark
Lots of jupyter notebooks

TODO

kavgan/nlp-in-practice: Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.