Build pipelines with Pandas using “pdpipe”



The second method looks for the string drop in the Price_tag column and drops those rows that match. And finally, the third method removes the Price_tag column, cleaning up the DataFrame. After all, this Price_tag column was only needed temporarily, to tag specific rows, and should be removed after it served its purpose.

All of this is done by simply chaining stages of operations on the same pipeline!

At this point, we can look back and see what our pipeline does to the DataFrame right from the beginning,

  • drops a specific column
  • one-hot-encodes a categorical data column for modeling
  • tags data based on a user-defined function
  • drops rows based on the tag
  • drops the temporary tagging column

All of this — using the following five lines of code,

pipeline = pdp.ColDrop('Avg. Area House Age')
pipeline+= pdp.OneHotEncode('House_size')
pipeline+=pdp.ApplyByCols('Price',price_tag,'Price_tag',drop=False)
pipeline+=pdp.ValDrop(['drop'],'Price_tag')
pipeline+= pdp.ColDrop('Price_tag')
df5 = pipeline(df)

There are many more useful and intuitive DataFrame manipulation methods available for DataFrame manipulation. However, we just wanted to show that even some operations from Scikit-learn and NLTK package are included in pdpipe for making awesome pipelines.

Scaling estimator from Scikit-learn

One of the most common tasks for building machine learning models is the scaling of the data. Scikit-learn offers a few different types of scaling such as Min-Max scaling, or Standardization based scaling (where mean of a data set is subtracted followed by division by standard deviation).

We can directly chain such scaling operation in a pipeline. Following code demonstrates the use,

pipeline_scale = pdp.Scale('StandardScaler',exclude_columns=['House_size_Medium','House_size_Small'])df6 = pipeline_scale(df5)

Here we applied the StandardScaler estimator from the Scikit-learn package to transform the data for clustering or neural network fitting. We can selectively exclude columns which do not need such scaling like we have done here for the indicator columns House_size_Medium and House_size_Small.

Tokenizer from NLTK

We note that the Address field in our DataFrame is pretty useless right now. However, if we can extract zip code or State from those strings, they might be useful for some kind of visualization or machine learning task.

We can use a Word Tokenizer for this purpose. NLTK is a popular and powerful Python library for text mining and natural language processing (NLP) and offers a range of tokenizer methods. Here, we can use one such tokenizer to split up the text in the address field and extract the name of the state from that. We recognize that the name of the state is the penultimate word in the address string. Therefore, following chained pipeline will do the job for us,

def extract_state(token):
return str(token[-2])
pipeline_tokenize=pdp.TokenizeWords('Address')pipeline_state = pdp.ApplyByCols('Address',extract_state,
result_columns='State')
pipeline_state_extract = pipeline_tokenize + pipeline_statedf7 = pipeline_state_extract(df6)

The resulting DataFrame looks like following,



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *