Facts and Figures about H1B Employees in the US: Exploratory Analyses. A Pilot Study

Highlights

The row file containing R-codes and Output can be accessed here: RPubs - Facts and Figures of H1B Employees in the US: Exploratory and Trend Analyses

This is a small study (pilot) research on the facts and figures of of H1B employees in the United States using a quarter long actual data. It will be followed by a much bigger data set if the findings suggest that it is worthy of inquiring. The data came from the online repository of the Department of Labor. This study doesn’t tell anything whether somebody applied or got the H1B visa, nonetheless, it provides information about the critical step before actual H1B application called Labor Condition Application (LCA). In addition, it provides insights into the hiring company, job type, prevailing wage, state etc. I am going to dive deep down, however, I don’t really have any agenda here. I am open to exploration and surprises!! I simply want to be able to answer the questions like:

Which state, and city hire the most H1B employees?
Is there any particular zip code that hires the most?
What percentage of the total LCA applications are approved, denied, or withdrawn?
What kind of jobs do the H1B employees fill?
What is the average wage? And is there any differences on wages based on the job type?
What is the most recent trend of applying for LCA at the Department of Labor?

If I am able to answer these questions carefully, it may be useful for the international people interested in pursuing career in the United States. It may give them some ideas about job distribution by state, zip code, company, and even prevailing wages, which can play determining role in planning for their career. In addition, the comparison between academic and non-academic areas would shed light in the present context of discrepancy between these areas.

A. Setting Basic Stuff Straight

A.i. Loading Required Datafiles

For this project I am going to use the LCA Approval data that I downloaded form the Department of Labor website (https://www.dol.gov/agencies/eta/foreign-labor/performance). The data is from the last quarter of 2021, i.e., three months period from 10-2021 to 12-2021. There are some historical data that goes all the way to October 2019, but they are very limited.

Short Description of the Data file: Any US employer that wants to hire an international employee under the so called “specialty occupations” need to go through a standard procedure. One of such stages is famously called Labor Condition Application (LCA), which is filed by the employer at the Department of Labor(DoL). It is is pretty much like a written affidavit of the company saying that the company would pay and treat the new international hire fairly. It takes roughly 7-days to get decision on the application, which opens up a door for employers to file Non-immigrant temporary visa called H1B at the United States Citizenship and Immigration Services (USCIS).

Please note that it is a raw file and it is the first time I am dealing with this data. Let’s upload the data and explore them.

A.ii. Uploading the data and checking the variables

Checking the dimension of the dataset using dim() function, which gives us the number of rows and columns as an output.

dim(LCA_Approval)

[1] 122608     96

OMG!! Our data table has 122,608 data points. There are total of 96 variables. I am going to check what these variables are and decide how many of them are needed for me to successfully answer the questions above. Let’s check what are those variables:

names(LCA_Approval)

 [1] "CASE_NUMBER"                    "CASE_STATUS"                   
 [3] "RECEIVED_DATE"                  "DECISION_DATE"                 
 [5] "ORIGINAL_CERT_DATE"             "VISA_CLASS"                    
 [7] "JOB_TITLE"                      "SOC_CODE"                      
 [9] "SOC_TITLE"                      "FULL_TIME_POSITION"            
[11] "BEGIN_DATE"                     "END_DATE"                      
[13] "TOTAL_WORKER_POSITIONS"         "NEW_EMPLOYMENT"                
[15] "CONTINUED_EMPLOYMENT"           "CHANGE_PREVIOUS_EMPLOYMENT"    
[17] "NEW_CONCURRENT_EMPLOYMENT"      "CHANGE_EMPLOYER"               
[19] "AMENDED_PETITION"               "EMPLOYER_NAME"                 
[21] "TRADE_NAME_DBA"                 "EMPLOYER_ADDRESS1"             
[23] "EMPLOYER_ADDRESS2"              "EMPLOYER_CITY"                 
[25] "EMPLOYER_STATE"                 "EMPLOYER_POSTAL_CODE"          
[27] "EMPLOYER_COUNTRY"               "EMPLOYER_PROVINCE"             
[29] "EMPLOYER_PHONE"                 "EMPLOYER_PHONE_EXT"            
[31] "NAICS_CODE"                     "EMPLOYER_POC_LAST_NAME"        
[33] "EMPLOYER_POC_FIRST_NAME"        "EMPLOYER_POC_MIDDLE_NAME"      
[35] "EMPLOYER_POC_JOB_TITLE"         "EMPLOYER_POC_ADDRESS1"         
[37] "EMPLOYER_POC_ADDRESS2"          "EMPLOYER_POC_CITY"             
[39] "EMPLOYER_POC_STATE"             "EMPLOYER_POC_POSTAL_CODE"      
[41] "EMPLOYER_POC_COUNTRY"           "EMPLOYER_POC_PROVINCE"         
[43] "EMPLOYER_POC_PHONE"             "EMPLOYER_POC_PHONE_EXT"        
[45] "EMPLOYER_POC_EMAIL"             "AGENT_REPRESENTING_EMPLOYER"   
[47] "AGENT_ATTORNEY_LAST_NAME"       "AGENT_ATTORNEY_FIRST_NAME"     
[49] "AGENT_ATTORNEY_MIDDLE_NAME"     "AGENT_ATTORNEY_ADDRESS1"       
[51] "AGENT_ATTORNEY_ADDRESS2"        "AGENT_ATTORNEY_CITY"           
[53] "AGENT_ATTORNEY_STATE"           "AGENT_ATTORNEY_POSTAL_CODE"    
[55] "AGENT_ATTORNEY_COUNTRY"         "AGENT_ATTORNEY_PROVINCE"       
[57] "AGENT_ATTORNEY_PHONE"           "AGENT_ATTORNEY_PHONE_EXT"      
[59] "AGENT_ATTORNEY_EMAIL_ADDRESS"   "LAWFIRM_NAME_BUSINESS_NAME"    
[61] "STATE_OF_HIGHEST_COURT"         "NAME_OF_HIGHEST_STATE_COURT"   
[63] "WORKSITE_WORKERS"               "SECONDARY_ENTITY"              
[65] "SECONDARY_ENTITY_BUSINESS_NAME" "WORKSITE_ADDRESS1"             
[67] "WORKSITE_ADDRESS2"              "WORKSITE_CITY"                 
[69] "WORKSITE_COUNTY"                "WORKSITE_STATE"                
[71] "WORKSITE_POSTAL_CODE"           "WAGE_RATE_OF_PAY_FROM"         
[73] "WAGE_RATE_OF_PAY_TO"            "WAGE_UNIT_OF_PAY"              
[75] "PREVAILING_WAGE"                "PW_UNIT_OF_PAY"                
[77] "PW_TRACKING_NUMBER"             "PW_WAGE_LEVEL"                 
[79] "PW_OES_YEAR"                    "PW_OTHER_SOURCE"               
[81] "PW_OTHER_YEAR"                  "PW_SURVEY_PUBLISHER"           
[83] "PW_SURVEY_NAME"                 "TOTAL_WORKSITE_LOCATIONS"      
[85] "AGREE_TO_LC_STATEMENT"          "H1B_DEPENDENT"                 
[87] "WILLFUL_VIOLATOR"               "SUPPORT_H1B"                   
[89] "STATUTORY_BASIS"                "APPENDIX_A_ATTACHED"           
[91] "PUBLIC_DISCLOSURE"              "PREPARER_LAST_NAME"            
[93] "PREPARER_FIRST_NAME"            "PREPARER_MIDDLE_INITIAL"       
[95] "PREPARER_BUSINESS_NAME"         "PREPARER_EMAIL"

The output lists the names of all variables in the LCA_Approval data table. The overall variables can be classified into 4-big buckets:

Job Related Variables like job_title, soc_code etc.
Hiring Company Related variables like employer_name, employer_address, etc.,
Attorney and/or Preparer Related Variables like attorney_phone, preparer_phone, etc., and
Wage Related Variables like wage_level, unit_of_pay, etc.,

In relation to the guiding questions above, I shortlisted 16 variables, for now. They come from all of the buckets. Table below provides a brief description of the variables:

Table 1: Variables and Their Brief Description.
Variable_Names	Description
case_number	LCA application case number
case_status	Application outcome/decision
lca_application_date	Application date
lca_decision_data	Date of application decision
visa_class	Employees' proposed visa type e.g., H1B
soc_title	Job category, e.g. computer system analyst
job_type	Fulltime or a parttime job
job_begin_date	First date of work
total_worker	Total H1B workers in the company
employer_name	Name of the hiring company
employer_state	Employer's state
employer_zip	Employer's zip code
attorney_representation	If an attorney to filed the case
prevailing_wage	Proposed wage for the employment
pay_wage_level	Wage level: ranges between I to IV
H1B_dependent	Whether the employee need H1B to stay in the US

Beside these variables, I may have to create some new variables as the analyses progress further. It is time to subset the data file and move forward.

A.iii. Subsetting the data

As mentioned above, I am going to select only 16 columns from 96 using select() function, and save the new data in as a new object named LCA_final.

It is confusing to have all of the variable names in upper case. They are case sensitive. It’s convention/preferred lower case letters to denote variables. Thus, following the R recommendation, I am going to change them in the lower case using the clean_names() function from the {janitor} package.

I got the clean names. Some of the names are unnecessarily long. I would like to work with simple intuitive mostly two-word column names. I can change the long names using rename() function.

 [1] "case_number"             "case_status"            
 [3] "received_date"           "decision_date"          
 [5] "visa_class"              "job_title"              
 [7] "job_type"                "begin_date"             
 [9] "total_worker"            "employer_name"          
[11] "employer_state"          "employer_zip"           
[13] "attorney_representation" "prevailing_wage"        
[15] "h1b_dependent"           "pw_unit_of_pay"

There are 16 variables, most of them have two letters separated using an underscore (_) sing. So far, the data have been uploaded, required variables are selected, they are cleaned and renamed. The next task is to look into the data. Let’s check what is our data look like.

B. Exploring Data

Now, I am going to conduct some of the preliminary inquiry of the data set and check if there’s any inconsistencies. In addition, I am going to study my data by conducting some descriptive analyses and visualizing them.

B.i. Summarizing the data

case_number        case_status        received_date                
 Length:122608      Length:122608      Min.   :2019-10-01 00:00:00  
 Class :character   Class :character   1st Qu.:2021-10-12 00:00:00  
 Mode  :character   Mode  :character   Median :2021-11-05 00:00:00  
                                       Mean   :2021-10-22 08:17:44  
                                       3rd Qu.:2021-12-01 00:00:00  
                                       Max.   :2021-12-31 00:00:00  
 decision_date                  visa_class         job_title        
 Min.   :2021-10-01 00:00:00   Length:122608      Length:122608     
 1st Qu.:2021-10-22 00:00:00   Class :character   Class :character  
 Median :2021-11-16 00:00:00   Mode  :character   Mode  :character  
 Mean   :2021-11-15 05:40:47                                        
 3rd Qu.:2021-12-08 00:00:00                                        
 Max.   :2021-12-31 00:00:00                                        
   job_type           begin_date                   total_worker
 Length:122608      Min.   :2019-10-14 00:00:00   Min.   :  1  
 Class :character   1st Qu.:2021-11-01 00:00:00   1st Qu.:  1  
 Mode  :character   Median :2021-12-06 00:00:00   Median :  1  
                    Mean   :2021-12-06 05:59:52   Mean   :  2  
                    3rd Qu.:2022-01-10 00:00:00   3rd Qu.:  1  
                    Max.   :2022-06-25 00:00:00   Max.   :100  
 employer_name      employer_state     employer_zip      
 Length:122608      Length:122608      Length:122608     
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
 attorney_representation prevailing_wage  h1b_dependent      pw_unit_of_pay    
 Length:122608           Min.   :     7   Length:122608      Length:122608     
 Class :character        1st Qu.: 76856   Class :character   Class :character  
 Mode  :character        Median : 95243   Mode  :character   Mode  :character  
                         Mean   : 97169                                        
                         3rd Qu.:119933                                        
                         Max.   :810850

The output shows the summary of the all 16 variables. Looks like most them are listed as characters, a few as date, and some as numeric variables. There are total of 122,608 data points. When I look at the received_date variable, I know that the data come all the way from 2019 October 1 through 2021 December 31. However, decision_date doesn’t tell the same story. Looks like some of the people waited very long time to get decisions on their LCA applications. I will look into this variable in a moment. For now, let’s move towards the zip_code column.

B.ii. Extracting only 5-digit zip codes

Looks like there is some problem in the employer_zip variable because it cannot have any string or character value in it. The problem may arise because of some minor differences while entering this data. My guess is some of the zip-codes might have a dash (-) followed by 4-digit numbers after the first 5-digit zip codes. I will go ahead and select only first 6-digits using the substr() function, and check what happens.

chr [1:122608] "98052" "77845" "77845" "56623" "08854" "77845" "08030" ...

As seen in the output, I now have 5-digit zip codes. It is still listed as a character vector, but its OK. I will change the class together with other vectors.

B.iii. Subsetting the data table

Based on the above summary table, I have traced some problem in the prevailing_wage column. The minimum wage is just $7 while the highest is $810,850, and the mean of $97,169. Looks like some of the wages are in hourly, weekly, bi-weekly or even monthly basis while others in yearly basis. I can either change those rates into yearly salary, or get rid of them. Because, I have a huge data set, I am simply going to get rid of the hourly, weekly, and bi-weekly rates using filter() function. In addition, I am also going to get rid of the part time employees and consider analyses for the full time employees only.

Now, we don’t have any variability in terms of employees’ job_type and unit_of_pay. All we have are full time people having annual wages provided in their LCA applications. Thus, there is no need of the job_type and pw_unit_of_pay variables. I am going to get rid of them:

 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15080   78894   97760  102465  121014  810850

The summary of the data set shows that I successfully got rid of the job_type & pw_unit_of_pay variables. The summary shows that the total data points boiled down to 116,252 from 122,608, which is approximately 95% of the total data.

B.iv. Focusing on Prevailing Wage Distribution

The prevailing_wage is still troubling because of the range between the maximum and minimum wages. Average salary increased to $ 102,014 having the third quartile salary $121,014. I want to check the distribution of this variable using a histogram and will take necessary actions afterward.

Wow! What a distribution. Most of the prevailing wages are below $200,000/year. There are only a few that exceed $250,000 level. Though, really subtle, we can see a small bump in the $400,00 and $600,000 range and probably only one at $850,000 level. Clearly, they have skewed the average prevailing wage. We will probably get rid of these outliers but before that I want to zoom-in and check the 250 to 800 range.

There we go! Looks like there are good number of people between $250,000 to $ 300,000 range, who we cannot exclude from our analyses. However, it looks like there are one person each whose prevailing wage is around $350,000, $400,000, $500,000 & $800,000 per year. They are for sure the outliers and we will be better off getting them out of this study. However, I would like to make sure, I am right in doing so. For this, I will index them out and study.

B.iv.1. Indexing the Prevailing Wage and Doing Further Study of the Outliers

I am going to check for the total number of employees who were proposed more than $300,000 yearly wages. In addition, I want to see their job title and some other information.

# A tibble: 4 x 5
  job_title                         
  <chr>                             
1 Physicians and Surgeons, All Other
2 Physicians and Surgeons, All Other
3 Hospitalists                      
4 Physicians and Surgeons, All Other
  employer_name                                                   employer_state
  <chr>                                                           <chr>         
1 Casper Cardiology, LLC                                          WY            
2 The Moses H. Cone Memorial Hospital Operating Corporation, Inc. NC            
3 Monument Health Rapid City Hospital, Inc.                       SD            
4 The Moses H. Cone Memorial Hospital Operating Corporation, Inc. NC            
  prevailing_wage case_status          
            <dbl> <chr>                
1          306530 Certified            
2          810850 Certified - Withdrawn
3          398925 Certified            
4          494260 Certified

All of these people come from the health department. Two of the highest prevailing wages were from the same hospital located in North Carolina. Both of these employees were Physicians and Surgeons. Remaining two highest salaries were offered to the employee at Casper Cardiology, WY, and Monument Health Rapid City Hospital in South Dakota. One of the employees was again, a Physician and Surgeon, and while the other was reported to be a Hospitalist. Interestingly, the LCAs for the persons with the highest Prevailing Wages was first certified and withdrawn, afterward.

Based on these information, Physicians and Surgeons seem to be the highest paying jobs for the H1B employees. If you are still trying to decide on your major, go this route!!

B.iv.2. Filtering Data Trimming Outliers

As there are not many outliers, I am going to get rid of these four data points and move forward.

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15080   78894   97760  102451  121014  295797

The average prevailing wages increased slightly, i.e., ($102,451 - $102,014 = 437) after getting rid of the wages over $300,000/year. The minimum prevailing wages are $15,080, while the highest is $295,797.

Rows: 116,248
Columns: 14
$ case_number             <chr> "I-200-21270-606885", "I-200-21270-606909", "I~
$ case_status             <chr> "Certified", "Certified", "Certified", "Certif~
$ received_date           <dttm> 2021-09-26, 2021-09-26, 2021-09-26, 2021-09-2~
$ decision_date           <dttm> 2021-10-01, 2021-10-01, 2021-10-01, 2021-10-0~
$ visa_class              <chr> "H-1B", "H-1B", "H-1B", "H-1B", "H-1B", "H-1B1~
$ job_title               <chr> "Computer Systems Analysts", "Computer Systems~
$ begin_date              <dttm> 2022-02-17, 2022-03-09, 2021-10-01, 2021-10-1~
$ total_worker            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ employer_name           <chr> "COGNIZANT TECHNOLOGY SOLUTIONS US CORP", "COG~
$ employer_state          <chr> "TX", "TX", "MN", "NJ", "TX", "NJ", "FL", "MN"~
$ employer_zip            <chr> "77845", "77845", "56623", "08854", "77845", "~
$ attorney_representation <chr> "No", "No", "Yes", "No", "No", "No", "No", "Ye~
$ prevailing_wage         <dbl> 105040, 78187, 53851, 82202, 93600, 49338, 755~
$ h1b_dependent           <chr> "Yes", "Yes", "No", "Yes", "Yes", "N/A", "N/A"~

The output for the glimpse() command shows that there are now 14 columns and 116,248 rows of data. The variables are listed either as character, date-time, or numeric-double. Now, let’s focus on the date and time columns.

B.v. Focusing on the Date Variables

Looking at the classes of the data, most of the factor varibles are saved as character vectors. Before I change these vectors, I want to check if my date variables are in right format and if I can calculate days between these dates.

 X.class.LCA_final.received_date.. X.class.LCA_final.decision_date..
1                           POSIXct                           POSIXct
2                            POSIXt                            POSIXt
  X.class.LCA_final.begin_date..
1                        POSIXct
2                         POSIXt

Above table shows that the date variables received_date, decision_date, & begin_date are all in right format. Now, I am checking if I could calculate total number of days between the application date and the decision date.

Time differences in days
[1] 5 5 5 5 5 5

For some reasons, R calculated the difference in seconds. I can change it to day but I would like to stick with it and divide the outcome by 86,400 (60 X 60 X 24) to change them into days. And it worked. This calculation confirms that these variables are ready to take whatever comes their way.

B.vi. Changing the Class of the Varaibles

Now, I am nearing the phase where I can plot some diagrams and conduct analyses. But, before that, I want to change the class of some of the variables to factor using the as_factor function from {sjlabelled} package.

class(LCA_final$employer_name)

[1] "factor"

And that worked!

B.vii. Creating New Variables

Now, I want to create a few more variables before I move to analysis phase.

decision_time: it will be a continuous variable representing the total days the DoL took to provide decision on the LCA application.
days_towork: it will be the difference in days between the date of LCA approval and actual first day of working. And finally,
wage_thousand: a continuous variable showing the prevailing wages in thousand. The prevailing_wage variable is in regular dollar format which covers really huge range. It is easy to plot the data if our range is smaller.

names(LCA_final)

 [1] "case_number"             "case_status"            
 [3] "received_date"           "decision_date"          
 [5] "visa_class"              "job_title"              
 [7] "begin_date"              "total_worker"           
 [9] "employer_name"           "employer_state"         
[11] "employer_zip"            "attorney_representation"
[13] "prevailing_wage"         "h1b_dependent"          
[15] "decision_time"           "days_towork"            
[17] "wage_thousand"

head(LCA_final$days_towork)

[1] 139 159   0  10   4  19

The output shows that there are now 17 variables in total, including the three that I just created. Further check confirms that the values have been stored in the new columns, respectively.

   Min. 1st Qu. Median Mean 3rd Qu. Max.
[1,]    0       7      7   24       8  813
[2,] -804      -2      6   22      45  183
[3,]   15      79     98  102     121  296

The summary of the wage_thousand looks fine. However, I am thrown off guard by the statistics of decision time and days towork variables, especially, by the numbers in max and min column. Maximum reported decision time is 813 days. Really? I don’t know if it were that big of a deal, however, when mean decision time is 23.5 days then 813 days (more than 2-years) seem really unrealistic. It may be possible, though. On the other hand, the total days gap between the day of LCA approval and the first day of work is 804 days. Surprisingly enough, not after the date of LCA approval, the person was supposed to work more than 2-years prior to his/her LCA was approved by the Department of Labor.

In addition, LCA approval doesn’t mean one can start working. The hiring organization has to apply for H1B employees’ behalf at USCIS and wait until the application is approved, which takes months based on the information provided on the USCIS website(https://www.uscis.gov/working-in-the-united-states/temporary-workers/h-1b-specialty-occupations-and-fashion-models/h-1b-electronic-registration-process). If the company pays premium processing fees, the H1B decision time boils down to 15 working days, but still, this is something really interesting! IS IT POSSIBLE?

Having said that, I want to make sure I have this information correct. Thus, I am going to look into these issues more closely.

B.viii. Focusing on Issues Relating to Decision Time and First Day of Work

First of all, I am going to check if they represent a single or two different entries. Whet ever they are, I am going to index them first.

Amazingly, they hint to consecutive records, i.e., 112494 & 112495. Let’s pull further information about these records:

# A tibble: 2 x 17
  case_number        case_status           received_date      
  <chr>              <fct>                 <dttm>             
1 I-200-19287-085384 Certified - Withdrawn 2019-10-14 00:00:00
2 I-200-19280-075004 Certified - Withdrawn 2019-10-07 00:00:00
  decision_date       visa_class job_title                        
  <dttm>              <fct>      <fct>                            
1 2021-12-28 00:00:00 H-1B       Database Administrators          
2 2021-12-28 00:00:00 H-1B       Software Developers, Applications
  begin_date          total_worker employer_name                  employer_state
  <dttm>                     <dbl> <fct>                          <fct>         
1 2019-10-16 00:00:00            1 FRANKLIN TEMPLETON INVESTMENTS CA            
2 2019-10-21 00:00:00            1 InnoCore Solutions, Inc        TX            
  employer_zip attorney_representation prevailing_wage h1b_dependent
  <fct>        <fct>                             <dbl> <fct>        
1 94403        Yes                               77002 No           
2 75063        No                                94307 Yes          
  decision_time days_towork wage_thousand
          <dbl>       <dbl>         <dbl>
1           806        -804          77.0
2           813        -799          94.3

When I parsed them out, I came to realize that they are genuine cases. There is no data entry mistake. I am perplexed to know that it took so long for the DoL to approve their LCAs. In both cases, the LCAs were filled in October 2019, which were approved in December 2021, which is fairly recently. One of the person was hired by Franklin Templeton Investment located in California, while the other by InnoCore Solutions in Texas. One of them was dependent on H1B during the time of application while the other wasn’t. They had somewhat similar job titles, e.g., Database Administrator and Software Developer. Both of them were from the IT sector.

Looks like it is the common convention. The cases were first certified and withdrawn afterward. I am not sure how to explain it. I used the corresponding LCA reference number and checked for the status on DoL website (https://flag.dol.gov/case-status-search). The status columns shows that they are withdrawn. The reason may be, COVID-19. When they are not active, there is no use of having them in the studies. It may be cases that most of the LCAs that took too long to get decision are withdrawn afterward. I am going to get rid of certified_withdrawn cases from my final study: https://rpubs.com/nirmal/885691

They seem to be outliers but let me check the overall status using a frequency polygon. I want to make a combined plot for decision_time and days_towork.

Looks like the findings are pretty consistent because the distribution looks normal with a huge spike at 0. We can see a couple of small spikes on days_towork and that makes a complete sense. If I zoom in between -25 to +190 on x-axis, I will have clearer picture. Let me do just that:

The plot is pretty consistent. Decision time seem to starts earlier than the first day of work. Most of the decision seem to have been received before 20th day of the application. Most of the first day of work also seem to align with the decision time. Some first days of work seem to have set before the decision date, however, I guess it is just the formality. Or maybe the date was picked at the time the forms were filled and there was a gap between the picked date and the date they actually sent the application. Not a big of a deal!!

So far, so good!

Let’s Do Some Real Digging

Just to remind myself. Below are the questions that I primarily wanted to answer using my data. I guess, I have been able to provide tentative partial answers to the question number 5 and 6. Now, I want to take on these questions in order they appear:

Which state, and city hire the most H1B employees?
Is there any particular zip code that hires the most?
What percentage of the total LCA applications are approved, denied, or withdrawn?
What kind of jobs do the H1B employees fill?
What is the average wage? And is there any differences on wages based on the job type?
What is the most recent trend of applying for LCA at the Department of Labor?

I will use both visual and computational skills to answer all of these questions. I will try to diversify tools and see how successful I become.

Q.N.1. Which state, and city hire the most H1B employees?

I have to use the group_by() function to be able to answer this question. I will calculate the total number of employees by states and save the statistics in an object named employees_by_state.

I plotted the percent of LCAs filed at the Department of Labor during the aforementioned period per state on the United States Map. Darker red suggest higher percentage of filed LCAs, and it changes less dark as the percent decreases, ultimately fading into solid white. California has the bold red shade followed by the Texas and a few other states. Below is the list of top 6 and bottom 13 states interms of filed LCAs on behalf of H1B employees.

Let’s check states on the top and bottom of the list:

# A tibble: 19 x 3
# Groups:   state [19]
   state     n state_percent
   <fct> <int>         <dbl>
 1 CA    26088      22.4    
 2 TX    11244       9.67   
 3 NJ    10895       9.37   
 4 WA     8755       7.53   
 5 IL     7897       6.79   
 6 NY     7788       6.70   
 7 GU      109       0.0938 
 8 ME       84       0.0723 
 9 MS       71       0.0611 
10 WV       51       0.0439 
11 ND       46       0.0396 
12 VT       46       0.0396 
13 MT       34       0.0292 
14 SD       33       0.0284 
15 HI       31       0.0267 
16 WY       29       0.0249 
17 PR       14       0.0120 
18 AK       11       0.00946
19 VI        5       0.00430

Now, the employees are classified based on the states where they are supposed to work. Looking at the top 6 and bottom 13 states, I can see that there is a huge gap. California tops the list of the states with total of 26,088 (22.4%) LCAs filed at DOL within October - December 2021, followed by Texas, with 11,244 (9.67%) LCAs.

For me the discovery is that the foreign employees who want to work in Virgin Island (VI), Guam (GU), and Puerto Rico(PR) on H1B are also in the list. They go through the same process. Good to know!

Finally, there were total of 13 US states and territories that shared less than 0.1% of the LCAs. Virgin Island submitted total of 5 LCAs during the period of 3 months, followed by Alaska with 11 LCAs.

I want to represent state-wise percentage in a horizontal bar graph. However, the range between the highest and smallest states can be problematic. Thus, I am going to create a row named `other’ summing up the values less than 0.5% to represent combined ratio of the small US states and territories.

Creating two different objects and binding them together to come up with what I need

sum.small_states.n. sum.small_states.state_percent.
1                5025                             4.3

I created big_states and small_states tibbles. There were total of 5025 LCAs filed from these small states and territories and it made 4.3% of total cases. Now, I am creating a new data frame x with these values and bind it with the big_states.

head((big_states), 10)

# A tibble: 10 x 3
# Groups:   state [10]
   state     n state_percent
   <chr> <dbl>         <dbl>
 1 CA    26088         22.4 
 2 TX    11244          9.67
 3 NJ    10895          9.37
 4 WA     8755          7.53
 5 IL     7897          6.79
 6 NY     7788          6.70
 7 other  5025          4.3 
 8 MA     4388          3.77
 9 PA     4265          3.67
10 MD     3558          3.06

Everything worked. Other is in the 7^th position from the top. Now, I am going to plot them in a horizontal bar graph using ggplot2.

The above bar diagram provides a quick insights into the percentage of LCA filed at the DoL within the period of 3 months between 10-2021 and 12-2021. California tops the list with 26,088 LCAs filed in the period of 28 months, i.e., approximately 932 LCAs every month or 31 everyday. Indiana is at the bottom of the big states that filed LCAs within the same period. I know that Indiana is 26 from the top in the big_state dataframe. Thus, just to remind myself, I am going to index Indiana and see further details.

big_states[26, ] # extracts the values in the 26th rows from the big_state tibble

# A tibble: 1 x 3
# Groups:   state [1]
  state     n state_percent
  <chr> <dbl>         <dbl>
1 IN      683         0.588

There were total of 683 LCAs that came from the state of Indiana which roughly made .59%. In other words, approximately 228 LCAs a month or 38 LCAs every five days were filed during the period of 3 months.

Q.N.2. Is there any particular zip code that hires the most?

This questions wants me to continue digging the number of proposed H1B employees during the aforementioned period. The easy way is to subset the the LCA_final dataframe.

summary(employees_by_zip_percent)

   total_LCAs    zip_percent    total_zip   
 Min.   :   1   Min.   :0.0   Min.   :   1  
 1st Qu.:  60   1st Qu.:0.1   1st Qu.:   1  
 Median : 132   Median :0.1   Median :   2  
 Mean   : 310   Mean   :0.3   Mean   :  20  
 3rd Qu.: 264   3rd Qu.:0.2   3rd Qu.:   4  
 Max.   :5057   Max.   :4.4   Max.   :1664

Further digging of the data by Zip code shows that there were 1664 Zip codes that filed just 1 LCA; 673 with 2; 391 with 3; 248 with 4; 165 with 5; 138 with 6; and 102 with 7 LCAs within the aforementioned period. The output table is not shown here because of it takes a lot of space (239 rows of data). There were total of 38 zip codes which filed more than 500 LCAs each in this period, with a maximum of 5057 LCAs (i.e., 4.4%) from a single zip-code.

If you are thinking of applying for a job that has potential to sponsor H1B, here is the list of 38 top most zip codes you want to focus:

The bar diagram shows the percentage of LCA filed at the DoL. Zip code 98121 applied for approximately 4.35% of LCA during the study period. This zip code is in the state of Washington and it covers King County, where Amazon Head Quarter is located. Similarly, another 1.92% percent LCAs were filed from the zip code 98052, which also lies in the same county. Although, California is the most H1B friendly state, 98121 is the most popular zip codes in terms of number of filed LCAs, and probably H1B employees.

Note: The plot shows the actual zip codes, thus, feel free to check them for further details.

That being said, I want to make sure I have them correct. Let’s check the data stored in the column employer_name and match them with the above zip codes and find out ten most popular companies. Let’s quickly check the inputs in the column employer_name.

head((LCA_final$employer_name), 3)

[1] COGNIZANT TECHNOLOGY SOLUTIONS US CORP
[2] COGNIZANT TECHNOLOGY SOLUTIONS US CORP
[3] ANIP Acquisitions, Inc.               
19732 Levels: '47 Brand, LLC ... Zywie Inc.

Top 10 Companies in terms of Total LCAs Filed During 10/2019 and 12/2021

Above output shows that the input is messy. Some names are in upper case and some in title case format. If proceeded without changing them, we may end up having the same company repeat multiple times in my analysis. Thus, I would like to change all of the company names into title case using str_to_title() from {stringr} package.

Now, we know the top most companies that hire the most H1B employees. States are color coded, suggesting same color represent same state. For example, Amazon and Ernst companies are dark red and they both are located in the State of Washington.

Looking at the diagram itself, there is no surprise because these companies make a complete sense. I am a little bit surprised by the companies like Cognizant & Tata Consultancy in the top 10 list, though.

Q.N.3. What percentage of the total LCA applications are approved, denied, or withdrawn?

This question should be fairly easy to answer. Answer to this question will help me decide whether I would like to include all data in my final study. Let’s check the variable and how the values are stored.

str(LCA_final$case_status)

 Factor w/ 4 levels "Certified","Certified - Withdrawn",..: 1 1 1 1 1 1 1 1 1 1 ...

As seen above, it had four possibilities, i.e., Certified, Certified-Withdrawn, Denied, Withdrawn. Now, I am going to plot a pie chart using pie3D() function from {plotrix} package and check their comparative status:

Fair enough. Approximately 92% of the LCA filings are certified, followed by roughly 6.47% first certified and then denied afterward. Less than 1% are outright rejected while 1.64% of filings are withdrawn. These findings strengthen the idea that I can get rid of Certified-Withdrawn, Denied, & Withdrawn cases from my final study.

Finally, let’s check if the decisions have anything to do with the companies.

# A tibble: 15 x 3
# Groups:   employer_name, case_status [15]
   employer_name                          case_status               n
   <chr>                                  <fct>                 <int>
 1 Amazon.com Services Llc                Certified              3322
 2 Cognizant Technology Solutions Us Corp Certified              2546
 3 Microsoft Corporation                  Certified              2154
 4 Tata Consultancy Services Limited      Certified              2136
 5 Google Llc                             Certified              2051
 6 Ernst & Young U.s. Llp                 Certified              2046
 7 Tekorg Inc.                            Certified - Withdrawn  1591
 8 Apple Inc.                             Certified              1339
 9 Deloitte Consulting Llp                Certified              1137
10 Infosys Limited                        Certified              1128
11 Amazon Web Services, Inc.              Certified               965
12 Intel Corporation                      Certified               919
13 Facebook, Inc.                         Certified               859
14 Capgemini America Inc                  Certified               746
15 Meta Platforms, Inc.                   Certified               737

As expected, most of the LCAs filed during the period were certified and most of the big companies retained the rate. However, the only thing that stood out was Tekorg Inc., which had total of 1591 cases certified and then withdrawn!! I would like to see how many LCAs did this company filed and what’s the status:

# A tibble: 3 x 3
# Groups:   case_status [3]
  case_status               n decision_percent
  <fct>                 <int>            <dbl>
1 Certified               384            19.0 
2 Certified - Withdrawn  1591            78.7 
3 Withdrawn                46             2.28

Amazing! In last 3 months or may be last 29 months, this company filled slightly higher than 2000 LCAs for full time employees at the Department of Labor. It looks like none of the applications were denied, 46 were withdrawn before receiving decision. Approximately, 19% of the applications were certified and probably they are still active. The most striking finding is approximately 79% filings were first certified and then withdrawn. What is the reason? One possible reason can be COVID-19, or these LCAs did not make through the lottery process, or their H1B applications got rejected. Pretty unfortunate! Or, they may be the ones that took very long time to get decisions on. There’s 65000 cap every year.

Finally, I would like to subset the Certified-Withdrawn cases from my data set and check their decision time.

The average LCA decision time for the Certified-Withdrawn cases are approximately 258 days, which is 10 times higher than all cases. No doubt that the average decision time for all cases has been dragged much longer by the inclusion of Certified-Withdrawn cases in the analysis. The second plot clearly shows that most of the cases received decision within the first 10 days of application.

Q.N.4. What kind of jobs do the H1B employees fill?

First of all, I want to check if the values in the job_title are input consistently. If not, I have to find a way to do so. Let’s change them into the title case and check:

# head((LCA_final$job_title),10)
str(LCA_final$job_title)

 Factor w/ 490 levels "Accountants",..: 111 111 68 279 111 14 371 68 437 77 ...

The answer is, these people fill all kind of jobs in the United States. More precisely, they were supposed to fill 490 different types of jobs as mentioned in the LCAs filed at the Department of Labor during the last 3-months of 2021.

Now, job_title is a factor variable with 490 types of jobs. I would like to group them using group_by() function by job_title and count the number for each types of jobs. Because I have 490 different job titles, publishing all of those would make no sense. Thus, I am going to identify top 20 jobs and use dumbbell plot, which is a combination of geom_segment() and geom_line(), to display what they are.

he outcome shows that most of the top 20 jobs based on the filed LCAs during 10/2021-12/2021 were IT related. The Software Developers,Application was by far the most popular job_title. Roughly 38,000 LCAs were filed for the same title. Likewise, Software Developers, System Software was the second popular with approximately 10,000 LCAs within 3 months period. Computer System Analyst was top 3rd with approximately 7,000 LCAs.

i. Looking into Academic H1B Facts and Figures

Because I am from academia, there are many folks who are either Assistant Professors, or Postdoctoral Associates, I would like to know their population, job title, place of work and some other information. Job title is pretty big, thus I would like to use patterns and pass grep() function to index these entities.

First, I am going to create an object named pattern_edu that contains the most common words referring to academia
Second, I will pass the index through LCA_final data and create and index index_2,

str(index_2)

 int [1:4604] 33 61 205 207 259 317 383 436 479 505 ...

The output shows that I have successfully identified and isolated 4,604 instances where some University and/or Colleges filed at least 1 LCA during period of 10/2021-12/2022. In other words, the LCAs filed by an academic institute was roughly below 4% during the period of interest.

Third, pass index_2 through the data set to subset it to academic only data set, and
Draw summary of the Academic Institution and number of filed LCAs during the project period.
I will limit the display to top 20 academic institutions

summary_acad[, -4] # don't give the 4th column for the sake of space

# A tibble: 20 x 3
# Groups:   employer_name [20]
   employer_state employer_name                              num_LCA
   <fct>          <chr>                                        <int>
 1 CA             The Leland Stanford, Jr University             130
 2 MI             University Of Michigan                         118
 3 MD             Johns Hopkins University                       105
 4 NY             Columbia University                             84
 5 WI             University Of Wisconsin System                  78
 6 GA             Emory University                                73
 7 AL             The University Of Alabama At Birmingham         69
 8 CT             Yale University                                 67
 9 CA             University Of California, San Francisco         62
10 MA             Harvard University                              62
11 PA             Trustees Of The University Of Pennsylvania      61
12 MO             Washington University                           60
13 CA             University Of California, San Diego             59
14 IL             Northwestern University                         56
15 NC             Duke University                                 55
16 IN             Purdue University                               53
17 PA             University Of Pittsburgh                        52
18 TX             The University Of Texas At Austin               50
19 MN             University Of Minnesota                         49
20 FL             University Of Florida                           48

Above table shows the top 20 universities that filed the most LCAs during the period of October 2021 through December 2021. The output also gives where the universities are located. Now, I am going to visualize top 10 universities using geom_tile() function.

Darkblue tiles show higher percentage of LCAs which which fades into purple as the percent decreases. As can be seen the Leland Stanford, Jr University and University of Michigan were the top H1B employers during the last quarter of 2021.

Q.N.5. What is the average wage? And is there any differences on wages based on the job type?

Answer to this question can be tricky, especially when we have a big range between the minimum and maximum wages.

First thing first, I would like to calculate the differences between the average academic and non-academic prevailing wages. I have a feeling that academic H1B have lower average prevailing wages than non-academic ones.
My second step would be figure out some other ways to group the data points and calculate the average wages. I have already calculated the overall average mean, (a) $102,465 with the outliers included, and (b) $102,451 after getting rid of the outliers.

I already have academic_LCA for academic dataset and I can use index_2 to create nacademic_LCA dataset.

I now have academic_LCA data frame with 4,604 data points and nacademic_LCA with 111,644 data points. The mean outcome shows that the average prevailing wages for non_academic LCAs were $104,000/year compared to $62,000/year for academic LCAs filed during the aforementioned period.

Now, let’s create a combined density plot of average wages between these groups and check the differences.

Wow! Out of this world. The gap in mean prevailing wages between academic and non-academic is simply nacademic_mean - academic_mean = $42,000/year. Not a surprise why many people aim for companies straight out of colleges and universities!!

The last thing, I want to do is, to identify the mean prevailing wages for top 20 jobs based on the LCAs filed within project period and plot them to check for differences in average prevailing wages. I am going to use the prior techniques to identify top 20 jobs.

head((highest_wage), 20)

                             job_title total_LCA min_wage mean_wage max_wage
1      Oral And Maxillofacial Surgeons         1   228176    228176   228176
2                         Radiologists         7   160826    215025   285189
3                    Security Managers         1   211349    211349   211349
4     Computer Hardware Engineers, R&D         1   208000    208000   208000
5                        Orthodontists         2   208000    208000   208000
6       Preventive Medicine Physicians         1   208000    208000   208000
7      Obstetricians And Gynecologists        13   142022    203938   283358
8                     Chief Executives       131    65146    201864   292427
9  Physical Medicine And Rehabilitatio         2   176426    192213   208000
10                       Psychiatrists        24    57179    190763   293307
11 Financial Managers, Branch Or Depar        10   105300    178515   259022
12                   Anesthesiologists        14    59093    176091   208000
13                        Hospitalists        81    49067    175939   275933
14                            Surgeons        33    57470    173307   285106
15 Electronics Engineers, Except Compu         1   172245    172245   172245
16                  Financial Managers       350    58490    168096   259022
17 Hospitalist Physician And Practice          1   162802    162802   162802
18            Investment Fund Managers        17    86091    161661   211349
19 Computer And Information Systems Ma      3248    72093    161511   285304
20                  Nurse Anesthetists         1   161304    161304   161304

The output shows top 20 job titles having the highest prevailing wages mentioned in the LCAs. If we check the total_LCA column, we see that there are some with 1, 2, and 7 or more. We have to see them in-terms of total data points, i.e., 111,644. I can say without any doubt that these values are not enough to establish a pattern or at least they don’t represent the actual population. For example, Oral and Maxillofacial Surgeons’ prevailing wage was $228,176/year but it has only one entry. So, there is no use of this data point. It says that I have to find some other ways to group the data.

As mentioned earlier, this study is just an exploratory account of the available data, I am going to focus on the most common job types. I am going to establish the cutoff point of total_LCA to 499, which I believe is a fair number. We should not forget that these data represent some 3-months of LCA filing at the Department of Labor. I am first going to filter the data and arrange them in descending order.

The above plot packs whole lot of information. Minimum wages are represented by the circles towards the y-axis, maximum wages by the circles away from it, and the red diamonds show the mean prevailing wages for the job title. The black vertical line, on the other hand shows the average prevailing wages among these popular job titles.

As can be seen, some the of the jobs have bigger range compared to others. One interesting discovery, at least for me is, none of the minimum wages are above $75,000/year. Two thirds of the minimum wages shown in the plot are below 50,000/year. These findings hint that the prevailing wages are not just based on the job title. We may want to inquire employees preparation, knowledge, academic achievement, company that hires them, maybe states they work in, etc. It’s a complete different study. Finally, Majority of the jobs pay lower than the average wages. Medical scientists position seem to pay the least average wages compared to other among top 30 most popular job titles. The plot speaks for itself.

Furthermore, because I come from and academia. I want to figure out different jobs that H1B employees fill in academia and compare the average prevailing wages.

I tried to group the academic data by job titles. I came up with 243 unique job titles. Some of them overlap, but I think they are listed different for a reason. Sorting them in a descending order based on mean wage did not work because, like in the non-academic scenario, some of the high paying academic job titles did not have enough data points to establish the fact. For example Hospitalist has the highest mean prevailing wage ($181,466) but the big variation between minimum_wage, i.e.78,478 and maximum_wage, i.e., 271,690, and just a few number of total_LCAs, i.e., 6 cast a doubt about generalizing the average annual price tag on this job title. Thus, I am going to revert to the old tactics of using a cutoff point (20 or more) to track most popular job titles based on the number of LCAs filed at the Department of Labor between October 2019 and December 2021 and compare their prevailing wages.

# mean(top_45_academic$mean_wage) # $61,394
glimpse(top_45_academic)

Rows: 45
Columns: 5
$ job_title    <chr> "Medical Scientists, Except Epi", "Biochemists And Biophy~
$ total_LCA    <int> 517, 437, 270, 190, 177, 149, 130, 114, 114, 94, 84, 77, ~
$ minimum_wage <dbl> 27602, 33446, 19200, 32198, 21280, 38106, 22090, 17320, 3~
$ mean_wage    <dbl> 53445, 51509, 87960, 50770, 67430, 54073, 59062, 63939, 6~
$ maximum_wage <dbl> 124613, 98900, 233740, 100600, 117080, 136800, 113589, 14~

The output gave me 45 unique job_titles. Along with these job titles, I now have, the minimum, maximum and the mean prevailing wages mentioned in the LCAs. I have calculated the mean prevailing wages of the mean wages, which gave me average mean ($ 61,394) among 45-most popular academic jobs for H1B employers during the year 2020 and 2021. Let’s plot them in a dumbbell plot.

The highest mean wages are slightly over $127,000/year and they go to the Physicians and Surgeons. It’s completely consistent with what I figured out at the opening of this study, where I got rid of 4 data points because they offered pretty high prevailing wages. However, the range is too big! Soil and Plant Scientists have pretty narrow range and they make slightly less than the average wages a year. Most of the mean prevailing wages are below 90,000/year. Please feel free to extract information depicted by the plot as per your interest.

Q.N.6. What is the most recent trend of applying for LCA at the Department of Labor?

Most of the data in the LCA_final data set come from last 3-months of 2021. Having at least 5-time series would give us better views about the long term trend. Thus, I would refer you to the final study. Here’s the link: https://rpubs.com/nirmal/885691

You Made to the End of the Report

Search This Blog

Data-Driven Education Insights